# Table of contents

1. [Introduction](#intro)
2. [Import dependencies](#import_dep)
3. [Dataset construction](#data_constr)

# 1. Introduction
<a class="anchor" id="intro"></a>

Modern drug discovery requires researchers to: a) identify a therapeutic target, or a molecule structure whose abnormal activity is associated with a disease, and b) design drugs to act on these therapeutic targets. Both stages of this process are costly and time-intensive. Therefore, it's crucial to devise strategies to improve the efficiency of this search. One such strategy has been to develop a more comprehensive understanding of therapeutic targets and their interaction properties. For example, the number of drug-target interaction publications has increased over the past several years, as have the number of efforts to consolidate the information into useful databases.

Nevertheless, much of the data remains unstructured, waiting to be extracted and curated by human experts. The growing number of scientific publications makes it infeasible to manually sift through all the available literature.  Rather, we first need to find a way to distinguish the relevant documents from the mass of irrelevant ones. To that end, this Jupyter notebook introduces a document classifier that's able to successfully identify articles that contain drug-target interaction information and those that do not. 

# 2. Import dependencies
<a class="anchor" id="import_dep"></a>

In [1]:
# Base packages
import os
import settings
import numpy as np
import pandas as pd
from google.cloud import bigquery

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# 3. Dataset construction
<a class="anchor" id="data_constr"></a>

We will generate our classification model using the ChEMBL database. ChEMBL is an Open Data database containing binding, functional, and ADMET information for a large number of drug-like bioactive compounds. More detail can be found in the following [article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245175/). The ChEMBL database can be efficiently queried through the Google Cloud Platform; we define the credentials needed to access this platform here.

In [2]:
# Define relative paths
NOTEBOOKS = os.getcwd()
WKDIR = NOTEBOOKS.replace('/Notebooks', '')
DATA = WKDIR + '/Data'

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = WKDIR + "/" + settings.GOOGLE_CLOUD_CREDENTIALS
EBI_CHEMBL = "patents-public-data.ebi_chembl"
client = bigquery.Client()

The ChEMBL corpus is structured as a relational database. That is, the corpus is composed of several different tables
which relate to one another via specific variables, or keys. For instance, the "assays" table and "docs" table are
related to one another via the "doc_id" variable. In order to create our analysis dataset, we will merge the relevant tables together via the appropriate "keys" and querying our aggregated table to isolate the observations we are interested in. This section provides a brief overview of the tables and fields we will be using:

- We start with the "docs" table, which contains all scientific publications (journal articles or patents) from which assays have been extracted. We will use the "title" and "abstract" field from this table to generate our model.


- Next, given that we want our model to distinguish between articles that contain drug-target information and those that do not, we need to incorporate target-level information into the analysis dataset. This information can be found in the "target_dictionary" and "target_type" tables. The "target_dictionary" table contains the target IDs that we can map to the specific assays ("tid"), and the "target_type" table categorizes the targets into distinct, high-level classes (e.g., "PROTEIN,", "MOLECULAR," "NON-MOLECULAR," etc.). We are only interested in querying documents associated with "PROTEIN" targets.


- The "assays" table stores the list of all the assays extracted from the documents. This table contains a variety of useful fields that map to other tables. Specifically, the "doc_id" field maps to the "doc_id" field in the "docs" table, and the "tid" field maps to the "tid" field in the "target_dictionary" table. Note that the majority of assays associated with a "PROTEIN" target are binding assays (we confirm this with the query directly below). To simplify our analysis dataset, we will only query documents associated with binding assays.

In [3]:
assay_test = f"""
SELECT DISTINCT assay.assay_id, assay.assay_type 
FROM `{EBI_CHEMBL}.assays_24` assay
INNER JOIN `{EBI_CHEMBL}.target_dictionary_24` target_dict
    ON target_dict.tid = assay.tid
INNER JOIN `{EBI_CHEMBL}.target_type_24` target_type
    ON target_type.target_type = target_dict.target_type
WHERE target_type.parent_type = "PROTEIN"
"""
assay_test = client.query(assay_test).to_dataframe()
assay_test['assay_type'].value_counts()

B    221245
F     35374
A     13530
U       126
T        73
Name: assay_type, dtype: int64

A more thorough description of the schema can be found [here](http://uk.brahma.top/Assets/chembl/schema.html). Using the above information, we define the following queries:

In [4]:
# First, we define a query that captures all targets ("tid") from the target_dictionary table whose high-level class 
# is "PROTEIN."
protein_target_query = f"""
SELECT target_dict.tid, target_parent.parent_type 
FROM `{EBI_CHEMBL}.target_dictionary_24` target_dict 
INNER JOIN `{EBI_CHEMBL}.target_type_24` target_parent 
    ON target_parent.target_type = target_dict.target_type 
WHERE target_parent.parent_type = "PROTEIN" 
"""

# Query the relevant documents with drug-target information.
data_pos_query = f"""
SELECT DISTINCT docs.doc_id, docs.title, docs.abstract
FROM `{EBI_CHEMBL}.docs_24` docs 
INNER JOIN `{EBI_CHEMBL}.assays_24` assays 
    ON assays.doc_id = docs.doc_id 
INNER JOIN ({protein_target_query}) targets 
    ON targets.tid = assays.tid 
WHERE 
    docs.title                IS NOT NULL AND 
    docs.abstract             IS NOT NULL AND 
    assays.confidence_score >= "8"        AND 
    assays.assay_type        = "B" 
"""
data_pos = client.query(data_pos_query).to_dataframe()
data_pos['target'] = 1

# Generate the dataset with documents associated with non drug-target interactions.
data_neg_query = f"""
SELECT DISTINCT docs.doc_id, docs.title, docs.abstract
FROM `{EBI_CHEMBL}.docs_24` docs
WHERE
    docs.doc_id NOT IN
    (
        SELECT doc_id
        FROM ({data_pos_query})
    )                         AND
    docs.title    IS NOT NULL AND
    docs.abstract IS NOT NULL
"""
data_neg = client.query(data_neg_query).to_dataframe()
data_neg['target'] = 0

Let's examine the number of documents we have in each category.

In [5]:
print("Number of 'positive' articles: " + str(data_pos.shape[0]))
print("Number of 'negative' articles: " + str(data_neg.shape[0]))

# Save out
data_pos.to_csv(DATA + '/ChEMBL document data (positive).csv', index=False)
data_neg.to_csv(DATA + '/ChEMBL document data (negative).csv', index=False)
data_pos.head()

Number of 'positive' articles: 28288
Number of 'negative' articles: 32650


Unnamed: 0,doc_id,title,abstract,target
0,11595,Para-substituted Phe3 deltorphin analogues: en...,The delta-selective opioid peptide deltorphin ...,1
1,11597,Computer-aided mapping of the beta-adrenocepto...,Anomalously low affinities for the beta-1-adre...,1
2,11598,"Radiosynthesis, cerebral distribution, and bin...","An analog of 1,3-di-o-tolylguanidine (DTG), [1...",1
3,11601,"Synthesis, configuration, and activity of isom...",The novel semirigid derivatives (+)-cis-1-[2-p...,1
4,11607,Dihydropyrimidine angiotensin II receptor anta...,The discovery of the nonpeptide angiotensin II...,1
