# Installing The Environment

In order to do any kind of analysis we need the right tools.
Here we are quickly going to install DeepChem and jupyter to work in for the rest of the workshop


```bash
# Install DeepChem
git clone https://github.com/lilleswing/deepchem-slim.git
bash scripts/install_deepchem_conda.sh future_of_care
source activate future_of_care
python setup.py install
cd ..

# Install jupyter
conda install jupyter jupyterlab

#Start your Notebook
jupyter lab
```

## What We Are Doing Today

https://pubs.acs.org/doi/abs/10.1021/acs.molpharmaceut.8b00110
This paper came across my radar last week.  
TODO (Some filler about why p450 is important)





# Grabbing The Data

Training Set
https://pubchem.ncbi.nlm.nih.gov/bioassay/1851

Test Sets
* https://pubchem.ncbi.nlm.nih.gov/bioassay/410
* https://pubchem.ncbi.nlm.nih.gov/bioassay/883
* https://pubchem.ncbi.nlm.nih.gov/bioassay/899
* https://pubchem.ncbi.nlm.nih.gov/bioassay/891
* https://pubchem.ncbi.nlm.nih.gov/bioassay/884


In [94]:
import pandas as pd

In [98]:
df = pd.read_csv('data/AID_1851_datatable_all.csv')
curve_columns = list(filter(lambda x: x.find('CurveClass') > 0, df.columns.tolist()))
score_columns = list(filter(lambda x: x.find('Score') > 0, df.columns.tolist()))
ID_COLUMN = ['PUBCHEM_CID']
df = df[ID_COLUMN+curve_columns + score_columns]

In [99]:
df.columns.tolist()

['PUBCHEM_CID',
 'p450-cyp2c19-Fit_CurveClass',
 'p450-cyp2d6-Fit_CurveClass',
 'p450-cyp3a4-Fit_CurveClass',
 'p450-cyp1a2-Fit_CurveClass',
 'p450-cyp2c9-Fit_CurveClass',
 'Activity Score',
 'Activity Score.1',
 'Activity Score.2',
 'Activity Score.3',
 'Activity Score.4']

In [117]:
import math
def label_column(score, curve_class):
    # Convert Units
    score = float(score)
    curve_class = str(abs(float(curve_class)))
    
    if score >= 40 and curve_class in ("1.1", "1.2", "2.1"):
        return 1
    if curve_class not in ("1.1", "1.2", "2.1"):
        return 0
    return -1

def read_smiles_lookup_map(fname):
    with open(fname) as fin:
        lines = fin.readlines()
        lines = [x.strip().split(',') for x in lines]
        d = {x[0]:x[1] for x in lines}
        return d
        

def df_to_single_task(df, score_column, curve_column):
    cids = df['PUBCHEM_CID'].tolist()[3:]
    curve_classes = df[curve_column].tolist()[3:]
    scores = df[score_column].tolist()[3:]
    
    smiles_lookup = read_smiles_lookup_map('data/smiles_lookup.csv')
    table = []
    for cid, curve_class, score in zip(cids, curve_classes, scores):
        if pd.isna(cid) or pd.isna(curve_class) or pd.isna(score):
            continue
        cid = str(int(cid))
        if cid not in smiles_lookup:
            continue
        smiles = smiles_lookup[cid]
        table.append([smiles,curve_class,score])
        
    final_table = []
    for smile, curve_class, score in table:
        label = label_column(score, curve_class)
        if label == -1:
            continue
        final_table.append([smile, str(label)])
    return final_table
table = df_to_single_task(df, 'Activity Score', 'p450-cyp2c19-Fit_CurveClass')
with open('assets/2c19_train.tsv', 'w') as fout:
    for line in table:
        line = ",".join(line)
        fout.write("%s\n" % line)

In [118]:
# This doesn't line up with the paper!
# This is a very common thing in this field.....
len(y), sum(y),  len(y) - sum(y)

(9323, 5103, 4220)

In [119]:
df = pd.read_csv('data/AID_899_datatable_all.csv')
curve_columns = list(filter(lambda x: x.find('CurveClass') > 0, df.columns.tolist()))
score_columns = list(filter(lambda x: x.find('Score') > 0, df.columns.tolist()))
ID_COLUMN = ['PUBCHEM_CID']
df = df[ID_COLUMN+curve_columns + score_columns]

In [120]:
table = df_to_single_task(df, 'Activity Score', 'p450-cyp2c19-Fit_CurveClass')
with open('assets/2c19_test.tsv', 'w') as fout:
    for line in table:
        line = ",".join(line)
        fout.write("%s\n" % line)

Unnamed: 0,PUBCHEM_CID,Fit_CurveClass
0,,FLOAT
1,,Numerical encoding of curve description for th...
2,,NONE
3,,
4,,
5,3232584.0,4
6,3232585.0,-3
7,3232586.0,-3
8,3232587.0,-3
9,3232588.0,-2.2


In [121]:
cids = df['PUBCHEM_CID'].tolist()[3:]

In [125]:
cids = list(filter(lambda x: not pd.isna(x), cids))

In [None]:
cids = [str(int(x)) for x in cids]
smiles = get_smiles(cids)