# Compile ADMET Data
The [admetSAR website provides](http://lmmd.ecust.edu.cn/admetsar1/download/) some of their training data in a bunch of text files and Excel spreadsheets. Here, we bring them each into a format where I am sure what the output means.

In [1]:
from dlhub_sdk.models.datasets import TabularDataset
from glob import glob
import pandas as pd
import json
import os

## Handle TXT files
Some of our files are in `txt` file format. 

In [2]:
txt_files = glob(os.path.join('..', 'databases', 'ADMET', '**', '*.txt'), recursive=True)
print(f'Found {len(txt_files)} files in txt format')

Found 6 files in txt format


Print out the top of one file

In [3]:
with open(txt_files[0]) as fp:
    for l, _ in zip(fp, range(5)):
        print(f': {l[:64]}')

: Chemical name 	CAS RN	SMILES	End Points (pIGCC50,ug/L )	Species	
: 2,4-Dichloro-6-nitroaniline	2683434	C1=C(C=C(C(=C1Cl)N)[N+](=O)[
: 2,4-Dibromo-6-nitroaniline	827236	C1=C(C=C(C(=C1Br)N)[N+](=O)[O-
: 6-Chloro-2,4-dinitroaniline	3531199	C1=C(C=C(C(=C1[N+](=O)[O-])N
: 2-Bromo-4,6-dinitroaniline	1817738	C1=C(C=C(C(=C1[N+](=O)[O-])N)


At least one of the files is in tab-delimited format. 

In [4]:
data = {}
for f in txt_files:
    tag = os.path.basename(f)[:-4]
    data[tag] = pd.read_csv(f, encoding='GB18030', delimiter='\t')
    print(f'Loaded {tag}. Shape: {data[tag].shape}')

Loaded R_T_TPT_I. Shape: (1571, 6)
Loaded M_CYP3A4I_I. Shape: (18561, 3)
Loaded R_T_FHMT_I. Shape: (554, 9)
Loaded R_A_Caco2_I. Shape: (674, 5)
Loaded M_CYPPro_I. Shape: (11578, 11)
Loaded WS_RM. Shape: (1708, 3)


The encoding is needed because these files were saved on a Chinese computer

## Loading the XLS files
Other files are saved as Excel documents

In [5]:
xls_files = glob(os.path.join('..', 'databases', 'ADMET', '**', '*.xls*'), recursive=True)
print(f'Found {len(xls_files)} files in Excel format')

Found 22 files in Excel format


In [6]:
for f in xls_files:
    tag = os.path.basename(f)[:-4]
    data[tag] = pd.read_excel(f)
    print(f'Loaded {tag}. Shape: {data[tag].shape}')

Loaded T_hERG_II. Shape: (806, 3)
Loaded E_OCT2I_I. Shape: (907, 2)
Loaded T_FHMT_I. Shape: (554, 4)
Loaded A_BBB_I. Shape: (1839, 4)
Loaded A_PgpI_I. Shape: (1273, 2)
Loaded M_BIO_I. Shape: (1604, 3)
Loaded A_Caco2_I. Shape: (674, 2)
Loaded M_CYP3A4S_I. Shape: (674, 3)
Loaded T_Carc_I. Shape: (293, 3)
Loaded A_HIA_I. Shape: (578, 3)
Loaded A_PgpI_II. Shape: (1275, 2)
Loaded T_AMES_I. Shape: (8445, 4)
Loaded T_HBT_I. Shape: (195, 4)
Loaded T_hERG_I. Shape: (368, 3)
Loaded M_CYP1A2I_I. Shape: (14903, 3)
Loaded A_PgpS_I. Shape: (332, 2)
Loaded M_CYP2D6S_I. Shape: (671, 4)
Loaded M_CYP2D6I_I. Shape: (14741, 3)
Loaded M_CYP2C9S_I. Shape: (673, 3)
Loaded M_CYP2C19I_I. Shape: (14576, 3)
Loaded T_TPT_I. Shape: (1571, 4)
Loaded M_CYP2C9I_I. Shape: (14709, 3)


## Standardize: Define Inputs and Outputs, Add Description, etc.
The following sections are taking the datasets and making sure that I understand what they are.

### Toxicty
We have a few different toxicity data sources from ADMET

In [7]:
tox_subset = dict((k, v) for k, v in data.items() if k.startswith('T') or k.startswith('R_T'))
print(f'We have {len(tox_subset)} toxicity datasets: {" ".join(tox_subset.keys())}')

We have 9 toxicity datasets: R_T_TPT_I R_T_FHMT_I T_hERG_II T_FHMT_I T_Carc_I T_AMES_I T_HBT_I T_hERG_I T_TPT_I


Translating the names (R* means regression tests). Cross-referenced with the size of the datasets present in the paper. 
Those which are ~crossed out~ are not relevant to our tasks:
- ~R_T_TPT_I~: Toxicity for the _Tetrahymena Pyriformis_ (not relevant here)
- ~R_T_FHMT_I~: Flathead minnow toxicity
- T_hERG_II: Version 2 of Human Ether-a-go-go-Related Gene (hERG) Inhibition
- ~T_FHMT_I~: Regression version of flathead minnow
- T_Carc_I: Carcinogens (not sure exactly what this means)
- T_AMES_I: AMES Test results
- ~T_HBT_I~: Honey bee toxicity
- T_hERG_I: Some other version of an HERG dataset
- ~T_TPT_I~: Toxicity for the _Tetrahymena Pyriformis_

List out the relevant datasets and markup key details: What does the name mean? Where did it come from? Which column is the label? Which label is "bad"?

In [8]:
relevant_tox = {
    'T_hERG_I': {
        'title': 'human Ether-à-go-go-Related Gene inhibitor, dataset 1',
        'source': '10.1002/minf.201000159',
        'label_col': 'Labels',
        'toxic_class': 1
    },
    'T_hERG_II': {
        'title': 'human Ether-à-go-go-Related Gene inhibitor, dataset 2',
        'source': '10.1021/mp300023x',
        'label_col': 'Labels',
        'toxic_class': 1
    },
    'T_Carc_I': {
        'title': 'Chemical carcinogen in rats',
        'source': '10.1002/qsar.200860192',
        'label_col': 'Lable',
        'toxic_class': 1
    },
    'T_AMES_I': {
        'title': 'Ames Assay',
        'source': '10.1021/ci300400a',
        'label_col': 'Label',
        'toxic_class': 1
    }
}

Save the datasets and write descriptions

In [9]:
for k, details in relevant_tox.items():
    # Create a column where toxic-or-not is a boolean
    dataset = data[k]
    dataset['is_toxic'] = dataset[details['label_col']] == details['toxic_class']
    
    # Save it to disk as a CSV
    out_path = os.path.join('datasets', f'{k}.csv')
    dataset.to_csv(out_path, index=False)
    
    # Start the description
    desc = TabularDataset.create_model(out_path)
    desc.set_title(details['title'])
    desc.set_name(k)
    
    # Mark the input/output columns
    desc.mark_inputs(['SMILES'])
    desc.mark_labels(['is_toxic'])
        
    # Add citations
    desc.add_related_identifier("10.1021/ci300367a", "DOI", "IsDescribedBy")
    desc.add_related_identifier(details["source"], "DOI", "IsSourceOf")
    with open(os.path.join(os.path.dirname(out_path), f'{k}-description.json'), 'w') as fp:
        json.dump(desc.to_dict(), fp)