# Create Drug Repurposing Hub annotation file

This notebook creates a Drug Repurposing Hub annotation file from the Drug Repurposing Hub data.

Steps

- Read in SMILES and pert_name from samples file
- Standardize SMILES and get the corresponding InChIKey
- Create a dictionary of InChIKey to pert_name
- Read in drug annotations from drug file, indexed by pert_name
- Create a single dataframe with pert_name, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`
- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

# Prepare data


In [14]:
%%bash
wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt -O data/repurposing_samples_20200324.txt

cat data/repurposing_samples_20200324.txt | grep -v "^\!" | gzip > data/repurposing_samples_20200324_cleaned.txt.gz

rm data/repurposing_samples_20200324.txt

wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_drugs_20200324.txt -O data/repurposing_drugs_20200324.txt

cat data/repurposing_drugs_20200324.txt | grep -v "^\!" | gzip > data/repurposing_drugs_20200324_cleaned.txt.gz

rm data/repurposing_drugs_20200324.txt

- Read in SMILES and pert_name from samples file
- Standardize SMILES and get the corresponding InChIKey

In [17]:
%%bash
python \
    StandardizeMolecule.py \
    --num_cpu 10 \
    --limit_rows 10 \
    --augment \
    --input data/repurposing_samples_20200324_cleaned.txt.gz \
    --output data/repurposing_samples_20200324_standardized.csv.gz \
    run

INFO:root:Number of CPUs: 10




100%|██████████| 4/4 [00:01<00:00,  2.58it/s]


data/repurposing_samples_20200324_standardized.csv.gz


# Create annotation file

In [7]:
import pandas as pd

- Create a dictionary of InChIKey to pert_name

In [18]:
inchikey__pert_iname = pd.read_csv("data/repurposing_samples_20200324_standardized.csv.gz")
inchikey__pert_iname = inchikey__pert_iname[['InChIKey_standardized', 'pert_iname']]
inchikey__pert_iname.drop_duplicates(inplace=True)
inchikey__pert_iname.rename(columns={'InChIKey_standardized': 'InChIKey'}, inplace=True)


- Read in drug annotations from drug file, indexed by pert_name
- Create a single dataframe with pert_name, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`


In [19]:
pert_iname__annotations = pd.read_csv("data/repurposing_drugs_20200324_cleaned.txt.gz", sep="\t")

inchikey__annotations = pd.merge(
    inchikey__pert_iname, 
    pert_iname__annotations, 
    on='pert_iname', 
    how='inner')

# save this as `compound_annot_drug_full.csv`

inchikey__annotations.to_csv("data/compound_annot_drug_full.csv.gz", index=False)


- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

In [20]:
commit="0682dd2d52e4d68208ab4af3a0bd114ca557cb0e"

url = f"https://raw.githubusercontent.com/jump-cellpainting/datasets/{commit}/metadata/compound.csv.gz"

compound = pd.read_csv(url)

inchikey__annotations_filtered = inchikey__annotations.copy()

inchikey__annotations_filtered.columns = "Metadata_" + inchikey__annotations_filtered.columns

inchikey__annotations_filtered = inchikey__annotations_filtered[inchikey__annotations_filtered["Metadata_InChIKey"].isin(compound["Metadata_InChIKey"])]

inchikey__annotations_filtered.to_csv("data/compound_annot_drug.csv.gz", index=False)

inchikey__annotations_filtered.iloc[0]

Metadata_InChIKey            HJORMJIFDVBMOB-UHFFFAOYSA-N
Metadata_pert_iname                     (R)-(-)-rolipram
Metadata_clinical_phase                          Phase 1
Metadata_moa                 phosphodiesterase inhibitor
Metadata_target            PDE4A|PDE4B|PDE4C|PDE4D|PDE5A
Metadata_disease_area                                NaN
Metadata_indication                                  NaN
Name: 1, dtype: object