# Create Drug Repurposing Hub annotation file

This notebook creates a Drug Repurposing Hub annotation file from the Drug Repurposing Hub data.

Steps

- Read in SMILES and pert_iname from samples file
- Standardize SMILES and get the corresponding InChIKey
- Create a dictionary of InChIKey to pert_iname
- Read in drug annotations from drug file, indexed by pert_iname
- Create a single dataframe with pert_iname, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`
- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

# Prepare data


These steps are run ahead of time and the results are saved to the `data` directory.

```
%%bash
wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt -O data/repurposing_samples_20200324.txt

cat data/repurposing_samples_20200324.txt | grep -v "^\!" | gzip > data/repurposing_samples_20200324_cleaned.txt.gz

rm data/repurposing_samples_20200324.txt

wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_drugs_20200324.txt -O data/repurposing_drugs_20200324.txt

cat data/repurposing_drugs_20200324.txt | grep -v "^\!" | gzip > data/repurposing_drugs_20200324_cleaned.txt.gz

rm data/repurposing_drugs_20200324.txt
```

- Read in SMILES and pert_iname from samples file
- Standardize SMILES and get the corresponding InChIKey

```
%%bash
python \
    StandardizeMolecule.py \
    --num_cpu 14 \
    --augment \
    --input data/repurposing_samples_20200324_cleaned.txt.gz \
    --output data/repurposing_samples_20200324_standardized.csv.gz \
    run
```

# Create annotation file

In [1]:
import pandas as pd

- Create a dictionary of InChIKey to pert_iname

In [2]:
inchikey__pert_iname = pd.read_csv("data/repurposing_samples_20200324_standardized.csv.gz")
inchikey__pert_iname = inchikey__pert_iname[['InChIKey_standardized', 'pert_iname']]
inchikey__pert_iname.drop_duplicates(inplace=True)
inchikey__pert_iname.rename(columns={'InChIKey_standardized': 'InChIKey'}, inplace=True)


- Read in drug annotations from drug file, indexed by pert_iname
- Create a single dataframe with pert_iname, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`


In [3]:
pert_iname__annotations = pd.read_csv("data/repurposing_drugs_20200324_cleaned.txt.gz", sep="\t")

inchikey__annotations = pd.merge(
    inchikey__pert_iname, 
    pert_iname__annotations, 
    on='pert_iname', 
    how='inner')

# save this as `compound_annot_drug_full.csv`

inchikey__annotations.to_csv("data/compound_rephub_annot_full.csv.gz", index=False)


- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

In [4]:
commit="0682dd2d52e4d68208ab4af3a0bd114ca557cb0e"

url = f"https://raw.githubusercontent.com/jump-cellpainting/datasets/{commit}/metadata/compound.csv.gz"

compound = pd.read_csv(url)

inchikey__annotations_filtered = inchikey__annotations.copy()

inchikey__annotations_filtered.columns = "Metadata_" + inchikey__annotations_filtered.columns

inchikey__annotations_filtered = inchikey__annotations_filtered[inchikey__annotations_filtered["Metadata_InChIKey"].isin(compound["Metadata_InChIKey"])]

inchikey__annotations_filtered.to_csv("data/compound_rephub_annot.csv.gz", index=False)

What does the annotation file look like?


In [5]:
inchikey__annotations_filtered.iloc[0]

Metadata_InChIKey            HJORMJIFDVBMOB-UHFFFAOYSA-N
Metadata_pert_iname                     (R)-(-)-rolipram
Metadata_clinical_phase                          Phase 1
Metadata_moa                 phosphodiesterase inhibitor
Metadata_target            PDE4A|PDE4B|PDE4C|PDE4D|PDE5A
Metadata_disease_area                                NaN
Metadata_indication                                  NaN
Name: 1, dtype: object

Annotation file dimensions (unfiltered version)

In [16]:
inchikey__annotations.shape

(6776, 7)

Annotation file dimensions (filtered version)

In [17]:
inchikey__annotations_filtered.shape

(4846, 7)

How many unique `InChIKeys` in the annotation file?

In [18]:
inchikey__annotations_filtered["Metadata_InChIKey"].nunique()

4724

How many unique `pert_iname` in the annotation file?

In [9]:
inchikey__annotations_filtered["Metadata_pert_iname"].nunique()

4846

Mapping file dimensions

In [10]:
inchikey__pert_iname.shape

(6776, 2)

How many unique `pert_iname` in the mapping file?

In [11]:
inchikey__pert_iname["pert_iname"].nunique()

6776

How many unique `InChIKey` in the mapping file?

In [12]:
inchikey__pert_iname["InChIKey"].nunique()

6629

Report the list of `InChiKeys` that map to more than one `pert_iname`

In [25]:
inchikey_dup_counts = inchikey__pert_iname["InChIKey"].value_counts().sort_values(ascending=False)
inchikey_dup_counts = inchikey_dup_counts[inchikey_dup_counts > 1]
inchikey_dup_counts

inchikey__pert_iname_l = inchikey__pert_iname[inchikey__pert_iname["InChIKey"].isin(inchikey_dup_counts.index)]
inchikey__pert_iname_l
inchikey__pert_iname_l = inchikey__pert_iname_l.groupby("InChIKey")["pert_iname"].apply(lambda x: ":".join(x)).reset_index()
inchikey__pert_iname_l

Unnamed: 0,InChIKey,pert_iname
0,ACWBQPMHZXGDFX-UHFFFAOYSA-N,LCZ696:valsartan
1,AHUXISVXKYLQOD-UHFFFAOYSA-N,cefradine:cephradine
2,AQHHHDLHHXJYJD-UHFFFAOYSA-N,propranolol:propranolol-(R):propranolol-(S)
3,AUYYCJSJGJYCDS-UHFFFAOYSA-N,liothyronine:liothyronine-(isomer)
4,AYEOSGBMQHXVER-UHFFFAOYSA-N,"2,3-cis/exo-camphanediol:cis-exo-camphanediol-2,3"
...,...,...
128,ZCIXFSVENQDMCK-UHFFFAOYSA-J,antimony-potassium:antimonyl
129,ZKLPARSLTMPFCP-UHFFFAOYSA-N,cetirizine:levocetirizine
130,ZSTCZWJCLIRCOJ-UHFFFAOYSA-N,RU-42173:zilpaterol
131,ZXERDUOLZKYMJM-UHFFFAOYSA-N,INT-747:obeticholic-acid


Do the same for the annotations file and report duplicates.
Note that some of these annotations could be unreliable because the 2 (or more) `pert_iname`s corresponding to the same `InChIKey` could be different compounds.

In [26]:
inchikey__annotations_filtered_l = inchikey__annotations_filtered[inchikey__annotations_filtered["Metadata_InChIKey"].isin(inchikey_dup_counts.index)]
inchikey__annotations_filtered_l = inchikey__annotations_filtered_l.fillna('')
inchikey__annotations_filtered_l = inchikey__annotations_filtered_l.groupby("Metadata_InChIKey").agg(lambda x: ":".join(x)).reset_index()


inchikey__annotations_filtered_l.to_csv("data/repurposing_duplicates.csv", index=False)

inchikey__annotations_filtered_l

Unnamed: 0,Metadata_InChIKey,Metadata_pert_iname,Metadata_clinical_phase,Metadata_moa,Metadata_target,Metadata_disease_area,Metadata_indication
0,ACWBQPMHZXGDFX-UHFFFAOYSA-N,LCZ696:valsartan,Launched:Launched,angiotensin receptor antagonist:angiotensin re...,:AGTR1,cardiology:cardiology,angioedema|hypotension:hypertension|congestive...
1,AHUXISVXKYLQOD-UHFFFAOYSA-N,cefradine:cephradine,Launched:Launched,bacterial cell wall synthesis inhibitor:bacter...,CYP3A4:,infectious disease|otolaryngology:infectious d...,respiratory tract infections|otitis|skin infec...
2,AQHHHDLHHXJYJD-UHFFFAOYSA-N,propranolol:propranolol-(R):propranolol-(S),Launched:Preclinical:Preclinical,adrenergic receptor antagonist:adrenergic rece...,ADRB1|ADRB2:ADRB2|ADRB3:ADRB1|HTR1A|HTR5A|SLC10A1,cardiology|neurology/psychiatry::,hypertension|angina pectoris|migraine headache::
3,AUYYCJSJGJYCDS-UHFFFAOYSA-N,liothyronine:liothyronine-(isomer),Launched:Preclinical,thyroid hormone stimulant:,THRA|THRB:,endocrinology:,hypothyroidism|myxedema coma:
4,BFOWVMZUKTYNPH-UHFFFAOYSA-N,cyanocobalamin:hydroxocobalamin:methylcobalami...,Launched:Launched:Phase 3:Launched,methylmalonyl CoA mutase stimulant|vitamin B:v...,MUT:::,hematology|infectious disease|gastroenterology...,anemia|fish tapeworm infestation|celiac diseas...
...,...,...,...,...,...,...,...
105,YUTJCNNFTOIOGT-UHFFFAOYSA-N,anthralin:dithranol,Launched:Launched,DNA synthesis inhibitor:DNA synthesis inhibitor,:,dermatology:dermatology,psoriasis:psoriasis
106,YXSLJKQTIDHPOT-UHFFFAOYSA-N,atracurium:cisatracurium,Launched:Launched,acetylcholine receptor antagonist:acetylcholin...,:CHRNA2,critical care|neurology/psychiatry:neurology/p...,endotracheal intubation|muscle relaxant:muscle...
107,ZKLPARSLTMPFCP-UHFFFAOYSA-N,cetirizine:levocetirizine,Launched:Launched,histamine receptor antagonist:histamine recept...,HRH1:HRH1,allergy:allergy,allergic rhinitis:allergic rhinitis|urticaria
108,ZXERDUOLZKYMJM-UHFFFAOYSA-N,INT-747:obeticholic-acid,Phase 3:Launched,FXR agonist:FXR agonist,:NR1H4,:gastroenterology,:primary biliary cholangitis
