# Create Drug Repurposing Hub annotation file

This notebook creates a Drug Repurposing Hub annotation file from the Drug Repurposing Hub data.

Steps

- Read in SMILES and pert_name from samples file
- Standardize SMILES and get the corresponding InChIKey
- Create a dictionary of InChIKey to pert_name
- Read in drug annotations from drug file, indexed by pert_name
- Create a single dataframe with pert_name, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`
- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

# Prepare data


In [4]:
%%bash
wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_samples_20200324.txt -O data/repurposing_samples_20200324.txt

cat data/repurposing_samples_20200324.txt | grep -v "^\!" | gzip > data/repurposing_samples_20200324_cleaned.txt.gz

rm data/repurposing_samples_20200324.txt


- Read in SMILES and pert_name from samples file
- Standardize SMILES and get the corresponding InChIKey

In [5]:
%%bash
python \
    StandardizeMolecule.py \
    --num_cpu 7 \
    --limit_rows 20 \
    --augment \
    --input data/repurposing_samples_20200324_cleaned.txt.gz \
    --output data/repurposing_samples_20200324_standardized.csv \
    run

Traceback (most recent call last):
  File "/Users/shsingh/work/projects/2019_07_11_JUMP-CP/workspace/software/compound-annotator/StandardizeMolecule.py", line 260, in <module>
    fire.Fire(StandardizeMolecule)
  File "/Users/shsingh/mambaforge/envs/compound-annotator/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/Users/shsingh/mambaforge/envs/compound-annotator/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/Users/shsingh/mambaforge/envs/compound-annotator/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/Users/shsingh/work/projects/2019_07_11_JUMP-CP/workspace/software/compound-annotator/StandardizeMolecule.py", line 221, in run
    self._load_input()
  File "/Users/shsingh/work/projects/2019_07_11_JUMP-CP/workspace/software/compound-

CalledProcessError: Command 'b'python \\\n    StandardizeMolecule.py \\\n    --num_cpu 7 \\\n    --limit_rows 20 \\\n    --augment \\\n    --input data/repurposing_samples_20200324_cleaned.txt.gz \\\n    --output data/repurposing_samples_20200324_standardized.csv \\\n    run\n'' returned non-zero exit status 1.

# Create annotation file

In [None]:
import pandas as pd

- Create a dictionary of InChIKey to pert_name

In [None]:
inchikey__pert_iname = pd.read_csv("data/repurposing_samples_20200324_standardized.csv")
inchikey__pert_iname = inchikey__pert_iname[['InChIKey_standardized', 'pert_iname']]
inchikey__pert_iname.drop_duplicates(inplace=True)
inchikey__pert_iname.rename(columns={'InChIKey_standardized': 'InChIKey'}, inplace=True)


In [None]:
%%bash
wget -q https://s3.amazonaws.com/data.clue.io/repurposing/downloads/repurposing_drugs_20200324.txt -O data/repurposing_drugs_20200324.txt

cat data/repurposing_drugs_20200324.txt | grep -v "^\!" | gzip > data/repurposing_drugs_20200324_cleaned.txt

rm data/repurposing_drugs_20200324.txt

- Read in drug annotations from drug file, indexed by pert_name
- Create a single dataframe with pert_name, InChIKey, and drug annotations -- save this as `compound_annot_drug_full.csv`


In [None]:
pert_iname__annotations = pd.read_csv("data/repurposing_drugs_20200324_cleaned.txt", sep="\t")

inchikey__annotations = pd.merge(
    inchikey__pert_iname, 
    pert_iname__annotations, 
    on='pert_iname', 
    how='inner')

# save this as `compound_annot_drug_full.csv`

inchikey__annotations.to_csv("data/compound_annot_drug_full.csv.gz", index=False)


- Get InChIKey to JCP2022 mapping
- Filter `compound_annot_drug_full.csv` to only include compounds that have a JCP2022 mapping -- save this as `compound_annot_drug.csv`

In [None]:
commit="0682dd2d52e4d68208ab4af3a0bd114ca557cb0e"

url = f"https://raw.githubusercontent.com/jump-cellpainting/datasets/{commit}/metadata/compound.csv.gz"

compound = pd.read_csv(url)

inchikey__annotations_filtered = inchikey__annotations.copy()

inchikey__annotations_filtered.columns = "Metadata_" + inchikey__annotations_filtered.columns

inchikey__annotations_filtered = inchikey__annotations_filtered[inchikey__annotations_filtered["Metadata_InChIKey"].isin(compound["Metadata_InChIKey"])]

inchikey__annotations_filtered.to_csv("data/compound_annot_drug.csv.gz", index=False)

inchikey__annotations_filtered.iloc[0]