# Databases download

To download the databases, we provide code that will produce the terminal commands required to download most databases. \
After setting up the correct data paths, execute the code cells to obtain the bash code required to download and process the files.
### Dislaimer
- The ChEMBL database requires the generation of the `.csv` file from the [ChEMBL website](https://www.ebi.ac.uk/chembl/web_components/explore/activities/STATE_ID:GnsHH1n7OPU1JAjGvKLgvQ==). More information is provided in the respective section below.
- The DrugBank database requires an academic license to download, please refer to the [DrugBank website](https://go.drugbank.com/releases/latest) for further instructions.

In [20]:
from IPython.display import Markdown as md
from datetime import datetime
import os
import pandas as pd

Data paths

In [21]:
# We recommend downloading the databases in the data/ folder
data = "data" 
# replace root path with appropriate location (e.g., /home/username/)
md(f"""```bash
   cd root/path/C-PIE
   mkdir -p {data}
   ```""")

```bash
   cd root/path/C-PIE
   mkdir -p data
   ```

## BindingDB

BDB usually updates its database every month. The link includes the year and the month. \
If the link returns a 404 error, try manually updating the link with the previous month.

In [22]:
m = datetime.now().strftime('%m')
y = datetime.now().strftime('%Y') 
md(f"""```bash
   wget -P {data}/ https://www.bindingdb.org/bind/downloads/BindingDB_All_{y}{m}_tsv.zip -o BindingDB.tsv.zip
   unzip {data}/BindingDB.tsv.zip && rm {data}/BindingDB.tsv.zip
   ```""")

```bash
   wget -P data/ https://www.bindingdb.org/bind/downloads/BindingDB_All_202405_tsv.zip -o BindingDB.tsv.zip
   unzip data/BindingDB.tsv.zip && rm data/BindingDB.tsv.zip
   ```

## ChEMBL

The ChEMBL database requires the generation of the `.csv` file from the [ChEMBL website](https://www.ebi.ac.uk/chembl/web_components/explore/activities/). In the page, select the *Homo sapiens* target organism in the filtering section on the right. Then, clicking on the csv download button will start the file generation. Once the generation is complete, press the download button. The images below show the step by step procedure to download the file.

![Filtering section button for Homo sapiens](images/ChEMBL-filtering.png)

![CSV button to generate annotations file](images/ChEMBL-csv.png)

![Button to download annotations file](images/ChEMBL-download.png)


The downloaded `.zip` file will contain multiple csv files that will need to be merged

Unzip file

In [17]:
md(f"""```bash
   mkdir -p {data}/chembl
   unzip {data}/ChEMBL.zip -d {data}/chembl 
   rm {data}/ChEMBL.zip
   ```""")

```bash
   mkdir -p data/chembl
   unzip data/ChEMBL.zip -d data/chembl 
   rm data/ChEMBL.zip
   ```

Merge files with the following code

In [18]:
def merge_csv_files(folder_path, output_file):
    
    files = os.listdir(folder_path)
    
    csv_files = [file for file in files if file.endswith('.csv')]
    
    if len(csv_files) == 0:
        print("No CSV files found in the folder.")
        return
    
    merged_df = pd.DataFrame()
    
    for file in csv_files:
        file_path = os.path.join(folder_path, file)
        df = pd.read_csv(file_path)
        merged_df = pd.concat([merged_df, df], ignore_index=True)
    
    merged_df.to_csv(output_file, index=False)

folder_path = f'{data}/chembl'
output_file = f'{data}/ChEMBL.csv'
merge_csv_files(folder_path, output_file)


Remove downloaded data

In [27]:
md(f"""```bash
   rm -r {data}/chembl/
   ```""")

```bash
   rm -r data/chembl/
   ```

## CTD

In [12]:
md(f"""```bash
   wget -P {data}/ https://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz -o CTD.csv.gz
   gzip -d {data}/CTD.csv.gz
   ```""")

```bash
   wget -P data/ https://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz -o CTD.csv.gz
   gzip -d data/CTD.csv.gz
   ```

# Drugbank

The DrugBank database requires an academic license to download, please refer to the [DrugBank website](https://go.drugbank.com/releases/latest) for further instructions.
Once the academic license has been issued to the account, please download the complete database from the [Drugbank download page](https://go.drugbank.com/releases/latest), or use the commands below. \
Then, unzip the file to obtain the `.xml` file. 
This file can be converted to `.csv` using the code provided below. 

![DrugBank complete database download](images/Drugbank.png)

Download the zip file

In [18]:
email = 'YOURUSERNAME'
password = 'YOURPASSWORD'

md(f"""```bash
    wget -P {data}/ -user={email} -password={password} https://go.drugbank.com/releases/latest/downloads/all-full-database -o drugbank_all_full_database.xml.zip
   ```""")

```bash
    wget -P data/ -user=YOURUSERNAME -password=YOURPASSWORD https://go.drugbank.com/releases/latest/downloads/all-full-database -o drugbank_all_full_database.xml.zip
   ```

Unzip the file

In [7]:
md(f"""```bash
    unzip {data}/drugbank_all_full_database.xml.zip -d {data}
    rm {data}/drugbank_all_full_database.xml.zip
   ```""")

```bash
    unzip data/drugbank_all_full_database.xml.zip -d data
    rm data/drugbank_all_full_database.xml.zip
   ```

In [None]:
import pandas as pd
import collections
import xml.etree.ElementTree as ET

def collapse_list_values(row):
    for key, value in row.items():
        if isinstance(value, list):
            row[key] = '|'.join(value)
    return row

def xml2csv(file_path, output_file):
    tree = ET.parse(file_path)
    root = tree.getroot()

    ns = '{http://www.drugbank.ca}'
    inchikey_template = "{ns}calculated-properties/{ns}property[{ns}kind='InChIKey']/{ns}value"
    inchi_template = "{ns}calculated-properties/{ns}property[{ns}kind='InChI']/{ns}value"

    rows = list()
    for i, drug in enumerate(root):
        row = collections.OrderedDict()
        assert drug.tag == ns + 'drug'
        row['type'] = drug.get('type')
        row['drugbank-id'] = drug.findtext(ns + "drugbank-id[@primary='true']")
        row['name'] = drug.findtext(ns + "name")
        row['description'] = drug.findtext(ns + "description")
        row['InChIKey'] = drug.findtext(inchikey_template.format(ns = ns))
        rows.append(row)

    rows = list(map(collapse_list_values, rows))

    columns = ['drugbank-id', 'name', 'type', 'InChIKey', 'description']
    drugbank_df = pd.DataFrame.from_dict(rows)[columns]

    protein_rows = list()
    for i, drug in enumerate(root):
        drugbank_id = drug.findtext(ns + "drugbank-id[@primary='true']")
        for category in ['target', 'enzyme', 'carrier', 'transporter']:
            proteins = drug.findall('{ns}{cat}s/{ns}{cat}'.format(ns=ns, cat=category))
            for protein in proteins:
                row = {'drugbank-id': drugbank_id, 'protein_type': category}
                row['protein_name'] = protein.findtext('{}name'.format(ns))
                row['organism'] = protein.findtext('{}organism'.format(ns))
                actions = protein.findall('{ns}actions/{ns}action'.format(ns=ns))
                row['actions'] = '|'.join(action.text for action in actions)
                uniprot_ids = [polypep.text for polypep in protein.findall(
                    "{ns}polypeptide/{ns}external-identifiers/{ns}external-identifier[{ns}resource='UniProtKB']/{ns}identifier".format(ns=ns))]            
                if len(uniprot_ids) == 1:
                    row['uniprot_id'] = uniprot_ids[0]
                hgnc_ids = [polypep.text for polypep in protein.findall(
                    "{ns}polypeptide/{ns}external-identifiers/{ns}external-identifier[{ns}resource='HUGO Gene Nomenclature Committee (HGNC)']/{ns}identifier".format(ns=ns))]            
                if len(hgnc_ids) == 1:
                    row['HGNC'] = hgnc_ids[0]
                protein_rows.append(row)

    protein_df = pd.DataFrame.from_dict(protein_rows)

    drugbank = pd.merge(drugbank_df, protein_df, on='drugbank-id', how='left')

    drugbank.to_csv(output_file, sep=',', index=False)

file_path = f'{data}/full database.xml'
output_file = f'{data}/DB.csv'
xml2csv(file_path, output_file)


Remove downloaded data

In [19]:
md(f"""```bash
   rm -r {data}/full database.xml
   ```""")

```bash
   rm -r data/full database.xml
   ```

## DrugCentral

In [15]:
md(f"""```bash
   wget -P {data}/ https://unmtid-dbs.net/download/DrugCentral/2021_09_01/drug.target.interaction.tsv.gz -o DC.tsv.gz
   gzip -d {data}/DC.tsv.gz
   wget -P {data}/ https://unmtid-dbs.net/download/DrugCentral/2021_09_01/structures.smiles.tsv -o DC_comps.tsv 
   ```""")

```bash
   wget -P data/ https://unmtid-dbs.net/download/DrugCentral/2021_09_01/drug.target.interaction.tsv.gz -o DC.tsv.gz
   gzip -d data/DC.tsv.gz
   wget -P data/ https://unmtid-dbs.net/download/DrugCentral/2021_09_01/structures.smiles.tsv -o DC_comps.tsv 
   ```

Merge the two files to add SMILES and InChI information for compounds

In [20]:
DC = pd.read_csv(f'{data}/DC.tsv', sep='\t')
DC_comps = pd.read_csv(f'{data}/DC_comps.tsv', sep='\t')

DC_comps = DC_comps[['ID', 'SMILES', 'InChi', 'InChiKey']].rename(columns={'ID':'STRUCT_ID'})

DC = pd.merge(DC, DC_comps, on='STRUCT_ID', how='left')

DC.to_csv(f'{data}/DrugCentral.csv', index=False)

Remove downloaded data

In [26]:
md(f"""```bash
   rm {data}/DC.tsv && rm {data}/DC_comps.tsv 
   ```""")

```bash
   rm data/DC.tsv && rm data/DC_comps.tsv 
   ```

## DTC

In [32]:
md(f"""```bash
   wget -P {data}/ https://drugtargetcommons.fimm.fi/static/Excell_files/DTC_data.csv -o DTC.csv 
   ```""")

```bash
   wget -P data/ https://drugtargetcommons.fimm.fi/static/Excell_files/DTC_data.csv -o DTC.csv 
   ```

## STITCH

In [34]:
md(f"""```bash
   wget -P {data}/ http://stitch.embl.de/download/protein_chemical.links.detailed.v5.0/9606.protein_chemical.links.detailed.v5.0.tsv.gz -o STICH.tsv.gz
   gzip -d {data}/STITCH.tsv.gz
   ```""")

```bash
   wget -P data/ http://stitch.embl.de/download/protein_chemical.links.detailed.v5.0/9606.protein_chemical.links.detailed.v5.0.tsv.gz -o STICH.tsv.gz
   gzip -d data/STITCH.tsv.gz
   ```