## Extraction of Vitamin D Targets

This notebook demonstrates how to use **CPIExtract** to extract gene names (HGNC symbols) associated with Vitamin $D_3$ targets.
 
Cholecalciferol (Vitamin $D_3$) is a fat-soluble vitamin that plays a crucial role in calcium and phosphorus metabolism, supporting bone health and immune function. It is synthesized in the skin upon exposure to sunlight and can also be obtained from dietary sources or supplements.

### Required Packages
Ensure that the following Python packages are installed before running this notebook: `pandas`, `numpy`, `pickle`, `zipfile`, `os`, `shutil` and `cpiextract`.

In [None]:
import pandas as pd
import numpy as np
import pickle
import zipfile
import os
import shutil

from cpiextract import Comp2Prot
import BioNetTools as tools

### Unzipping Database Files
We use **CPIExtract** to retrieve the targets of Cholecalciferol. 

**CPIExtract** incorportates and unifies drug targets from different sources:  

- **BindingDB**
- **STITCH**
- **ChEMBL**
- **CTD** (Comparative Toxicogenomics Database)
- **DTC** (Drug Target Commons)
- **DrugBank**
- **DrugCentral**

These databases have beeen downloaded and collected in the file `examples/VitaminD/sup_data/cpie_databases/Databases.zip`. Consult the [CPIExtract](https://github.com/menicgiulia/CPIExtract) repository for further information and documentation.

First, we extract the databases

In [None]:
tools.unzip_file("../../sup_data/cpie_databases/Databases.zip", "../../sup_data/cpie_databases/Databases")

### Loading Databases
Next, we load the extracted databases into a dictionary for further processing.

In [None]:
# Define database directory path
data_path = "../../sup_data/cpie_databases/Databases"

# Load databases into pandas DataFrames

# BindingDB (downloaded on 03/30/2023)
file_path = os.path.join(data_path, 'BindingDB.csv')
BDB_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'Ligand SMILES', 'Ligand InChI', 'BindingDB MonomerID',
                                                    'Ligand InChI Key', 'BindingDB Ligand Name',
                                                    'Target Name Assigned by Curator or DataSource',
                                                    'Target Source Organism According to Curator or DataSource',
                                                    'Ki (nM)', 'IC50 (nM)', 'Kd (nM)', 'EC50 (nM)', 'pH', 'Temp (C)',
                                                    'Curation/DataSource',
                                                    'UniProt (SwissProt) Entry Name of Target Chain',
                                                    'UniProt (SwissProt) Primary ID of Target Chain'],
                         on_bad_lines='skip')

# STITCH (downloaded on 02/22/2023)
file_path = os.path.join(data_path, 'STITCH.tsv')
sttch_data = pd.read_csv(file_path, sep='\t')

# ChEMBL (downloaded on 02/01/2024)
file_path = os.path.join(data_path, 'ChEMBL.csv')
chembl_data = pd.read_csv(file_path, sep=',')

# CTD
file_path = os.path.join(data_path, 'CTD.csv')
CTD_data = pd.read_csv(file_path, sep=',')

# DTC (downloaded on 02/24/2023)
file_path = os.path.join(data_path, 'DTC.csv')
DTC_data = pd.read_csv(file_path, sep=',', usecols=['CID', 'compound_id', 'standard_inchi_key', 'target_id',
                                                    'gene_names', 'wildtype_or_mutant', 'mutation_info',
                                                    'standard_type', 'standard_relation', 'standard_value',
                                                    'standard_units', 'activity_comment', 'pubmed_id', 'doc_type'])

# DrugBank (downloaded on 03/02/2022)
file_path = os.path.join(data_path, 'DB.csv')
DB_data = pd.read_csv(file_path, sep=',')

# DrugCentral (downloaded on 02/25/2024)
file_path = os.path.join(data_path, 'DrugCentral.csv')
DC_data = pd.read_csv(file_path, sep=',')

# Store all databases in a dictionary
dbs = {
    'chembl': chembl_data,
    'bdb': BDB_data,
    'stitch': sttch_data,
    'ctd': CTD_data,
    'dtc': DTC_data,
    'db': DB_data,
    'dc': DC_data
}

### Retrieving Targets for Cholecalciferol
We identify the targets of Cholecalciferol using its International Compound Identifier or its [PubChem](https://pubchem.ncbi.nlm.nih.gov) CID: `5280795`.

In [None]:
# Cholecalciferol (PubChem CID: 5280795)
comp_id = 5280795

# Initialize Comp2Prot
C2P = Comp2Prot('local', dbs=dbs)

# Search for interactions
comp_dat, status = C2P.comp_interactions(input_id=comp_id)

### Extracting Target Information
CPIExtract retrieves all Cholecalciferol targets from the different databases, along with relevant interaction data.

In [None]:
comp_dat

### Saving Extracted Targets
Finally, we extract and save the HGNC symbols of Cholecalciferol targets.

In [None]:
# Extract HGNC symbols
vd_targets = set(comp_dat.hgnc_symbol)

# Save extracted targets
with open("../../../data/input/drug_targets/vitd_targets_cpie.pkl", 'wb') as file:
    pickle.dump(vd_targets, file)

# Clean up extracted database files
shutil.rmtree("../../sup_data/cpie_databases/Databases") 