<a href="https://colab.research.google.com/github/kattens/ChemBridge/blob/main/Interaction_and_Pathways_Data_Retrival.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🟢 1. Configuration and Paths**

In [1]:
#installing biopython
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [2]:
#import libraries
import pandas as pd
import numpy as np
import os
import requests
import json
from Bio.PDB import PDBParser, PPBuilder, is_aa
from Bio.Blast import NCBIWWW, NCBIXML
import concurrent.futures
from Bio.PDB.Polypeptide import PPBuilder

In [3]:
# --- 1. Configuration and Paths ---

#have the paths set, if not existing, make them
BASE_PATH = '/content/drive/MyDrive/Drug Repurposing Project/Interactions_Results'
SAVE_PATH = '/content/drive/MyDrive/Drug Repurposing Project/PubChem_PDB_Results'
PDB_SAVE_PATH = os.path.join(SAVE_PATH, "pdb_matches")

os.makedirs(SAVE_PATH, exist_ok=True)
os.makedirs(PDB_SAVE_PATH, exist_ok=True)


In [None]:
# We are combining all interaction and pathway files along with PubChem IDs and other relevant information, we will also focus on using the target PDB ID for further processing.
Interaction_folder_path = BASE_PATH

# 🔄 Combine all CSV files into one DataFrame
combined_df = pd.DataFrame()

for filename in os.listdir(Interaction_folder_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(Interaction_folder_path, filename)
        df = pd.read_csv(file_path)
        df['source_file'] = filename  # Optional: keep track of original source
        combined_df = pd.concat([combined_df, df], ignore_index=True)

# ➕ Add pubchem_id by removing '.csv' and converting to integer
combined_df['pubchem_id'] = combined_df['source_file'].str.replace('.csv', '', regex=False).astype(int)

# 💾 Save the combined DataFrame to a new CSV
output_path = os.path.join(Interaction_folder_path, '/content/drive/MyDrive/Drug Repurposing Project/Combined_Interaction_PDB_Targets.csv')
combined_df.to_csv(output_path, index=False)

print(f"✅ Combined CSV saved to: {output_path}")

In [5]:
df = pd.read_csv('/content/drive/MyDrive/Drug Repurposing Project/Combined_Interaction_PDB_Targets.csv')
df.columns

Index(['resolution', 'pdbid', 'title', 'expmethod', 'lignme', 'glytoucan',
       'cids', 'protacxns', 'geneids', 'pmid', 'dois', 'pmcids', 'pclids',
       'citations', 'source_file', 'pubchem_id'],
      dtype='object')

### 🧬 Key Column Descriptions – `combined_targets.csv`

| Column Name     | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| `pdbid`          | PDB ID of the protein-ligand structure (e.g., `4I24`)                       |
| `title`          | Title/description of the structure (often includes protein and ligand info)|
| `expmethod`      | Method used to determine the structure (e.g., X-RAY DIFFRACTION)            |
| `resolution`     | Resolution of the structure in Ångströms; lower values = better quality     |
| `lignme`         | Ligand(s) present in the structure (e.g., `CLQ` = chloroquine)              |
| `cids`           | PubChem Compound IDs (CIDs) for ligands in the structure                    |
| `protacxns`      | UniProt accession ID(s) or protein IDs involved in the interaction          |
| `geneids`        | NCBI Gene IDs corresponding to the proteins                                 |
| `pmid`           | PubMed ID of the publication describing the structure                       |
| `dois`           | DOI (Digital Object Identifier) for the structure's publication             |
| `pmcids`         | PubMed Central ID, if available                                              |
| `pclids`         | PubChem Literature IDs                                                       |
| `citations`      | Full text reference or author list                                          |
| `source_file`    | Name of the original CSV file (e.g., `444810.csv`) that this row came from  |


✅ **Retrieve PDB files for each PubChem ID**  
- For each PubChem ID listed in our CSV (interactions and pathways), query the RCSB database to check if a corresponding PDB structure exists. If found, download the PDB file. This step is necessary before performing BLAST analysis.

✅ **Extract protein sequences from PDB chains**  
- To prepare for BLAST, extract individual protein chains from each PDB file. BLAST requires chain-level separation to function correctly.

✅ **Run BLASTp against PDB, restricted to *Plasmodium falciparum* (taxonomy ID: 5833)**  
- Perform BLASTp to compare each extracted chain against the PDB database, filtered by *Plasmodium falciparum* (taxID 5833). This helps identify regions in the pathogen similar to our targets. We can infer potential drug binding relevance to the organism if a strong match is found.

✅ **Save top 3 BLAST hits to JSON**  
- Store the top 3 matches per query, including key metadata such as alignment score, E-value, and percent identity. Include the matched organism’s PDB ID if available; otherwise, record 'NAN'.

✅ **Download PDB structures (if available) or AlphaFold predictions**  
- For each matched entry in the results, attempt to download the corresponding PDB file. If not available (i.e., PDB ID is 'NAN'), fall back to downloading AlphaFold predictions for the matched protein to analyze structural similarities.

In [6]:
# --- 2. Sequence Extraction ---
def extract_protein_chains(pdb_file_path):
    """Extracts protein chain sequences from a PDB file."""
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("structure", pdb_file_path)
    ppb = PPBuilder()
    chain_sequences = {}
    for model in structure:
        for chain in model:
            residues = [res for res in chain if is_aa(res)]
            if residues:
                peptides = ppb.build_peptides(chain)
                if peptides:
                    sequence = ''.join(str(peptide.get_sequence()) for peptide in peptides)
                    if sequence:
                        chain_sequences[chain.id] = sequence
    return chain_sequences

# --- 3. BLAST Functions ---
def blast_sequence(sequence, tax_id="5833"):
    """Performs a BLAST search against the PDB database."""
    result_handle = NCBIWWW.qblast(
        program="blastp",
        database="pdb",
        sequence=sequence,
        entrez_query=f"txid{tax_id}[ORGN]",
        hitlist_size=3
    )
    return NCBIXML.read(result_handle)

def safe_blast_with_timeout(sequence, timeout=600):
    """Performs BLAST with a timeout to prevent indefinite waiting."""
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future = executor.submit(blast_sequence, sequence)
        try:
            return future.result(timeout=timeout)
        except concurrent.futures.TimeoutError:
            print("⏰ Timeout reached! Skipping this chain.")
            return None

# --- 4. UniProt PDB Mapping ---
def get_pdb_for_accession(accession):
    """Retrieves PDB IDs associated with a UniProt accession."""
    url = f"https://rest.uniprot.org/uniprotkb/{accession}.json"
    response = requests.get(url)
    if response.ok:
        data = response.json()
        pdbs = []
        for xref in data.get("uniProtKBCrossReferences", []):
            if xref["database"] == "PDB":
                pdbs.append(xref["id"])
        return pdbs
    return []

# --- 5. PDB and AlphaFold Download ---
def download_file(url, dest_path, description):
    """Downloads a file from a URL."""
    if os.path.exists(dest_path):
        print(f"📦 Skipping already downloaded {description}")
        return dest_path
    r = requests.get(url)
    if r.ok:
        with open(dest_path, 'w') as f:
            f.write(r.text)
        print(f"📥 Downloaded {description}")
        return dest_path
    print(f"⚠️ Failed to download {description}")
    return None

def download_pdb(pdb_id, dest_folder):
    """Downloads a PDB file."""
    dest_path = os.path.join(dest_folder, f"{pdb_id}.pdb")
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    return download_file(url, dest_path, f"PDB {pdb_id}")

def download_alphafold(uniprot_id, dest_folder):
    """Downloads an AlphaFold model."""
    file_name = f"AF-{uniprot_id}.pdb"
    dest_path = os.path.join(dest_folder, file_name)
    url = f"https://alphafold.ebi.ac.uk/files/{file_name}"
    return download_file(url, dest_path, f"AlphaFold model for {uniprot_id}")

# --- 6. Main Processing Loop ---
for folder_name in os.listdir(BASE_PATH):
    folder_path = os.path.join(BASE_PATH, folder_name)
    if os.path.isdir(folder_path):
        for file in os.listdir(folder_path):
            if file.endswith('.pdb'):
                pdb_file_path = os.path.join(folder_path, file)
                pdb_id = file.replace('.pdb', '')
                pubchem_id = folder_name

                print(f"\n📄 Processing {pdb_id} from PubChem {pubchem_id}")
                protein_chains = extract_protein_chains(pdb_file_path)

                for chain_id, sequence in protein_chains.items():
                    json_name = f"{pubchem_id}_{pdb_id}_{chain_id}.json"
                    json_file_path = os.path.join(SAVE_PATH, json_name)

                    if os.path.exists(json_file_path):
                        print(f"⏭️ Skipping chain {chain_id} (already processed)")
                        continue

                    print(f"\n🧬 Chain {chain_id} | Length: {len(sequence)}")
                    result = safe_blast_with_timeout(sequence)

                    if result and result.alignments:
                        top_hits = []
                        for alignment in result.alignments[:3]:
                            hsp = alignment.hsps[0]
                            pdb_code, chain_code = None, None
                            parts = alignment.hit_id.split('|')
                            if alignment.hit_id.startswith('pdb|') and len(parts) >= 3:
                                pdb_code, chain_code = parts[1], parts[2]

                            hit = {
                                "hit_def": alignment.hit_def,
                                "e_value": hsp.expect,
                                "score": hsp.score,
                                "query": hsp.query[:60],
                                "subject": hsp.sbjct[:60],
                                "pdb_code": pdb_code,
                                "chain": chain_code
                            }

                            acc = parts[1] if parts else None
                            if acc:
                                mapped_pdbs = get_pdb_for_accession(acc)
                                hit['mapped_pdbs'] = mapped_pdbs
                                for pdb_id_hit in mapped_pdbs:
                                    download_pdb(pdb_id_hit, PDB_SAVE_PATH)
                                if not mapped_pdbs:
                                    download_alphafold(acc, PDB_SAVE_PATH)

                            top_hits.append(hit)

                        with open(json_file_path, 'w') as f:
                            json.dump(top_hits, f, indent=2)
                        print(f"💾 Saved results to {json_file_path}")

                    else:
                        print(f"❌ No valid BLAST result for {pdb_id} Chain {chain_id}")
                        no_hits_file = os.path.join(SAVE_PATH, "no_pdb_hits.txt")
                        with open(no_hits_file, 'a') as f:
                            f.write(f"{pubchem_id}_{pdb_id}_{chain_id}\n")