<a href="https://colab.research.google.com/github/kattens/ChemBridge/blob/main/Dataset_Pubchem_targets_blast_proteins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Our Goal**

Now we want a dataset structure that:

    Stores this triplet clearly

    Allows you to:

        Run sequence alignments between target and malaria proteins

        Analyze which drugs may bind to malaria proteins

        Potentially train or evaluate a model later

✅ Recommended Dataset Format (JSONL or DataFrame)

🧬 Each row represents a single target–malaria protein match



In [None]:
{
  "pubchem_id": "4735",
  "target_chain_id": "1RKW_A",
  "target_sequence": "MVLSPADKTN...",
  "malaria_match_id": "3D7A_A",
  "malaria_sequence": "MVLSPADKTV...",
  "percent_identity": 85.2,
  "alignment_length": 120,
  "evalue": 1e-50,
  "bitscore": 240.0
}

{'pubchem_id': '4735',
 'target_chain_id': '1RKW_A',
 'target_sequence': 'MVLSPADKTN...',
 'malaria_match_id': '3D7A_A',
 'malaria_sequence': 'MVLSPADKTV...',
 'percent_identity': 85.2,
 'alignment_length': 120,
 'evalue': 1e-50,
 'bitscore': 240.0}


Save all these in a list of rows → either:

    A .jsonl file (one JSON per line)

    Or .csv / Pandas DataFrame

# **Start the Code here**

In [12]:
json_path = '/content/drive/MyDrive/Drug Repurposing Project/PubChem_PDB_Results'

In [13]:
#Create the Ultimate Dataset:
import pandas as pd
import json
import os

#Make the main dataset
columns = ['pubchem_id', 'target_chain_id', 'target_sequence', 'malaria_match_id', 'malaria_sequence', 'percent_identity', 'alignment_length', 'evalue', 'bitscore']
df = pd.DataFrame(columns=columns)

✅ What it does:

    Each folder = a PubChem ID

    Each file = a target name (e.g., a protein chain)

    Each file contains a list of BLAST hit dictionaries

    You want one row per hit, with all relevant info

In [14]:
# Traverse each PubChem ID folder
for pubchem_id in os.listdir(json_path):
    folder_path = os.path.join(json_path, pubchem_id)
    if not os.path.isdir(folder_path):
        continue

    for file in os.listdir(folder_path):
        if not file.endswith('.json'):
            continue
        target_chain_id = file.replace('.json', '')
        file_path = os.path.join(folder_path, file)

        with open(file_path, 'r') as f:
            try:
                data = json.load(f)
            except Exception as e:
                print(f"Error reading {file_path}: {e}")
                continue

        # Skip if message says "No result found."
        if isinstance(data, list) and len(data) > 0 and "message" in data[0]:
            if data[0]["message"] == "No result found.":
                continue

        for hit in data:
            row = {
                'pubchem_id': pubchem_id,
                'target_chain_id': target_chain_id,
                'malaria_match_id': hit.get("subject_id"),
                'percent_identity': hit.get("percent_identity"),
                'alignment_length': hit.get("alignment_length"),
                'evalue': hit.get("evalue"),
                'bitscore': hit.get("bitscore")
            }
            df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)

# Preview or save
print(df.head())
# df.to_csv("malaria_alignment_dataset.csv", index=False)


  df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)


  pubchem_id target_chain_id target_sequence malaria_match_id  \
0   11511120          4I24_A             NaN           5DYK_A   
1   11511120          4I24_A             NaN           1V0O_A   
2   11511120          4I24_A             NaN           1OB3_A   
3   11511120          4I24_B             NaN           5DYK_A   
4   11511120          4I24_B             NaN           1V0O_A   

  malaria_sequence  percent_identity alignment_length        evalue  bitscore  
0              NaN         27.403846              208  1.415680e-13     159.0  
1              NaN         24.074074              216  2.054470e-10     131.0  
2              NaN         24.074074              216  2.112270e-10     131.0  
3              NaN         28.571429              210  7.233640e-13     154.0  
4              NaN         22.685185              216  1.192090e-11     141.0  


In [15]:
df.to_csv("final_dataset.csv", index=False)