<a href="https://colab.research.google.com/github/kattens/ChemBridge/blob/main/Dataset_Pubchem_targets_blast_proteins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Our Goal**

Now we want a dataset structure that:

    Stores this triplet clearly

    Allows you to:

        Run sequence alignments between target and malaria proteins

        Analyze which drugs may bind to malaria proteins

        Potentially train or evaluate a model later

✅ Recommended Dataset Format (JSONL or DataFrame)

🧬 Each row represents a single target–malaria protein match



In [None]:
{
  "pubchem_id": "4735",
  "target_chain_id": "1RKW_A",
  "target_sequence": "MVLSPADKTN...",
  "malaria_match_id": "3D7A_A",
  "malaria_sequence": "MVLSPADKTV...",
  "percent_identity": 85.2,
  "alignment_length": 120,
  "evalue": 1e-50,
  "bitscore": 240.0
}

{'pubchem_id': '4735',
 'target_chain_id': '1RKW_A',
 'target_sequence': 'MVLSPADKTN...',
 'malaria_match_id': '3D7A_A',
 'malaria_sequence': 'MVLSPADKTV...',
 'percent_identity': 85.2,
 'alignment_length': 120,
 'evalue': 1e-50,
 'bitscore': 240.0}


Save all these in a list of rows → either:

    A .jsonl file (one JSON per line)

    Or .csv / Pandas DataFrame

#SOme example code generated (Just save it here)

In [None]:
import os
import json
import pandas as pd

def load_fasta_sequence(fasta_path):
    """Reads a FASTA file and returns the amino acid sequence as a string."""
    with open(fasta_path, "r") as f:
        lines = f.readlines()
    sequence = ''.join([line.strip() for line in lines if not line.startswith(">")])
    return sequence if sequence else None

def build_dataset(base_dir):
    """
    Builds a dataset of PubChem drug IDs, target PDB chains, and matched malaria proteins.

    Args:
        base_dir (str): Path to the top-level folder with PubChem ID folders.

    Returns:
        pd.DataFrame: Compiled dataset.
    """
    rows = []

    for pubchem_id in os.listdir(base_dir):
        folder_path = os.path.join(base_dir, pubchem_id)
        if not os.path.isdir(folder_path) or pubchem_id.startswith('.'):
            continue

        for file in os.listdir(folder_path):
            if not file.endswith(".json") or file.startswith('.'):
                continue

            json_path = os.path.join(folder_path, file)
            fasta_path = json_path.replace(".json", ".fasta")

            if not os.path.exists(fasta_path):
                continue

            target_sequence = load_fasta_sequence(fasta_path)
            if not target_sequence:
                continue

            with open(json_path, "r") as f:
                try:
                    results = json.load(f)
                except:
                    continue

            # Skip if it's an error or "no result found" message
            if isinstance(results, dict) or (len(results) == 1 and "message" in results[0]):
                continue

            target_chain_id = os.path.splitext(file)[0]  # filename without .json

            for hit in results:
                rows.append({
                    "pubchem_id": pubchem_id,
                    "target_chain_id": target_chain_id,
                    "target_sequence": target_sequence,
                    "malaria_match_id": hit.get("subject_id"),
                    "malaria_sequence": None,  # placeholder for future step
                    "percent_identity": hit.get("percent_identity"),
                    "alignment_length": hit.get("alignment_length"),
                    "evalue": hit.get("evalue"),
                    "bitscore": hit.get("bitscore")
                })

    return pd.DataFrame(rows)


# Example usage:
# dataset = build_dataset("/your/full/path/to/PubChem_PDB_Results")
# dataset.to_csv("drug_target_malaria_dataset.csv", index=False)
