<a href="https://colab.research.google.com/github/kattens/PubChem-Data-Handler/blob/main/Blast_Run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recap:

We have prepared a CSV file from an initial file named `cdot_targets`, containing PubChem IDs. For each PubChem ID, we searched PubChem to find interacting targets and documented these interactions in the "Target Names" column. Additionally, we included "Accession IDs," corresponding to UniProt IDs for each target, and a column for gene names to improve the accuracy of BLAST searches.

## Goal:

Our objective is to run BLAST searches using UniProt IDs from our CSV file against targets related to Plasmodium malaria.

## Tools and Libraries:

- **NCBIWWW**: Facilitates online BLAST searches.
- **NCBIXML**: Parses the XML results from BLAST into a manageable Python format.
- **SeqIO**: Utilized for reading and writing sequences across various bioinformatics file formats.

## Implementation:

Since the base code is operational, our next step is to save the data in a JSON file for each accession ID. This will allow us to manage and analyze the results efficiently.

## List of Plasmodium Variants on UniProt:

- **Plasmodium falciparum** - Taxon ID: 5833
- **Plasmodium malariae** - Taxon ID: 5858
- **Plasmodium vivax** - Taxon ID: 5855
- **Plasmodium ovale** - Taxon ID: 36330
- **Plasmodium berghei** - Taxon ID: 5821
- **Plasmodium reichenowi** - Taxon ID: 5854
- **Plasmodium gonderi** - Taxon ID: 77519
- **Plasmodium chabaudi** - Taxon ID: 5825

## Next Steps:

- Iterate through all UniProt IDs in the CSV file, perform a BLAST search for each, and store the results in a JSON file for subsequent analysis.


In [1]:
#install Bio
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [2]:
import csv
import pandas as pd
from Bio.Blast import NCBIWWW,NCBIXML
from Bio import SeqIO
import requests

In [3]:
path = '/content/drive/MyDrive/target_results.csv'
df = pd.read_csv(path)

In [4]:
df.head()

Unnamed: 0,PubChem ID,Target Names,Accession IDs,Target Gene Name
0,5330175,"['Tyrosineprotein', 'NTMT1', 'FH', 'Chain', 'N...","['A0A0K2VLS4', 'S4R3J7', 'P07954', 'P0C023', '...","['NTMT1', 'FH', 'NSD2', 'KDR', 'GPX4', 'COMT',..."
1,5311340,"['ID4', 'ALDH1A1', 'EZH2', 'MYC', 'GLA', 'APOB...","['P47928', 'Q5SYQ8', 'Q921E6', 'A0A8A5GQJ2', '...","['ID4', 'ALDH1A1', 'EZH2', 'MYC', 'GLA', 'APOB..."
2,11511120,"['AcylCoA', 'Epidermal', 'Mitogenactivated', '...","['B2BXS0', 'Q9Z0P7', 'L8GZV5', 'P05067', 'Q9ZN...","['NADH', 'MAP', 'CYP2C9', 'NSD2', 'ERBB4', 'GP..."
3,221354,"['CYP2D6', 'lethal', 'ALDH1A1', 'RGS12', 'ALOX...","['P10635', 'A1Z198', 'Q5SYQ8', 'E9Q652', 'I3L1...","['CYP2D6', 'ALDH1A1', 'RGS12', 'ALOX15B', 'HPG..."
4,6806409,[],[],[]


In [5]:
import requests
from Bio import Entrez, SeqIO
from Bio.Blast import NCBIWWW, NCBIXML

def fetch_and_blast_sequence(accession_id, email="your_email@example.com", blast_db="nr",
                             blast_type="blastp", expect=0.01, matrix_name="BLOSUM62",
                             alignments=50, hitlist_size=50, filter="F", gapcosts="11 1",
                             taxonomy=5858, identity_threshold=80):
    """
    Fetch a sequence from UniProt and run BLAST with custom settings, filtering results by identity percentage and taxonomy.

    Parameters:
        accession_id (str): UniProt accession ID.
        email (str): Email address for NCBI Entrez.
        blast_db (str): BLAST database to search against.
        blast_type (str): Type of BLAST search.
        expect (float): E-value threshold for BLAST.
        matrix_name (str): Scoring matrix name.
        alignments (int): Number of alignments to show.
        hitlist_size (int): Number of hits to return.
        filter (str): Filter options.
        gapcosts (str): Gap costs.
        taxonomy (str): Filter search with taxonomy ID or organism name.
        identity_threshold (int): Minimum percentage of identity for hits.
    """
    # Set Entrez email
    Entrez.email = email

    # Fetch sequence from UniProt
    url = f"https://rest.uniprot.org/uniprotkb/{accession_id}.fasta"
    response = requests.get(url)
    if response.status_code == 200:
        sequence_data = response.text
    else:
        print(f"Failed to fetch sequence from UniProt for {accession_id} with status {response.status_code}")
        return

    # Format Entrez query if taxonomy is specified
    entrez_query = f"txid{taxonomy}[ORGN]" if taxonomy else None

    # Perform BLAST search
    print(f"Running {blast_type} for Accession ID: {accession_id}...")
    result_handle = NCBIWWW.qblast(blast_type, blast_db, sequence_data,
                                   expect=expect, matrix_name=matrix_name,
                                   alignments=alignments, hitlist_size=hitlist_size,
                                   filter=filter, gapcosts=gapcosts,
                                   entrez_query=entrez_query)

    # Parse and filter BLAST results
    blast_record = NCBIXML.read(result_handle)
    if not blast_record.alignments:
        print(f"No BLAST results found for Accession ID: {accession_id}")
        return

    print(f"Results for Accession ID: {accession_id}:")
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            identity_perc = (hsp.identities / hsp.align_length) * 100
            if identity_perc >= identity_threshold:
                print(f"  Hit ID: {alignment.hit_id}, Identity: {identity_perc:.2f}%")
                print(f"  Hit Description: {alignment.hit_def}")
                print(f"    E-value: {hsp.expect}")
                print(f"    Score: {hsp.score}")
                print(f"    Query Alignment: {hsp.query[:50]}...")
                print(f"    Subject Alignment: {hsp.sbjct[:50]}...")
                print("-" * 80)



In [10]:
# Example usage
fetch_and_blast_sequence("P10635")

Running blastp for Accession ID: P10635...
Results for Accession ID: P10635:
  Hit ID: ref|NP_000097.3|
  Hit Description: cytochrome P450 2D6 isoform 1 [Homo sapiens] >sp|P10635.2| RecName: Full=Cytochrome P450 2D6; AltName: Full=CYPIID6; AltName: Full=Cholesterol 25-hydroxylase; AltName: Full=Cytochrome P450-DB1; AltName: Full=Debrisoquine 4-hydroxylase [Homo sapiens] >gb|AAA53500.1| cytochrome P450 IID6 [Homo sapiens] >gb|AAH75023.1| Cytochrome P450, family 2, subfamily D, polypeptide 6 [Homo sapiens] >gb|AAH75024.1| Cytochrome P450, family 2, subfamily D, polypeptide 6 [Homo sapiens] >gb|AAS55001.1| cytochrome P4502D6 [Homo sapiens] >gb|ABB77895.1| cytochrome P450 2D6 [Homo sapiens]
    E-value: 0.0
    Score: 2617.0
    Query Alignment: MGLEALVPLAVIVAIFLLLVDLMHRRQRWAARYPPGPLPLPGLGNLLHVD...
    Subject Alignment: MGLEALVPLAVIVAIFLLLVDLMHRRQRWAARYPPGPLPLPGLGNLLHVD...
--------------------------------------------------------------------------------
  Hit ID: gb|ACY39277.1|
  Hit Descr

In [8]:
# Example usage
fetch_and_blast_sequence("L8GZV5")

Running blastp for Accession ID: L8GZV5...
Results for Accession ID: L8GZV5:
  Hit ID: ref|XP_004340546.1|
  Hit Description: mitogenactivated (MAP) kinase [Acanthamoeba castellanii str. Neff] >gb|ELR18507.1| mitogenactivated (MAP) kinase [Acanthamoeba castellanii str. Neff]
    E-value: 0.0
    Score: 2416.0
    Query Alignment: MHAPPGPTASMSVTSSSSSASSSSSSISSPSPLLRLASHNLSPRPTTPGH...
    Subject Alignment: MHAPPGPTASMSVTSSSSSASSSSSSISSPSPLLRLASHNLSPRPTTPGH...
--------------------------------------------------------------------------------
  Hit ID: gb|KAL0087681.1|
  Hit Description: hypothetical protein J3Q64DRAFT_1676315 [Phycomyces blakesleeanus]
    E-value: 3.52184e-55
    Score: 511.0
    Query Alignment: YDLQHVIGQGAYGVVWLALDRRSGQRVAVKKIADVFGDSKEAKRTLREVR...
    Subject Alignment: YQFIREMGQGSYGVVCAAKDSETDEQVAIKKVCRVFEKSILSKRALREVK...
--------------------------------------------------------------------------------
  Hit ID: ref|XP_018287602.1|
  Hit Description: hypothetical protei

In [9]:
# Example usage
fetch_and_blast_sequence("Q9Z0P7")

Running blastp for Accession ID: Q9Z0P7...
Results for Accession ID: Q9Z0P7:
  Hit ID: ref|NP_001020562.1|
  Hit Description: suppressor of fused homolog isoform 2 [Mus musculus] >sp|Q9Z0P7.1| RecName: Full=Suppressor of fused homolog [Mus musculus] >gb|AAH48168.1| Suppressor of fused homolog (Drosophila) [Mus musculus] >gb|AAH56997.1| Suppressor of fused homolog (Drosophila) [Mus musculus] >emb|CAB38081.1| Su(fu) protein, partial [Mus musculus] >emb|CAC34258.1| suppressor of fused homolog (Drosophila) [Mus musculus] >emb|CAC34271.1| suppressor of fused homolog (Drosophila) [Mus musculus]
    E-value: 0.0
    Score: 2576.0
    Query Alignment: MAELRPSVAPGPAAPPASGPSAPPAFASLFPPGLHAIYGECRRLYPDQPN...
    Subject Alignment: MAELRPSVAPGPAAPPASGPSAPPAFASLFPPGLHAIYGECRRLYPDQPN...
--------------------------------------------------------------------------------
  Hit ID: ref|XP_021048491.1|
  Hit Description: suppressor of fused homolog isoform X2 [Mus pahari]
    E-value: 0.0
    Score: 2567.0
