<a href="https://colab.research.google.com/github/kattens/PubChem-Data-Handler/blob/main/Pubchem_Downloader_Phase_one.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install PubChemPy

In [2]:
import csv
import pandas as pd
import pubchempy as pcp
import os
import requests


We process **PubChem IDs (CIDs)** listed in a CSV file, where each CID represents a chemical compound, such as a drug, metabolite, or other small molecule. Our objective is to identify **targets** associated with these molecules.

---

### Data Downloaded for Each CID

#### 1. **Biological Assay Summary**  
**Format:** UniProt  
This dataset provides **experimental data** from bioassays where the molecule was tested for biological activity. It includes:
- **Interacting Targets:** Proteins, enzymes, or other biological entities that the molecule binds to or affects.  
- **Bioactivity Data:** Quantitative metrics such as IC50, EC50, or binding affinity.  
- **Toxicity and Pharmacology:** Insights into the molecule’s drug potential, safety, and biological effects.  

---

#### 2. **Interactions and Pathways**  
**Format:** PDB  
This dataset highlights biological **pathways and targets** (e.g., proteins or enzymes) that the molecule interacts with, showcasing its role in:  
- **Metabolic Pathways**  
- **Signaling Cascades**  
- **Other Cellular Processes**

---

### Detailed Breakdown

#### A. **Biological Assay Summary (Experimental Evidence)**  
- **Purpose:** Provides experimental results from bioassays where the molecule was directly tested.  
- **Key Details:**  
  - **Targets:** Proteins or enzymes directly shown to interact with the molecule, based on experimental evidence.  
  - **Bioactivity:** Quantitative measures like IC50 (inhibitory concentration), EC50 (effective concentration), or binding affinities.  
  - **Toxicology/Pharmacology:** Information on drug potential and safety profiles.  

This is **direct experimental evidence** answering:  
*"Which proteins or enzymes does this molecule bind to, and how effectively?"*

---

#### B. **Interactions and Pathways (Biological Context)**  
- **Purpose:** Provides contextual information on how the molecule participates in biological processes.  
- **Key Details:**  
  - **Pathways:** Metabolic or signaling pathways where the molecule plays a role.  
  - **Protein Interactions:** Proteins or enzymes the molecule interacts with within these pathways.  

This dataset gives **broader biological context**, often inferred, rather than direct experimental evidence.

---

### Summary of Differences  
- **Biological Assay Summary (A):** Direct experimental evidence of molecule-target interactions.  
- **Interactions and Pathways (B):** Contextual, sometimes inferred, information about pathways and broader biological roles.

---

### Final Objective  
Using this information, we generate a list of sequences to perform **BLAST** against malaria orthologs.  

#Example:

https://pubchem.ncbi.nlm.nih.gov/compound/10219#section=Protein-Bound-3D-Structures

In [3]:
file_path = '/content/drive/MyDrive/cdot_actives_50 1.xlsx'

df = pd.read_excel(file_path)

In [4]:
#based on a experiment theres some NaN values in the column that we should remove
#remove the rows with NaN as the pubchem_cid value
df = df.dropna(subset=['pubchem_cid'])

In [5]:
#just a representation of the cdot csv file:
df

Unnamed: 0,CDoT Plate No.,*.1 Barcode_Well address,Unnamed: 2,Hit rate,pert_iname,pert_iname.1,concentration_mM,vendor,catalog_no,smiles,...,pubchem_cid,clinical_phase,moa,target,disease_area,indication,DMPNN_prediction,max_tanimoto_active,MAIP_pred_score,max_topological_tanimoto
0,19,BR00130947K18,BRD-K51329597-001-02-9,85.069661,-AZMÂ 475271.00,-AZMÂ 475271.00,10.000000,Tocris,3963,COc1ccc(Cl)c(Nc2ncnc3cc(OCC4CCN(C)CC4)c(OC)cc2...,...,5330175.0,Preclinical,SRC inhibitor,SRC,,,1.00,0.458333,75.354837,0.768537
1,11,BR00130917J06,BRD-K17705806-003-03-6,100.303774,JTC-801,JTC-801,10.000000,Selleck,S2722,CCc1ccc(OCc2ccccc2C(=O)Nc2ccc3nc(C)cc(N)c3c2)cc1,...,5311340.0,Phase 2,opioid receptor antagonist,OPRL1,,,1.00,0.538462,42.501938,0.706559
2,12,BR00130916N19,BRD-K57169635-001-05-9,94.288306,dacomitinib,dacomitinib,10.000000,Tocris,6231,COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1NC(=O)\C=C\C...,...,11511120.0,Launched,EGFR inhibitor,EGFR|ERBB2|ERBB4,,,1.00,0.393617,80.514961,0.615859
3,20,BR00130945P03,BRD-K73799155-001-01-5,97.792777,NSC-5844,NSC-5844,10.000048,MedChemEx,HY-100033,Clc1ccc2c(NCCNc3ccnc4cc(Cl)ccc34)ccnc2c1,...,221354.0,Preclinical,CC chemokine receptor agonist,CCR1,,,1.00,0.966667,52.324497,1.000000
4,14,BR00130957L17,BRD-K26122255-001-01-7,94.788143,VLX600,VLX600,10.000000,Enamine,EN300-395210,CC(=NNc1nnc2c(n1)[nH]c1c(C)cccc21)c1ccccn1,...,6806409.0,Phase 1,antitumor agent|ubiquitin C-terminal hydrolase...,TP53|USP14,,,1.00,0.385965,14.558880,0.730836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,22,BR00130941G13,BRD-K39359968-001-03-9,95.331905,L-755507,L-755507,10.000000,Tocris,2197,CCCCCCNC(=O)Nc1ccc(cc1)S(=O)(=O)Nc1ccc(CCNC[C@...,...,9829836.0,Preclinical,adrenergic receptor agonist,ADRB1|ADRB2|ADRB3,,,0.97,0.337662,42.797476,0.555102
61,22,BR00130941P04,BRD-K32289541-001-02-8,99.823256,EHop-016,EHop-016,10.000000,Selleck,S7319,CCn1c2ccccc2c2cc(Nc3ccnc(NCCCN4CCOCC4)n3)ccc12,...,51031035.0,Preclinical,Ras GTPase inhibitor,RAC1|RAC3,,,0.97,0.457143,83.399485,0.512428
62,5,BR00130929L09,BRD-M87138257-003-01-1,96.772553,acriflavine,acriflavine,9.583359,Sigma,MFCD00064307,Nc1ccc2cc3ccc(N)cc3nc2c1.C[n+]1c2cc(N)ccc2cc2c...,...,452192.0,Launched,hypoxia inducible factor inhibitor,HIF1A,infectious disease,fungal infection,0.97,0.371429,,0.540603
63,7,BR00130925O14,BRD-K43389675-003-20-9,102.670071,daunorubicin,daunorubicin,10.000000,Tocris,1467,COc1cccc2C(=O)c3c(O)c4C[C@](O)(C[C@H](O[C@H]5C...,...,30323.0,Launched,RNA synthesis inhibitor|topoisomerase inhibitor,TOP2A|TOP2B,hematologic malignancy,acute myeloid leukemia (AML)|acute lymphoblast...,0.97,1.000000,80.946386,1.000000


In [6]:
#access to the pubchem_cid column
df['pubchem_cid']

Unnamed: 0,pubchem_cid
0,5330175.0
1,5311340.0
2,11511120.0
3,221354.0
4,6806409.0
...,...
60,9829836.0
61,51031035.0
62,452192.0
63,30323.0


In [7]:
#make a list
pubchem_ids = df['pubchem_cid'].tolist() #type = float
pubchem_targets = df['target'].tolist() #type = string

#convert the float to int
pubchem_ids = [int(i) for i in pubchem_ids]

#make every element in pubchem_targets a list and if see | break and make a new entry in the list
for i in range(len(pubchem_targets)):
  if isinstance(pubchem_targets[i], str): #checking if the element is a string before continuing
    if '|' in pubchem_targets[i]:
        pubchem_targets[i] = pubchem_targets[i].strip().split('|')
    else:
        #remove the '' from the entry
        pubchem_targets[i] = pubchem_targets[i].strip().replace(' ', '') #Added strip before replace
        pubchem_targets[i] = [pubchem_targets[i]]

'''
#just to make sure the values are correct to make a dict
print(len(pubchem_ids))
print(len(pubchem_targets))
print(type(pubchem_targets[2][0]))
'''

#make a dictionary such that ids are the keys and targets are the values
pubchem_dict = dict(zip(pubchem_ids , pubchem_targets))
print(pubchem_dict)


{5330175: ['SRC'], 5311340: ['OPRL1'], 11511120: ['EGFR', 'ERBB2', 'ERBB4'], 221354: ['CCR1'], 6806409: ['TP53', 'USP14'], 5329480: ['EGFR', 'ERBB2'], 12947: ['TLR7', 'TLR9'], 444810: ['MRGPRX1'], 135421339: ['BRAF'], 9939609: ['PLA2G7'], 42627755: ['ERNÂ\xa01.00'], 53464483: ['TRPV4'], 3647519: ['HNMT'], 9810709: ['TOP2A'], 6413301: ['TP53', 'USP14'], 119081415: ['CDK7'], 5311382: ['EGFR', 'FGFR1', 'PDGFRB', 'PKMYT1', 'SRC', 'WEE1'], 53315868: ['EHMT2'], 10219: ['RPS2'], 9914412: ['AURKA', 'AURKB'], 2993: ['KCNN1', 'KCNN3'], 24756910: ['EGFR', 'ERBB2'], 6918097: ['ADORA3'], 4534086: ['SLC8A1', 'TRPC3', 'TRPC5', 'TRPC6'], 73416445: ['ATP1A1'], 132928: ['BDKRB2'], 5281035: ['AR'], 121750: ['TYMS'], 9852185: ['BCL2'], 51000408: ['ATR'], 73602827: ['CDK7'], 3499: ['PRKCA', 'PRKCB', 'PRKCD', 'PRKCG', 'PRKCZ'], 9809926: ['CACNA2D1'], 41867: ['CHD1', 'TOP2A'], 6918837: ['HDAC1', 'HDAC2', 'HDAC3', 'HDAC4', 'HDAC6', 'HDAC7', 'HDAC8', 'HDAC9'], 24858111: ['DNMT1', 'DNMT3A', 'DNMT3B'], 33630: ['

# Now we have a dictionary of the ids and targets.


In [8]:
for cid in pubchem_ids:
  print(cid)

5330175
5311340
11511120
221354
6806409
5329480
12947
444810
135421339
9939609
42627755
53464483
3647519
9810709
6413301
119081415
5311382
53315868
10219
9914412
2993
24756910
6918097
4534086
73416445
132928
5281035
121750
9852185
51000408
73602827
3499
9809926
41867
6918837
24858111
33630
2798
441074
4735
3034034
676352
49855250
154257
24795070
11978790
439530
6445562
9829526
5583
446536
46848036
9549305
25150857
53315882
6197
51358113
16007391
9829836
51031035
452192
30323
11785878


#Biological Essay Code:

In [10]:
folder_path = '/content/drive/MyDrive/IDS_Target_Result'

'''
Next step is to download the ids biological test results csv files from pubchem website
We tried to use the PUG API but it wasnt downloading the correct files so we did it manually

'''
#folder to save the csv files in:
folder_path = '/content/drive/MyDrive/IDS_Target_Result'

def fetch_bioassay_results(pubchem_ids, folder_path):
    # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)
    for cid in pubchem_ids:
        # The url, using f-string formatting -> same for all the files
        url = f"https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={{%22download%22:%22*%22,%22collection%22:%22bioactivity%22,%22order%22:[%22acvalue,asc%22],%22start%22:1,%22limit%22:10000000,%22downloadfilename%22:%22pubchem_cid_{cid}_bioactivity%22,%22nullatbottom%22:1,%22where%22:{{%22ands%22:[{{%22cid%22:%22{cid}%22}}]}}}}"

        # Send a GET request to the API
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Define the file path
            file_path = os.path.join(folder_path, f"{cid}.csv")
            # Write the content to a CSV file
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"Bioassay data for CID {cid} has been downloaded and saved to {file_path}")
        else:
            print(f"Failed to retrieve bioassay data for CID {cid}. HTTP Status Code: {response.status_code}")



#Interaction and Pathways:

In [11]:
'''
def fetch_pathways_results(pathway_ids, folder_path):
  # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)
    for cid in pubchem_ids:
        # The url, using f-string formatting -> same for all the files
        url = f"https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={{\"download\":\"*\",\"collection\":\"pdb\",\"order\":[\"resolution,asc\"],\"start\":1,\"limit\":10000000,\"downloadfilename\":\"pubchem_cid_{cid}_pdb\",\"where\":{{\"ands\":[{{\"cid\":\"{cid}\"}}]}}}}"

        # Send a GET request to the API
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Define the file path
            file_path = os.path.join(folder_path, f"{cid}.csv")
            # Write the content to a CSV file
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"Pathways data for CID {cid} has been downloaded and saved to {file_path}")
        else:
            print(f"Failed to retrieve pathways data for CID {cid}. HTTP Status Code: {response.status_code}")
'''

'\ndef fetch_pathways_results(pathway_ids, folder_path):\n  # Ensure the folder exists\n    os.makedirs(folder_path, exist_ok=True)\n    for cid in pubchem_ids:\n        # The url, using f-string formatting -> same for all the files\n        url = f"https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={{"download":"*","collection":"pdb","order":["resolution,asc"],"start":1,"limit":10000000,"downloadfilename":"pubchem_cid_{cid}_pdb","where":{{"ands":[{{"cid":"{cid}"}}]}}}}"\n\n        # Send a GET request to the API\n        response = requests.get(url)\n\n        # Check if the request was successful\n        if response.status_code == 200:\n            # Define the file path\n            file_path = os.path.join(folder_path, f"{cid}.csv")\n            # Write the content to a CSV file\n            with open(file_path, \'wb\') as file:\n                file.write(response.content)\n            print(f"Pathways data for CID {cid} has been downloaded and saved to

In [12]:
#uncomment if you need to download the ids targets
fetch_bioassay_results(pubchem_ids , folder_path)

folder_path_interaction = '/content/drive/MyDrive/IDS_Target_Result_Interaction'
#uncomment if you need to download the ids targets
#fetch_pathways_results(pubchem_ids , folder_path_interaction)


Bioassay data for CID 5330175 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5330175.csv
Bioassay data for CID 5311340 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5311340.csv
Bioassay data for CID 11511120 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/11511120.csv
Bioassay data for CID 221354 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/221354.csv
Bioassay data for CID 6806409 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/6806409.csv
Bioassay data for CID 5329480 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5329480.csv
Bioassay data for CID 12947 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/12947.csv
Bioassay data for CID 444810 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/444810.csv
Bioassay data for CID 135421339 has been downloaded and saved to /content/driv

#Now that we have all the csv files we should extract and create the dictionary for the new ids and targets.

In [13]:
#open csv and get the column names
new_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result/10219.csv')
print(new_df.columns)

Index([' baid', 'acvalue', 'aid', 'sid', 'cid', 'geneid', 'pmid', 'aidtype',
       'aidmdate', 'hasdrc', 'rnai', 'activity', 'protacxn', 'acname',
       'acqualifier', 'aidsrcname', 'aidname', 'cmpdname', 'targetname',
       'targeturl', 'ecs', 'repacxn', 'taxid', 'cellids', 'targettaxid',
       'anatomyid', 'anatomy', 'dois', 'pmcids', 'pclids', 'citations'],
      dtype='object')


#Biological Essay Download Code

In [14]:
#create an empty dictionary
bioessay_dict = {}

#go through all the csv files in the folder, have the name of the file as key and the column 'targetname' as value
for file in os.listdir(folder_path):
  if file.endswith('.csv'):
    file_path = os.path.join(folder_path , file)
    df = pd.read_csv(file_path)
    file = file.replace('.csv', '')  # Remove .csv from the key
    #if the value is equal to nan or not a string remove it
    df = df[df['targetname'].notna()]
    #modify the bioessay_dict in a way that the values are only unique elements -> use set() function
    bioessay_dict[file] = list(set(df['targetname']))

    #step 2: remove dashes (-) from the list of the values
    for i in range(len(bioessay_dict[file])):
      if '-' in bioessay_dict[file][i]:
        bioessay_dict[file][i] = bioessay_dict[file][i].replace('-', '')
        print(bioessay_dict[file][i])  # for debugging -> its working

'''

'''

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Glycogen synthase kinase3 beta (human)
CYP2C9  cytochrome P450 family 2 subfamily C member 9 (human)
CA2  carbonic anhydrase 2 (human)
HNMT  histamine Nmethyltransferase (human)
GPX1  glutathione peroxidase 1 (human)
ATR  ATR serine/threonine kinase (human)
UL53  nuclear egress lamina protein (Human betaherpesvirus 5)
Mitogenactivated protein kinase 1 (human)
ALOX12  arachidonate 12lipoxygenase, 12S type (human)
GNMT  glycine Nmethyltransferase (human)
Tyrosineprotein kinase ABL1 (human)
Cyclindependent kinase 2 (human)
LASVsSgp1  nucleoprotein (Mammarenavirus lassaense)
PRKDC  protein kinase, DNAactivated, catalytic subunit (human)
Transcription intermediary factor 1alpha (human)
CYP3A4  cytochrome P450 family 3 subfamily A member 4 (human)
GAMT  guanidinoacetate Nmethyltransferase (human)
GPX4  glutathione peroxidase 4 (human)
Luciferin 4monooxygenase (Luciola mingrelica)
Cyclindependent kinase 13 (human)
IP6K1  inosito

'\n\n'

In [None]:
#to see if its working
print(bioessay_dict['5330175'])
print(bioessay_dict.keys())
#print(bioessay_dict.values()[61])
print(list(bioessay_dict.values())[0])
print(len(list(bioessay_dict.values())[0]))

['Luciferin 4monooxygenase (Luciola mingrelica)', 'Severe acute respiratory syndrome coronavirus 2', 'CYP3A4  cytochrome P450 family 3 subfamily A member 4 (human)', 'NNMT  nicotinamide Nmethyltransferase (human)', 'Canis lupus familiaris (dog)', 'Mus musculus (house mouse)', 'CYP2D6  cytochrome P450 family 2 subfamily D member 6 (gene/pseudogene) (human)', 'FYN  FYN protooncogene, Src family tyrosine kinase (human)', 'PTK6  protein tyrosine kinase 6 (human)', 'Chain A, Calcium and integrinbinding protein 1 (human)', 'COMT  catecholOmethyltransferase (human)', 'GPX1  glutathione peroxidase 1 (human)', 'IFNB1  interferon beta 1 (human)', 'CYP2C9  cytochrome P450 family 2 subfamily C member 9 (human)', 'NSD2  nuclear receptor binding SET domain protein 2 (human)', 'CYP3A7  cytochrome P450 family 3 subfamily A member 7 (human)', 'S  surface glycoprotein (Severe acute respiratory syndrome coronavirus 2)', 'CIB1  calcium and integrin binding 1 (human)', 'LCK  LCK protooncogene, Src family t

### **Why did we decide to remove the '-' from the target names?**
1. **Removing the Dash (`-`) Broadens the Search Scope**:
   - Correct: Removing the dash makes the query less strict, allowing UniProt to search for terms independently rather than as a single specific entity.
   - With the dash, UniProt interprets the query as a combined condition (`DRD1 - dopamine receptor D1`), which might not match any entry directly or narrows the search results significantly.

2. **Dashes in Names Indicate Context, Not the Full Name**:
   - Partially correct: While the dash *can* separate parts of the query (e.g., gene name vs. organism name), its meaning is dependent on the context. For example:
     - `"DRD1 - dopamine receptor D1 (human)"` might be interpreted as looking for a human protein explicitly labeled in that format.
     - Without the dash (`DRD1 dopamine receptor D1`), UniProt searches for "DRD1" and "dopamine receptor D1" more freely, increasing potential matches.

3. **Why Removing the Dash Expands Results**:
   - UniProt's search treats the dash (`-`) as part of the query, which could limit results to exact matches with the dash present. When you remove the dash, the system searches for terms individually, yielding broader matches.

---

### **Refined Explanation**
- If the dash exists in your query (`DRD1 - dopamine receptor D1`), UniProt interprets it as a specific, tightly coupled phrase. This might not yield results unless an entry explicitly matches this exact structure.
- Without the dash (`DRD1 dopamine receptor D1`), UniProt treats the terms as separate keywords and attempts to find matches containing any or all of them. This is why removing the dash often yields broader and more relevant results.

---

### **Source**

The behavior you've observed when searching UniProt with and without a dash (`-`) in your query stems from how UniProt's search engine interprets special characters and query syntax. While UniProt's documentation doesn't explicitly detail the impact of dashes in search queries, we can infer the following based on general search engine behaviors and available information:

**1. Special Characters in Search Queries:**
- Search engines often treat special characters like dashes as operators or delimiters. In some contexts, a dash can signify exclusion (e.g., `term1 -term2` searches for entries containing "term1" but not "term2"). However, without explicit documentation from UniProt on this behavior, it's unclear how a dash is processed in your specific query.

**2. Broadening Search Results:**
- Removing the dash from your query (`DRD1 dopamine receptor D1`) allows the search engine to interpret the terms more flexibly, potentially returning a broader set of results. This approach aligns with general search practices where simplifying queries can yield more comprehensive results.

**3. Organism Specification:**
- Including terms like "human" in your query helps specify the organism of interest. UniProt entries often include the organism name, so adding this term can refine your search to entries related to human proteins.

**Recommendations:**
- **Simplify Queries:** Use straightforward terms without special characters unless you're certain of their function in the search syntax.
- **Specify Organism:** Include the organism name (e.g., "human") to narrow down results to the species of interest.
- **Consult Documentation:** For complex queries, refer to UniProt's [Advanced Search Help](https://www.uniprot.org/help/advanced_search) for guidance on query syntax and field-specific searches.


# About the Names:  
Perform the search first using only the first word in the name, and then run the search again using the full name.


#Interactions and Pathways Download Code

In [15]:
'''
#open csv and get the column names
new_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result_Interaction/10219.csv')
print(new_df.columns)
'''

"\n#open csv and get the column names\nnew_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result_Interaction/10219.csv')\nprint(new_df.columns)\n"

- we have to work with the pdbid column and

In [None]:
#for interactions we will use the pdbid column to download the pdb files from the rcsb website


# Summary of Changes  
We removed `.csv` from the keys, ensuring that the list of values for each key is unique, and removed dashes from the values in the target name list.

# Next Step  
1. Create a folder named after each ID (key in the dictionary).  
2. For each value in the list corresponding to a key, download the UniProt sequence file (using the first entry in the search query).  
3. Place the downloaded files into the respective folder.  

The final result should be 63 folders, each containing multiple UniProt sequence files.

In [None]:
'''

https://www.uniprot.org/uniprotkb/P21964/entry


https://www.uniprot.org/uniprotkb?query=COMT+-+catecholOmethyltransferase+%28human%29

https://www.uniprot.org/uniprotkb?query=SYK+spleen+associated+tyrosine+kinase+%28human%29

https://www.uniprot.org/uniprotkb/P61073/entry

'''

'\n\nhttps://www.uniprot.org/uniprotkb/P21964/entry\n\n\nhttps://www.uniprot.org/uniprotkb?query=COMT+-+catecholOmethyltransferase+%28human%29\n\nhttps://www.uniprot.org/uniprotkb?query=SYK+spleen+associated+tyrosine+kinase+%28human%29\n\nhttps://www.uniprot.org/uniprotkb/P61073/entry\n\n'

#UniProt API

In [None]:
# Step 1: Function to search UniProt and get the accession ID
def search_uniprot(query):
    """
    Search UniProt for a given query and return the accession ID of the first result.
    """
    base_search_url = "https://rest.uniprot.org/uniprotkb/search"
    search_params = {
        "query": query,
        "fields": "accession",
        "size": 1,  # Limit to the first result
    }
    response = requests.get(base_search_url, params=search_params)

    if response.status_code == 200:
        results = response.json().get("results", [])
        if results:
            accession = results[0]["primaryAccession"]
            print(f"Found entry: {accession}")
            return accession
        else:
            print("No results found.")
            return None
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

# Step 2: Function to download the FASTA sequence using accession ID
def download_uniprot_sequence(accession, folder_path):
    """
    Download the FASTA sequence for a given accession ID and save it in the specified folder.
    """
    # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)

    # Construct the file path
    file_path = os.path.join(folder_path, f"{accession}.fasta")

    # Base URL for UniProt sequence download
    base_url = f"https://rest.uniprot.org/uniprotkb/{accession}.fasta"
    response = requests.get(base_url)

    if response.status_code == 200:
        # Save the sequence to the specified folder
        with open(file_path, "w") as fasta_file:
            fasta_file.write(response.text)
        print(f"Sequence saved at {file_path}")
    else:
        print(f"Error: {response.status_code}, {response.text}")

# Step 3: Combined function to search and download
def get_uniprot_fasta(query, folder_path):
    """
    Search UniProt for a query, retrieve the first result's accession ID,
    and download the FASTA sequence to the specified folder.
    """
    # Search the entry
    accession = search_uniprot(query)
    if accession:
        # Download the sequence
        download_uniprot_sequence(accession, folder_path)


In [None]:
'''
This is the pipeline of making the folders and downloading the files
'''

FolderPath = '/content/drive/MyDrive/ExamplePath'

first_key, first_value = next(iter(bioessay_dict.items()))

bioessay_dict[first_key] = first_value

print(bioessay_dict)
print(len(first_value))
print(len(first_key))

{'5330175': ['ORF1ab  ORF1a polyprotein;ORF1ab polyprotein (Severe acute respiratory syndrome coronavirus 2)', 'NTMT1  Nterminal XaaProLys Nmethyltransferase 1 (human)', 'Chain A, Calcium and integrinbinding protein 1 (human)', 'SRC  SRC protooncogene, nonreceptor tyrosine kinase (human)', 'HNMT  histamine Nmethyltransferase (human)', 'Canis lupus familiaris (dog)', 'FYN  FYN protooncogene, Src family tyrosine kinase (human)', 'GNMT  glycine Nmethyltransferase (human)', 'Mus musculus (house mouse)', 'CSK  Cterminal Src kinase (human)', 'KDR  kinase insert domain receptor (human)', 'CIB1  calcium and integrin binding 1 (human)', 'S  surface glycoprotein (Severe acute respiratory syndrome coronavirus 2)', 'NNMT  nicotinamide Nmethyltransferase (human)', 'CYP2C9  cytochrome P450 family 2 subfamily C member 9 (human)', 'Severe acute respiratory syndrome coronavirus 2', 'Rattus norvegicus (Norway rat)', 'PTK6  protein tyrosine kinase 6 (human)', 'PNMT  phenylethanolamine Nmethyltransferase 

In [None]:
#go through the dictionary
for key in bioessay_dict:
    #create a folder with the key name in FolderPath
    folder_path = os.path.join(FolderPath, key)
    #create the folder
    os.makedirs(folder_path, exist_ok=True)
    #after making folder download the values in the dict
    for value in bioessay_dict[key]:
      for entry in value:
        get_uniprot_fasta(entry, folder_path)



Found entry: Q6E7F2
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/Q6E7F2.fasta
Found entry: A0A4P7VJP0
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/A0A4P7VJP0.fasta
Found entry: M9MSB2
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/M9MSB2.fasta


KeyboardInterrupt: 

In [None]:
ppath = '/content/drive/MyDrive/ExamplePath/5330175'

#check how many files are ther in ppath
print(len(os.listdir(ppath)))

33


#errors and problems:
1. downloading pdb files for each interaction and pathways material
2. some of the interaction and pathways return empty csv files while manually its not the case
3. some interaction and pathways pdb results are mostly protein complexes, how do we solve or extract the target to do the blast?


# **Do the blast**
### Steps:
1. make a dictionary with the name of the uniprot entries and their ids
OR
2. add a column to the csv with the uniprot ids
3. run the blast in uniprot against plasmodium malariae (5858) and save the results