<a href="https://colab.research.google.com/github/kattens/PubChem-Data-Handler/blob/main/Pubchem_Downloader_Phase_one.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
%%capture
!pip install PubChemPy

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
import csv
import pandas as pd
import pubchempy as pcp
import os
import requests


We process **PubChem IDs (CIDs)** listed in a CSV file, where each CID represents a chemical compound, such as a drug, metabolite, or other small molecule. Our objective is to identify **targets** associated with these molecules.

---

### Data Downloaded for Each CID

#### 1. **Biological Assay Summary**  
**Format:** UniProt  
This dataset provides **experimental data** from bioassays where the molecule was tested for biological activity. It includes:
- **Interacting Targets:** Proteins, enzymes, or other biological entities that the molecule binds to or affects.  
- **Bioactivity Data:** Quantitative metrics such as IC50, EC50, or binding affinity.  
- **Toxicity and Pharmacology:** Insights into the molecule’s drug potential, safety, and biological effects.  

---

#### 2. **Interactions and Pathways**  
**Format:** PDB  
This dataset highlights biological **pathways and targets** (e.g., proteins or enzymes) that the molecule interacts with, showcasing its role in:  
- **Metabolic Pathways**  
- **Signaling Cascades**  
- **Other Cellular Processes**

---

### Detailed Breakdown

#### A. **Biological Assay Summary (Experimental Evidence)**  
- **Purpose:** Provides experimental results from bioassays where the molecule was directly tested.  
- **Key Details:**  
  - **Targets:** Proteins or enzymes directly shown to interact with the molecule, based on experimental evidence.  
  - **Bioactivity:** Quantitative measures like IC50 (inhibitory concentration), EC50 (effective concentration), or binding affinities.  
  - **Toxicology/Pharmacology:** Information on drug potential and safety profiles.  

This is **direct experimental evidence** answering:  
*"Which proteins or enzymes does this molecule bind to, and how effectively?"*

---

#### B. **Interactions and Pathways (Biological Context)**  
- **Purpose:** Provides contextual information on how the molecule participates in biological processes.  
- **Key Details:**  
  - **Pathways:** Metabolic or signaling pathways where the molecule plays a role.  
  - **Protein Interactions:** Proteins or enzymes the molecule interacts with within these pathways.  

This dataset gives **broader biological context**, often inferred, rather than direct experimental evidence.

---

### Summary of Differences  
- **Biological Assay Summary (A):** Direct experimental evidence of molecule-target interactions.  
- **Interactions and Pathways (B):** Contextual, sometimes inferred, information about pathways and broader biological roles.

---

### Final Objective  
Using this information, we generate a list of sequences to perform **BLAST** against malaria orthologs.  

#Example:

https://pubchem.ncbi.nlm.nih.gov/compound/10219#section=Protein-Bound-3D-Structures

In [16]:
file_path = '/content/drive/MyDrive/Drug Repurposing Project/cdot_actives_50.xlsx'

df = pd.read_excel(file_path)

In [20]:
#based on a experiment theres some NaN values in the column that we should remove
#remove the rows with NaN as the pubchem_cid value
df = df.dropna(subset=['pubchem_cid'])

#make a list
pubchem_ids = df['pubchem_cid'].tolist() #type = float
pubchem_targets = df['target'].tolist() #type = string

#convert the float to int
pubchem_ids = [int(i) for i in pubchem_ids]

#make every element in pubchem_targets a list and if see | break and make a new entry in the list
for i in range(len(pubchem_targets)):
  if isinstance(pubchem_targets[i], str): #checking if the element is a string before continuing
    if '|' in pubchem_targets[i]:
        pubchem_targets[i] = pubchem_targets[i].strip().split('|')
    else:
        #remove the '' from the entry
        pubchem_targets[i] = pubchem_targets[i].strip().replace(' ', '') #Added strip before replace
        pubchem_targets[i] = [pubchem_targets[i]]


#make a dictionary such that ids are the keys and targets are the values
pubchem_dict = dict(zip(pubchem_ids , pubchem_targets))
print(pubchem_dict)


{5330175: ['SRC'], 5311340: ['OPRL1'], 11511120: ['EGFR', 'ERBB2', 'ERBB4'], 221354: ['CCR1'], 6806409: ['TP53', 'USP14'], 5329480: ['EGFR', 'ERBB2'], 12947: ['TLR7', 'TLR9'], 444810: ['MRGPRX1'], 135421339: ['BRAF'], 9939609: ['PLA2G7'], 42627755: ['ERNÂ\xa01.00'], 53464483: ['TRPV4'], 3647519: ['HNMT'], 9810709: ['TOP2A'], 6413301: ['TP53', 'USP14'], 119081415: ['CDK7'], 5311382: ['EGFR', 'FGFR1', 'PDGFRB', 'PKMYT1', 'SRC', 'WEE1'], 53315868: ['EHMT2'], 10219: ['RPS2'], 9914412: ['AURKA', 'AURKB'], 2993: ['KCNN1', 'KCNN3'], 24756910: ['EGFR', 'ERBB2'], 6918097: ['ADORA3'], 4534086: ['SLC8A1', 'TRPC3', 'TRPC5', 'TRPC6'], 73416445: ['ATP1A1'], 132928: ['BDKRB2'], 5281035: ['AR'], 121750: ['TYMS'], 9852185: ['BCL2'], 51000408: ['ATR'], 73602827: ['CDK7'], 3499: ['PRKCA', 'PRKCB', 'PRKCD', 'PRKCG', 'PRKCZ'], 9809926: ['CACNA2D1'], 41867: ['CHD1', 'TOP2A'], 6918837: ['HDAC1', 'HDAC2', 'HDAC3', 'HDAC4', 'HDAC6', 'HDAC7', 'HDAC8', 'HDAC9'], 24858111: ['DNMT1', 'DNMT3A', 'DNMT3B'], 33630: ['

In [None]:
#access to the pubchem_cid column
#df['clinical_phase'] # focus on those that are Launched
#df['target'] # ask if we need only these or should we add more
#df['pert_iname.1'] # focus on those that has a commercial name

#df['pubchem_cid']

for cid in pubchem_ids:
  print(cid)

# **Now we have a dictionary of the ids and targets.**


#Biological Essay Code:

In [23]:
'''
Next step is to download the ids biological test results csv files from pubchem website
We tried to use the PUG API but it wasnt downloading the correct files so we did it manually

'''
#folder to save the csv files in:

def fetch_bioassay_results(pubchem_ids, folder_path):
    # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)
    for cid in pubchem_ids:
        # The url, using f-string formatting -> same for all the files
        url = f"https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={{%22download%22:%22*%22,%22collection%22:%22bioactivity%22,%22order%22:[%22acvalue,asc%22],%22start%22:1,%22limit%22:10000000,%22downloadfilename%22:%22pubchem_cid_{cid}_bioactivity%22,%22nullatbottom%22:1,%22where%22:{{%22ands%22:[{{%22cid%22:%22{cid}%22}}]}}}}"

        # Send a GET request to the API
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Define the file path
            file_path = os.path.join(folder_path, f"{cid}.csv")
            # Write the content to a CSV file
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"Bioassay data for CID {cid} has been downloaded and saved to {file_path}")
        else:
            print(f"Failed to retrieve bioassay data for CID {cid}. HTTP Status Code: {response.status_code}")



#Interaction and Pathways:

In [24]:
def fetch_pathways_results(pathway_ids, folder_path):
  # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)
    for cid in pubchem_ids:
        # The url, using f-string formatting -> same for all the files
        url = f"https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={{\"download\":\"*\",\"collection\":\"pdb\",\"order\":[\"resolution,asc\"],\"start\":1,\"limit\":10000000,\"downloadfilename\":\"pubchem_cid_{cid}_pdb\",\"where\":{{\"ands\":[{{\"cid\":\"{cid}\"}}]}}}}"

        # Send a GET request to the API
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Define the file path
            file_path = os.path.join(folder_path, f"{cid}.csv")
            # Write the content to a CSV file
            with open(file_path, 'wb') as file:
                file.write(response.content)
            print(f"Pathways data for CID {cid} has been downloaded and saved to {file_path}")
        else:
            print(f"Failed to retrieve pathways data for CID {cid}. HTTP Status Code: {response.status_code}")


In [None]:
#uncomment if you need to download the ids targets
folder_path = '/content/drive/MyDrive/Drug Repurposing Project/Interactions_Results'

fetch_pathways_results(pubchem_ids , folder_path)

Bioassay data for CID 5330175 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5330175.csv
Bioassay data for CID 5311340 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5311340.csv
Bioassay data for CID 11511120 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/11511120.csv
Bioassay data for CID 221354 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/221354.csv
Bioassay data for CID 6806409 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/6806409.csv
Bioassay data for CID 5329480 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/5329480.csv
Bioassay data for CID 12947 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/12947.csv
Bioassay data for CID 444810 has been downloaded and saved to /content/drive/MyDrive/IDS_Target_Result/444810.csv
Bioassay data for CID 135421339 has been downloaded and saved to /content/driv

#Now that we have all the csv files we should extract and create the dictionary for the new ids and targets.

In [None]:
#open csv and get the column names
new_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result/10219.csv')
print(new_df.columns)

Index([' baid', 'acvalue', 'aid', 'sid', 'cid', 'refsid', 'geneid', 'pmid',
       'aidtype', 'aidmdate', 'hasdrc', 'rnai', 'activity', 'protacxn',
       'acname', 'acqualifier', 'aidsrcname', 'aidname', 'cmpdname',
       'targetname', 'targeturl', 'ecs', 'repacxn', 'taxid', 'cellids',
       'targettaxid', 'anatomyid', 'anatomy', 'dois', 'pmcids', 'pclids',
       'citations'],
      dtype='object')


# **NOTE**
We need to filter the targets based on the target tax id to get the only human ones which is {"targettaxid":9606}

In [4]:
# Folder containing the CSV files
folder_path = '/content/drive/MyDrive/Drug Repurposing Project/Interactions_Results'
folder_path_interaction = '/content/drive/MyDrive/Drug Repurposing Project/Human_Interaction_Targets'

# Create the target folder if it doesn't exist
os.makedirs(folder_path_interaction, exist_ok=True)
missing_taxid = []
for file in os.listdir(folder_path):
    if file.endswith('.csv'):
        file_path = os.path.join(folder_path, file)
        interaction_file_path = os.path.join(folder_path_interaction, file)

        # Read CSV and clean up column names (strip spaces to prevent key errors)
        df = pd.read_csv(file_path)
        df.columns = df.columns.str.strip()  # Ensures column names are correctly formatted
        #check if the column exists
        if 'taxid' not in df.columns:
          missing_taxid.append(file)
        else:
          # Filter rows where targettaxid is 9606 (Homo sapiens)
          df_filtered = df[df['taxid'] == 9606]

          # Save the filtered version to the interaction folder
          df_filtered.to_csv(interaction_file_path, index=False)

          # Optionally, save the original file again if needed (this step can be omitted if the file should remain unchanged)
          # df.to_csv(file_path, index=False)

          print(f"Original saved: {file_path}")
          print(f"Filtered and saved: {interaction_file_path}")


In [None]:
print(missing_taxid)

['5330175.csv', '5311340.csv', '11511120.csv', '221354.csv', '6806409.csv', '5329480.csv', '12947.csv', '444810.csv', '135421339.csv', '9939609.csv', '42627755.csv', '53464483.csv', '3647519.csv', '9810709.csv', '6413301.csv', '119081415.csv', '5311382.csv', '53315868.csv', '10219.csv', '9914412.csv', '2993.csv', '24756910.csv', '6918097.csv', '4534086.csv', '73416445.csv', '132928.csv', '5281035.csv', '121750.csv', '9852185.csv', '51000408.csv', '73602827.csv', '3499.csv', '9809926.csv', '41867.csv', '6918837.csv', '24858111.csv', '33630.csv', '2798.csv', '441074.csv', '4735.csv', '3034034.csv', '676352.csv', '49855250.csv', '154257.csv', '24795070.csv', '11978790.csv', '439530.csv', '6445562.csv', '9829526.csv', '5583.csv', '446536.csv', '46848036.csv', '9549305.csv', '25150857.csv', '53315882.csv', '6197.csv', '51358113.csv', '16007391.csv', '9829836.csv', '51031035.csv', '452192.csv', '30323.csv', '11785878.csv']


#Biological Essay Download Code

In [6]:
#create an empty dictionary
bioessay_dict = {}
#change the path to be the Human Target ones
folder_path = folder_path

#go through all the csv files in the folder, have the name of the file as key and the column 'targetname' as value
for file in os.listdir(folder_path):
  if file.endswith('.csv'):
    file_path = os.path.join(folder_path , file)
    df = pd.read_csv(file_path)
    file = file.replace('.csv', '')  # Remove .csv from the key
    #if the value is equal to nan or not a string remove it
    df = df[df['targetname'].notna()]
    #modify the bioessay_dict in a way that the values are only unique elements -> use set() function
    bioessay_dict[file] = list(set(df['targetname']))

    #step 2: remove dashes (-) from the list of the values
    for i in range(len(bioessay_dict[file])):
      if '-' in bioessay_dict[file][i]:
        bioessay_dict[file][i] = bioessay_dict[file][i].replace('-', '')
        print(bioessay_dict[file][i])  # for debugging -> its working


KeyError: 'targetname'

# '''
#BIG ASSUMPTION: genes are in capslock!
'''


In [7]:
#WHATS bioessay_dict: A dictionary such that keys are pubchem ids and values are a list of target names
#WHAT TO DO: search for the accesion id for each target name

'''
#BIG ASSUMPTION: genes are in
'''

for key in bioessay_dict:
  bioessay_dict[key] = [s.split()[0] for s in bioessay_dict[key]]
  print(bioessay_dict[key])

In [8]:
bioessay_dict

{}

In [None]:
#to see if its working
print(bioessay_dict['5330175'])
print(bioessay_dict.keys())
#print(bioessay_dict.values()[61])
print(list(bioessay_dict.values())[0])
print(len(list(bioessay_dict.values())[0]))
print(type(bioessay_dict['5330175']))

for i in bioessay_dict['5330175']:
  print(i)
  print(type(i))

['IFNB1', 'NTMT1', 'Tyrosineprotein', 'KDR', 'Chain', 'GNMT', 'GAMT', 'COMT', 'SRC', 'Tyrosineprotein', 'IP6K1', 'GPX4', 'Tyrosineprotein', 'Tyrosineprotein', 'CYP3A7', 'Tyrosineprotein', 'NNMT', 'PNMT', 'NSD2', 'Protooncogene', 'GPX1', 'GPT2', 'CYP3A4', 'HNMT', 'Tyrosineprotein', 'CIB1', 'FH', 'Proteintyrosine', 'CYP2D6', 'CYP2C9', 'Histone']
dict_keys(['444810', '135421339', '9939609', '42627755', '53464483', '9810709', '3647519', '6413301', '119081415', '5311382', '53315868', '10219', '9914412', '2993', '24756910', '6918097', '4534086', '73416445', '132928', '5281035', '121750', '9852185', '51000408', '73602827', '9809926', '41867', '6918837', '24795070', '11978790', '439530', '6445562', '9829526', '5583', '446536', '46848036', '9549305', '25150857', '53315882', '6197', '51358113', '16007391', '9829836', '51031035', '452192', '30323', '11785878', '5330175', '5311340', '11511120', '6806409', '221354', '5329480', '12947', '24858111', '33630', '2798', '441074', '4735', '3034034', '6763

#zip the gene name with the accession id for the blast part

In [9]:
import csv

def create_csv(file_path, data):
    """
    Create a CSV file from a dictionary where:
    - The keys are integers (PubChem IDs).
    - The values are lists of strings (Target Names).

    Parameters:
    - file_path: The path to save the CSV file.
    - data: The dictionary containing the data.
    """
    headers = ["PubChem ID", "Target Names", "Accession IDs", "Target Gene Name"]

    # Open the file for writing
    with open(file_path, mode="w", newline="") as csvfile:
        writer = csv.writer(csvfile)

        # Write the header row
        writer.writerow(headers)

        # Process and write each entry in the data dictionary
        for pubchem_id, target_names in data.items():
            # Filter Target Gene Names: Keep only uppercase or numeric elements
            target_gene_name = [name for name in target_names if name.isupper() or name.isdigit()]

            # Write the data to the CSV
            writer.writerow([pubchem_id, target_names, "", target_gene_name])



In [10]:
file_path = '/content/drive/MyDrive/Drug Repurposing Project/Human_target_results.csv'
create_csv(file_path , bioessay_dict)

In [11]:
df = pd.read_csv(file_path)

In [12]:
df.head(63)

Unnamed: 0,PubChem ID,Target Names,Accession IDs,Target Gene Name


In [None]:
print(type(df['Target Names'][53]))

<class 'str'>


In [None]:
df['Target Gene Name'][22]

"['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'CA9', 'EWS/FLI', 'GNMT', 'GAMT', 'CGAS', 'COMT', 'ATR', 'IP6K1', 'TP53', 'GPX4', 'CA12', 'SNCA', 'CYP3A7', 'PRKDC', 'ALOX12', 'PNMT', 'NSD2', 'NNMT', 'GPX1', 'GPT2', 'CYP3A4', 'ATXN2', 'CA2', 'ATM', 'HNMT', 'CA1', 'CIB1', 'FH', 'CYP2D6', 'CYP2C9']"

In [None]:
df['Target Names'][22]

"['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'Glycogen', 'CA9', 'Serine/threonineprotein', 'Chain', 'EWS/FLI', 'Peregrin', 'GNMT', 'GAMT', 'Bromodomaincontaining', 'Aurora', 'CGAS', 'Transcription', 'COMT', 'Fibroblast', 'ATR', 'IP6K1', 'TP53', 'Casein', 'GPX4', 'CA12', 'SNCA', 'CYP3A7', 'Tyrosineprotein', 'PRKDC', 'ALOX12', 'PNMT', 'NSD2', 'NNMT', 'GPX1', 'GPT2', 'CYP3A4', 'ATXN2', 'CA2', 'ATM', 'HNMT', 'Mitogenactivated', 'CA1', 'CIB1', 'FH', 'Cyclindependent', 'CYP2D6', 'CYP2C9', 'Histone']"

In [None]:
df['Target Gene Name'][12]

"['MAPK14', 'GRK5', 'MCOLN1', 'HTT', 'GAPDH', 'IFNG', 'TRIM24', 'RET', 'NTRK1', 'AURKB', 'NTRK2', 'GAMT', 'TAOK1', 'CSNK1D', 'CAMK1D', 'CLK2', 'RGS12', 'LCK', 'IP6K1', 'FLT3', 'BLM', 'JAK2', 'PRKCQ', 'MAPK9', 'HPGD', 'ALOX12', 'PNMT', 'MAP4K4', 'MAP2K1', 'ATXN2', 'GSK3B', 'MET', 'HNMT', 'CHEK1', 'CDK2', 'KMT2A', 'CDK1', 'PRKD3', 'CSNK1A1', 'MAPT', 'DAPK3', 'CSF1R', 'PIM1', 'APOBEC3G', 'PDGFRA', 'KDR', 'BRPF1', 'DYRK1B', 'MEN1', 'LRRK2', 'FEN1', 'TYRO3', 'ABCB1', 'TBK1', 'RAF1', 'SNCA', 'NNMT', 'PDGFRB', 'GPX1', 'MAP4K2', 'IGF1R', 'AURKA', 'PLK1', 'ABL1', 'EGFR', 'IFNB1', 'BTK', 'CDK8', 'ROCK2', 'NTMT1', 'DYRK1A', 'TNK2', 'NEK2', 'EWS/FLI', 'JAK3', 'LTK', 'SMPD1', 'VDR', 'PRKAA1', 'PRKACA', 'PRKX', 'TP53', 'RPS6KA3', 'MAP4K5', 'KAT2A', 'PTK2', 'NTRK3', 'MARK3', 'FYN', 'BRD4', 'AURKC', 'CIB1', 'GBA1', 'ROCK1', 'CDK7', 'FGFR3', 'IKBKE', 'CDK9', 'CLK4', 'KCNJ6', 'LIMK1', 'ALOX15B', 'MAP2K2', 'PINK1', 'GNMT', 'GNAI1', 'CAMK2A', 'FLT1', 'CGAS', 'POLI', 'ALDH1A1', 'COMT', 'MAPK8', 'CDK5', 'MS

#Detailed explanation of the data handling:
- some of the values (target_names) do not contain gene name. Just to keep the track of these entries, we will add a new column in the csv file ['gene name'] and the corresponding values would be 1 if it exists and if not 0

In [None]:
for key in bioessay_dict:
  print(key)
  print(bioessay_dict[key])

444810
['Homo']
135421339
['Phosphatidylinositol', 'AKT3', 'Protooncogene', 'Calcium/calmodulindependent', 'Insulin', 'Bone', 'TRIM24', 'Serine/threonineprotein', 'PRKAB1', 'MAP2K7', 'Serine/threonineprotein', 'Serine/threonineprotein', 'Aurora', 'Phosphatidylinositol', '[Pyruvate', 'Interferoninduced,', 'Cyclindependent', 'Serine/threonineprotein', 'CSNK1D', 'Activated', 'Hepatocyte', 'MyosinIIIb', 'Tyrosineprotein', 'EPHA7', 'Serine/threonineprotein', 'Serine/threonineprotein', 'Mitogenactivated', 'Serine/threonineprotein', 'MAP2K4', 'Mitogenactivated', 'CLK3', 'Serine/threonineprotein', 'Misshapenlike', 'MAP2K1', 'NQO2', 'GSK3B', 'Phosphatidylinositol', 'Protein', 'Serine/threonineprotein', 'Wee1like', 'Leucinerich', 'Insulinlike', 'BMX', 'Mitogenactivated', 'Dual', 'PIM1', 'Serine/threonineprotein', 'Serine/threonineprotein', 'Serine/threonineprotein', 'NEK7', 'TLK1', 'MAPK10', 'Mitogenactivated', 'TGFbeta', 'Cyclindependent', 'Tyrosineprotein', 'Transient', 'Receptorinteracting', 

### **Why did we decide to remove the '-' from the target names?**
1. **Removing the Dash (`-`) Broadens the Search Scope**:
   - Correct: Removing the dash makes the query less strict, allowing UniProt to search for terms independently rather than as a single specific entity.
   - With the dash, UniProt interprets the query as a combined condition (`DRD1 - dopamine receptor D1`), which might not match any entry directly or narrows the search results significantly.

2. **Dashes in Names Indicate Context, Not the Full Name**:
   - Partially correct: While the dash *can* separate parts of the query (e.g., gene name vs. organism name), its meaning is dependent on the context. For example:
     - `"DRD1 - dopamine receptor D1 (human)"` might be interpreted as looking for a human protein explicitly labeled in that format.
     - Without the dash (`DRD1 dopamine receptor D1`), UniProt searches for "DRD1" and "dopamine receptor D1" more freely, increasing potential matches.

3. **Why Removing the Dash Expands Results**:
   - UniProt's search treats the dash (`-`) as part of the query, which could limit results to exact matches with the dash present. When you remove the dash, the system searches for terms individually, yielding broader matches.

---

### **Refined Explanation**
- If the dash exists in your query (`DRD1 - dopamine receptor D1`), UniProt interprets it as a specific, tightly coupled phrase. This might not yield results unless an entry explicitly matches this exact structure.
- Without the dash (`DRD1 dopamine receptor D1`), UniProt treats the terms as separate keywords and attempts to find matches containing any or all of them. This is why removing the dash often yields broader and more relevant results.

---

### **Source**

The behavior you've observed when searching UniProt with and without a dash (`-`) in your query stems from how UniProt's search engine interprets special characters and query syntax. While UniProt's documentation doesn't explicitly detail the impact of dashes in search queries, we can infer the following based on general search engine behaviors and available information:

**1. Special Characters in Search Queries:**
- Search engines often treat special characters like dashes as operators or delimiters. In some contexts, a dash can signify exclusion (e.g., `term1 -term2` searches for entries containing "term1" but not "term2"). However, without explicit documentation from UniProt on this behavior, it's unclear how a dash is processed in your specific query.

**2. Broadening Search Results:**
- Removing the dash from your query (`DRD1 dopamine receptor D1`) allows the search engine to interpret the terms more flexibly, potentially returning a broader set of results. This approach aligns with general search practices where simplifying queries can yield more comprehensive results.

**3. Organism Specification:**
- Including terms like "human" in your query helps specify the organism of interest. UniProt entries often include the organism name, so adding this term can refine your search to entries related to human proteins.

**Recommendations:**
- **Simplify Queries:** Use straightforward terms without special characters unless you're certain of their function in the search syntax.
- **Specify Organism:** Include the organism name (e.g., "human") to narrow down results to the species of interest.
- **Consult Documentation:** For complex queries, refer to UniProt's [Advanced Search Help](https://www.uniprot.org/help/advanced_search) for guidance on query syntax and field-specific searches.


# About the Names:  
Perform the search first using only the first word in the name, and then run the search again using the full name.


#Interactions and Pathways Download Code

In [None]:
'''
#open csv and get the column names
new_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result_Interaction/10219.csv')
print(new_df.columns)
'''

"\n#open csv and get the column names\nnew_df = pd.read_csv('/content/drive/MyDrive/IDS_Target_Result_Interaction/10219.csv')\nprint(new_df.columns)\n"

- we have to work with the pdbid column and

In [None]:
#for interactions we will use the pdbid column to download the pdb files from the rcsb website

# Summary of Changes  
We removed `.csv` from the keys, ensuring that the list of values for each key is unique, and removed dashes from the values in the target name list.

# Next Step  
1. Create a folder named after each ID (key in the dictionary).  
2. For each value in the list corresponding to a key, download the UniProt sequence file (using the first entry in the search query).  
3. Place the downloaded files into the respective folder.  

The final result should be 63 folders, each containing multiple UniProt sequence files.

In [None]:
'''

https://www.uniprot.org/uniprotkb/P21964/entry


https://www.uniprot.org/uniprotkb?query=COMT+-+catecholOmethyltransferase+%28human%29

https://www.uniprot.org/uniprotkb?query=SYK+spleen+associated+tyrosine+kinase+%28human%29

https://www.uniprot.org/uniprotkb/P61073/entry

'''

'\n\nhttps://www.uniprot.org/uniprotkb/P21964/entry\n\n\nhttps://www.uniprot.org/uniprotkb?query=COMT+-+catecholOmethyltransferase+%28human%29\n\nhttps://www.uniprot.org/uniprotkb?query=SYK+spleen+associated+tyrosine+kinase+%28human%29\n\nhttps://www.uniprot.org/uniprotkb/P61073/entry\n\n'

#UniProt API

In [None]:
# Step 1: Function to search UniProt and get the accession ID
#NEW APPROAH:
def search_uniprot(query):
    """
    Search UniProt for a given query and return the accession ID of the first result.
    """
    base_search_url = "https://rest.uniprot.org/uniprotkb/search"
    search_params = {
        "query": query,
        "fields": "accession",
        "size": 1,  # Limit to the first result
    }
    response = requests.get(base_search_url, params=search_params)

    if response.status_code == 200:
        results = response.json().get("results", [])
        if results:
            accession = results[0]["primaryAccession"]
            print(f"Found entry: {accession}")
            return accession
        else:
            print("No results found.")
            return None
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None



#modifications should be deleted later

In [None]:

'''
#here is the code we dont need to use now since we dont need to download the sequences we can blast only with the accesion id


# Step 2: Function to download the FASTA sequence using accession ID
def download_uniprot_sequence(accession, folder_path):
    """
    Download the FASTA sequence for a given accession ID and save it in the specified folder.
    """
    # Ensure the folder exists
    os.makedirs(folder_path, exist_ok=True)

    # Construct the file path
    file_path = os.path.join(folder_path, f"{accession}.fasta")

    # Base URL for UniProt sequence download
    base_url = f"https://rest.uniprot.org/uniprotkb/{accession}.fasta"
    response = requests.get(base_url)

    if response.status_code == 200:
        # Save the sequence to the specified folder
        with open(file_path, "w") as fasta_file:
            fasta_file.write(response.text)
        print(f"Sequence saved at {file_path}")
    else:
        print(f"Error: {response.status_code}, {response.text}")

# Step 3: Combined function to search and download
def get_uniprot_fasta(query, folder_path):
    """
    Search UniProt for a query, retrieve the first result's accession ID,
    and download the FASTA sequence to the specified folder.
    """
    # Search the entry
    accession = search_uniprot(query)
    if accession:
        # Download the sequence
        download_uniprot_sequence(accession, folder_path)
'''

In [None]:
df

Unnamed: 0,PubChem ID,Target Names,Accession IDs,Target Gene Name
0,444810,['Homo'],,[]
1,135421339,"['Phosphatidylinositol', 'AKT3', 'Protooncogen...",,"['AKT3', 'TRIM24', 'PRKAB1', 'MAP2K7', 'CSNK1D..."
2,9939609,"['Homo', 'NTMT1', 'IFNG', 'Chain', 'GNMT', 'GA...",,"['NTMT1', 'IFNG', 'GNMT', 'GAMT', 'COMT', 'IP6..."
3,42627755,"['IFNB1', 'NTMT1', 'Chain', 'GNMT', 'GAMT', 'C...",,"['IFNB1', 'NTMT1', 'GNMT', 'GAMT', 'COMT', 'IP..."
4,53464483,"['NR1I2', 'TRPC6', 'NTMT1', 'Chain', 'GNMT', '...",,"['NR1I2', 'TRPC6', 'NTMT1', 'GNMT', 'TRPC3', '..."
...,...,...,...,...
58,3034034,"['NR1I2', 'HTT', 'IFNG', 'HDAC9', 'AHR', 'Solu...",,"['NR1I2', 'HTT', 'IFNG', 'HDAC9', 'AHR', 'GAMT..."
59,676352,"['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'RAD52', 'Ch...",,"['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'RAD52', 'EW..."
60,49855250,"['IFNB1', 'KCNJ6', 'Homo', 'NTMT1', 'IFNG', 'H...",,"['IFNB1', 'KCNJ6', 'NTMT1', 'IFNG', 'HDAC9', '..."
61,154257,"['Prostaglandin', 'Neuronal', 'IFNG', 'TRIM24'...",,"['IFNG', 'TRIM24', 'CSNK1D', 'LIFR', 'GSK3B', ..."


In [None]:
'''
#example
protein_name = "LASVsLgp1"
accession_id = search_uniprot(protein_name)
print(f"Accession ID for {protein_name}: {accession_id}")
'''

#we wanna add the accesion_id to the csv (df)


def create_csv(file_path, data):
    """
    Create a CSV file from a dictionary where:
    - The keys are integers (PubChem IDs).
    - The values are lists of strings (Target Names).

    Parameters:
    - file_path: The path to save the CSV file.
    - data: The dictionary containing the data.
    """
    headers = ["PubChem ID", "Target Names", "Accession IDs", "Target Gene Name"]

    # Open the file for writing
    with open(file_path, mode="w", newline="") as csvfile:
        writer = csv.writer(csvfile)

        # Write the header row
        writer.writerow(headers)

        # Process and write each entry in the data dictionary
        for pubchem_id, target_names in data.items():
            # Filter Target Gene Names: Keep only uppercase or numeric elements
            target_gene_name = [name for name in target_names if name.isupper() or name.isdigit()]
            accession_ids = [search_uniprot(name) for name in target_names]
            # Write the data to the CSV
            writer.writerow([pubchem_id, target_names, accession_ids , target_gene_name])


In [None]:
file_path = '/content/drive/MyDrive/Human_target_results.csv'
create_csv(file_path , bioessay_dict)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Found entry: L8GNA7
Found entry: P05067
Found entry: A0A0K2VLS4
Found entry: Q42512
Found entry: Q9ZNZ7
Found entry: Q84128
Found entry: L8GZV5
No results found.
Found entry: Q84128
Found entry: L8GL10
Found entry: P45452
Found entry: A0A0P1AX45
Found entry: Q9P206
Found entry: O95278
Found entry: Q9ZNZ7
Found entry: L8HHY1
Found entry: L8GN38
Found entry: A0A0K2VLS4
Found entry: A0A0M4BXD2
Found entry: L8GN38
Found entry: H3BU54
Found entry: A0A2P2LEY3
Found entry: A0A1V9YMQ6
Found entry: A0A1V9YMQ6
Found entry: L8GZV5
No results found.
Found entry: L8GL10
Found entry: O14757
Found entry: L8GN38
Found entry: A0A1V9YMQ6
Found entry: P01579
Found entry: A0A1V9YMQ6
Found entry: P9WPP9
Found entry: A0A1V9YMQ6
Found entry: A0A2P2LL28
Found entry: L8GIY7
Found entry: A0A0K2VLS4
Found entry: L8GN38
Found entry: A0A0K2VLS4
Found entry: Q9T0I8
Found entry: A0A0K2VLS4
Found entry: O95278
Found entry: A0A0P1A9P5
Found entry: P61996

In [None]:
df = pd.read_csv(file_path)

In [None]:
df

Unnamed: 0,PubChem ID,Target Names,Accession IDs,Target Gene Name
0,444810,['Homo'],['W8R5U2'],[]
1,135421339,"['Phosphatidylinositol', 'AKT3', 'Protooncogen...","['O08967', 'P56279', 'Q95M86', 'L8GL10', 'P013...","['AKT3', 'TRIM24', 'PRKAB1', 'MAP2K7', 'CSNK1D..."
2,9939609,"['Homo', 'NTMT1', 'IFNG', 'Chain', 'GNMT', 'GA...","['W8R5U2', 'S4R3J7', 'P01579', 'P0C023', 'Q147...","['NTMT1', 'IFNG', 'GNMT', 'GAMT', 'COMT', 'IP6..."
3,42627755,"['IFNB1', 'NTMT1', 'Chain', 'GNMT', 'GAMT', 'C...","['P01574', 'S4R3J7', 'P0C023', 'Q14749', 'A0A0...","['IFNB1', 'NTMT1', 'GNMT', 'GAMT', 'COMT', 'IP..."
4,53464483,"['NR1I2', 'TRPC6', 'NTMT1', 'Chain', 'GNMT', '...","['Q9CRZ0', 'Q99N78', 'S4R3J7', 'P0C023', 'Q147...","['NR1I2', 'TRPC6', 'NTMT1', 'GNMT', 'TRPC3', '..."
...,...,...,...,...
58,3034034,"['NR1I2', 'HTT', 'IFNG', 'HDAC9', 'AHR', 'Solu...","['Q9CRZ0', 'H0YA07', 'P01579', 'A0A0R4J1F3', '...","['NR1I2', 'HTT', 'IFNG', 'HDAC9', 'AHR', 'GAMT..."
59,676352,"['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'RAD52', 'Ch...","['P01574', 'S4R3J7', 'H0YA07', 'P01579', 'Q3MI...","['IFNB1', 'NTMT1', 'HTT', 'IFNG', 'RAD52', 'EW..."
60,49855250,"['IFNB1', 'KCNJ6', 'Homo', 'NTMT1', 'IFNG', 'H...","['P01574', 'P48051', 'W8R5U2', 'S4R3J7', 'P015...","['IFNB1', 'KCNJ6', 'NTMT1', 'IFNG', 'HDAC9', '..."
61,154257,"['Prostaglandin', 'Neuronal', 'IFNG', 'TRIM24'...","['G5EFH8', 'P04629', 'P01579', 'O15164', 'P353...","['IFNG', 'TRIM24', 'CSNK1D', 'LIFR', 'GSK3B', ..."


In [None]:
'''
This is the pipeline of making the folders and downloading the files
'''

FolderPath = '/content/drive/MyDrive/ExamplePath'

first_key, first_value = next(iter(bioessay_dict.items()))

bioessay_dict[first_key] = first_value

print(bioessay_dict)
print(len(first_value))
print(len(first_key))

{'5330175': ['ORF1ab  ORF1a polyprotein;ORF1ab polyprotein (Severe acute respiratory syndrome coronavirus 2)', 'NTMT1  Nterminal XaaProLys Nmethyltransferase 1 (human)', 'Chain A, Calcium and integrinbinding protein 1 (human)', 'SRC  SRC protooncogene, nonreceptor tyrosine kinase (human)', 'HNMT  histamine Nmethyltransferase (human)', 'Canis lupus familiaris (dog)', 'FYN  FYN protooncogene, Src family tyrosine kinase (human)', 'GNMT  glycine Nmethyltransferase (human)', 'Mus musculus (house mouse)', 'CSK  Cterminal Src kinase (human)', 'KDR  kinase insert domain receptor (human)', 'CIB1  calcium and integrin binding 1 (human)', 'S  surface glycoprotein (Severe acute respiratory syndrome coronavirus 2)', 'NNMT  nicotinamide Nmethyltransferase (human)', 'CYP2C9  cytochrome P450 family 2 subfamily C member 9 (human)', 'Severe acute respiratory syndrome coronavirus 2', 'Rattus norvegicus (Norway rat)', 'PTK6  protein tyrosine kinase 6 (human)', 'PNMT  phenylethanolamine Nmethyltransferase 

In [None]:
#go through the dictionary
for key in bioessay_dict:
    #create a folder with the key name in FolderPath
    folder_path = os.path.join(FolderPath, key)
    #create the folder
    os.makedirs(folder_path, exist_ok=True)
    #after making folder download the values in the dict
    for value in bioessay_dict[key]:
      for entry in value:
        get_uniprot_fasta(entry, folder_path)



Found entry: Q6E7F2
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/Q6E7F2.fasta
Found entry: A0A4P7VJP0
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/A0A4P7VJP0.fasta
Found entry: M9MSB2
Sequence saved at /content/drive/MyDrive/ExamplePath/5330175/M9MSB2.fasta


KeyboardInterrupt: 

In [None]:
ppath = '/content/drive/MyDrive/ExamplePath/5330175'

#check how many files are ther in ppath
print(len(os.listdir(ppath)))

33


#errors and problems:
1. downloading pdb files for each interaction and pathways material
2. some of the interaction and pathways return empty csv files while manually its not the case
3. some interaction and pathways pdb results are mostly protein complexes, how do we solve or extract the target to do the blast?


# **Do the blast**
### Steps:
1. make a dictionary with the name of the uniprot entries and their ids
OR
2. add a column to the csv with the uniprot ids
3. run the blast in uniprot against plasmodium malariae (5858) and save the results

#✅put all the targets from interaction and pathways file together:
- after we downloaded the whole thing, we will add them to one big csv file so we can go through them and download


In [None]:

# We are combining all interaction and pathway files along with PubChem IDs and other relevant information, we will also focus on using the target PDB ID for further processing.
Interaction_folder_path = '/content/drive/MyDrive/IDS_Target_Result_Interaction'

# 🔄 Combine all CSV files into one DataFrame
combined_df = pd.DataFrame()

for filename in os.listdir(Interaction_folder_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(Interaction_folder_path, filename)
        df = pd.read_csv(file_path)
        df['source_file'] = filename  # Optional: keep track of original source
        combined_df = pd.concat([combined_df, df], ignore_index=True)

# ➕ Add pubchem_id by removing '.csv' and converting to integer
combined_df['pubchem_id'] = combined_df['source_file'].str.replace('.csv', '', regex=False).astype(int)

# 💾 Save the combined DataFrame to a new CSV
output_path = os.path.join(Interaction_folder_path, 'combined_targets.csv')
combined_df.to_csv(output_path, index=False)

print(f"✅ Combined CSV saved to: {output_path}")