# Leveraging Machine Learning To Predict Binding Affinity Of Drugs To Their Target Based Only On Their Structural Embeddings.

# 1. Data Filtering for important compounds
In this notebook is on data filetering and cleaning. There are too many compounds and their interactions mentioned in the datafile and I feel its going to be a lot to work with for a simple project, so its better I trim the set of compounds to only have those interactions where its easy to find data on chembl.

In [2]:
import pandas as pd
df = pd.read_csv('data/DTC_data.csv', low_memory=True)

  df = pd.read_csv('data/DTC_data.csv', low_memory=True)


In [3]:
# check the columns
df.columns

Index(['compound_id', 'standard_inchi_key', 'compound_name', 'synonym',
       'target_id', 'target_pref_name', 'gene_names', 'wildtype_or_mutant',
       'mutation_info', 'pubmed_id', 'standard_type', 'standard_relation',
       'standard_value', 'standard_units', 'activity_comment',
       'ep_action_mode', 'assay_format', 'assaytype', 'assay_subtype',
       'inhibitor_type', 'detection_tech', 'assay_cell_line',
       'compound_concentration_value', 'compound_concentration_value_unit',
       'substrate_type', 'substrate_relation', 'substrate_value',
       'substrate_units', 'assay_description', 'title', 'journal', 'doc_type',
       'annotation_comments'],
      dtype='object')

In [5]:
# Will subset the following columns for now as they repesent drug and target interaction
df_set = df[['compound_id', 'standard_inchi_key', 
       'target_id', 'gene_names', 'wildtype_or_mutant', 'standard_type', 'standard_relation',
       'standard_value', 'standard_units']]

In [6]:
df_set.isna().sum() 
# check the null values associated with these columns to decide which columns to drop

compound_id            134222
standard_inchi_key      91914
target_id               14833
gene_names            1220291
wildtype_or_mutant    5805369
standard_type             350
standard_relation     2288713
standard_value         378702
standard_units         458412
dtype: int64

In [7]:
df_set.head()

Unnamed: 0,compound_id,standard_inchi_key,target_id,gene_names,wildtype_or_mutant,standard_type,standard_relation,standard_value,standard_units
0,CHEMBL3545284,,Q9Y4K4,MAP4K5,,KDAPP,=,19155.14,NM
1,CHEMBL3545284,,Q9Y478,PRKAB1,,KDAPP,=,1565.72,NM
2,CHEMBL3545284,,Q9Y2U5,MAP3K2,,KDAPP,=,746.77,NM
3,CHEMBL3545284,,Q9Y2K2,SIK3,,KDAPP,=,13558.67,NM
4,CHEMBL3545284,,Q9UL54,TAOK2,,KDAPP,=,2220.98,NM


# Selecting concentration values

I will be looking for only those compound where they have the standard type reported as Kd values. I will also only use those compounds where the units are represented in either NM, MM or NMOL/L and make sure that the standard relation is '='. I also dont think I would like to work with mutated proteins so I will be filtering them off as well.

In [8]:
# filetering based on boolean 

df_set = df_set[(df_set.standard_type == 'KD') | (df_set.standard_type == 'Kd')| 
                (df_set.standard_type == 'KI') | (df_set.standard_type == 'Ki')]

df_set = df_set[(df_set.standard_units == 'NM')|(df_set.standard_units == 'MM') | 
                (df_set.standard_units == 'NMOL/L')]

df_set = df_set[(df_set.standard_relation == '=')]


In [9]:
df_set = df_set[(df_set.wildtype_or_mutant != 'mutated')]

In [10]:
# drop those rows where there are null values for all columns
df_set.dropna(how='all', inplace=True)

In [11]:
df_set.drop_duplicates(inplace=True)

In [12]:
df_set.info()

<class 'pandas.core.frame.DataFrame'>
Index: 443163 entries, 179 to 5980809
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   compound_id         414330 non-null  object 
 1   standard_inchi_key  432589 non-null  object 
 2   target_id           437324 non-null  object 
 3   gene_names          314300 non-null  object 
 4   wildtype_or_mutant  12229 non-null   object 
 5   standard_type       443163 non-null  object 
 6   standard_relation   443163 non-null  object 
 7   standard_value      443163 non-null  float64
 8   standard_units      443163 non-null  object 
dtypes: float64(1), object(8)
memory usage: 33.8+ MB


In [13]:
# look at how many unique compounds are present
df_set.compound_id.nunique()

164662

In [14]:
# drop those rows where there is null value associated with target, 
# no point keeping a compound if there is no target data associated with it
df_set.dropna(axis=0, subset=['target_id', 'gene_names'], inplace=True)

In [15]:
imp_comp = set(df_set.compound_id.values)

In [16]:
len(imp_comp)

129482

In [17]:
''' Will save these compounds to extract chemical information for them in a different notebook.
'''
with open('cleaned_data/imp_comp.txt', 'w') as f:
    for item in imp_comp:
        f.write("%s\n" % item)

concentration value               <- Filtering compounds based on their concentration type.
    │   ├── target type           <- Removed mutant proteins from dataset.
    │   └── droping values        <- Dropping duplicates, null and columns not used for analaysis

# 2. Smiles retrieval

In this notebook we will retrieve the smiles information from chembl database. I will be using their pythoin web resource client to help retrieve this information, but it can be retrieved much easily by installing a SQL clients and making request through their database. Smiles inforamtion will be the structure feature we need to work with.

First we load the data, we will remove all those compounds for which smiles inforamtion is not available. It may be available in other databases but for this project I would prefer to strictly work with chembl data.

In [1]:
import pandas as pd

with open('cleaned_data/imp_comp.txt', 'r') as file:
    lines = file.readlines()

# Strip newline characters
lines = [line.strip() for line in lines]

# Convert to DataFrame
comp = pd.DataFrame(lines, columns=['chem_id'])

In [2]:
comp.head()

Unnamed: 0,chem_id
0,CHEMBL1914462
1,CHEMBL2169974
2,CHEMBL434959
3,CHEMBL24352
4,CHEMBL461667


## Load chembl client and download data

In [3]:
import logging
from operator import itemgetter
from IPython.display import Image, display
from chembl_webresource_client.new_client import new_client

In [4]:
'''
We will look at all the available resources in the chembl database and how many molecules are present in the database
'''


available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
print(available_resources)
print(len(available_resources))

molecule = new_client.molecule
molecule.set_format('json')
print("%s molecules available in ChEMBL" % len(molecule.all()))

35
2431025 molecules available in ChEMBL


In [5]:
# we can get information about a drug by using the compound id
record = molecule.get('CHEMBL3904876') # testing using chem id of 1st compound

In [6]:
# the record is a dictionary containing chemical information on the molecules
type(record), record.keys()

(dict,

In [7]:
# the smiles information we need is in the following keys
record['molecule_structures']['canonical_smiles']

'CCc1c(N)ncnc1N1CCC(c2nc(-c3ccnc(OC)c3)cn2CCN2CCC2)CC1'

In [8]:
# we will create a new column to hold smiles values
comp['Smiles'] = ''
comp.head()

Unnamed: 0,chem_id,Smiles
0,CHEMBL1914462,
1,CHEMBL2169974,
2,CHEMBL434959,
3,CHEMBL24352,
4,CHEMBL461667,


In [9]:
'''
We run the following code to get the smiles for our compound
'''

# import time

# i=60000

# for ind in range(60000, len(df_tr.compound_id.values)):
#     comp = df_tr.loc[ind,'compound_id']
#     try:  
#         record = molecule.get(comp)
#         smiles = record['molecule_structures']['canonical_smiles']    
#         df_tr.loc[i, 'Smiles'] = smiles
#     except:
#         df_tr.loc[i, 'Smiles'] = 'None'
#     i+=1
#     if i%1000==0:
#         %time
#         time.sleep(5)
#         %time
#         print(f"Done with {i} number of compounds")

'\nWe run the following code to get the smiles for our compound\n'

Once we collected all the smiles information we will create a new dataframe that has molecular fingerprint for the smiles. The fingerprint will help our model as teh fingerprints contain binary information regarding the molecules substructure. We will then merge them to create a single file which we will save later

In [11]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem

# Load your DataFrame from a CSV file (uncomment and set the correct path)
# df_test_k_sep = pd.read_csv('path_to_your_file.csv')

# For demonstration purposes, I'll create a sample DataFrame
# Replace this with your actual DataFrame loaded from the CSV
data = {
    "Compound_SMILES": [
        "CCO",       # Ethanol
        "CC(=O)O",   # Acetic acid
        "CCN(CC)CC", # Triethylamine
        "C1=CC=CC=C1" # Benzene
    ]
}
df_test_k_sep = pd.DataFrame(data)

# Create a list holding ECFP4 values for each compound
ECFP4 = [
    list(AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(smiles), 2, nBits=1024).ToBitString())
    for smiles in df_test_k_sep["Compound_SMILES"]
]

# Turn the list into a DataFrame
ecfp4_df = pd.DataFrame(ECFP4, columns=[f'Bit_{i}' for i in range(1024)])

# Optionally, you can add the SMILES strings back to the DataFrame
ecfp4_df['Compound_SMILES'] = df_test_k_sep["Compound_SMILES"]

# Display the resulting DataFrame
print(ecfp4_df)


  Bit_0 Bit_1 Bit_2 Bit_3 Bit_4 Bit_5 Bit_6 Bit_7 Bit_8 Bit_9  ... Bit_1015  \
0     0     0     0     0     0     0     0     0     0     0  ...        0   
1     0     0     0     0     0     0     0     0     0     0  ...        0   
2     0     0     0     0     0     0     0     0     0     0  ...        0   
3     0     0     0     0     0     0     0     0     0     0  ...        0   

  Bit_1016 Bit_1017 Bit_1018 Bit_1019 Bit_1020 Bit_1021 Bit_1022 Bit_1023  \
0        0        0        0        0        0        0        0        0   
1        0        1        0        0        0        0        0        0   
2        0        0        0        0        0        0        0        0   
3        0        0        0        0        0        0        0        0   

  Compound_SMILES  
0             CCO  
1         CC(=O)O  
2       CCN(CC)CC  
3     C1=CC=CC=C1  

[4 rows x 1025 columns]


In [12]:
comp.to_csv('cleaned_data/ECFP4.csv')

├── smiles_retrieval          <- Obtained SMILES data on filtered set of compounds using chemblID

## 3. Protein Sequence

In this notebook we will collect the protein sequence of our targets. We will be collecting it based on their uniprot ID. I will be using a differnt method to get the protein sequences, by making request for each uniprot id.

In [13]:
# load the libraries
import pandas as pd
import requests

In [14]:
'''
In the uniprot website I have noticed that if i simply put the unitprot id at the end of following url
"https://www.uniprot.org/uniprot/" then it helps me get the sequence in a text format. I will use this method
to retrieve the sequence and see what changes it needs to make it proper.
'''

text = requests.get('https://www.uniprot.org/uniprot/O00238.fasta')

In [15]:
seq = text.text
seq # we look at the text

'>sp|O00238|BMR1B_HUMAN Bone morphogenetic protein receptor type-1B OS=Homo sapiens OX=9606 GN=BMPR1B PE=1 SV=1\nMLLRSAGKLNVGTKKEDGESTAPTPRPKVLRCKCHHHCPEDSVNNICSTDGYCFTMIEED\nDSGLPVVTSGCLGLEGSDFQCRDTPIPHQRRSIECCTERNECNKDLHPTLPPLKNRDFVD\nGPIHHRALLISVTVCSLLLVLIILFCYFRYKRQETRPRYSIGLEQDETYIPPGESLRDLI\nEQSQSSGSGSGLPLLVQRTIAKQIQMVKQIGKGRYGEVWMGKWRGEKVAVKVFFTTEEAS\nWFRETEIYQTVLMRHENILGFIAADIKGTGSWTQLYLITDYHENGSLYDYLKSTTLDAKS\nMLKLAYSSVSGLCHLHTEIFSTQGKPAIAHRDLKSKNILVKKNGTCCIADLGLAVKFISD\nTNEVDIPPNTRVGTKRYMPPEVLDESLNRNHFQSYIMADMYSFGLILWEVARRCVSGGIV\nEEYQLPYHDLVPSDPSYEDMREIVCIKKLRPSFPNRWSSDECLRQMGKLMTECWAHNPAS\nRLTALRVKKTLAKMSESQDIKL\n'

This is quiet good enough. I just need to do minor changes to the text to acquire the information that I need. We can do in following steps.

In [16]:
# find the index for next line
seq.find('\n')

110

In [17]:
protein = seq[111:] 
# save the information after the first line

In [18]:
protein.replace('\n', '') # replace all the next line with empty string so it becomes a single continuous sequence

'MLLRSAGKLNVGTKKEDGESTAPTPRPKVLRCKCHHHCPEDSVNNICSTDGYCFTMIEEDDSGLPVVTSGCLGLEGSDFQCRDTPIPHQRRSIECCTERNECNKDLHPTLPPLKNRDFVDGPIHHRALLISVTVCSLLLVLIILFCYFRYKRQETRPRYSIGLEQDETYIPPGESLRDLIEQSQSSGSGSGLPLLVQRTIAKQIQMVKQIGKGRYGEVWMGKWRGEKVAVKVFFTTEEASWFRETEIYQTVLMRHENILGFIAADIKGTGSWTQLYLITDYHENGSLYDYLKSTTLDAKSMLKLAYSSVSGLCHLHTEIFSTQGKPAIAHRDLKSKNILVKKNGTCCIADLGLAVKFISDTNEVDIPPNTRVGTKRYMPPEVLDESLNRNHFQSYIMADMYSFGLILWEVARRCVSGGIVEEYQLPYHDLVPSDPSYEDMREIVCIKKLRPSFPNRWSSDECLRQMGKLMTECWAHNPASRLTALRVKKTLAKMSESQDIKL'

In [19]:
# created a list of all the target uniprot ID's we will get the information for all of them

targets = ['P06213', 'P78368', 'Q9H2K8', 'P49336', 'Q6DT37', 'P09619', 'Q9Y463', 'P04626', 
           'P11802', 'P49674', 'Q05397', 'Q07912', 'Q13873', 'P45985', 'Q96D53', 'P42345', 
           'Q13233', 'P27037', 'P43405', 'O15075', 'P49759', 'P53355', 'O94804', 'P19525', 
           'P53671', 'Q9NRP7', 'P57078', 'Q92630', 'Q13237', 'P45984', 'Q02779', 'P52564', 
           'O60674', 'P07332', 'P33981', 'P04629', 'P30291', 'Q9UK32', 'Q15835', 'P53350', 
           'Q96RR4', 'Q8IVW4', 'P80192', 'Q9Y2U5', 'O15530', 'O15146', 'Q9BYP7', 'Q13163', 
           'P68400', 'P14616', 'P07949', 'O94768', 'Q12851', 'Q9UHD2', 'Q13470', 'P35916', 
           'P36888', 'O75676', 'Q16832', 'P37173', 'Q13188', 'P48729', 'Q8NE63', 'Q16539', 
           'P35968', 'Q8NI60', 'P19784', 'Q8WTQ7', 'Q15349', 'P22455', 'Q02763', 'Q13705', 
           'P42338', 'P00519', 'P29597', 'P27361', 'Q9BZL6', 'P0C1S8', 'P24723', 'O00444', 
           'Q08345', 'Q04759', 'P31751', 'P49840', 'P45983', 'Q9NYY3', 'O14965', 'Q92772', 
           'Q9UEE5', 'O75914', 'P06239', 'Q9BWU1', 'Q9Y243', 'O14578', 'P48736', 'P35590', 
           'O60331', 'P51813', 'O43683', 'Q9NSY1', 'Q96GD4', 'O14730', 'O15197', 'Q92918', 
           'P53779', 'Q8TF76', 'P29376', 'Q8IVH8', 'P30530', 'O76039', 'Q9H1R3', 'P15735', 
           'Q8TD08', 'P07333', 'Q00526', 'Q86Z02', 'Q9NYL2', 'Q6XUX3', 'P17612', 'Q9H2G2', 
           'P16591', 'Q8WXR4', 'P10721', 'Q86YV6', 'P51617', 'P49841', 'P31749', 'Q86Y07', 
           'Q99759', 'P29323', 'Q8TBX8', 'P50613', 'Q9C098', 'Q8IY84', 'P48730', 'Q9H4B4', 
           'P12931', 'Q05655', 'Q9UBF8', 'Q15418', 'O00141', 'O96017', 'Q96PY6', 'O94806', 
           'P49760', 'Q9UIK4', 'P51956', 'Q9UM73', 'Q9Y616', 'Q9Y4K4', 'O14757', 'P22607', 
           'P46734', 'Q9H422', 'P16234', 'Q8N5S9', 'O75716', 'P42685', 'Q09013', 'Q13043', 
           'Q9UBE8', 'Q6PHR2', 'P25098', 'Q96Q40', 'O14936', 'P42684', 'Q9BVS4', 'Q13627', 
           'O00506', 'Q6P3R8', 'Q9P2K8', 'P51817', 'P51451', 'O43293', 'Q9BRS2', 'Q14289', 
           'O14920', 'P21860', 'Q7L7X3', 'Q9Y6E0', 'Q9HAZ1', 'Q56UN5', 'Q02156', 'Q16288', 
           'Q15759', 'Q12866', 'Q9H2X6', 'P54753', 'P08922', 'P52333', 'Q9UQB9', 'P08581', 
           'Q9Y6R4', 'Q16584', 'P21802', 'Q15746', 'P49137', 'Q8N568', 'Q9H093', 'P23458', 
           'Q9P286', 'O14976', 'P17948', 'Q16620']

In [20]:
# We will store all the sequence information in a new dataframe

proteinset = pd.DataFrame({'target_id': [],
                           'Sequence':[]
                          })

In [21]:
# run this cell to get all the sequences

for target in targets:
    sequence = requests.get(f'https://www.uniprot.org/uniprot/{target}.fasta').text
    ind = sequence.find('\n')
    raw = sequence[ind+1:]
    protein = raw.replace('\n', '')
    row = proteinset.shape[0]
    proteinset.loc[row] = [target, protein]

In [22]:
proteinset.head()

Unnamed: 0,target_id,Sequence
0,P06213,MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTR...
1,P78368,MDFDKKGGKGETEEGRRMSKAGGGRSSHGIRSSGTSSGVLMVGPNF...
2,Q9H2K8,MRKGVLKDPEIADLFYKDDPEELFIGLHEIGHGSFGAVYFATNAHT...
3,P49336,MDYDFKVKLSSERERVEDLFEYEGCKVGRGTYGHVYKAKRKDGKDD...
4,Q6DT37,MERRLRALEQLARGEAGGCPGLDGLLDLLLALHHELSSGPLRRERS...


In [23]:
# we save this
proteinset.to_csv('cleaned_data/seq_data.csv')

## Featurising the sequence

Our model cannot work with just the sequence information and so we need to find a way for represent them in anumerical format. I am gonna use a simple method of just turning all the sequences into a number representation for all the amino acid letters.

In [24]:
# write a function that converts sequence in to numbers for each amino acid

seq_rdic = ['A','I','L','V','F','W','Y','N','C','Q','M','S','T','D','E','R','H','K','G','P','O','U','X','B','Z']
seq_dic = {w: i+1 for i,w in enumerate(seq_rdic)}

def encodeSeq(seq, seq_dic=seq_dic):
    if pd.isnull(seq):
        return [0]
    else:
        return [seq_dic[aa] for aa in seq]


In [25]:
seq = proteinset.Sequence[1] 

# test on the first sequence
encode = encodeSeq(seq, seq_dic)
print(encode)

[11, 14, 5, 14, 18, 18, 19, 19, 18, 19, 15, 13, 15, 15, 19, 16, 16, 11, 12, 18, 1, 19, 19, 19, 16, 12, 12, 17, 19, 2, 16, 12, 12, 19, 13, 12, 12, 19, 4, 3, 11, 4, 19, 20, 8, 5, 16, 4, 19, 18, 18, 2, 19, 9, 19, 8, 5, 19, 15, 3, 16, 3, 19, 18, 8, 3, 7, 13, 8, 15, 7, 4, 1, 2, 18, 3, 15, 20, 2, 18, 12, 16, 1, 20, 10, 3, 17, 3, 15, 7, 16, 5, 7, 18, 10, 3, 12, 1, 13, 15, 19, 4, 20, 10, 4, 7, 7, 5, 19, 20, 9, 19, 18, 7, 8, 1, 11, 4, 3, 15, 3, 3, 19, 20, 12, 3, 15, 14, 3, 5, 14, 3, 9, 14, 16, 13, 5, 13, 3, 18, 13, 4, 3, 11, 2, 1, 2, 10, 3, 2, 13, 16, 11, 15, 7, 4, 17, 13, 18, 12, 3, 2, 7, 16, 14, 4, 18, 20, 15, 8, 5, 3, 4, 19, 16, 20, 19, 13, 18, 16, 10, 17, 1, 2, 17, 2, 2, 14, 5, 19, 3, 1, 18, 15, 7, 2, 14, 20, 15, 13, 18, 18, 17, 2, 20, 7, 16, 15, 17, 18, 12, 3, 13, 19, 13, 1, 16, 7, 11, 12, 2, 8, 13, 17, 3, 19, 18, 15, 10, 12, 16, 16, 14, 14, 3, 15, 1, 3, 19, 17, 11, 5, 11, 7, 5, 3, 16, 19, 12, 3, 20, 6, 10, 19, 3, 18, 1, 14, 13, 3, 18, 15, 16, 7, 10, 18, 2, 19, 14, 13, 18, 16, 1, 13, 20, 2

It works so now we make a new column called encoded and store all the numerical sequence values in them

In [26]:
proteinset['encoded'] = proteinset['Sequence'].map(lambda x: encodeSeq(x, seq_dic))

In [27]:
proteinset.head()

Unnamed: 0,target_id,Sequence,encoded
0,P06213,MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTR...,"[11, 1, 13, 19, 19, 16, 16, 19, 1, 1, 1, 1, 20..."
1,P78368,MDFDKKGGKGETEEGRRMSKAGGGRSSHGIRSSGTSSGVLMVGPNF...,"[11, 14, 5, 14, 18, 18, 19, 19, 18, 19, 15, 13..."
2,Q9H2K8,MRKGVLKDPEIADLFYKDDPEELFIGLHEIGHGSFGAVYFATNAHT...,"[11, 16, 18, 19, 4, 3, 18, 14, 20, 15, 2, 1, 1..."
3,P49336,MDYDFKVKLSSERERVEDLFEYEGCKVGRGTYGHVYKAKRKDGKDD...,"[11, 14, 7, 14, 5, 18, 4, 18, 3, 12, 12, 15, 1..."
4,Q6DT37,MERRLRALEQLARGEAGGCPGLDGLLDLLLALHHELSSGPLRRERS...,"[11, 15, 16, 16, 3, 16, 1, 3, 15, 10, 3, 1, 16..."


The problem is that some sequences are long and some are very small. We need each of them to be of the same length. So I will define a function called padding which will basically turn the list into the same length of 3000 by adding a 0 at the end. I am choosing 3000 because the maximum length is 2549.

In [28]:
seqlen = []
for seq in proteinset.encoded.values:
    seqlen.append(len(seq))
print(max(seqlen))

2549


In [29]:
def padding(seq):
    for i in range(3000):
        if i >= len(seq):
            seq.append(0)

In [30]:
proteinset['padded'] = proteinset['encoded'].map(lambda x: padding(x))

In [31]:
seqlen = []
for seq in proteinset.encoded.values:
    seqlen.append(len(seq))
print(max(seqlen))

3000


We need to create a new dataframe that has the target ID and the encoded seqences as separate columns, we do that in the following way

In [32]:
protdata = proteinset[['target_id']]

In [33]:
encodes = proteinset['encoded'].apply(pd.Series)

In [34]:
encodes = encodes.rename(columns = lambda x : 'encode_' + str(x))

In [35]:
protdata = pd.concat([protdata, encodes], axis=1)

In [36]:
protdata.head()

Unnamed: 0,target_id,encode_0,encode_1,encode_2,encode_3,encode_4,encode_5,encode_6,encode_7,encode_8,...,encode_2990,encode_2991,encode_2992,encode_2993,encode_2994,encode_2995,encode_2996,encode_2997,encode_2998,encode_2999
0,P06213,11,1,13,19,19,16,16,19,1,...,0,0,0,0,0,0,0,0,0,0
1,P78368,11,14,5,14,18,18,19,19,18,...,0,0,0,0,0,0,0,0,0,0
2,Q9H2K8,11,16,18,19,4,3,18,14,20,...,0,0,0,0,0,0,0,0,0,0
3,P49336,11,14,7,14,5,18,4,18,3,...,0,0,0,0,0,0,0,0,0,0
4,Q6DT37,11,15,16,16,3,16,1,3,15,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# we will save them into a csv file

protdata.to_csv('cleaned_data/seqdata.csv')

├── protein_retrieval         <- Obtained Protein sequencen FASTA files from Uniprot using TargetID