 Alexandros bioinformatics Project – Cancer Bioactivity Prediction

The project was dealt by us in four parts, namely:<br>
1. **Data Retrieval**: Collection of bioactivity database for Cancer gene from ChEMBL using chembl_webresource_client. <br>
2. **Data Pre-processing**: Classifying the bioactivity of the compounds by IC50 value (as active, inactive or intermediate) and making a dataframe with only the relevant columns from unprocessed database (and further, saving it into a separate .csv file). <br>
3. **Making and adding descriptors**: Making and adding relevant descriptors (ALogP, PSA, Mol. Wt., Number of H acceptors and H donors) and combining pre-processed data and these molecular descriptors. <br>
4. **Applying different ML models on the data**: Using the above data, splitting it into testing and training data sets, applying different ML models and checking their accuracy.<br>

### Part 1: Data Retrieval

In [500]:
# Importing the ChEMBl webresource for searching targets and downloading data.

import pandas as pd
from chembl_webresource_client.new_client import new_client

In [501]:
# Searching by the string 'any cancer gene' and saving all the target results in targets Dataframe.

target_query = new_client.target.search('TP53') 

# A total will vary based on your gene choice.

targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Tumor suppressor p53-binding protein 1,15.0,False,CHEMBL2424509,"[{'accession': 'Q12888', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Mus musculus,TP53-binding protein 1,15.0,False,CHEMBL4295790,"[{'accession': 'P70399', 'component_descriptio...",SINGLE PROTEIN,10090
2,[],Homo sapiens,"Fructose-2,6-bisphosphatase TIGAR",15.0,False,CHEMBL4295958,"[{'accession': 'Q9NQ88', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Homo sapiens,TP53-regulating kinase,14.0,False,CHEMBL1938223,"[{'accession': 'Q96S44', 'component_descriptio...",SINGLE PROTEIN,9606
4,"[{'xref_id': 'P02340', 'xref_name': None, 'xre...",Mus musculus,Cellular tumor antigen p53,13.0,False,CHEMBL4164,"[{'accession': 'P02340', 'component_descriptio...",SINGLE PROTEIN,10090
5,"[{'xref_id': 'P04637', 'xref_name': None, 'xre...",Homo sapiens,Cellular tumor antigen p53,12.0,False,CHEMBL4096,"[{'accession': 'P04637', 'component_descriptio...",SINGLE PROTEIN,9606
6,[],Homo sapiens,Cellular tumor antigen p53/Death-associated pr...,10.0,False,CHEMBL3885543,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
7,[],Homo sapiens,Tumour suppressor protein p53/Mdm4,9.0,False,CHEMBL2221344,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
8,[],Homo sapiens,CREB-binding protein/p53,9.0,False,CHEMBL3301383,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
9,[],Homo sapiens,Tumour suppressor p53/oncoprotein Mdm2,8.0,False,CHEMBL1907611,"[{'accession': 'P04637', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


_**Note**: a single-protein database, usually works best._

In [502]:
# Selecting the target.

selected_target = targets.target_chembl_id[0] 

In [503]:
# Filtering out only the IC50 type molecules from the selected target.

activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,13444885,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
1,,13444886,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,3.5
2,,13444887,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
3,,13444888,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
4,,13444889,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,,22489792,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,31.32
90,,22489793,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,50.0
91,,22489794,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,50.0
92,,22489795,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,22.27


In [504]:
# Removing all the chemical compound entries where the IC50 value is missing.

df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,13444885,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
1,,13444886,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,3.5
2,,13444887,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
3,,13444888,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
4,,13444889,[],CHEMBL2429072,Inhibition of 53BP1 (unknown origin) after 30 ...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,,22489792,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,31.32
90,,22489793,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,50.0
91,,22489794,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,50.0
92,,22489795,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL4689468,Inhibition of recombinant human N-terminal His...,B,,,BAO_0000190,BAO_0000219,...,Homo sapiens,Tumor suppressor p53-binding protein 1,9606,,,IC50,uM,UO_0000065,,22.27


In [505]:
# Re-indexing the data frame after removal of missing data rows.

df2.index=range(len(df2))

In [507]:
# Saving unprocessed,filtered raw data.

df2.to_csv('TP53', index=False)

### Part 2: Data Pre-processing

In [509]:
# Classification of Bioactivity of compounds (inactive;active;intermediate) by IC50 value.
# Note that the Bioactivity Data of each Compounds is in the IC50 unit.

bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  elif (float(i)>1000) & (float(i)<10000):
    bioactivity_class.append("intermediate")

In [510]:
# Combining necessary data columns needed for descriptors and model making.

bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df2 = pd.concat([df2, bioactivity_class], axis=1)
selection = ['molecule_chembl_id','canonical_smiles','standard_value','bioactivity_class','value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,value
0,CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,10000.0,inactive,10.0
1,CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,3500.0,intermediate,3.5
2,CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,10000.0,inactive,10.0
3,CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CC...,10000.0,inactive,10.0
4,CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)...,10000.0,inactive,10.0
...,...,...,...,...,...
89,CHEMBL4798544,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccccn3)ccc2n1,31320.0,inactive,31.32
90,CHEMBL4787042,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccc4ccoc4c3)cc...,50000.0,inactive,50.0
91,CHEMBL4753650,CN(C)c1ccc2cc(NC(=O)/C=C/c3cccc4cccnc34)ccc2n1,50000.0,inactive,50.0
92,CHEMBL4746119,Cc1cc2cc(NC(=O)/C=C/c3ccco3)ccc2[nH]1,22270.0,inactive,22.27


In [512]:
# Saving pre-processed data.

df3.to_csv('TP53', index=False)

### Part 3: Making and adding Data Descriptors

In [513]:
# Importing rdkit to make the descriptors from SMILES of each compound.

import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [515]:
# Defining a function to find relevant descriptors as given by the Lipinski's Rule of Five, 
# These descriptors are the global factors which will determine the drug_likeness of a compound.
# The function takes the SMILES notation of each compound as the input.
# More on this is written in the "Project_Description.pdf" file in the root directory.

def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_TPSA = Descriptors.TPSA(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_TPSA,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","ALogP","PSA","NumHDonors","NumHAcceptors"] # The Descriptors for the compounds, that we are choosing.
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)
    
    return descriptors

# Note:Although TPSA is not exactly a Lipinski Descriptor, we considered TPSA(aka PSA) too as it was mentioned by Parthiban Sir.

In [516]:
# Making the dataframe of descriptors for the pre-processed data (descriptors are found for our data, using 
# the lipinski() function above).

df_lipinski = lipinski(df3.canonical_smiles)
df_lipinski

Unnamed: 0,MW,ALogP,PSA,NumHDonors,NumHAcceptors
0,537.708,5.62220,76.71,3.0,5.0
1,537.708,5.62220,76.71,3.0,5.0
2,446.595,3.87860,64.68,2.0,4.0
3,472.633,4.22080,55.89,1.0,4.0
4,446.595,3.97480,55.89,1.0,4.0
...,...,...,...,...,...
89,360.461,4.43632,58.12,1.0,4.0
90,399.494,5.78752,58.37,1.0,4.0
91,368.440,4.50090,58.12,1.0,4.0
92,266.300,3.72122,58.03,2.0,2.0


In [517]:
# Combining the pre-processed data and the descriptors dataframe we made in the above step, into a single dataframe.

df_combined = pd.concat([df3,df_lipinski],axis=1)
df_combined

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class,value,MW,ALogP,PSA,NumHDonors,NumHAcceptors
0,CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,10000.0,inactive,10.0,537.708,5.62220,76.71,3.0,5.0
1,CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,3500.0,intermediate,3.5,537.708,5.62220,76.71,3.0,5.0
2,CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,10000.0,inactive,10.0,446.595,3.87860,64.68,2.0,4.0
3,CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CC...,10000.0,inactive,10.0,472.633,4.22080,55.89,1.0,4.0
4,CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)...,10000.0,inactive,10.0,446.595,3.97480,55.89,1.0,4.0
...,...,...,...,...,...,...,...,...,...,...
89,CHEMBL4798544,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccccn3)ccc2n1,31320.0,inactive,31.32,360.461,4.43632,58.12,1.0,4.0
90,CHEMBL4787042,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccc4ccoc4c3)cc...,50000.0,inactive,50.0,399.494,5.78752,58.37,1.0,4.0
91,CHEMBL4753650,CN(C)c1ccc2cc(NC(=O)/C=C/c3cccc4cccnc34)ccc2n1,50000.0,inactive,50.0,368.440,4.50090,58.12,1.0,4.0
92,CHEMBL4746119,Cc1cc2cc(NC(=O)/C=C/c3ccco3)ccc2[nH]1,22270.0,inactive,22.27,266.300,3.72122,58.03,2.0,2.0


In [520]:
# Defining a function that calculates pIC50 from IC50 values.
# We are converting IC50 to pIC50 (negative log10 of IC50) for more uniform distribution of data.

def pIC50(input):
    pIC50 = []

    for i in input['standard_value_norm']:
        molar = i*(10**-9) # Converts nM to M
        pIC50.append(-np.log10(molar))

    input['pIC50'] = pIC50
    x = input.drop('standard_value_norm', 1)
        
    return x

In [521]:
# Before finding pIC50 and adding it to our dataframe, we have to set values of IC50 above 100000000 to 100000000
# as for values greater than 100,000,000, pIC50 value will be negative, which we do not want to happen.

# Defining a function to set the value of IC50 above 100000000 to 100000000.

df_combined.standard_value = pd.to_numeric(df_combined.standard_value) # As, df_combined.standard_value is of dtype=object.
def norm_value(input):
    norm = []

    for i in input['standard_value']:
        if i > 100000000:
          i = 100000000
        norm.append(i)

    input['standard_value_norm'] = norm
    x = input.drop('standard_value', 1)
        
    return x

In [522]:
# Applying the above function on the IC50 value column of our current dataframe.

df_norm = norm_value(df_combined)
df_norm

  x = input.drop('standard_value', 1)


Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,value,MW,ALogP,PSA,NumHDonors,NumHAcceptors,standard_value_norm
0,CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10.0,537.708,5.62220,76.71,3.0,5.0,10000.0
1,CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,intermediate,3.5,537.708,5.62220,76.71,3.0,5.0,3500.0
2,CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10.0,446.595,3.87860,64.68,2.0,4.0,10000.0
3,CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CC...,inactive,10.0,472.633,4.22080,55.89,1.0,4.0,10000.0
4,CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)...,inactive,10.0,446.595,3.97480,55.89,1.0,4.0,10000.0
...,...,...,...,...,...,...,...,...,...,...
89,CHEMBL4798544,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccccn3)ccc2n1,inactive,31.32,360.461,4.43632,58.12,1.0,4.0,31320.0
90,CHEMBL4787042,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccc4ccoc4c3)cc...,inactive,50.0,399.494,5.78752,58.37,1.0,4.0,50000.0
91,CHEMBL4753650,CN(C)c1ccc2cc(NC(=O)/C=C/c3cccc4cccnc34)ccc2n1,inactive,50.0,368.440,4.50090,58.12,1.0,4.0,50000.0
92,CHEMBL4746119,Cc1cc2cc(NC(=O)/C=C/c3ccco3)ccc2[nH]1,inactive,22.27,266.300,3.72122,58.03,2.0,2.0,22270.0


In [524]:
# Finding and adding the pIC50 column, using the normalised IC50 data column

df_final = pIC50(df_norm)
df_final

  x = input.drop('standard_value_norm', 1)


Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,value,MW,ALogP,PSA,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL2424677,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10.0,537.708,5.62220,76.71,3.0,5.0,5.000000
1,CHEMBL2426376,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,intermediate,3.5,537.708,5.62220,76.71,3.0,5.0,5.455932
2,CHEMBL2426375,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(C5CCCN5)CC4)...,inactive,10.0,446.595,3.87860,64.68,2.0,4.0,5.000000
3,CHEMBL2426374,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CC5CCC(C4)C5N4CC...,inactive,10.0,472.633,4.22080,55.89,1.0,4.0,5.000000
4,CHEMBL2426373,CCN1Cc2ccc(NC(=O)c3ccc(C(=O)N4CCC(N5CCCC5)CC4)...,inactive,10.0,446.595,3.97480,55.89,1.0,4.0,5.000000
...,...,...,...,...,...,...,...,...,...,...
89,CHEMBL4798544,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccccn3)ccc2n1,inactive,31.32,360.461,4.43632,58.12,1.0,4.0,4.504178
90,CHEMBL4787042,CCN(CC)c1cc(C)c2cc(NC(=O)/C=C/c3ccc4ccoc4c3)cc...,inactive,50.0,399.494,5.78752,58.37,1.0,4.0,4.301030
91,CHEMBL4753650,CN(C)c1ccc2cc(NC(=O)/C=C/c3cccc4cccnc34)ccc2n1,inactive,50.0,368.440,4.50090,58.12,1.0,4.0,4.301030
92,CHEMBL4746119,Cc1cc2cc(NC(=O)/C=C/c3ccco3)ccc2[nH]1,inactive,22.27,266.300,3.72122,58.03,2.0,2.0,4.652280


In [525]:
# Saving the compounds with their ID, SMILES, Bioactivity Class and the Descriptors into a single .csv file

df_final.to_csv('TP53', index=False)

In [526]:
# Only taking the descriptors and the pIC50 value columns, and making it to a new dataframe.

selection=["MW","ALogP","PSA","NumHDonors","NumHAcceptors","pIC50"]
df4=df_final[selection]

In [527]:
# Saving the compounds data descriptors into a single .csv file.

df4.to_csv('TP53', index=False)

## Part 4: Building the Model 

In [528]:
# Importing relavant sklearn modules

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, accuracy_score

We have tried to run **three ML models**, on X and Y data, namely: <br>
1. Multiple Linear Regression<br>
2. Random Forest Model<br>
3. SVM Classification Model<br>
<br>The **R2 score/Accuracy** that we got for each model is displayed. 

In [529]:
# We choose the final Data Descriptors dataframe we got from the above Part 3 as X,
# and the 'value' column of the database (given initially) as Y

X,Y=df4,df_final.value

In [530]:
# Multiple linear Regression Model

r2_sum,i=0,0
for i in range(1000): # Repeating the model 1000 times and finding the average R2 Score
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
    
    model = linear_model.LinearRegression()
    model.fit(X_train, Y_train)
    
    Y_pred = model.predict(X_test)
    r2_sum = r2_sum + r2_score(Y_test,Y_pred)
    
print("The R2 score (mean) that we are getting for this model is :",r2_sum/1000)

The R2 score (mean) that we are getting for this model is : 0.8593623949701596


In [531]:
# Random Forest Model
from sklearn.ensemble import RandomForestRegressor

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestRegressor(n_estimators=1000)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

errors = np.abs(Y_pred-(Y_test).astype(float)) 
mape = 100 * (errors/((Y_test).astype(float)))
accuracy = 100-np.mean(mape) # mape = mean absolute percentage error

print("The accuracy of the model is: ",round(accuracy, 2), '%.')

The accuracy of the model is:  97.94 %.


In [532]:
# Support Vector Machine-Classification Model

from sklearn import svm
def support_vector(x_data,y_data):
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.50)
    model = svm.SVC(kernel ='linear')
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)

    return [confusion_matrix(y_test,y_pred),accuracy_score(y_test,y_pred)]

acc_score = support_vector(df_final[['ALogP','MW','NumHDonors','NumHAcceptors','PSA']],df_final.bioactivity_class)[1]
support_vector(df_final[['ALogP','MW','NumHDonors','NumHAcceptors','PSA']],df_final.bioactivity_class)

# In this model, we have taken only 'ALogP','MW','NumHDonors','NumHAcceptors' and 'PSA' and classified by 'bioactivity_class'

[array([[44,  0],
        [ 3,  0]]),
 0.9361702127659575]

In [533]:
print("The accuracy of the model is: ",round(acc_score*100,2), '%.')

The accuracy of the model is:  91.49 %.


### Conclusion:
The Random Forest Model seemed the best of all the models being that it will produce the hightest accuracy score based on the TP53 gene and using Tumor suppressor p53-binding protein 1 which is [0] the other models (SVM and Multiple linear Regression Model) may produce different results based on gene selection. It would have been great if we had the time to incorporate these elements into our ANN. Regardless we gave it all we had for the time frame. I wish you all the best, good health and happy graduation!

This program was created based on the Data Professor's Bioinformatics YouTube tutorial series on drug discovery and the majority of those elements were used
<br>Link:https://www.youtube.com/playlist?list=PLtqF5YXg7GLlQJUv9XJ3RWdd5VYGwBHrP<br>

Nantasenamat Ph.D, C. (n.d.). Data professor. Bioinformatics Project. Retrieved November 2, 2022, from https://www.youtube.com/c/DataProfessor/featured 