<a href="https://colab.research.google.com/github/mozey256/OSCAAR/blob/main/ML_Chemogenomics_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Orthology Based Design for Synergistic Drug Combinations Against ***Apsergillus fumigatus*** Using Chemogenomic and Drug Structural Data for ***Cryptococcus neoformans***

**Ainembabazi Moses 2022/HD07/2039U Makerere University IDI and ACE**

Chemogenomics data is a rich dataset for synergistic drug combination design but its not available for all fungal pathogens. Now lets solve this problem using Orthology Design

In [10]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip



Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Downloading https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
[2K     [32m/[0m [32m17.8 MB[0m [31m38.3 MB/s[0m [33m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ydata-profiling
  Building wheel for ydata-profiling (setup.py) ... [?25l[?25hdone
  Created wheel for ydata-profiling: filename=ydata_profiling-0.0.dev0-py2.py3-none-any.whl size=357942 sha256=9a3def02d046b12231d31a0a2b4cee5cd0cb21084986e6ba05ef07912793a8f0
  Stored in directory: /tmp/pip-ephem-wheel-cache-_imnk004/wheels/07/29/61/f533cc7cbd0a97efb2d1b94d3254a3e859a949367ba842577b
Successfully built ydata-profiling
Installing collected packages: ydata-profiling
  Attempting uninstall: ydata-profiling
    Found existing installation: ydata-profiling 4.7.0
    Uninstalling ydata-profiling-4.7.0:
      Successfully uninstalled ydata-profiling-4.7.0
Su

In [35]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2023.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit
Successfully installed rdkit-2023.9.5


Import all the required packages here

In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
#---------------------- RDKit packages
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolDescriptors
from pydantic_settings import BaseSettings
from pandas_profiling import ProfileReport
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit import DataStructs
import matplotlib.pyplot as plt
from rdkit.ML.Cluster import Butina
#------------------- progress bar
from tqdm import tqdm
#------------------- hide warning
import warnings
warnings.filterwarnings('ignore')

# CHEMOGENOMICS DATA

In [3]:
data = pd.read_csv('/content/Chemical-Genetics Dataset mmc3.csv')

check the loaded data


In [6]:
df = data


In [7]:
df.head()

Unnamed: 0,NAME,2aminobenzothiazole_conc1_T30,2aminobenzothiazole_conc2_T30,2aminobenzothiazole_conc3_T30,2hydroxyethylhydrazine_conc3_T30,3aminotriazole_conc1_T30,3aminotriazole_conc2_T30,3aminotriazole_conc3_T30,4hydroxytamoxifene_conc1_T30,4hydroxytamoxifene_conc2_T30,...,verrucarin_conc1_T30,verrucarin_conc2_T30,verrucarin_conc3_T30,ZnCl2_conc1_T30,ZnCl2_conc2_T30,ZnCl2_conc3_T30,average,stdev,-2.5 stdev,+2.5 stdev
0,CNAG_02695,1.056383,-27.976239,-12.390959,99.463196,-0.184176,-21.307448,-47.518637,13.259051,-7.443089,...,-107.722709,-4.442773,33.778516,89.765777,84.699476,80.812676,-7.219154,67.225201,-175.282156,160.843848
1,CNAG_06761,-9.030092,10.790969,26.735925,,35.430513,33.511175,18.639025,-214.559477,,...,,,,-182.055423,-134.930253,-121.099541,-34.617671,107.448156,-303.238061,234.002719
2,CNAG_01862,-106.748572,-1.341236,-0.944822,,-51.078138,23.378876,118.536041,,-213.627304,...,-428.691317,-261.446249,,-136.171531,-105.2876,-169.371448,-26.393266,130.335674,-352.232451,299.445918
3,CNAG_03664,-44.008538,-41.25612,-29.498142,-72.341975,12.642868,-83.95712,-110.532908,-24.765466,19.55412,...,-30.260754,-56.972615,-24.080345,105.303404,-66.122509,-105.01988,-6.469043,73.885705,-191.183305,178.245219
4,CNAG_01181,-3.578873,18.579128,-47.704215,123.927046,-71.78615,-29.939403,72.732075,19.046087,5.637677,...,-2.853405,94.638273,34.692855,-63.216421,-18.818562,6.887958,-2.804432,53.876986,-137.496897,131.888032


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1452 entries, 0 to 1451
Columns: 444 entries, NAME to +2.5 stdev
dtypes: float64(443), object(1)
memory usage: 4.9+ MB


select every column name which represents a lower dosage of the drug which is conc3 and save it to a new file.

In [16]:

# Select only the columns containing 'conc3' in their name
conc3_columns = [col for col in df.columns if 'conc3' in col]

# Select the entries of these columns
conc3_data = df[conc3_columns]

# Save the selected data to a new CSV file
conc3_data.to_csv('selected_dosage.csv', index=False)


In [17]:
df1 = pd.read_csv('/content/selected_dosage.csv')

In [18]:
df1.head()


Unnamed: 0,2aminobenzothiazole_conc3_T30,2hydroxyethylhydrazine_conc3_T30,3aminotriazole_conc3_T30,4hydroxytamoxifene_conc3_T30,A23187_conc3_T30,abietic-acid_conc3_T30,acifluorofen-methyl_conc3_T30,aconitine_conc3_T30,agelasine_conc3_T30,alamethicin_conc3_T30,...,thiabendazole_conc3_T30,thozonium-bromide_conc3_T30,tomatine_conc3_T30,trichostatinA_conc3_T30,trimethoprim_conc3_T30,tunicamycin_conc3_T30,usnic-acid_conc3_T30,valinomycin_conc3_T30,verrucarin_conc3_T30,ZnCl2_conc3_T30
0,-12.390959,99.463196,-47.518637,-8.810942,4.29748,-17.59699,-11.436236,-10.310285,35.952584,104.501338,...,-27.897476,34.798421,6.03769,-125.625952,29.12957,70.102756,-17.918227,-2.355947,33.778516,80.812676
1,26.735925,,18.639025,-229.150985,73.488603,,-48.800696,,-297.873286,,...,38.358847,2.335099,14.473617,,56.453545,-3.168274,-49.071984,,,-121.099541
2,-0.944822,,118.536041,-108.495497,117.847484,,38.534702,,,,...,13.051399,101.798766,-19.548246,,-24.539292,-194.570508,18.674033,,,-169.371448
3,-29.498142,-72.341975,-110.532908,34.647391,-88.657068,11.5953,-20.262663,-14.02531,37.577135,54.973262,...,29.391496,-38.587139,-92.325338,-67.183398,-15.017046,-44.875864,76.052004,-142.573033,-24.080345,-105.01988
4,-47.704215,123.927046,72.732075,18.603897,-5.247973,27.610765,-37.733011,-35.406175,-3.568782,7.880451,...,-1.64131,-1.832095,-54.107417,86.496364,-22.360077,40.110359,-3.45998,-33.460913,34.692855,6.887958


In [24]:
columns = list(df1.columns)
columns

['2aminobenzothiazole_conc3_T30',
 '2hydroxyethylhydrazine_conc3_T30',
 '3aminotriazole_conc3_T30',
 '4hydroxytamoxifene_conc3_T30',
 'A23187_conc3_T30',
 'abietic-acid_conc3_T30',
 'acifluorofen-methyl_conc3_T30',
 'aconitine_conc3_T30',
 'agelasine_conc3_T30',
 'alamethicin_conc3_T30',
 'alexidine_conc3_T30',
 'allantoin_conc3_T30',
 'alternariol_conc3_T30',
 'aluminum-sulfate_conc3_T30',
 'amantadine_conc3_T30',
 'amiodarone_conc3_T30',
 'ammonium-persulfate_conc3_T30',
 'amphotericinB_conc3_T30',
 'andrastin_conc3_T30',
 'antimycin_conc3_T30',
 'apicidin_conc3_T30',
 'artemesinin_conc3_T30',
 'azide_conc3_T30',
 'BaCl2_conc3_T30',
 'bafilomycin_conc3_T30',
 'BCS_conc3_T30',
 'betulinic-acid_conc3_T30',
 'bifonazole_conc3_T30',
 'borate_conc3_T30',
 'brefeldinA_conc3_T30',
 'CaCl2_conc3_T30',
 'caffeine_conc3_T30',
 'calcofluor-white_conc3_T30',
 'camptothecin_conc3_T30',
 'castanospermine_conc3_T30',
 'cerulenin_conc3_T30',
 'chloroquine_conc3_T30',
 'chlorpromazine_conc3_T30',
 'c

In [22]:
df1.describe()

Unnamed: 0,2aminobenzothiazole_conc3_T30,2hydroxyethylhydrazine_conc3_T30,3aminotriazole_conc3_T30,4hydroxytamoxifene_conc3_T30,A23187_conc3_T30,abietic-acid_conc3_T30,acifluorofen-methyl_conc3_T30,aconitine_conc3_T30,agelasine_conc3_T30,alamethicin_conc3_T30,...,thiabendazole_conc3_T30,thozonium-bromide_conc3_T30,tomatine_conc3_T30,trichostatinA_conc3_T30,trimethoprim_conc3_T30,tunicamycin_conc3_T30,usnic-acid_conc3_T30,valinomycin_conc3_T30,verrucarin_conc3_T30,ZnCl2_conc3_T30
count,1447.0,1447.0,1446.0,1451.0,1448.0,1446.0,1450.0,1443.0,1448.0,1448.0,...,1448.0,1451.0,1449.0,1372.0,1447.0,1451.0,1446.0,1447.0,1446.0,1452.0
mean,-3.259671,-1.900895,1.927553,-2.58686,0.52022,0.910867,-0.469433,-1.171001,-0.236123,-2.94166,...,-1.329431,-0.102921,-3.095922,1.344102,0.174904,-2.540051,-1.293033,-0.713006,-2.688197,-1.297769
std,33.123578,42.66655,45.864301,28.960188,24.49619,23.93225,24.699677,26.891972,26.886021,27.525268,...,38.997733,21.438528,37.268917,21.267259,26.667256,57.230765,21.456906,30.789559,26.565339,27.50217
min,-371.385938,-425.235728,-395.478995,-400.171641,-198.099914,-376.301691,-400.885844,-437.180055,-365.125912,-210.499068,...,-188.729555,-146.387238,-383.835539,-358.366759,-118.804684,-429.860597,-102.41594,-438.528773,-440.556178,-242.398371
25%,-18.127331,-16.889843,-16.588563,-10.213033,-12.211059,-10.387951,-12.686126,-9.72361,-9.733992,-14.952158,...,-23.93424,-10.606191,-20.975288,-7.455343,-14.37887,-27.554736,-13.152196,-11.037369,-10.487528,-13.126277
50%,-1.724132,-1.746717,2.761458,-0.764871,-0.261006,0.938718,0.208273,-0.116559,1.205677,-2.711213,...,-0.769562,-0.819104,-0.711218,0.769675,1.184842,0.077744,-0.564275,0.845338,0.298391,-0.138352
75%,14.602861,18.173445,21.539027,9.303977,12.484823,12.810821,12.671962,10.026371,11.416434,10.240174,...,22.268803,10.154084,17.046668,9.435652,16.146399,29.160638,10.842204,12.765992,8.953357,11.167245
max,108.770371,223.063183,268.118561,75.530561,118.59704,119.521476,88.409488,142.516313,155.004199,152.066598,...,134.360324,101.798766,154.291009,186.841095,124.446323,321.110502,82.523704,161.446366,134.951335,158.757308


In [26]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1452 entries, 0 to 1451
Columns: 148 entries, 2aminobenzothiazole_conc3_T30 to ZnCl2_conc3_T30
dtypes: float64(148)
memory usage: 1.6 MB


# Check for missing data and  handle it

In [21]:

missing_data = df1.isnull().sum()

# Print the count of missing values for each column
print("Missing data count:")
print(missing_data)


Missing data count:
2aminobenzothiazole_conc3_T30       5
2hydroxyethylhydrazine_conc3_T30    5
3aminotriazole_conc3_T30            6
4hydroxytamoxifene_conc3_T30        1
A23187_conc3_T30                    4
                                   ..
tunicamycin_conc3_T30               1
usnic-acid_conc3_T30                6
valinomycin_conc3_T30               5
verrucarin_conc3_T30                6
ZnCl2_conc3_T30                     0
Length: 148, dtype: int64


Fill the missing data with the mean


In [31]:
# Fill missing values with the mean of the column
df1.fillna(df1.mean(), inplace=True)

df1.to_csv('filledmissing_chemo_data.csv', index=False)

In [15]:
df2 = pd.read_csv('/content/filledmissing_chemo_data.csv')

check for missing data again


In [16]:
missing_data = df2.isnull().sum()

# Print the count of missing values for each column
print("Missing data count:")
print(missing_data)

Missing data count:
2aminobenzothiazole_conc3_T30       0
2hydroxyethylhydrazine_conc3_T30    0
3aminotriazole_conc3_T30            0
4hydroxytamoxifene_conc3_T30        0
A23187_conc3_T30                    0
                                   ..
tunicamycin_conc3_T30               0
usnic-acid_conc3_T30                0
valinomycin_conc3_T30               0
verrucarin_conc3_T30                0
ZnCl2_conc3_T30                     0
Length: 148, dtype: int64


Check for Duplicates and remove  them if they are available

In [17]:
# Check for duplicates
print(df2.duplicated().sum())

# Remove duplicates if necessary
#df.drop_duplicates(inplace=True)

0


# Visualize the data to understand distributions, relationships, and trends.

In [8]:
# Distribution of each feature
# for column in df2.columns:
#     sns.histplot(df2[column].dropna())
#     plt.title(f'Distribution of {column}')
#     plt.show()

# Pairplot to visualize relationships between features
# sns.pairplot(df2)
# plt.show()

# Correlation heatmap
# corr = df2.corr()
# sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
# plt.title('Correlation Heatmap')
# plt.show()

In [19]:
profile = ProfileReport(df2, title= 'c.neo chemogenomics_reprpt', html={'style':{'full_width':True}})

In [1]:
profile = ProfileReport(df2, title="Pandas Profiling Report")

profile.to_file("outputReport.html")

# To display the report in the notebook
profile.to_notebook_iframe()

NameError: name 'ProfileReport' is not defined