## Installing RDkit

We will first install [conda](https://pypi.org/project/condacolab/) package manager then install all the packages that we need. In this case, RDKit. You can use this method to install other packages in colab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

After you see the kernel restarting (you will get a notification that the seesion has crashed from unknown reason), run the following cell. The expected output is "**Everything looks OK**!"

In [None]:
import condacolab
condacolab.check()

Now the conda environment is set up, let us install the packages that we need - rdkit. We will use **conda install** to get **rdkit** package from conda-forge channel (**-c**).

In [None]:
!conda install -c conda-forge rdkit

## Importing required definitions

In [None]:
from rdkit import Chem # A core definition
from rdkit.Chem.Draw import MolsToGridImage # For displaying multiple molecules
from rdkit.Chem.AllChem import * # conformer generation and adding H 
from rdkit.Chem.rdMolDescriptors import * # To calculate descriptors
from rdkit.Chem.Draw import IPythonConsole # This displays the molecule in-line
from rdkit.Chem import PandasTools # for pandas dataframe with rdkit
import pandas as pd 
from rdkit.DataStructs.cDataStructs import ConvertToNumpyArray
import numpy as np
import matplotlib.pyplot as plt

Representing molecules as strings is done with SMILES. Simplified molecular-input line-entry system (SMILES) is a string based representation of a molecule. (https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) For example n-butane is represented as CCCC. Interactive RDKit demo for SMILES is available at https://rdkit.org/temp/demo/demo.html

In [None]:
# creating a molecule object from SMILES of n-buane
mol = Chem.MolFromSmiles("CC=CC")
mol

The molecule object has atom objects which can be used for some operations

In [None]:
# Get the number of atoms and types of atoms

for idx, atom in enumerate(mol.GetAtoms()):
  print("Atom", idx + 1,"has atomic number of",atom.GetAtomicNum())

print()
# available methods for atoms
# dir(atom)

Similar to atoms, the bonds list can also be retrieved.

In [None]:
# Here we will get the tyrp of bond whether single,double
for idx,bond in enumerate(mol.GetBonds()):
  print("Bond",idx+1,"the type of bond is",bond.GetBondType().name)

print()
# available methods for bonds
# dir(bond)

## **Try it yourself!**
Find the molecule that has -

 1) highest number of atoms

 2) highest number of rings

 2) highest number of double bonds

 3) largest number of non-carbon atoms (atoms not C or H)

 The list of smiles is given to you. There are 133885 smiles. You can consider is smaller set of smiles by slicing the list (e.g. smiles_list_500 = smiles_list[:500])

In [None]:
# DO NOT EDIT THE CODE IN THE CELL
# run this code before you work on the solution to the exercise
! wget https://raw.githubusercontent.com/vinayak2019/chemistry_python_intermediate/main/H_smiles.dat

# read the file with smiles
with open("H_smiles.dat","r") as f:
    smiles_file = f.read()

# clean the files to generate list of smiles
smiles_list = smiles_file.strip().split("\n")
print("The number of smiles in the list is",len(smiles_list))


In [None]:
# YOUR CODE HERE

## Generating the 3D structure
The molecule generated from smiles has no hydrogen atoms nor the co-ordinates for the atoms (conformer). To generate 3D descriptors of a molecule for machine learning, we need a 3D structure/conformer. 2D descriptors may not need conformer.

In [None]:
# Calculation of molecular weight with hydrogen atoms
CalcExactMolWt(mol)

In [None]:
# Computing a 3D descriptor - radius of gyration
CalcRadiusOfGyration(mol)

In [None]:
# checking whether the molecule object has a conformer
mol.GetNumConformers()

In [None]:
# Let's add confomer and hydrogens to the molecule
mol_h = AddHs(mol)
print("Number of conformers is ", mol_h.GetNumConformers())
print(Chem.MolToMolBlock(mol_h))
mol_h

Always add hydrogen before conformer generation.

In [None]:
# adding conformer
EmbedMolecule(mol_h)
print("Number of conformers is ", mol_h.GetNumConformers())
print(Chem.MolToMolBlock(mol_h))

Adding multiple conformers. You can use a forcefield to optimize the structure and compute energy.

In [None]:
# Generating 50 conformers for the molecule
EmbedMultipleConfs(mol_h,numConfs=50)
print("Number of conformers is ", mol_h.GetNumConformers())

# **Try it yourself**

Plot the distribution of molecular volume (ComputeMolVolume) for the molecules in the smiles_list


In [None]:
# YOUR CODE HERE


## Dataframe
Creating a pandas dataframe for storing data is possible with PandasTools. 

In [None]:
# First create a pandas dataframe with SMILES as a column. sample(500) gets
# 500 entries from the larger set of 133885 molecules  
df = pd.DataFrame(smiles_list, columns=["smiles"]).sample(500)
df.head() # to look the the first 5 entries

In [None]:
# using pandastools to create molecule from smiles within the dataframe
PandasTools.AddMoleculeColumnToFrame(df,smilesCol="smiles")
df.head() 

Now that we have the molecule objects, we can generate the input and the target values for machine learning. We will use the number of rings as the target value and molecular fingerprint as the input. You can find more details on molecular fingerprints [here](https://docs.chemaxon.com/display/docs/chemical-fingerprints.md)



In [None]:
# generating the target values - molecular weights.
# we use the CalcExactMolWt function from rdkit
df["target"] = df["ROMol"].apply(CalcNumRings)
df.head()

## Generate the Morgan fingerprints


In [None]:
# we define a function to generate a vector from a molecule object

def get_input(mol):
  fp = GetMorganFingerprintAsBitVect(mol, 2, nBits=100) # gets the vector
  arr = np.zeros((0,))
  ConvertToNumpyArray(fp,arr)  # converts the vector to numpy array
  return arr

df["input"] = df["ROMol"].apply(get_input) # adding the input column to the dataframe
df.head()

# Machine learning model

In the previous cells we have the the input and target values for the machine learning model. Let us now get the data in the right format to train a model. 

In [None]:
# The input values must be in the for of a vector/list
# Here we assign the values from the dataframe to X and y
X = df["input"].values.tolist()
y = df["target"].values.tolist()
print("Input",X[0])
print("Target",y[0])

We always split the data into train and test set. The train set values are used for training while the test set is used to evalute the model. We will use a random forrest classifier as our model. You can find more details on random forrest [here](https://en.wikipedia.org/wiki/Random_forest)

In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
import seaborn as sns

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=42) # split data for training and testing
model = RandomForestClassifier(random_state=42) # initialize the model
model.fit(X_train, y_train) # train the model
y_predict = model.predict(X_test) # get prediction on the test set


For evaluation of the model we use confusion matrix. The x-axis is the true value and the y-axis is the predicted value

In [None]:
confusion_mat = metrics.confusion_matrix(y_predict,y_test)
sns.heatmap(confusion_mat,annot=True,cmap="Blues",cbar=False)

## **Try it yourself**

1) Use random forest classifer for classification based on number of rings but try changing the nBits value to check if the model improves. In the above example, nBits was 100 

In [None]:
# YOUR CODE HERE

2) Use random forrest regressor for predicting the molecular mass. You can use metric.mean_squared_error for evaluating the model.

In [None]:
# YOUR CODE HERE