# Chemoinformatics and Chemical Descriptors

So we've had some practice with python and handling some chemical data. We need to fill in one more big step before we can build our own models.

## SMILES

### Drawing Structures

If we want to train a model to predict some property--say, solubility--we need a way for our program to uniquely identify a molecule. One of the more common tools is called the SMILES string (simplified molecular-input line-entry system), which translates a chemical line drawing into a computer-readible string. SMILES strings were invented before the days of ChemDraw, but they've maintained relevance in chemoinformatics. In this notebook assignment, you will learn how to write, read and translate SMILES strings.

We will make use of some existing python libraries to help, excecute the cell below. See your instructor if the cell produces an error.

In [None]:
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
from rdkit.DataStructs import FingerprintSimilarity
import numpy as np

IPythonConsole.ipython_useSVG = True
IPythonConsole.molSize = 300, 300
rdDepictor.SetPreferCoordGen(True)
%matplotlib inline

SMILES strings are written using atomic symbols, punctuation, and numbers to specify the connectivity of a molecule. Atomic symbols for common elements (B,C,N,O,S,F,Cl,Br,I) are written as-is, while less common elements require brackets, [Au].

Single bonds are implied when atoms are adjacent, and double and triple bonds require different notation. We can use `rdkit` to view the line structure generated by SMILES. The function we'll call generates a `Molecule` object, which we'll use for many things later on.

Let's start by viewing propane by inputting its SMILES string below.

In [None]:
propane = Chem.MolFromSmiles()
propane

We can signify a double bond using `=`. In the cell below, create a drawing of 2-pentene using its SMILES string.

You won't be asked to draw SMILES strings on your own, but it can be useful to be able to read them. In the cell below, plot the drawings of ethanol, benzene, caffeine, and acetate. 

Rdkit has a lot of useful functionality, One thing it can do is plot many molecules in a grid. To do this:

    1. Put all molecules in a list
    2. Create a new list with the molecule names. It should be in the same order as your molecule list.
    3. Create a grid using the rdkit function
    
In the cell below, fill in the missing code to perform these tasks.

In [None]:
mols = []  # contains the molecule objects
names = [] # contains the names

#Now we create the GridImage
grid = Draw.MolsToGridImage( , legends=names) #pass the 'mols' list here and create the image

####END

grid #visualize your molecules!

### The Molecule Object

Molecule objects in rdkit are very useful beyond visualizing structures. For example, they can return structural data, like numbers of atoms or bonds, so that users can do a some data mining.

Let's take an example--caffeine--and see how to analyze these properties.

In [None]:
caffeine = Chem.MolFromSmiles('CN1C=NC2=C1C(=O)N(C(=O)N2C)C')
caffeine

In [None]:
# let's save and print some features
n_atoms = caffeine.GetNumAtoms()
print(f"Caffeine has {n_atoms} atoms")

In [None]:
n_bonds = caffeine.GetNumBonds()
print(f"Caffeine has {n_bonds} bonds")

We can even get some atomic data. Let's use a function to get a list of each atom:

In [None]:
atoms = list(caffeine.GetAtoms()) # we'll make it a list

Let's now loop through our list and print some atomic data, including atomic number, number of bonding partners, and hybridization. 

In [None]:
for atom in atoms:
    symbol = atom.GetSymbol()
    degree = atom.GetDegree()
    hybridization = atom.GetHybridization()
    print(f"Atom: {symbol}, Degree: {degree}, Hybridization: {hybridization}")

In the cell below, create a $\beta$-carotene Molecule object using the SMILES string. Then, determine and print the number of $sp^2$ hybridized carbon atoms in the molecule

## Chemical Descriptors: Fingerprints

SMILES strings are useful for identifying molecules, but they cannot generally be used in building machine learning models. ML algorithms need to operate on arrays of numbers, so we need a way to translate a chemical structure.

![alternative text](desc.png)


One common way to accomplis this translation is to use molecular fingerprints. Molecular fingerprints are arrays of 1s and 0s translated from a chemical structure, typically using a SMILES string. There are a lot of ways to do this translations, and we'll be using a common one called Morgan fingerprints.

Luckily, `rdkit` has  built-in tools to generate these fingerprints automatically. Let's see what the Morgan fingerprint for Caffeine looks like:

In [None]:
caffeine_fp = AllChem.GetMorganFingerprintAsBitVect(caffeine, 2, nBits=1024)
print(caffeine_fp.ToList())

Morgan fingerprints allow us to more easily compare the similarity of two molecules. This type of task is crucial in categorizing very large collections of molecules. On common way to compare the similarity of two structures is to use the Tanimoto similarity. The Tanimoto similarity relates the structural similarity of to values to a single number. On its own, it can be hard to interpret, and it is really only useful to characterize the similarities of a collection of structures.

In the cell below, make two new molecules from their SMILES strings, say toluene and theobromine.

In [None]:
toluene = 
theobromine = 

Then, make the Morgan fingerprints:

In [None]:
toluene_fp = 
theobromine_fp = 

The function `FingerprintSimilarity(f1,f2)` calculates the Tanimoto similarity between two structures given the fingerprints `f1` and `f2`. It returns a number that characterizes the similarity. 

In the cell below, use this function to determine whether theobromine or toluene is more structurally similar to caffeine. 

## Assignment

I've provided a .csv data file containing information on a set of molecules, including an identifier, IUPAC name, and SMILES string. In this assignment, you are tasked with generating a similarity plot so that we can visually inspect the similarity between all pairs of molecules in the dataset (don't worry, I'll do most of the plotting).

1. The data is in a file called `example2.tsv`. Use the cell below to load it into a Pandas DataFrame.

In [None]:
import pandas as pd
datafile = pd.read_csv('example2.tsv', sep='\t')

2. Print the first 5 rows of the dataframe below.

3. Now we're going to add a column to this dataframe that creates a Molecule object from the SMILES string. As it turns out, this is a very common thing to do, so `rdkit` has a single function to carry this out. I'll provide the call, but print the first five rows again to ensure that the change was made.

In [None]:
PandasTools.AddMoleculeColumnToFrame(datafile,'SMILES_parent', 'Molecule',includeFingerprints=True)



4. Use the cell below to print the number of molecules in this dataset.

5. Let's take a look at these molecules. Execute the cell below (without editing) to see the chemical structures. 

In [None]:
PandasTools.FrameToGridImage(datafile,column= 'Molecule', molsPerRow=4,subImgSize=(150,150),legendsCol="Mol_ID")



6. We can see that some of these molecules look similar. We ultimately want to calculate the similarities. First, we'll need the Morgan fingerprints. In the cell below, generate the Morgan fingerprint for each molecule in the dataset, and store it in a list.

In [None]:
## Here's an empty list
fplist = []

## Below, loop through each molecule, generate the fingerprint,
## and put it in the list



7. In the cell below, add this list to our DataFrame. Use 'mfp' as the column name. Then, print the first 5 rows to ensure the change was made.

8. Here, we'll need to build a 2-dimensional array to store the similarities between each pair of molecules. I think its simplest to use a 2-D numpy array. 

    I'll initialize the array to be the correct size, but full of zeros. Your task is then to loop through *each pair* of molecules, calculate the Tanimoto similarity, and store it in the correct spot in the array. 

In [None]:
# First initialize the array to the correct size,
ts = np.zeros((len(datafile), len(datafile)))

## Loop through your molecules/fingerprints here,
## determine the similarity,
## then store it in the array



9. With your array completed, run the following cell without edit to generate the plot.

In [None]:
import matplotlib as mpl
from matplotlib import pyplot as plt

plt.imshow(ts, interpolation='none',cmap='Purples')
plt.colorbar(orientation='vertical')
plt.xlabel('Molecule Index')
plt.ylabel('Molecule Index')

 10. Which two molecules are the most similar? What is the similarity score? Plot both molecules.

 11. Which molecules are the least similar? What is the similarity score? Plot both molecules. 

 12. Which molecule is the most unique of the set? How can you quantify this?