# AI in medicine - CADD exercise

- **Tutor:** David Schaller, AG Volkamer, Charité - Universitätsmedizin Berlin (david.schaller@charite.de)
- **Target audience**: Medical students from Charité

This notebook is based on [TeachOpenCADD](https://github.com/volkamerlab/TeachOpenCADD/) and the scikit-learn [intro](https://github.com/volkamerlab/ai_in_medicine) from week 1.

## Aim

In this notebook, the experience gained in the first week will be applied to perform a virtual screening experiment for inhibitors of the epidermal growth factor receptor ([EGFR](https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor)) via machine learning. First, computer-friendly molecular representations will be introduced, which allow the training of machine learning models. Next, a support vector machine will be trained to classify molecules as active or inactive. The trained model will be used to predict the activity of a small molecule set. Finally, successful participants can check their hits for potential activity against EGFR via online resources.

## Learning goals

- apply knowledge from first week
- represent molecules in a computer-friendly fashion
- perform a virtual screening experiment
- check online resources for potential activities

## Theory
The essential theory of machine learning algorithms was covered in the first week. The concept of virtual screening will be presented seperately via slides.

## References
- [epidermal growth factor receptor](https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor)
- [molecular fingerprints](http://infochim.u-strasbg.fr/CS3/program/material/Bajorath.pdf)
- [support vector machine](https://en.wikipedia.org/wiki/Support_vector_machine)
- [virtual screening](https://en.wikipedia.org/wiki/Virtual_screening)
- [TeachOpenCADD](https://github.com/volkamerlab/TeachOpenCADD/)
- [scikit-learn Intro from week 1](https://github.com/volkamerlab/ai_in_medicine)

## Python packages
- [scikit-learn](https://scikit-learn.org/stable/)
- [rdkit](https://www.rdkit.org/)
- [pandas](https://pandas.pydata.org/)
- [numpy](https://numpy.org/)
- [matplotlib](https://matplotlib.org/)

## Practical

**Content**

1. Install RDKit  
2. Import modules  
3. Data preparation  
 3.1 Load data  
 3.2 Interpret molecules  
4. Classify data  
5. Split data  
6. Train a support vector classifier
7. Assess performance
8. Apply to unknown molecules

### 1. Install RDKit

RDKit is not installed in Google Colab, and RDKit is only available via `conda` (a package manager), which is not available on Colab either. To provide RDKit, we will need to (1) Install conda (we will use `condacolab` for that) and (2) Install RDKit using `mamba`.

In [None]:
!pip install condacolab
import condacolab
condacolab.install()

In [None]:
!mamba install -yq rdkit

### 2. Import modules

These modules are needed to perform all parts of this exercise. Feel free to add other modules, since there are multiple solutions to succeed.

In [None]:
#data handling
import numpy as np
import pandas as pd

# chemistry
from rdkit import Chem
from rdkit.Chem import MACCSkeys
from rdkit.Chem.Draw import IPythonConsole, rdMolDraw2D

# machine learning
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split

# plotting
import matplotlib.pyplot as plt

### 3. Data preparation

#### 3.1 Load data

We will use a subset of molecules retrieved from the [ChEMBL](https://www.ebi.ac.uk/chembl/) database, which contains ~5k molecules with reported activity against EGFR. The whole ChEMBL database currently contains ~16 million datapoints for ~2 million compounds.

***Insert code to load the data found in `data/egfr_chembl25.csv` into a pandas dataframe named 'df' and display the first few rows.***

In [None]:
# Read activity data for EGFR into a pandas dataframe named df
egfr_chembl25_link = 'https://github.com/volkamerlab/ai_in_medicine/raw/master/data/egfr_chembl25.csv'
#################### <-- insert code below

#################### <-- insert code above

The dataframe contains information about the ChEMBL ID, which can be used to query the ChEMBL database, a molecule in the form of a SMILES string and an activity value in the form of IC50 in nM. The IC50 describes the molar concentration that will result in 50 percent of inhibition in-vitro.

#### 3.2 Interpret molecules

Next, we will interpret the molecules that are stored as SMILES strings and transform them in a format that a machine learning algorithm can handle. The [RDKit](https://www.rdkit.org/) library is a free open-source framework that can be used to work with molecular data. In the following cells you will learn a few basic functionalities from RDKit and how you can store substructures of molecules in computer-friendly bit vectors, that can be later used to train your model.

In [None]:
# pick the first SMILES stored in the dataframe and display the molecule with RDKit
print(df['smiles'][0])
mol = Chem.MolFromSmiles(df['smiles'][0])
mol

The SMILES (**S**implified **M**olecular **I**nput **L**ine **E**ntry **S**ystem) representation allows to store the types and connectivity of atoms in a single string.  
**Atom types** are represented by their atomic symbols, upper case letters represent aliphatic atoms, lower case letters represented aromatic atoms, hydrogens are often stripped away, since those can be inferred from atom type and connectivity:  
`C` - aliphatic (sp3) carbon  
`n` - aromatic (sp2) nitrogen  
**Bonds** are only represented if needed:  
`-` - single bond (`CC` and `C-C` are the same, since single bonds are used by default)  
`=` - double bond (`C=C-C=C` and `cccc` are the same)  
`#` - triple bond  
**Ring** opening and closures are represented with numbers:  
`c1ccccc1` - benzene  
**Substituents** leaving a chain or ring are represented with brackets:  
`c1cc(C)ccc1` - methyl-substituted benzene.  
`CC(F)(Br)Cl`  - ethane substituted with fluorine, chlorine and bromine

**With the rules from above, you should be able to create the SMILES for acetylsalicylic acid, the active ingredient of Aspirin.**

<img src='images/aspirin.png'>

2D representation of acetylsalicylic acid taken from [Wikipedia](https://en.wikipedia.org/wiki/Aspirin#/media/File:Aspirin-skeletal.svg).

In [None]:
# Write the smiles for acetylsalicylic acid
####################

####################

Molecules can be represented in form of [molecular fingerprints](http://infochim.u-strasbg.fr/CS3/program/material/Bajorath.pdf), which store the presence of substructures in a bit vector consisting of zeros and ones. Here we will use Molecular ACCess System (MACCS) keys, which are implemented in RDKit and record the presence of a predefined set of substructures.

In [None]:
maccs_keys = list(MACCSkeys.GenMACCSKeys(mol))
print(maccs_keys)
print('Zeros:', len(maccs_keys) - sum(maccs_keys))
print('Ones:', sum(maccs_keys))

Let's explore which substrucutres can be found in our sample molecule.

In [None]:
# Get indices of ones
maccs_key_series = pd.Series(maccs_keys)
maccs_key_series[maccs_key_series==1].index

The underlying substructures are provided in RDKit via a dictionary and are represented as [SMARTS](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html), which is an extension of the SMILES language and especially useful for substructure searches. The SMARTS of bit 80 represents the substructure `[#7]~*~*~*~[#7]`.
- `#7` - any nitrogen
- `~` - any bond
- `*` - any atom

We can also visualize substructures with the following lines. Just replace the `bit_index` with any of the indices found above.

In [None]:
bit_index = 80
smarts = MACCSkeys.smartsPatts[bit_index][0]
Chem.MolFromSmarts(smarts)

Next, let's use the following function to add an RDKit representation and the MACCS keys for each of the SMILES in the data set.

In [None]:
def add_mols_and_maccs(df, smiles_column='smiles'):
    """
    Generate rdkit molecule objects and MACCSkeys and add them to the given dataframe.
    
    Parameters
    ----------
    df: pandas.DataFrame
        A data frame containing a column with SMILES.
    """
    df['mol'] = df[smiles_column].apply(Chem.MolFromSmiles)
    df['maccs'] = df['mol'].apply(MACCSkeys.GenMACCSKeys)
    return

In [None]:
# add columns for rdkit molecules and maccs keys
add_mols_and_maccs(df)
display(df.head())

### 4. Classify data

To train a machine learning model to classify molecules as active or inactive, we need to add an activity label to our data set.

***Insert code below that adds a column named 'active' to the dataframe that holds the value 1.0 if the IC50 is lower than 500 and otherwise 0.0.***

In [None]:
# Mark every molecule as active with an IC50 < 500
####################

####################

The following lines should find 2762 actives and 2147 inactives.

In [None]:
print('Actives:', int(df['active'].sum()))
print('Inactives:', int(len(df)-df['active'].sum()))

### 5. Split data

***Split the data into training and test set by using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.***

In [None]:
# specify features and label
x, y = df['maccs'].to_list(), df['active'].to_list()
# Split the features and labels into training and test sets
####################

####################

### 6. Train a support vector classifier

***Train a [support vector classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).***

In [None]:
# train model
####################

####################

### 7. Assess performance

- ***[predict](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict) the activity of the test set***  
- ***assess the performance of your model by plotting a [ROC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) and calcuation of the [AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)***

In [None]:
# predict the activity of the test set
####################

####################

In [None]:
# calculate AUC
####################

####################

In [None]:
# plot the ROC curve
####################

####################

### 8. Apply to unknown molecules

***Use your model and predict the activity of a set of unknown molecules located at*** `data/egfr_candidates.csv`***.***

In [None]:
# load data and assign maccs keys 
egfr_candidates_link = 'https://github.com/volkamerlab/ai_in_medicine/raw/master/data/egfr_candidates.csv'
####################

####################

In [None]:
# predict the activity
####################

####################

***Visit [PubChem](https://pubchem.ncbi.nlm.nih.gov/), an online resource for chemical information, and query the database with a SMILES of a predicted active and predicted inactive. What can you find out about the molecules?***