# Challenge PSL : Drug-Protein Interaction

## Introduction

L'objectif de ce challenge est de prédire la capacité inhibitrice de molécules sur 3 protéines : 
- Protéine 
- Protéine 
- Protéine 

Vous disposez de trois datasets distincts correspondants à chacune des protéines. Chaque dataset est composé d'un set d'entrainement, comportant les molecules (SMILES) et la valeur d'inhibition correspondante (pIC50), et d'un set de test comportant uniquement les molécules. C'est ce dernier set de molécules dont vous devrez prédire la capacité inhibitrice et soumettre votre prédiction sur la plateforme, en respectant la syntaxe du fichier de soumission (disponible ici). Les datasets doivent être traités de façon indépendante les uns des autres. 




Vous trouverez dans ce notebook une introduction à la manipulation de molécules pour la prédiction de propriétés biologiques via l'utilisation de descripteurs moléculaires type *fingerprints*. Un exemple de model prédictif est donné en utilisant cet encodage et peut vous servir de base, mais vous êtes libres d'utiliser les méthodes de votre choix pour mener à bien ce challenge. 

Vos résultats mais aussi et surtout la clarté/reproductibilité de votre code seront pris en compte dans la notation de ce challenge !

### Imports

In [16]:
# Linear algebra and data handling
import numpy as np
import pandas as pd

# RDKit
import rdkit 
from rdkit import Chem
from rdkit.Chem import AllChem

# Machine learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

# Viz
import matplotlib.pyplot as plt

### Dataset : CHRM2

First, we need to load the dataset. Do not hesitate to do more exploratory data analysis on your own !

In [62]:
# Loading the train and test csv files
train = pd.read_csv('/root/cloud-data/snf-mgln-idd-workspace/I0549447/code/github/ChallengePSL/data/CHRM2/train.csv')
test = pd.read_csv('/root/cloud-data/snf-mgln-idd-workspace/I0549447/code/github/ChallengePSL/data/CHRM2/test.csv')

# You can check the shapes of the train and test sets
print(train.shape, test.shape)

# You can check what it looks like
train.head()

(304, 3) (131, 2)


Unnamed: 0,drug_id,smiles,y
0,0,CCN1CC[C@@H](c2ccccc2)[C@H](C(=O)OC2CCCCC2)C1,5.387206
1,1,CCCCCCCCCCOC(=O)C1=C(c2ccccc2)CCN(CC)C1,7.675718
2,2,C[N+]1(CC(=O)c2cccc(F)c2)CC[C@@H](N2CC(c3ccc(F...,5.999957
3,3,C#CC(C)O/N=C1\CN2CCC1C2,5.277358
4,4,CCN1CCN(c2ccc(NC(=O)c3cc(C(F)(F)F)ccc3C(F)(F)F...,5.56862


As you can see, molecules are encoded using the SMILES notation. To be understandable by the rdkit package, we need to convert them to "Mol" type.

In [66]:
# Convert SMILES to mol files
train_mols = [AllChem.MolFromSmiles(smile) for smile in train['smiles']]
test_mols = [AllChem.MolFromSmiles(smile) for smile in test['smiles']]

type(train_mols[0])

rdkit.Chem.rdchem.Mol

After converting SMILES to Mol, we can compute fingerprints. For this example, we use the Morgan fingerprints, a popular type of circular fingerprints, but many others exist. 

In [67]:
# Convert Mol to fingerprints
train_fps = np.array([AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048) for mol in train_mols])
test_fps = np.array([AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048) for mol in test_mols])

print(train_fps.shape, test_fps.shape)

(304, 2048) (131, 2048)


We will be using scikit-learn implementation of Random Forest as predictive model. We can now define our features (X) and targets (y).

In [68]:
X_train = train_fps
y_train = train['y']

X_test = test_fps

Cross-validation is a good practice to estimate the "true" performance of a model on unseen data. If the cross-validation scores are satisfying, we can then train the model on the entire dataset and predict the test set.For this introduction we ouse the default parameters of the Random Forest, but keep in mind they need to be optimized for the challenge.

In [70]:
model = RandomForestRegressor()
cv = cross_validate(estimator=model, 
                    X=X_train, 
                    y=y_train, 
                    cv=20, 
                    scoring=["neg_median_absolute_error", "neg_root_mean_squared_error"],
                    n_jobs=-1,
                    verbose=0)

print(f"Mean MAE over 10 folds = {-cv['test_neg_median_absolute_error'].mean():.4f}")
print(f"Mean RMSE over 10 folds = {-cv['test_neg_root_mean_squared_error'].mean():.4f}")

Mean MAE over 10 folds = 0.3966
Mean RMSE over 10 folds = 0.7028


We can now predict X_test and make a submission file, following the template visible 

In [71]:
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [72]:
df = pd.DataFrame()

In [75]:
df['drug_id'] = test['drug_id']

In [76]:
df['y'] = y_pred

In [78]:
df.to_csv('/root/cloud-data/snf-mgln-idd-workspace/I0549447/code/github/ChallengePSL/data/CHRM2/baseline_rf.csv', index=False)