# Building a QSAR model using the Morgan fingerprint in rdkit and Random forest from Spark's Machine Learning

## QSAR
We know that the structure of a molecule detmerines the molecules properties but we are not capable of calculating molecular properties from molecular structures in general. One approach to doing this is used in the field of Quantitative Structure-Activity Relationships (QSAR) / Quantitative Structure-Property Relationship (QSPR) where a molecular structure is mathematically described and then machine learning is used to predict molecular activity or molecular properties. The names QSAR and QSPR are sometimes used a bit sloppy and mixed up, QSAR seems to be the most common term but QSPR is arguably a bit more general. So we need a strategy for describing the molecular properties, a machine larning algorithm and a lot of data that we can use to train our machine larning algorithm on.

### Molecular descriptor
A molecular descriptor is a mathematical representation of a molecule resulting from the transformation of the symbolic representation of a molecule into numbers. A simple approach might simply be to count the number of different atoms or fragments. There are many different approaches, some calculated from the matrix resulting from the pairwise distances between all atoms in the molecule, some being fragment based. The idea is to create a vector of numbers calcualted in such a way that similar compounds will have similar vectors. There are two different approaches worth mentioning. Either each posiition in the vector is predefined and map to one thing meaning that we know exactly what each position means, (_e.g._, if a position in the vector has the number 1 we know that the molecule contains exactly the substructure that corresponds to that position), or a _hashing_ algorithm is used to go from a descriptor to a position in the vector. The hashing algorithm is a one-way algorithm that turns, _e.g._, a string into a number (corresponding to the position in the vector). The hashing approach has the benefit that it does not require someone to predefine which structures should be used, anything found can be hashed and used in the vector but we can not easily know what each position means, and there is the risk for hash collisions (_i.e._, the hashing algorithm mapping many things into the same number making it impossible to say which one it originally was). In this lab we will use a circular fingerpint known as the Morgan fingerprint as implemented in the Python library RDKit and hash it down to a vector. A circular fingerprint uses each atom in a molecular and describe the atom neighbours out to a certain distance or "radius".

### Machine learning algorithm
We will use the Random Forest alogithm which is a good algorithm to start with. We will not go into detail on how the algorithm works in this lab but it constructs a multitude of decission trees and then weight them together.

### Dataset
The data we willl look at in this lab is distribution coefficient, log D, we will use a calculated value that we have extracted from a database where this number has been calculated for many substances. Since we want to work with big data we will use a calculated value. Of course the quality of the model depends heavily on the number of training examples and we will try with different sizes in this lab and see how the model improves.

In [1]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors
import time

start_time = time.time()

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

ModuleNotFoundError: No module named 'sklearn'

In [None]:
df = spark.read.option("header","true")\
               .option("delimiter", '\t').csv("acd_logd.smiles")\
               .sample(0.02)

In [None]:
data = df.select("canonical_smiles", "acd_logd").rdd.map( lambda row: (row.canonical_smiles, float(row.acd_logd)) )\
         .map( lambda x: (Chem.MolFromSmiles(x[0]), x[1]) )\
         .map( lambda x: (AllChem.GetMorganFingerprintAsBitVect(x[0], 2, nBits=4096), x[1]) )\
         .map( lambda x: (np.array(x[0]),x[1]) )\
         .map( lambda x: (Vectors.dense(x[0].tolist()),x[1]) )\
         .map( lambda x: (x[0],x[1]))\
         .toDF(["features", "label"] 

In [None]:
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

spark.stop()
print("--- %s seconds ---" % (time.time() - start_time))