# Data Camp 2016

<h1> <a href=http://www.datascience-paris-saclay.fr/>Paris Saclay Center for Data Science</a> </h1>

<h2> RAMP on qualitative and quantitative non-invasive monitoring of anti-cancer drugs </h2>

<i>Camille Marini (LTCI/CNRS), Alex Gramfort (LTCI/Télécom ParisTech), Sana Tfaili (Lip(Sys)²/UPSud), Laetitia Le (Lip(Sys)²/UPSud), Mehdi Cherti (LAL/CNRS), Balázs Kégl (LAL/CNRS)</i>

<i>Report by: OUMOUSS EL MEHDI</i>
Github: oumoussmehdi

## Approach 

New to the field of data science, yet very passionate about it, I learned a lot from this
experience. 

At the beginning, my approach was a more or less what I would call a brute force approach 
to the problem, mainly due to my lake knowledge of the use cases of machine learning algorithms.
Therefore, I have tried several algorithms independently and sometimes combined them. which 
obviousely didn't yield to good results, with some exceptional cases. I even created a 
table to track the performance of each algorith I used, in order to compare them.

So, I changed my approach, by have several reading about the different algorithms that I had contact with and tried to understand their concept more properly and especially when it is recommanded to use them. In other words, the Use Cases.

At this level, I started treating the problem in a more scientific way.
And here after are the steps, I followed:
    1. Analyse the data and provided plots, in order to get insights.
    2. Come up with a classifier, with a low error
    3. Create a good regressor, with low mare.
    4. Feature extraction, in order to improve the performance of the predictif model
    5. Fine-tuning (didn't reach this step)

## Performance

Before even going to the details of my work, here after is the performance I achieved throught my submissions:

* 1st: (Modified the Classifier)
* 2nd: (Modified the Classifier)
* 3rd: (Modified the Classifier, Regressor and the Classifier's Feature Extraction)      

Performance:             
      
|     | combined | error | mare  |
|-----|----------|-------|-------|
| 1st | 0.109    | 0.065 | 0.195 |
| 2nd | 0.153    | 0.120 | 0.219 |
| 3rd | 0.165    | 0.058 | 0.378 |

Time spent on the project : 
One day and half (The camp was in Overlap with other courses and labs in my master programm)

## What I learned : Takeoffs

* Working on real data.
* Expanded my knowledge about different machine learning algorithms.
* Familiarized with feature extraction, a technique that i hadn't worked on much.
* I am sure, I will learn a lot from the collaborative session.

## Cassification

The algorithms I used: (Reference: Wikipedia)

Stochastic gradient descent:
A stochastic approximation of the gradient descent optimization method 
for minimizing an objective function that is written as a sum of differentiable functions
    
Principal component analysis (PCA): 
a statistical procedure that uses an orthogonal transformation to convert 
a set of observations of possibly correlated variables into a set of values 
of linearly uncorrelated variables called principal components. 

Gradient boosting: 
A technique for regression and classification problems, 
which produces a prediction model in the form of an ensemble of weak prediction models. 

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn import linear_model
from sklearn.ensemble import GradientBoostingClassifier

class Classifier(BaseEstimator):
    def __init__(self):
        self.n_components = 10
        self.n_estimators = 500 
        
        sgd = linear_model.SGDClassifier()
        pca = PCA(n_components=self.n_components)
        gbc = GradientBoostingClassifier(n_estimators=self.n_estimators, learning_rate=1.0,max_depth=1, random_state=0)
      
        self.clf = Pipeline([
            ('sgd', sgd),
            ('pca', pca),
            ('clf', gbc)
        ])
        #
    def fit(self, X, y):
        self.clf.fit(X, y)

    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

## Regression

As for the regressor, after several experiments with different regressors (xxxxx) , 
the combinasion of PCA and GradientBoostingRegressor was simply the best in matter of 
performance, even when the classifier is different.

That is why, I kept this regressor all along my experiment. However, In my last submission, after noticing that the change in the parameters (components, etc) plays a big role in 
enhancing the performance, I tried to follow the same appraoch with the regressor, but that
did yield to a bigger mare.

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import linear_model


class Regressor(BaseEstimator):
    def __init__(self):
        self.n_components = 10
        self.n_estimators = 200
        self.learning_rate = 0.2
        self.list_molecule = ['A', 'B', 'Q', 'R']
        self.dict_reg = {}
        
        pca =  PCA(n_components=self.n_components)
        gbr = GradientBoostingRegressor(n_estimators=self.n_estimators,learning_rate=self.learning_rate,random_state=42)
        for mol in self.list_molecule:
            self.dict_reg[mol] = Pipeline([
               ('r0', pca),('r1', gbr)
                ])

            
    def fit(self, X, y):
        for i, mol in enumerate(self.list_molecule):
            ind_mol = np.where(np.argmax(X[:, -4:], axis=1) == i)[0]
            XX_mol = X[ind_mol]
            y_mol = y[ind_mol]
            self.dict_reg[mol].fit(XX_mol, np.log(y_mol))

    def predict(self, X):
        y_pred = np.zeros(X.shape[0])
        for i, mol in enumerate(self.list_molecule):
            ind_mol = np.where(np.argmax(X[:, -4:], axis=1) == i)[0]
            XX_mol = X[ind_mol]
            y_pred[ind_mol] = np.exp(self.dict_reg[mol].predict(XX_mol))
        return y_pred

## Feature Extraction

In my 3rd submission, I want it to explore my models reaction, if I change some features.
The plots :"Raman spectra for each type of molecule" and 
"Mean Raman spectra for each concentration value". see (drug_spectra_starting_kit)

were very interessting int the sense that the region approximately between 1000 and 1600, 
showed a different behavior. Therefore, I want to emphasize this behavior, 
by chunking the spectra dataframe.

In [None]:
import numpy as np
# import pandas as pd


class FeatureExtractorClf():
    def __init__(self):
        pass

    def fit(self, X_df, y):
        pass

    def transform(self, X_df):
        XX = np.array([np.array(dd) for dd in X_df['spectra']])
        for item in XX : 
            item = item[1000:1600]
        return XX

I think that this helped reduce the error of the classifier, since it was the best compared
to the other classifiers. Yet, the mare of the regressor was higher ?? 
(needs further investigation)

## Possible Improvements

I was planning also to do the following experiments:
    * see how the classifier will react, when we will add 'vial' and 'solute'
    to the dataframe.
    * Investigate also how Intensity and frequency, can play a role to enhance 
    the model's performance
    * Maybe, try to create a feature from already existing ones and see if it can
    help improve the model

### Acknowledgement

My work isn't the best, however, I learned a lot from this camp. And I am willing to improve my skills in matter of data science.
So, Thank you!