# Tech test

The aim of this test is to evaluate some of the skills that you will use on your day-to-day activies at Sensyne Health.
We collaborate as a team and the output of the Analytics side of the team has to be usable by others who might not necessarily be fluent in ML-ese.
The aim of this task is to complete the assignment by focussing on key elements such as code reusability, clarity, conciseness, and use of best practices.

In order to complete this assignment please consider the following classification problem given the dataset below (you are free to add and remove steps as you feel is required). 

Data contains information about mothers who may or may not develop diabetes (Outcome).

1. Explore the data, identify and clarify any assumption you will make
2. Consider any change/operation you will do based on your assumptions
3. Your colleagues have used a Logistic regression classifier. Review the code and apply all the changes that you feel are required
4. Compare this outcome with other two classifiers. Which one is the best out of the three?
5. You are afraid of overfitting. How do you adjust your program to take care of that?
6. Which classifier would you pick?

At every step, git commit a different version of the Notebook to show the changes. Please do so on a local git repository. Don't worry about branches.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random as rnd # this is an unusual way of importing this--I would consider just import random

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats



np.random.seed(int(rnd.random()*10000000))
dataset = pd.read_csv("./dataset.csv")

## Question
Can you please explore the data and provide some valid assumptions on them?

One of the main assumptions I am making is that the samples are independent.  There really isn't a good way for me to test this based on the features provided.

There are a lot of missing values for some of the features.  One assumption is that these data are missing at random.  A full discussion of this problem is outside the scope of this tech test, but we can look to see how these values are distributed with respect to the outcome variable (which itself is somewhat imbalanced):

In [None]:
dataset.groupby('Outcome').agg(lambda x: len(x) - x.astype(bool).sum(axis=0))

Generally speaking, these missing values (0s in this dataset) are distributed roughly in proportion to the outcome measure (about 2 to 1--i.e., there are about twice as many non-diabetes as diabetes outcomes).

Looking at histograms of the individual features, we can see that a number of them are not normally distributed--though this is not a problem for logistic regression, it is for the regularization if used:

In [None]:
for col in dataset.columns:
    fig,ax=plt.subplots()
    dataset[col].plot.hist(ax=ax)
    ax.set_xlabel(col)

Another assumption of logistic regression is no multicollinearity.  We can some idea of this by looking at the heatmap of correlations between features.  We drop the outcome as we will look at that using point-biserial correlation.  Because of the large number of zeros, we really need to do some sort of imputation before we look at this, because we will be feeding our models imputed data.  

In [None]:
import seaborn as sns
from sklearn.impute import SimpleImputer
imp = SimpleImputer()
cols_with_missing = ['Glucose', 'BloodPressure', 'BMI', 'Insulin', 'SkinThickness']
dataset.loc[:,cols_with_missing] = dataset.loc[:,cols_with_missing].replace({0:np.nan})
dataset_arr = imp.fit_transform(dataset)
dataset.loc[:,:] = dataset_arr
corr_mat = dataset.drop(columns=['Outcome']).corr()
sns.heatmap(corr_mat, annot=True)

There is a high and somewhat understandable relationship between Age and number of pregnancies.  Otherwise, no serious problems.  We can test for multicollinearity by looking at the variance inflation factor (VIF):

In [None]:
X = dataset.drop(columns=['Outcome'])
X.loc[:,'const'] = 1

vif = pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

vif.drop('const')

These values are all completely fine.

In [None]:
for col in dataset.columns:
    r,p = stats.pointbiserialr(dataset['Outcome'].values, dataset[col].values)
    print('{}: r value is {}, p={}'.format(col,r,p))

Almost all of the predictors have a significant relationship to the outcome variable

## Question
Anything that we need to do based on your assumptions?

We will apply PowerTransform to Age, Pedigree, Insulin and Skin Thickness.

This is a very messy way of doing a train-test split--espeicially the repeated use of magic numbers like 0.3 and 0.7. Instead we will extract a holdout set using train_test_split, then run cross validation using GridSearchCV.  In an ideal world we would do our imputation separately on the train and test sets to prevent leakage, but for the purposes of this test I'm going to just do it once and move on from there (and actually, it has already been done above).

In [None]:

non_normal_cols = ['Age', 'DiabetesPedigreeFunction', 'Insulin', 'SkinThickness']

powt = PowerTransformer()

dataset.loc[:, non_normal_cols] = powt.fit_transform(dataset.loc[:, non_normal_cols])

X,y = dataset.drop(columns=['Outcome']), dataset.loc[:,'Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)



## Logistic regression

Here is a function that allows us to pass a constructed classifier and a param grid to carry out cross validation and find the best parameter settings

In [None]:
def run_grid_search(classifier, params):

    search = GridSearchCV(classifier, params, n_jobs=-1, scoring='roc_auc')
    search.fit(X_train, y_train)
    print("Best parameter (CV score=%0.3f):" % search.best_score_)
    print(search.best_params_)

    prediction_test = search.predict(X_test)
    prediction_probability_test = search.predict_proba(X_test)

    fig, ax = plt.subplots()

    metrics.plot_roc_curve(search, X_test, y_test, name= "test data", ax = ax)
    metrics.plot_roc_curve(search, X_train, y_train, name= "train data", ax = ax)
    
    return search

In [None]:

param_grid = {
    'C': [0.01, 0.1, 0.25, 0.5, 0.75, 1]
}

classifier = LogisticRegression(class_weight='balanced', max_iter=1000)
grid_search = run_grid_search(classifier, param_grid)

In [None]:
classifier = svm.SVC(probability=True, class_weight='balanced', kernel='linear')

param_grid = {
    'C': [0.01, 0.1, 0.25, 0.5, 0.75, 1],
}

grid_search = run_grid_search(classifier, param_grid)

In [None]:
classifier = RandomForestClassifier()

param_grid = {
    'max_depth': [3, 5, 10],
    'max_features': [2, 5],
    'min_samples_leaf': [2, 5],
    'min_samples_split': [5, 10],
    'n_estimators': [25, 100, 500]
}


grid_search = run_grid_search(classifier, param_grid)

Based on the results here I would stick with logistic regression as it is fast and the most straightforward to interpret.  Obviously I have not explored the full parameter space for these models.  Additionally, you could play with adding features using sklearn's PolynomialFeatures(), explore different imputation strategies etc.  It seems likely that some of these predictors might interact strongly with age, such that for instance a certain glucose reading or blood pressure when you are 25 is no big deal, but is very bad when you are older.

Overfitting is generally addressed by regularization in logistic regression and SVMs, and by restricting tree depth and number of samples required for a split/in a leaf for tree-based learners.  Also, cross-validation and the use of a holdout set ensures that we  select our parameters correctly (i.e., not just picking the ones that happen to do best on the test set by chance).