# Tech test

The aim of this test is to evaluate some of the skills that you will use on your day-to-day activies at Sensyne Health.
We collaborate as a team and the output of the Analytics side of the team has to be usable by others who might not necessarily be fluent in ML-ese.
The aim of this task is to complete the assignment by focussing on key elements such as code reusability, clarity, conciseness, and use of best practices.

In order to complete this assignment please consider the following classification problem given the dataset below (you are free to add and remove steps as you feel is required). 

Data contains information about mothers who may or may not develop diabetes (Outcome).

1. Explore the data, identify and clarify any assumption you will make
2. Consider any change/operation you will do based on your assumptions
3. Your colleagues have used a Logistic regression classifier. Review the code and apply all the changes that you feel are required
4. Compare this outcome with other two classifiers. Which one is the best out of the three?
5. You are afraid of overfitting. How do you adjust your program to take care of that?
6. Which classifier would you pick?

At every step, git commit a different version of the Notebook to show the changes. Please do so on a local git repository. Don't worry about branches.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random as rnd

np.random.seed(int(rnd.random()*10000000))
dataset = pd.read_csv("./dataset.csv")

## Question
Can you please explore the data and provide some valid assumptions on them?

One of the main assumptions I am making is that the samples are independent.  There really isn't a good way for me to test this based on the features provided.

There are a lot of missing values for some of the features.  One assumption is that these data are missing at random.  A full discussion of this problem is outside the scope of this tech test, but we can look to see how these values are distributed with respect to the outcome variable (which itself is somewhat imbalanced):

In [None]:
dataset.groupby('Outcome').agg(lambda x: len(x) - x.astype(bool).sum(axis=0))

Generally speaking, these missing values (0s in this dataset) are distributed roughly in proportion to the outcome measure (about 2 to 1--i.e., there are about twice as many non-diabetes as diabetes outcomes).

Looking at histograms of the individual features, we can see that a number of them are not normally distributed--though this is not a problem for logistic regression, it is for the regularization if used:

In [None]:
for col in dataset.columns:
    fig,ax=plt.subplots()
    dataset[col].plot.hist(ax=ax)
    ax.set_xlabel(col)

Another assumption of logistic regression is no multicollinearity.  We can some idea of this by looking at the heatmap of correlations between features.  We drop the outcome as we will look at that using point-biserial correlation.  Because of the large number of zeros, we really need to do some sort of imputation before we look at this, because we will be feeding our models imputed data.  

In [None]:
import seaborn as sns
from sklearn.impute import SimpleImputer
imp = SimpleImputer()
cols_with_missing = ['Glucose', 'BloodPressure', 'BMI', 'Insulin', 'SkinThickness']
dataset.loc[:,cols_with_missing] = dataset.loc[:,cols_with_missing].replace({0:np.nan})
dataset_arr = imp.fit_transform(dataset)
dataset.loc[:,:] = dataset_arr
corr_mat = dataset.drop(columns=['Outcome']).corr()
sns.heatmap(corr_mat, annot=True)

There is a high and somewhat understandable relationship between Age and number of pregnancies.  Otherwise, no serious problems.  We can test for multicollinearity by looking at the variance inflation factor (VIF):

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = dataset.drop(columns=['Outcome'])
X.loc[:,'const'] = 1

vif = pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

vif.drop('const')

These values are all completely fine.

In [None]:
from scipy import stats

for col in dataset.columns:
    r,p = stats.pointbiserialr(dataset['Outcome'].values, dataset[col].values)
    print('{}: r value is {}, p={}'.format(col,r,p))

Almost all of the predictors have a significant relationship to the outcome variable

## Question
Anything that we need to do based on your assumptions?

We will apply PowerTransform to Age, Pedigree, Insulin and Skin Thickness.

In [None]:
class0 = dataset.Outcome==0
class1 = dataset.Outcome==1

In [None]:
data_class0 = dataset[class0]
data_class1 = dataset[class1]

In [None]:
# Split train and test
# Remaining of the proportion get you (1 - 0.7) automatically

train_split_0 = int(np.floor(0.7 * len(data_class0)))
train_split_1 = int(np.floor(0.7 * len(data_class1)))

train_data = pd.concat([ data_class0[ :train_split_0    ], data_class1[ :train_split_1    ] ])
test_data  = pd.concat([ data_class0[  train_split_0+1: ], data_class1[  train_split_1+1: ] ])

assert abs(0.7 - (len(train_data) / (len(train_data) + len(test_data)))) < 0.01, "There must be a problem with the train/test split of data"

## Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

classifier = LogisticRegression().fit(train_data.loc[:, train_data.columns != 'Outcome'], np.ravel(train_data.loc[:, train_data.columns == 'Outcome']))

prediction_test             = classifier.predict(test_data.loc[:, test_data.columns != 'Outcome'])
prediction_probability_test = classifier.predict_proba(test_data.loc[:, test_data.columns != 'Outcome'])

fig, ax = plt.subplots()

metrics.plot_roc_curve(classifier, test_data.loc[:, test_data.columns != 'Outcome'], np.ravel(test_data.loc[:, test_data.columns == 'Outcome']), name= "test data", ax = ax)
metrics.plot_roc_curve(classifier, train_data.loc[:, train_data.columns != 'Outcome'], np.ravel(train_data.loc[:, train_data.columns == 'Outcome']), name= "train data",ax = ax)
