# KEN3450, Data Analysis 2020 

**Kaggle Competition 2020**<br>

In [110]:
import numpy as np
import pandas as pd
import scipy as sp
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


#import your classifiers here

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Diagnosing the Maastricht Flu 

You are given the early data for an outbreak of a dangerous virus originating from a group of primates being kept in a Maastricht biomedical research lab in the basement of Henri-Paul Spaaklaan building, this virus is dubbed the "Maastricht Flu".

You have the medical records of $n$ number of patients in `flu_train.csv`. There are two general types of patients in the data, flu patients and healthy (this is recorded in the column labeled `flu`, a 0 indicates the absences of the virus and a 1 indicates presence). Notice that the dataset is unbalanced and you can expect a similar imbalance in the testing set.

**Your task:** build a model to predict if a given patient has the flu. Your goal is to catch as many flu patients as possible without misdiagnosing too many healthy patients.

**The deliverable:** submit your final solution via Kaggle competition using the `flu_test.csv` data.

Maastricht Gemeente will use your model to diagnose sets of future patients (held by us). You can expect that there will be an increase in the number of flu patients in any groups of patients in the future.

Here are some benchmarks for comparison and for expectation management. Notice that because the dataset is unbalanced, we expect that there is going to be a large difference in the accuracy for each class, thus `accuracy` is a metric that might be misleading in this case (see also below). That's why the baselines below are based on the expected accuracy **per class** and also they give you an estimate for the AUROC on all patients in the testing data. This is the score you see in the Kaggle submission as well.

**Baseline Model:** 
- ~50% expected accuracy on healthy patients in training data
- ~50% expected accuracy on flu patients in training data
- ~50% expected accuracy on healthy patients in testing data (future data, no info on the labels)
- ~50% expected accuracy on flu patients in testing data (future data, no info on the labels)
- ~50% expected AUROC on all patients in testing data (future data, no info on the labels)

**Reasonable Model:** 
- ~70% expected accuracy on healthy patients in training data
- ~55% expected accuracy on flu patients, in training data
- ~70% expected accuracy on healthy patients in testing data (future data, no info on the labels, to be checked upon your submission)
- ~57% expected accuracy on flu patients, in testing data (future data, no info on the labels, to be checked upon your submission)
- ~65% expected AUROC on all patients, in testing data (future data, no info on the labels, to be checked from Kaggle)

**Grading:**
Your grade will be based on:
1. your model's ability to out-perform the benchmarks (they are kind of low, so we won't care much about this)
2. your ability to carefully and thoroughly follow the data analysis pipeline
3. the extend to which all choices are reasonable and defensible by methods you have learned in this class

## Step 1: Read the data, clean and explore the data

There are a large number of missing values in the data. Nearly all predictors have some degree of missingness. Not all missingness are alike: NaN in the `'pregnancy'` column is meaningful and informative, as patients with NaN's in the pregnancy column are males, where as NaN's in other predictors may appear randomly. 


**What do you do?:** We make no attempt to interpret the predictors and we make no attempt to model the missing values in the data in any meaningful way. We replace all missing values with 0.

However, it would be more complete to look at the data and allow the data to inform your decision on how to address missingness. For columns where NaN values are informative, you might want to treat NaN as a distinct value; You might want to drop predictors with too many missing values and impute the ones with few missing values using a model. There are many acceptable strategies here, as long as the appropriateness of the method in the context of the task and the data is discussed.

In [22]:
#Train
df = pd.read_csv('data/flu_train.csv')
df = df[~np.isnan(df['flu'])]
df.head()

Unnamed: 0,ID,Gender,Age,Race1,Education,MaritalStatus,HHIncome,HHIncomeMid,Poverty,HomeRooms,...,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregnantNow,flu
0,51624,male,34,White,High School,Married,25000-34999,30000.0,1.36,6.0,...,,Yes,Yes,16.0,8.0,1.0,No,Heterosexual,,0
1,51630,female,49,White,Some College,LivePartner,35000-44999,40000.0,1.91,5.0,...,,Yes,Yes,12.0,10.0,1.0,Yes,Heterosexual,,0
2,51638,male,9,White,,,75000-99999,87500.0,1.84,6.0,...,,,,,,,,,,0
3,51646,male,8,White,,,55000-64999,60000.0,2.33,7.0,...,,,,,,,,,,0
4,51647,female,45,White,College Grad,Married,75000-99999,87500.0,5.0,6.0,...,,No,Yes,13.0,20.0,0.0,Yes,Bisexual,,0


In [23]:
#Test
df_test = pd.read_csv('data/flu_test.csv')
df_test.head()

Unnamed: 0,ID,Gender,Age,Race1,Education,MaritalStatus,HHIncome,HHIncomeMid,Poverty,HomeRooms,...,RegularMarij,AgeRegMarij,HardDrugs,SexEver,SexAge,SexNumPartnLife,SexNumPartYear,SameSex,SexOrientation,PregnantNow
0,51625,male,4,Other,,,20000-24999,22500.0,1.07,9.0,...,,,,,,,,,,
1,51678,male,60,White,High School,Married,15000-19999,17500.0,1.03,5.0,...,,,No,Yes,20.0,1.0,,No,,
2,51694,male,38,White,Some College,Married,20000-24999,22500.0,1.15,6.0,...,No,,No,Yes,23.0,1.0,1.0,No,Heterosexual,
3,51695,male,8,White,,,65000-74999,70000.0,3.55,5.0,...,,,,,,,,,,
4,51711,female,59,Other,8th Grade,Widowed,20000-24999,22500.0,1.37,4.0,...,,,,,,,,,,


In [24]:
#What's up in each set

x = df.values[:, :-1]
y = df.values[:, -1]

x_test = df_test.values[:, :-1]

print('x train shape:', x.shape)
print('x test shape:', x_test.shape)
print('train class 0: {}, train class 1: {}'.format(len(y[y==0]), len(y[y==1])))

x train shape: (5246, 71)
x test shape: (1533, 70)
train class 0: 4936, train class 1: 310


## Step 2: Model Choice

The first task is to decide which classifier to use (from the ones that we learned this block), i.e. which one would best suit our task and our data. Note that our data are heavily unbalanced, thus you need to do some exploration on how different classifiers handle inbalances in the data (we will discuss some of these techniques during week 3 lecture).

It would be possible to do brute force model comparison here - i.e. tune all models and compare which does best with respect to various benchmarks. However, it is also reasonable to do a first round of model comparison by running models (with out of the box parameter settings) on the training data and eliminating some models which performed very poorly.

Let the best model win!

# Cleaning, normalization and data splits

In [146]:
def fill_bin_num(dataframe, feature, bin_feature, bin_size, stat_measure, min_bin=None, max_bin=None, default_val='No'):
    if min_bin is None:
        min_bin = dataframe[bin_feature].min()
    if max_bin is None:
        max_bin = dataframe[bin_feature].max()
    new_dataframe = dataframe.copy()
    df_meancat = pd.DataFrame(columns=['interval', 'stat_measure'])
    for num_bin, subset in dataframe.groupby(pd.cut(dataframe[bin_feature], np.arange(min_bin, max_bin+bin_size, bin_size), include_lowest=True)):
        if stat_measure is 'mean':
            row = [num_bin, subset[feature].mean()]
        elif stat_measure is 'mode': 
            mode_ar = subset[feature].mode().values
            if len(mode_ar) > 0:
                row = [num_bin, mode_ar[0]]
            else:
                row = [num_bin, default_val]
        else:
            raise Exception('Unknown statistical measure: ' + stat_measure)
        df_meancat.loc[len(df_meancat)] = row
    for index, row_df in dataframe[dataframe[feature].isna()].iterrows():
        for _, row_meancat in df_meancat.iterrows():
            if row_df[bin_feature] in row_meancat['interval']:
                new_dataframe.at[index, feature] = row_meancat['stat_measure']
    return new_dataframe


def make_dummy_cols(dataframe, column, prefix, drop_dummy):
    dummy = pd.get_dummies(dataframe[column], prefix=prefix)
    dummy = dummy.drop(columns=prefix+'_'+drop_dummy)
    dataframe = pd.concat([dataframe, dummy], axis=1)
    dataframe = dataframe.drop(columns=column)
    return dataframe


def cleaning(dataframe_raw):
    dataframe = dataframe_raw.copy()

    dataframe = dataframe.set_index('ID')

    dataframe.loc[(dataframe['Age']<=13) & (dataframe['Education'].isna()), 'Education'] = 'Lower School/Kindergarten'
    dataframe.loc[(dataframe['Age']==14) & (dataframe['Education'].isna()), 'Education'] = '8th Grade'
    dataframe.loc[(dataframe['Age']<=17) & (dataframe['Education'].isna()), 'Education'] = '9 - 11th Grade'
    dataframe.loc[(dataframe['Age']<=21) & (dataframe['Education'].isna()), 'Education'] = 'High School'
    dataframe['Education'] = dataframe['Education'].fillna('Some College')

    dataframe.loc[(dataframe['Age']<=20) & (dataframe['MaritalStatus'].isna()), 'MaritalStatus'] = 'NeverMarried'
    dataframe.at[dataframe['MaritalStatus'].isna(), 'MaritalStatus'] = fill_bin_num(dataframe, 'MaritalStatus', 'Age', 5, 'mode',20)

    dataframe = dataframe.drop(columns=['HHIncome'])

    dataframe.loc[dataframe['HHIncomeMid'].isna(), 'HHIncomeMid'] = dataframe['HHIncomeMid'].median()

    dataframe.loc[dataframe['Poverty'].isna(), 'Poverty'] = dataframe['Poverty'].median()

    dataframe.loc[dataframe['HomeRooms'].isna(), 'HomeRooms'] = dataframe['HomeRooms'].mean()

    dataframe.loc[dataframe['HomeOwn'].isna(), 'HomeOwn'] = dataframe['HomeOwn'].mode().values[0]

    dataframe.loc[(dataframe['Work'].isna()) & (dataframe['Education'].isna()) & (dataframe['Age']<=20), 'Work'] = 'NotWorking'

    dataframe.loc[dataframe['Work'].isna(), 'Work'] = dataframe['Work'].mode().values[0]

    dataframe = fill_bin_num(dataframe, 'Weight', 'Age', 2, 'mean')

    dataframe = dataframe.drop(columns=['HeadCirc'])

    for index, row in dataframe.iterrows():
        if np.isnan(row['Height']) and not np.isnan(row['Length']):
            dataframe.at[index, 'Height'] = row['Length']
    dataframe = fill_bin_num(dataframe, 'Height', 'Age', 2, 'mean')

    dataframe = dataframe.drop(columns=['Length'])

    for index, row in dataframe[dataframe['BMI'].isna()].iterrows():
        dataframe.at[index, 'BMI'] = row['Weight'] / ((row['Height']/100)**2)

    dataframe = dataframe.drop(columns='BMICatUnder20yrs')

    dataframe = dataframe.drop(columns='BMI_WHO')

    dataframe = fill_bin_num(dataframe, 'Pulse', 'Age', 10, 'mean')

    dataframe.loc[(dataframe['Age']<10) & (dataframe['BPSysAve'].isna()), 'BPSysAve'] = 105
    dataframe = fill_bin_num(dataframe, 'BPSysAve', 'Age', 5, 'mean', 10)

    dataframe.loc[(dataframe['Age']<10) & (dataframe['BPDiaAve'].isna()), 'BPDiaAve'] = 60
    dataframe = fill_bin_num(dataframe, 'BPDiaAve', 'Age', 5, 'mean', 10)

    dataframe = dataframe.drop(columns='BPSys1')

    dataframe = dataframe.drop(columns='BPDia1')

    dataframe = dataframe.drop(columns='BPSys2')

    dataframe = dataframe.drop(columns='BPDia2')

    dataframe = dataframe.drop(columns='BPSys3')

    dataframe = dataframe.drop(columns='BPDia3')

    dataframe = dataframe.drop(columns=['Testosterone'])

    dataframe.loc[(dataframe['Age']<10) & (dataframe['DirectChol'].isna()), 'DirectChol'] = 0 
    dataframe = fill_bin_num(dataframe, 'DirectChol', 'Age', 5, 'mean', 10)

    dataframe.loc[(dataframe['Age']<10) & (dataframe['TotChol'].isna()), 'TotChol'] = 0
    dataframe = fill_bin_num(dataframe, 'TotChol', 'Age', 5, 'mean', 10)
    
    dataframe.loc[dataframe['UrineVol1'].isna(), 'UrineVol1'] = dataframe['UrineVol1'].median()

    dataframe.loc[dataframe['UrineFlow1'].isna(), 'UrineFlow1'] = dataframe['UrineFlow1'].median()

    dataframe = dataframe.drop(columns=['UrineVol2'])

    dataframe = dataframe.drop(columns=['UrineFlow2'])

    dataframe['Diabetes'] = dataframe['Diabetes'].fillna('No')

    dataframe['DiabetesAge'] = dataframe['DiabetesAge'].fillna(0)

    dataframe.loc[(dataframe['Age']<=12) & (dataframe['HealthGen'].isna()), 'HealthGen'] = 'Good'
    dataframe = fill_bin_num(dataframe, 'HealthGen', 'Age', 5, 'mode', 10)

    dataframe.loc[(dataframe['Age']<=12) & (dataframe['DaysMentHlthBad'].isna()), 'DaysMentHlthBad'] = 0
    dataframe = fill_bin_num(dataframe, 'DaysMentHlthBad', 'Age', 5, 'mean', 10)

    dataframe.loc[(dataframe['Age']<=15) & (dataframe['LittleInterest'].isna()), 'LittleInterest'] = 'None'
    dataframe = fill_bin_num(dataframe, 'LittleInterest', 'Age', 5, 'mode', 15)

    dataframe.loc[(dataframe['Age']<=12) & (dataframe['DaysMentHlthBad'].isna()), 'DaysMentHlthBad'] = 0
    dataframe = fill_bin_num(dataframe, 'DaysMentHlthBad', 'Age', 5, 'mean', 10)

    for index, row in dataframe.iterrows():
        if np.isnan(row['nBabies']) and not np.isnan(row['nPregnancies']):
            dataframe.at[index, 'nBabies'] = row['nPregnancies']
    dataframe['nBabies'] = dataframe['nBabies'].fillna(0)

    dataframe['nPregnancies'] = dataframe['nPregnancies'].fillna(0)

    dataframe['Age1stBaby'] = dataframe['Age1stBaby'].fillna(0)

    dataframe.loc[(dataframe['Age']==0) & (dataframe['SleepHrsNight'].isna()), 'SleepHrsNight'] = 14
    dataframe.loc[(dataframe['Age']<=2) & (dataframe['SleepHrsNight'].isna()), 'SleepHrsNight'] = 12
    dataframe.loc[(dataframe['Age']<=5) & (dataframe['SleepHrsNight'].isna()), 'SleepHrsNight'] = 10
    dataframe.loc[(dataframe['Age']<=10) & (dataframe['SleepHrsNight'].isna()), 'SleepHrsNight'] = 9
    dataframe.loc[(dataframe['Age']<=15) & (dataframe['SleepHrsNight'].isna()), 'SleepHrsNight'] = 8
    dataframe['SleepHrsNight'] = dataframe['SleepHrsNight'].fillna(dataframe_raw['SleepHrsNight'].mean())

    dataframe['SleepTrouble'] = dataframe['SleepTrouble'].fillna('No')

    dataframe.loc[(dataframe['Age']<=4) & (dataframe['PhysActive'].isna()), 'PhysActive'] = 'No'
    dataframe = fill_bin_num(dataframe, 'PhysActive', 'Age', 2, 'mode', 16)
    dataframe['PhysActive'] = dataframe['PhysActive'].fillna('Yes') # Big assumption here. All kids between 4 and 16 are physically active

    dataframe = dataframe.drop(columns=['PhysActiveDays'])

    dataframe = dataframe.drop(columns=['TVHrsDay'])

    dataframe = dataframe.drop(columns=['TVHrsDayChild'])

    dataframe = dataframe.drop(columns=['CompHrsDay'])

    dataframe = dataframe.drop(columns=['CompHrsDayChild'])

    dataframe.loc[(dataframe['Age']<18) & (dataframe['Alcohol12PlusYr'].isna()), 'Alcohol12PlusYr'] = 'No'
    dataframe = fill_bin_num(dataframe, 'Alcohol12PlusYr', 'Age', 5, 'mode', 18)

    dataframe.loc[(dataframe['Age']<18) & (dataframe['AlcoholDay'].isna()), 'AlcoholDay'] = 0
    dataframe = fill_bin_num(dataframe, 'AlcoholDay', 'Age', 5, 'mean', 18)

    dataframe.loc[(dataframe['Age']<18) & (dataframe['AlcoholYear'].isna()), 'AlcoholYear'] = 0
    dataframe = fill_bin_num(dataframe, 'AlcoholYear', 'Age', 5, 'mean', 18)

    dataframe.loc[(dataframe['Age']<20) & (dataframe['SmokeNow'].isna()), 'SmokeNow'] = 'No'
    dataframe = fill_bin_num(dataframe, 'SmokeNow', 'Age', 5, 'mode', 20)

    dataframe['Smoke100'] = dataframe['Smoke100'].fillna('No')

    dataframe['Smoke100n'] = dataframe['Smoke100n'].fillna('No')

    dataframe.loc[(dataframe['SmokeNow']=='No') & (dataframe['SmokeAge'].isna()), 'SmokeAge'] = 0
    dataframe = fill_bin_num(dataframe, 'SmokeAge', 'Age', 5, 'mean', 20)

    dataframe.loc[(dataframe['Age']<18) & (dataframe['Marijuana'].isna()), 'Marijuana'] = 'No'
    dataframe.loc[(dataframe['Marijuana'].isna()) & (dataframe['SmokeNow']=='No'), 'Marijuana'] = 'No'
    dataframe = fill_bin_num(dataframe, 'Marijuana', 'Age', 5, 'mode', 20)

    dataframe.loc[(dataframe['Marijuana']=='No') & (dataframe['AgeFirstMarij'].isna()), 'AgeFirstMarij'] = 0
    dataframe = fill_bin_num(dataframe, 'AgeFirstMarij', 'Age', 5, 'mean', 20)

    dataframe.loc[(dataframe['Marijuana']=='No') & (dataframe['RegularMarij'].isna()), 'RegularMarij'] = 'No'
    dataframe = fill_bin_num(dataframe, 'RegularMarij', 'Age', 5, 'mode', 20)

    dataframe.loc[(dataframe['RegularMarij']=='No') & (dataframe['AgeRegMarij'].isna()), 'AgeRegMarij'] = 0
    dataframe = fill_bin_num(dataframe, 'AgeRegMarij', 'Age', 5, 'mean', 20)

    dataframe.loc[(dataframe['Age']<18) & (dataframe['HardDrugs'].isna()), 'HardDrugs'] = 'No'
    dataframe = fill_bin_num(dataframe, 'HardDrugs', 'Age', 5, 'mode', 18)

    mode_sex_age = dataframe['SexAge'].mode()[0]
    dataframe.loc[(dataframe['Age']<=mode_sex_age) & (dataframe['SexEver'].isna()), 'SexEver'] = 'No'
    dataframe['SexEver'] = dataframe['SexEver'].fillna('Yes')

    dataframe.loc[(dataframe['SexEver']=='No') & (dataframe['SexAge'].isna()), 'SexAge'] = 0
    dataframe['SexAge'] = dataframe['SexAge'].fillna(mode_sex_age)

    dataframe.loc[(dataframe['SexEver']=='No') & (dataframe['SexNumPartnLife'].isna()), 'SexNumPartnLife'] = 0
    dataframe = fill_bin_num(dataframe, 'SexNumPartnLife', 'Age', 5, 'mean')
    dataframe['SexNumPartnLife'] = dataframe_raw.loc[(dataframe_raw['Age'] >= 60) & (dataframe_raw['Age'] <= 70), 'SexNumPartnLife'].mode()[0] # Missing values for the elderly. Assumed that lifetime sex partners do not increase after 60.

    dataframe.loc[(dataframe['SexEver']=='No') & (dataframe['SexNumPartYear'].isna()), 'SexNumPartYear'] = 0
    dataframe = fill_bin_num(dataframe, 'SexNumPartYear', 'Age', 10, 'mean')
    dataframe['SexNumPartYear'] = dataframe['SexNumPartYear'].fillna(0)

    dataframe['SameSex'] = dataframe['SameSex'].fillna('No')

    dataframe = dataframe.drop(columns=['SexOrientation'])

    dataframe['PregnantNow'] = dataframe['PregnantNow'].fillna('No')


    # Making dummy variables
    dataframe['male'] = 1*(dataframe['Gender'] ==  'male')
    dataframe = dataframe.drop(columns=['Gender'])

    dataframe['white'] = np.where(dataframe['Race1'] == 'white',1,0)
    dataframe = dataframe.drop(columns=['Race1'])

    dataframe = make_dummy_cols(dataframe, 'Education', 'education', '8th Grade')

    dataframe = make_dummy_cols(dataframe, 'MaritalStatus', 'maritalstatus', 'Separated')

    dataframe = make_dummy_cols(dataframe, 'HomeOwn', 'homeown', 'Other')

    dataframe = make_dummy_cols(dataframe, 'Work', 'work', 'Looking')

    dataframe['Diabetes'] = np.where(dataframe['Diabetes'] == 'Yes',1,0)

    dataframe = make_dummy_cols(dataframe, 'HealthGen', 'healthgen', 'Poor')

    dataframe = make_dummy_cols(dataframe, 'LittleInterest', 'littleinterest', 'None')

    dataframe = make_dummy_cols(dataframe, 'Depressed', 'depressed', 'None')

    dataframe['SleepTrouble'] = np.where(dataframe['SleepTrouble'] == 'Yes',1,0)

    dataframe['PhysActive'] = np.where(dataframe['PhysActive'] == 'Yes',1,0)

    dataframe['Alcohol12PlusYr'] = np.where(dataframe['Alcohol12PlusYr'] == 'Yes',1,0)

    dataframe['SmokeNow'] = np.where(dataframe['SmokeNow'] == 'Yes',1,0)
    
    dataframe['Smoke100'] = np.where(dataframe['Smoke100'] == 'Yes',1,0)

    dataframe['Smoke100n'] = np.where(dataframe['Smoke100n'] == 'Yes',1,0)

    dataframe['Marijuana'] = np.where(dataframe['Marijuana'] == 'Yes',1,0)

    dataframe['RegularMarij'] = np.where(dataframe['RegularMarij'] == 'Yes',1,0)

    dataframe['HardDrugs'] = np.where(dataframe['HardDrugs'] == 'Yes',1,0)

    dataframe['SexEver'] = np.where(dataframe['SexEver'] == 'Yes',1,0)

    dataframe['SameSex'] = np.where(dataframe['SameSex'] == 'Yes',1,0)

    dataframe['PregnantNow'] = np.where(dataframe['PregnantNow'] == 'Yes',1,0)

    return dataframe

In [147]:
from sklearn import preprocessing
data = cleaning(df).select_dtypes(include = 'number')
norm = preprocessing.MinMaxScaler()
data_n = norm.fit_transform(data.drop('flu', axis=1))
ndata = pd.DataFrame(norm.fit_transform(data.drop('flu', axis=1)), index=data.index)
ndata['flu'] = data['flu']
num_test = cleaning(df_test).select_dtypes(include='number')
ntest = pd.DataFrame(norm.fit_transform(num_test), index=num_test.index)


In [148]:
train, test = train_test_split(ndata, stratify=ndata['flu'], test_size=0.1)

X_train = train.drop('flu', axis=1)
X_test = test.drop('flu', axis=1)
y_train = train['flu']
y_test = test['flu']

# Support Vector Machine

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV


cw = []
for i in np.linspace(start = 0.006, stop = 0.08, num = 5):
    cw.append({0:i, 1:1-i})
cw.append('balanced')
C = [x for x in np.linspace(start = 0.2, stop = 1.5, num = 5)]
C.append(1)

param_grid = {
    'C':C,
    'kernel':['linear', 'rbf', 'poly', 'sigmoid'],
    'degree':[2,3,4,5,6,7,8],
    'gamma':['auto'],
    'shrinking':[True, False],
    'class_weight': cw
}

In [None]:
sv = SVC()
sv_r = RandomizedSearchCV(sv, param_grid, scoring=scorel, cv=3, return_train_score=True, verbose=2, random_state=42, n_jobs=-2, n_iter=300)
sv_r.fit(X_train, y_train)

In [None]:
params = sv_r.best_params_
print('The best parameters are {} giving an average Balanced Accuract of {:.4f}'.format(params, sv_r.best_score_))

In [None]:
a = np.array(sv_r.best_estimator_.predict(ntest))
result = pd.DataFrame(np.array([num_test.index, a], dtype=np.int32).T, columns=['ID', 'Prediction'])
result.to_csv('result_svm.csv', index=False)

# Logistic Regression

In [149]:
import telegram 
import json
import os
def notify_me(message='Done'):
    filename = os.environ['HOME']+'/.telegram'
    with open(filename) as f:
        json_blob = f.read()
        credentials = json.loads(json_blob)
    bot = telegram.Bot(token=credentials['api_key'])
    bot.send_message(chat_id=credentials['chat_id'], text=message)

In [150]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

cw = []
for i in np.linspace(start = 0.001, stop = 0.4, num = 20):
    cw.append({0:i, 1:1-i})
cw.append('balanced')

In [151]:
w0 = 0.0599
param_grid = {
    'C':[x for x in np.linspace(start = 0.001, stop = 20, num = 40)],
    'penalty':['l1', 'l2', 'elasticnet'],
    'max_iter':[10, 100, 1000, 10000],
    'class_weight': cw
}

In [157]:
lr = LogisticRegression()
lr_r = GridSearchCV(lr, param_grid, scoring='balanced_accuracy', cv=3, return_train_score=True, verbose=0, n_jobs=-1)
lr_r.fit(ndata.drop('flu', axis=1), ndata['flu'])

GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [0.001, 0.5137948717948717, 1...
                                          {0: 0.274, 1: 0.726},
                                          {0: 0.29500000000000004, 1: 0.705},
                                          {0: 0.316, 1: 0.6839999999999999},
                                          {0: 0.337, 1: 0.663},
        

In [158]:
params = lr_r.best_params_

notify_me('The best parameters are {} giving an average ROC AUC score of {:.4f}'.format(params, lr_r.best_score_))

In [159]:
pd.DataFrame(lr_r.cv_results_).sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_class_weight,param_max_iter,param_penalty,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
748,0.107531,0.007842,0.004651,0.000961,1.02659,balanced,100,l2,"{'C': 1.0265897435897433, 'class_weight': 'bal...",0.658212,0.665610,0.728881,0.684234,0.031714,1,0.728949,0.734793,0.703191,0.722311,0.013729
751,0.135089,0.006675,0.003793,0.000261,1.02659,balanced,1000,l2,"{'C': 1.0265897435897433, 'class_weight': 'bal...",0.658212,0.665306,0.728881,0.684133,0.031774,2,0.728949,0.734945,0.703343,0.722412,0.013704
754,0.141895,0.014802,0.003962,0.000437,1.02659,balanced,10000,l2,"{'C': 1.0265897435897433, 'class_weight': 'bal...",0.658212,0.665306,0.728881,0.684133,0.031774,2,0.728949,0.734945,0.703343,0.722412,0.013704
547,0.073622,0.012917,0.005176,0.002330,1.02659,"{0: 0.064, 1: 0.9359999999999999}",1000,l2,"{'C': 1.0265897435897433, 'class_weight': {0: ...",0.643673,0.675751,0.730723,0.683382,0.035945,4,0.726831,0.717528,0.695613,0.713324,0.013087
544,0.069095,0.004190,0.003588,0.000091,1.02659,"{0: 0.064, 1: 0.9359999999999999}",100,l2,"{'C': 1.0265897435897433, 'class_weight': {0: ...",0.643673,0.675751,0.730723,0.683382,0.035945,4,0.726831,0.717528,0.695613,0.713324,0.013087
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4020,0.001642,0.000467,0.000000,0.000000,7.69292,balanced,10,l1,"{'C': 7.692923076923076, 'class_weight': 'bala...",,,,,,10076,,,,,
4019,0.003879,0.002107,0.000000,0.000000,7.69292,"{0: 0.4, 1: 0.6}",10000,elasticnet,"{'C': 7.692923076923076, 'class_weight': {0: 0...",,,,,,10077,,,,,
4017,0.002412,0.000586,0.000000,0.000000,7.69292,"{0: 0.4, 1: 0.6}",10000,l1,"{'C': 7.692923076923076, 'class_weight': {0: 0...",,,,,,10078,,,,,
4037,0.002332,0.000649,0.000000,0.000000,8.20572,"{0: 0.001, 1: 0.999}",100,elasticnet,"{'C': 8.205717948717947, 'class_weight': {0: 0...",,,,,,10079,,,,,


In [155]:
w = 0.064
fin = LogisticRegression(class_weight={0:w,1:1-w}, C=2.02, penalty='l2')
fin.fit(X_train, y_train)
np.mean(cross_val_score(fin, X_train, y_train, scoring='balanced_accuracy', cv=3))

0.6617766270986633

In [85]:
lo = lr_r.best_estimator_

In [160]:
a = np.array(lr_r.best_estimator_.predict(ntest))
result = pd.DataFrame(np.array([num_test.index, a], dtype=np.int32).T, columns=['ID', 'Prediction'])
result.to_csv('result_lr_n.csv', index=False)

In [64]:
def scorel(model, X_test, y_test):
    return 0.6*np.mean(cross_val_score(model,X_train,y_train,scoring='balanced_accuracy', cv=5))+0.4*score(model,X_test, y_test)[2]

# XGBoost

In [None]:
import xgboost as xgb

xg_c = xgb.XGBClassifier(max_depth=3)
param_grid = {
    'objective':['reg:squarederror', 'reg:logistic', 'binary:logistic'],
    'scale_pos_weight':[20,21,22],
    'colsample_bytree':[0.3],
    'eval_metric':['aucpr', 'auc', 'mae', 'map'],
    'alpha':[5, 10, 20],
    'n_estimators': [5, 10, 25, 40, 50, 100, 125],
    'learning_rate': [0.05, 0.1, 0.15]
}
xg_s = GridSearchCV(xg_c, param_grid, scoring='balanced_accuracy', cv=3, return_train_score=True)

In [None]:
xg_s.fit(ndata.drop('flu', axis=1), ndata['flu'])

In [None]:
params = xg_s.best_params_
print('The best parameters are {} giving an average ROC AUC score of {:.4f}'.format(params, xg_s.best_score_))
xg = xg_s.best_estimator_

In [None]:
a = np.array(xg.predict(ntest.values))
result = pd.DataFrame(np.array([num_test.index, a], dtype=np.int32).T, columns=['ID', 'Prediction'])
result.to_csv('result_xg.csv', index=False)

# Random Forests

In [111]:
from sklearn.ensemble import RandomForestClassifier

w0 = 0.0599

cw = []
for i in np.linspace(start = 0.001, stop = 0.15, num = 10):
    cw.append({0:i, 1:1-i})
cw.append('balanced')

param_grid = {
    'n_estimators' : [20,50,70,110, 130, 150, 200],
    'max_features' : ['auto', 'sqrt'], 
    'max_depth':[3, 5, 7, 10, 15, None],
    'criterion' : ['gini', 'entropy'],
    'min_samples_split' : [2, 3, 5, 7],
    'min_samples_leaf' : [2, 3, 5, 7],
    'class_weight': cw
}

In [112]:
rfs = RandomForestClassifier()
rfs_random = RandomizedSearchCV(rfs, param_grid, scoring='balanced_accuracy', cv=3, return_train_score=True, random_state=42, n_jobs=-1, n_iter=1000)
rfs_random.fit(X_train, y_train)

RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [113]:
params = rfs_random.best_params_
notify_me('The best parameters are {} giving an average ROC AUC score of {:.4f}'.format(params, rfs_random.best_score_))

In [116]:
rf = RandomForestClassifier(**rfs_random.best_params_)
rf.fit(ndata.drop('flu', axis=1), ndata['flu'])

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.050666666666666665,
                                     1: 0.9493333333333334},
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [117]:
a = np.array(rf.predict(ntest))
result = pd.DataFrame(np.array([num_test.index, a], dtype=np.int32).T, columns=['ID', 'Prediction'])
result.to_csv('result_rf.csv', index=False)

# Decision Tree

In [None]:
param_grid = {
    'max_features' : ['auto', 'sqrt'], 
    'max_depth':[3, 4, 5,6, 7, 10, None],
    'criterion' : ['gini', 'entropy'],
    'min_samples_split' : [2, 3,4, 5, 7],
    'min_samples_leaf' : [2, 3,4, 5, 7],
    'class_weight': cw
}
clf = tree.DecisionTreeClassifier()

In [None]:
clf_r = RandomizedSearchCV(clf, param_grid, scoring='balanced_accuracy', cv=3, return_train_score=True, verbose=0, n_iter=2000)
clf_r.fit(X_train, y_train)

In [None]:
params = clf_r.best_params_
print('The best parameters are {} giving an average ROC AUC score of {:.4f}'.format(params, clf_r.best_score_))

In [None]:
a = np.array(clf_r.best_estimator_.predict(ntest))
result = pd.DataFrame(np.array([num_test.index, a], dtype=np.int32).T, columns=['ID', 'Prediction'])
result.to_csv('result_dt.csv', index=False)

# Final Results

In [20]:
pd.DataFrame({'LR':0.69452, 'SVM':0.69214, 'XGBoost':0.67610}, index=[0])

Unnamed: 0,LR,SVM,XGBoost
0,0.69452,0.69214,0.6761


# Scoring

## On evaluation

### AUROC

As mentioned abbove, we will use the accuracy scores for each class and for the whole dataset, as well as the AUROC score from Kaggle platform. You can coimpute AUROC locally (e.g. on your train/validation set) by calling the relevant scikit learn function:

In [None]:
###AUROC locally

#score = roc_auc_score(real_labels, predicted_labels)

#real_labels: the ground truth (0 or 1)
#predicted_labels: labels predicted by your algorithm (0 or 1)

### Accuracy (per class)

Below there is a function that will be handy for your models. It computes the accuracy per-class, based on a model you pass as parameter and a dataset (split to x/y)

In [None]:
def extended_score(model, x_test, y_test):
    overall = 0
    class_0 = 0
    class_1 = 0
    for i in range(100):
        sample = np.random.choice(len(x_test), len(x_test))
        x_sub_test = x_test[sample]
        y_sub_test = y_test[sample]
        
        overall += model.score(x_sub_test, y_sub_test)
        class_0 += model.score(x_sub_test[y_sub_test==0], y_sub_test[y_sub_test==0])
        class_1 += model.score(x_sub_test[y_sub_test==1], y_sub_test[y_sub_test==1])

    return pd.Series([overall / 100., 
                      class_0 / 100.,
                      class_1 / 100.],
                      index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

In [44]:
#same job as before, but faster?

score = lambda model, x_val, y_val: pd.Series([model.score(x_val, y_val), 
                                                 model.score(x_val[y_val==0], y_val[y_val==0]),
                                                 model.score(x_val[y_val==1], y_val[y_val==1])], 
                                                index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

In [79]:
def scorel(model, X_test, y_test):
    return 0.8*np.mean(cross_val_score(model,X_train,y_train,scoring='balanced_accuracy', cv=2))+0.2*score(model,X_test, y_test)[2]

## Solution extraction for Kaggle

Make sure that you extract your solutions (predictions) in the correct format required by Kaggle

## Step 3: Conclusions

Highlight at the end of your notebook, which were the top-3 approaches that produced the best scores for you. That is, provide a table with the scores you got (on the AUROC score you get from Kaggle) and make sure that you judge these in relation to your work on the training set