# KEN3450, Data Analysis 2020 

**Kaggle Competition 2020**<br>

Team: MammaMia!

Members:
- Lucas Giovanni Uberti-Bona Marín
- Giacomo Anerdi

In [None]:
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

#import your classifiers here

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Diagnosing the Maastricht Flu 

You are given the early data for an outbreak of a dangerous virus originating from a group of primates being kept in a Maastricht biomedical research lab in the basement of Henri-Paul Spaaklaan building, this virus is dubbed the "Maastricht Flu".

You have the medical records of $n$ number of patients in `flu_train.csv`. There are two general types of patients in the data, flu patients and healthy (this is recorded in the column labeled `flu`, a 0 indicates the absences of the virus and a 1 indicates presence). Notice that the dataset is unbalanced and you can expect a similar imbalance in the testing set.

**Your task:** build a model to predict if a given patient has the flu. Your goal is to catch as many flu patients as possible without misdiagnosing too many healthy patients.

**The deliverable:** submit your final solution via Kaggle competition using the `flu_test.csv` data.

Maastricht Gemeente will use your model to diagnose sets of future patients (held by us). You can expect that there will be an increase in the number of flu patients in any groups of patients in the future.

Here are some benchmarks for comparison and for expectation management. Notice that because the dataset is unbalanced, we expect that there is going to be a large difference in the accuracy for each class, thus `accuracy` is a metric that might be misleading in this case (see also below). That's why the baselines below are based on the expected accuracy **per class** and also they give you an estimate for the AUROC on all patients in the testing data. This is the score you see in the Kaggle submission as well.

**Baseline Model:** 
- ~50% expected accuracy on healthy patients in training data
- ~50% expected accuracy on flu patients in training data
- ~50% expected accuracy on healthy patients in testing data (future data, no info on the labels)
- ~50% expected accuracy on flu patients in testing data (future data, no info on the labels)
- ~50% expected AUROC on all patients in testing data (future data, no info on the labels)

**Reasonable Model:** 
- ~70% expected accuracy on healthy patients in training data
- ~55% expected accuracy on flu patients, in training data
- ~70% expected accuracy on healthy patients in testing data (future data, no info on the labels, to be checked upon your submission)
- ~57% expected accuracy on flu patients, in testing data (future data, no info on the labels, to be checked upon your submission)
- ~65% expected AUROC on all patients, in testing data (future data, no info on the labels, to be checked from Kaggle)

**Grading:**
Your grade will be based on:
1. your model's ability to out-perform the benchmarks (they are kind of low, so we won't care much about this)
2. your ability to carefully and thoroughly follow the data analysis pipeline
3. the extend to which all choices are reasonable and defensible by methods you have learned in this class

## Step 1: Read the data, clean and explore the data

There are a large number of missing values in the data. Nearly all predictors have some degree of missingness. Not all missingness are alike: NaN in the `'pregnancy'` column is meaningful and informative, as patients with NaN's in the pregnancy column are males, where as NaN's in other predictors may appear randomly. 


**What do you do?:** We make no attempt to interpret the predictors and we make no attempt to model the missing values in the data in any meaningful way. We replace all missing values with 0.

However, it would be more complete to look at the data and allow the data to inform your decision on how to address missingness. For columns where NaN values are informative, you might want to treat NaN as a distinct value; You might want to drop predictors with too many missing values and impute the ones with few missing values using a model. There are many acceptable strategies here, as long as the appropriateness of the method in the context of the task and the data is discussed.

In [None]:
#Train
df = pd.read_csv('data/flu_train.csv')
df = df[~np.isnan(df['flu'])]
df.head()

In [None]:
#Test
df_test = pd.read_csv('data/flu_test.csv')
df_test.head()

In [None]:
#What's up in each set

x = df.values[:, :-1]
y = df.values[:, -1]

x_test = df_test.values[:, :-1]

print('x train shape:', x.shape)
print('x test shape:', x_test.shape)
print('train class 0: {}, train class 1: {}'.format(len(y[y==0]), len(y[y==1])))

---
### Data Exploration ###

Initial expection for the data's missing values, quartiles, min/max and standard deviation.

In [None]:
df.describe()

As it can be seen, many features contain missing values. However, in some of these columns the missing value has meaning. For example a missing value in SmokeAge means that the individual has never smoked.

In [None]:
df.dtypes

### Analysing different features ###

**Gender**

In [None]:
df['Gender'].isna().sum()

In [None]:
df['Gender'].value_counts().plot.bar(rot=0)
plt.show()

There are no null values in this column. As it can be seen, there the two genders present in the dataset and they are quite balanced.

**Age**

In [None]:
df['Age'].isna().sum()

In [None]:
df['Age'].hist(bins=16)
plt.show()

No missing values. It can be observed that the age is not normally distributed in the dataset.

**Race**

In [None]:
df['Race1'].isna().sum()

In [None]:
df['Race1'].value_counts().plot.bar(rot=0)
plt.show()

No missing values. As the classes are quite unbalanced where `White` is in the majority of instances it is decided to aggregate the other classes together. This means that the cleaned data just contains whether any given individual is white or not.

**Education**

In [None]:
df['Education'].isna().sum()

In [None]:
df['Education'].value_counts().plot.bar(rot=45)
plt.show()

In which age groups is the data missing?

In [None]:
df.loc[df['Education'].isna()]['Age'].hist(bins=16)
plt.show()

In [None]:
dummies = pd.get_dummies(df['Education'])
pd.concat([dummies, df['Age']], axis=1).corr()['Age']

1672 instance don't have a value in the 'Education' feature. It looks like this feature tells at what point a given individual has stopped his/her education. This means that the dataset has missing values for young people that are still at school or at university. This is changed into the education level that has been currently achieved by estimating in which education category each individual is using the age.
For the remaining few people with missing values, which are older than 35 years old, it can be assumed that they have finished their education have the values filled in by looking at what is the most prominent category per age group. Depending on different time periods the level of education that people got is different. For example, in the 60s going to university was much less likely than it is today. It can be assumed that the data was collected in the same time period which could be even of a few years. This means that `Age` would be closely related to when any given individual was born.

**Marital Status**

In [None]:
df['MaritalStatus'].isna().sum()

In [None]:
df['MaritalStatus'].value_counts().plot.bar(rot=45)
plt.show()

In [None]:
df.loc[df['MaritalStatus'].isna()]['Age'].hist(bins=16)
plt.show()

Similarly to the education example, most missing values arise from young indivuals where it can be assumed that they never married. Instead, for the remaining missing values of the older people, the missing values are filled in using the most common label depending on the age group.

**HHIncome**

In [None]:
df['HHIncome'].isna().sum()

In [None]:
df['HHIncome'].value_counts().plot.bar(rot=90)
plt.show()

This column is not needed as there is already the `HHIncomeMid` column for each category. As a result it is dropped.

**HHIncomeMid**

In [None]:
df['HHIncomeMid'].isna().sum()

In [None]:
df['HHIncomeMid'].value_counts().sort_index().plot.bar(rot=90)
plt.show()

The missing values are filled in by taking the mean.

**Poverty**

In [None]:
df['Poverty'].isna().sum()

In [None]:
df['Poverty'].hist(bins=10)
plt.show()

A clear distribution cannot be seen. The missing values are filled in by taking the mean.

**Home Rooms**

In [None]:
df['HomeRooms'].isna().sum()

In [None]:
df['HomeRooms'].hist(bins=12)
plt.show()

The data appears to be normally distributed. There are very few data point missing. These are filled in by taking the mean.

**Home Own**

In [None]:
df['HomeOwn'].isna().sum()

In [None]:
df['HomeOwn'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['HomeOwn'].isna()]['Age'].hist(bins=8)
plt.show()

Similarly, there are very few missing values which are filled by using the most common label.

**Work**

In [None]:
df['Work'].isna().sum()

In [None]:
df['Work'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
len(df[(df['Work'].isna()) & (df['Education'].isna()) & (df['Age']<=20)])

It looks like most of the individuals with missing values in the 'Work' column have not finished their education yet. These are filled with `NotWorking` so that the model can more easily differentiate out of the individuals that are not studying the correlation between `Working` and `Flu`. Remaining are filled by plugging in the most common category.

**Weight**

In [None]:
df['Weight'].isna().sum()

In [None]:
df['Weight'].hist(bins=20)
plt.show()

In [None]:
df.loc[df['Weight'].isna()]['Age'].hist(bins=8)
plt.show()

There are few values missing. These are filled in by using the mean of individuals grouped by age.

**Length**

In [None]:
df['Length'].isna().sum()

In [None]:
df['Length'].hist()
plt.show()

In [None]:
df.loc[df['Length'].notna()]['Age'].value_counts().plot.bar()
plt.show()

This column is only for kids between 0 and 3 years old. This column is merged with the `Height` column as they practically contain the same information.

**Head Circumference**

In [None]:
df['HeadCirc'].notna().sum()

In [None]:
df['HeadCirc'].hist()
plt.show()

In [None]:
df.loc[df['HeadCirc'].notna()]['Age'].hist(bins=32)
plt.show()

In [None]:
len(df.loc[df['Age']==0])

Values are present only for babies. However, many of these instances still have missing values. Therefore, it was decided to drop this feature.

**Height**

In [None]:
df['Height'].isna().sum()

In [None]:
df['Height'].hist(bins=30)
plt.show()

In [None]:
df.loc[df['Height'].isna()]['Age'].hist(bins=32)
plt.show()

In [None]:
df.loc[(df['Height'].isna()) & (df['Length'].isna())]['Age'].hist(bins=16)
plt.show()

In [None]:
print('Values present in both Height and Length', len(df[(df['Height'].notna()) & (df['Length']).notna()]))
print('Values present in neither Height and Length', len(df[(df['Height'].isna()) & (df['Length']).isna()]))

The data seems to be normally distributed. Most of the missing values are from young children. Where possible the `Length` value is used to fill in the missing values. The remaining missing values are filled in by taking the mean per age group the individual is part of.

**BMI**

In [None]:
df['BMI'].isna().sum()

In [None]:
df['BMI'].hist()
plt.show()

In [None]:
df.loc[df['BMI'].isna()]['Age'].hist(bins=8)
plt.show()

The missing values are calculate using the BMI formula which uses the columns `Height` and `Weight` that have no more missing values.

**BMI Category Under 20 years**

In [None]:
df.loc[df['Age'] < 20]['BMICatUnder20yrs'].isna().sum()

In [None]:
df['BMICatUnder20yrs'].value_counts().plot.bar(rot=0)
plt.show()

Column is dropped as there is a numerical BMI value already which can be seen as being more informative.

**BMI WHO**

In [None]:
df['BMI_WHO'].isna().sum()

In [None]:
df['BMI_WHO'].value_counts().plot.bar(rot=0)
plt.show()

Column is dropped as there is a numerical BMI value already which can be seen as being more informative.

**Pulse**

In [None]:
df['Pulse'].isna().sum()

In [None]:
df['Pulse'].hist()
plt.show()

In [None]:
df.loc[df['Pulse'].isna()]['Age'].hist(bins=32)
plt.show()

Most missing values are from kids. 

In [None]:
df.plot.scatter('Age', 'Pulse')
plt.show()

Values are filled in with the mean for each age group. A bin size of 10 years is used as seen from the graph, the values over ages do not drastically change.

**BPSysAve**

In [None]:
df['BPSysAve'].isna().sum()

In [None]:
df.loc[df['BPSysAve'].isna()]['Age'].hist(bins=32)
plt.show()

Mean of 105 is manually filled in for kids between 0 and 10 years old as all instances within this age group have `BPSysAve` missing. The remaining missing values for people older than 10 years old is filled in with the mean per age group. Only `BPSysAve` is kept while `BPSys1`, `BPSys2` and `BPSys3` are dropped.

**BPDiaAve**

In [None]:
df['BPDiaAve'].isna().sum()

In [None]:
df.loc[df['BPDiaAve'].isna()]['Age'].hist(bins=32)
plt.show()

Mean of 60 is manually filled in for kids between 0 and 10 years old as all instances within this age group have `BPDiaAve` missing. The remaining missing values for people older than 10 years old is filled in with the mean per age group. Only `BPDiaAve` is kept while `BPDia1`, `BPDia2` and `BPDia3` are dropped.

**Testosterone**

In [None]:
df['Testosterone'].isna().sum()

In [None]:
df['Testosterone'].hist()
plt.show()

In [None]:
df.loc[df['Testosterone'].isna()]['Age'].hist(bins=16)
plt.show()

There are too many missing values. This column is dropped.

**DirectChol**

In [None]:
df['DirectChol'].isna().sum()

In [None]:
df.loc[df['DirectChol'].isna()]['Age'].hist(bins=16)
plt.show()

In [None]:
df.loc[df['DirectChol'].notna(), 'DirectChol'].head()

**TotChol**

In [None]:
df['TotChol'].isna().sum()

In [None]:
df.loc[df['TotChol'].isna()]['Age'].hist(bins=16)
plt.show()

In [None]:
df.loc[df['TotChol'].notna(), 'TotChol'].head()

**UrineVol1**

In [None]:
df['UrineVol1'].isna().sum()

In [None]:
df['UrineVol1'].hist(bins=20)
plt.show()

In [None]:
df.plot.scatter('UrineVol1', 'flu')
plt.show()

In [None]:
df[['UrineVol1', 'flu']].corr()

**UrineFlow1**

In [None]:
df['UrineFlow1'].isna().sum()

In [None]:
df['UrineVol1'].hist(bins=20)
plt.show()

In [None]:
df.plot.scatter('UrineFlow1', 'flu')
plt.show()

**UrineVol2** and **UrineFlow2**

In [None]:
df['UrineVol2'].isna().sum()

In [None]:
df['UrineFlow2'].isna().sum()

These two columns are dropped as there are too many missing values.

**Diabetes**

In [None]:
df['Diabetes'].isna().sum()

In [None]:
df['Diabetes'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['Age'] < 1]['Diabetes'].isna().sum()

Most missing values are from babies where it can be assumed they don't have diabetes. The few remaining ones are also filled with the most common label `No`.

In [None]:
df.loc[df['Diabetes'].isna()]['Age'].hist(bins=32)
plt.show()

**Diabetes Age**

In [None]:
len(df[(df['DiabetesAge'].isna()) & (df['Diabetes'].isna())])

In [None]:
df['DiabetesAge'].hist(bins=16)
plt.show()

Most missing values for this features are because the individuals don't have diabetes. These entries are filled in with a 0. The instances where the data is actually missing is for the same individuals that had a missing value in the `Diabetes` column. Since these missing values were filled with No, the NaN values in the `DiabetesAge` column are also filled with 0. In short, all missing values in this column are filled in with the value 0.

**Health General**

In [None]:
df['HealthGen'].isna().sum()

In [None]:
df['HealthGen'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['HealthGen'].isna()]['Age'].hist(bins=8)
plt.show()

In [None]:
df.loc[df['Age'] <= 11]['HealthGen'].notna().sum()

In [None]:
df.loc[df['Age'] <= 20]['HealthGen'].value_counts().plot.bar(rot=0)
plt.show()

Many missing values. Individuals with `Age` $< 12$ are filled in with `Good` as it is the most common label for the entire population. The remaining ones are filled by using the most common label per age group.

**Days Mental Health Bad**

In [None]:
df['DaysMentHlthBad'].isna().sum()

In [None]:
df['DaysMentHlthBad'].hist()
plt.show()

In [None]:
df.loc[df['DaysMentHlthBad'].isna()]['Age'].hist(bins=16)
plt.show()

Most missing values are in younger individuals. For individuals with `Age` $\le 12$ a value of 0 is filled in. The remaining ones are filled in using the mean value per age group.

**Little Interest**

In [None]:
df['LittleInterest'].isna().sum()

In [None]:
df['LittleInterest'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['LittleInterest'].isna()]['Age'].hist(bins=16)
plt.show()

Lots of missing data but most of it is from the younger individuals. It is assumed that people with `Age` $\le 15$ have a value of `None` and the rest are filled in by taking the most common label per age group. 

**Depressed**

In [None]:
df['Depressed'].isna().sum()

In [None]:
df['Depressed'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['Depressed'].isna()]['Age'].hist(bins=16)
plt.show()

Similarly to the `LittleInterest` feature, a large amount of the missing data is from the younger individuals. It is assumed that people less than with `Age` $\le 15$ have a value of 0 and the rest are filled in by taking the mean per age group. 

**Number Pregnancies**

In [None]:
len(df[(df['nPregnancies'].isna()) & (df['Gender']=='female')])

In [None]:
len(df[df['nPregnancies']==0])

In [None]:
len(df[(df['nPregnancies'].isna()) & (df['nBabies']>0)])

In [None]:
df['nPregnancies'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
len(df[(df['nPregnancies'].notna()) & (df['Gender']=='male')])

It looks like there is never the value 0. It is then assumed that a null value means that the individual has had 0 pregnancies. As a sanity check, there are no males that have had a pregnancy.

**Number Babies**

In [None]:
len(df[(df['nPregnancies'].notna()) & (df['nBabies']).isna()])

In [None]:
len(df[df['nBabies']==0])

In [None]:
df['nBabies'].value_counts().plot.bar(rot=0)
plt.show()

How many instances have the amount of pregnancies different from the amount of babies?

In [None]:
len(df[(df['nPregnancies'] != df['nBabies']) & (df['nBabies'].notna())])

Are there instances where the amount of pregnancies is less than the amount of babies?

In [None]:
len(df[(df['nPregnancies'] < df['nBabies']) & (df['nBabies'].notna())])

In [None]:
len(df[(df['nPregnancies'].notna()) & (df['nBabies'].isna())])

Similarly to `nPregnancies`, all missing values are filled instances with missing `nBabies` and 0 `nPregnancies` are filled in with 0.

**Age 1st Baby**

In [None]:
len(df[(df['nBabies']!=0) & (df['Age1stBaby']).isna()])

In [None]:
df['Age1stBaby'].hist(bins=12)
plt.show()

In [None]:
df.loc[(df['nBabies'].notna()) & (df['Age1stBaby'].isna())]['Age'].hist(bins=16)
plt.show()

In [None]:
df.loc[(df['nBabies'].notna()) & (df['Age1stBaby'].isna())]['Age'].hist(bins=16)
plt.show()

The data appears to be normally distributed. Missing values where `nBabies` is 0 are filled in with 0.

**Sleep Hours Night**

In [None]:
df['SleepHrsNight'].isna().sum()

In [None]:
df['SleepHrsNight'].hist()
plt.show()

In [None]:
df.loc[df['SleepHrsNight'].isna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['SleepHrsNight'].notna()) & (df['Age']==i)]))]
print(age_count)

All individuals with `Age` $\le 15$ have missing values. For these instances the values are manually inputted by using the average sleep time got from. For the remaining people, the sleep hours per night is found using by taking the mean of the each group a particular person is part of.

**Sleep Trouble**

In [None]:
df['SleepTrouble'].isna().sum()

In [None]:
df['SleepTrouble'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['SleepTrouble'].isna()]['Age'].hist(bins=14)
plt.show()

Since all missing values come from individuals with an age smaller or equal to 16. Maybe all these values can be assumed to be `No` as younger people are less likely to have sleep trouble.

**Physically Active**

In [None]:
df['PhysActive'].isna().sum()

In [None]:
df['PhysActive'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['PhysActive'].isna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['PhysActive'].notna()) & (df['Age']==i)]))]
print(age_count)

In [None]:
fig = plt.figure(figsize=(9,4))
ax1 = fig.add_subplot(1,2,1)
ax1 = sns.boxplot('PhysActive', 'Age', data=df)
ax1.set_title('original dataset')
plt.tight_layout()

For kids with `Age` $\le 4$ years old, `No` is inserted. All the individuals where $5 \ge$`Age` $15 \le$ are assumed to have physical activity. The remaining missing values for the population older than $15$ years old the most common value per age group is inserted.

**Physically Active Days**

In [None]:
df['PhysActiveDays'].isna().sum()

In [None]:
df['PhysActiveDays'].hist(bins=6)
plt.show()

There are too many missing values and as a result this column is dropped.

**TV Hours per Day**

In [None]:
df['TVHrsDay'].isna().sum()

In [None]:
df['TVHrsDay'].value_counts().plot.bar(rot=45)
plt.show()

There are too many missing values and as a result this column is dropped.

**Computer Hours per Day**

In [None]:
df['CompHrsDay'].isna().sum()

In [None]:
df['CompHrsDay'].value_counts().plot.bar(rot=45)
plt.show()

There are too many missing values and as a result this column is dropped.

**TV Hours per Day Child**

In [None]:
df['TVHrsDayChild'].isna().sum()

In [None]:
df['TVHrsDayChild'].value_counts().plot.bar(rot=0)
plt.show()

There are too many missing values and as a result this column is dropped.

**Computer Hours per Day Child**

In [None]:
df['TVHrsDayChild'].isna().sum()

In [None]:
df['TVHrsDayChild'].hist(bins=6)
plt.show()

There are too many missing values and as a result this column is dropped.

**Alcohol 12+ Years**

In [None]:
df['Alcohol12PlusYr'].isna().sum()

In [None]:
df['Alcohol12PlusYr'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['Alcohol12PlusYr'].isna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['Alcohol12PlusYr'].notna()) & (df['Age']==i)]))]
print(age_count)

All values are missing for people younger than 18 years old. These are filled with a `No`. the remaining ones are filled by taking the most common value per age group each individual given is part of.

**Alcohol Day**

In [None]:
df['AlcoholDay'].isna().sum()

In [None]:
df['AlcoholDay'].hist(bins=20)
plt.show()

In [None]:
df.loc[df['AlcoholDay'].notna()]['Age'].hist(bins=14)
plt.show()

Again, people younger than 18 years old have all the values missing which are then filled with a 0. The remaining ones are filled by taking the most common value per age group each given individual is part of.

**Alcohol Year**

In [None]:
df['AlcoholYear'].isna().sum()

In [None]:
df['AlcoholYear'].hist()
plt.show()

Same as the `AlcoholDay` column, people younger than 18 years old have all the values missing which are then filled with a 0. The remaining ones are filled by taking the most common value per age group each given individual is part of.

**Smoke Now**

In [None]:
df['SmokeNow'].isna().sum()

In [None]:
df['SmokeNow'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['SmokeNow'].notna()]['Age'].hist(bins=14)
plt.show()

In [None]:
fig = plt.figure(figsize=(9,4))
ax1 = fig.add_subplot(1,2,1)
ax1 = sns.boxplot('SmokeNow', 'Age', data=df)
ax1.set_title('original dataset')
plt.tight_layout()

In [None]:
df.loc[df['SmokeNow']=='Yes']['Age'].hist(bins=8)
plt.show()

**Smoke Past 100 Months**

In [None]:
df['Smoke100'].isna().sum()

In [None]:
df['Smoke100'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
len(df.loc[(df['Smoke100'].isna()) & (df['Age']>=20)])

All missing values are from individuals younger than 20 years old. As hese people are assumed to not smoke, it follows that they should not identify as smokers. This means that all the missing values are filled in with `No`.

**Smoker Identify**

In [None]:
df['Smoke100n'].isna().sum()

In [None]:
df['Smoke100n'].value_counts().plot.bar(rot=0)
plt.show()

Just like the `Smoke100` feature, all missing values are from individuals younger than 20 years old. Therefore, all these people it assumed that they do not smoke.

**Smoke Age**

In [None]:
df['SmokeAge'].isna().sum()

In [None]:
df['SmokeAge'].hist(bins=30)
plt.show()

In [None]:
len(df.loc[(df['SmokeNow']=='Yes') & (df['SmokeAge'].isna())])

Tha data appears to be normally distributed. The individuals which have the value `No` in the `SmokeNow` column have their entry filled with a 0. The few entries of `SmokeAge` of people that smoke but have a missing value are filled in by taking the mean of the age group a given individual is part of.

**Marijuana**

In [None]:
df['Marijuana'].isna().sum()

In [None]:
df['Marijuana'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['Marijuana'].notna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['Marijuana'].notna()) & (df['Age']==i)]))]
print(age_count)

In [None]:
len(df.loc[(df['Marijuana'].isna()) & (df['SmokeNow']=='Yes')])

The individuals which have the value `No` in the `SmokeNow` column have their entry filled with a `No` in the `Marijuana` column as well.  

**Age First Marijuana**

In [None]:
df['AgeFirstMarij'].isna().sum()

In [None]:
df['AgeFirstMarij'].hist(bins=20)
plt.show()

In [None]:
len(df.loc[(df['Marijuana']=='Yes') & (df['AgeFirstMarij'].isna())])

Most missing values are from people that do not smoke marijuana. The entries for these instance are therefore filled with a 0. The remaining missing values are filled with the mean of the age group the given individual is part of.

**Regular Marijuana**

In [None]:
df['RegularMarij'].isna().sum()

In [None]:
df['RegularMarij'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
len(df.loc[(df_test['Marijuana']=='Yes') & (df['RegularMarij'].isna())])

Instances that have `No` in the `Marijuana` feature have their missing value in `RegularMarij` set to `No` as well. The remaining missing values, if any, are filled in by using the most common value of the age group the given individual is part of.

**Age Regular Marijuana**

In [None]:
df['AgeRegMarij'].isna().sum()

In [None]:
df['AgeRegMarij'].hist(bins=20)
plt.show()

In [None]:
len(df[(df['AgeRegMarij'].isna()) & (df['Marijuana'] == 'Yes')])

Similarly, instances that have `No` in the `Marijuana` feature have their missing value in `AgeRegMarij` set to 0. The remaining missing values, are filled in by using the mean value of the age group the given individual is part of.

**Hard Drugs**

In [None]:
df['HardDrugs'].isna().sum()

In [None]:
df['HardDrugs'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['HardDrugs'].isna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['HardDrugs'].notna()) & (df['Age']==i)]))]
print(age_count)

Instances where the value is missing and `Age` $\le 18$ have the `HardDrugs` attribute set to `No`. The remaining missing values, are filled in by using the most common value of the age group the given individual is part of.

**Sex Ever**

In [None]:
df['SexEver'].isna().sum()

In [None]:
df['SexEver'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['SexEver'].isna()]['Age'].hist(bins=14)
plt.show()

The missing values are filled in the following way. If the individual's age is less than the mode of `SexAge`, then the entry is filled with `No`. Otherwise, it is filled with `Yes`. 

**Sex Age**

In [None]:
df['SexAge'].isna().sum()

In [None]:
df['SexAge'].hist(bins=20)
plt.show()

In [None]:
len(df.loc[(df['SexAge'].isna()) & (df['SexEver'].notna())])

This feature appears to be normally distributed. Instances where there is a missing value and `SexEver` is `No` are filled with 0. The remaining missing values are filled in with `Yes`.

**Sex Number of Partners Life**

In [None]:
df['SexNumPartnLife'].isna().sum()

In [None]:
df['SexNumPartnLife'].hist(bins=20)
plt.show()

In [None]:
len(df[(df['SexNumPartnLife'].isna()) & (df['SexEver'] == 'Yes')])

In [None]:
df.loc[df['SexNumPartnLife'].isna()]['Age'].hist(bins=14)
plt.show()

In [None]:
age_count = list()
for i in range(df['Age'].max()):
    age_count += [(i, len(df.loc[(df['SexNumPartnLife'].notna()) & (df['Age']==i)]))]
print(age_count)

Missing values for people that have the `SexEver` attribute to `No` are filled in with 0. The remaining missing values are filled in by using the mean of the age group that the given individual is part of. However the dataset doesn't contain any value for people of 70 years or older. These values are filled in by using the mean of the instances with $60 \ge$ `SexNumPartnLife` $< 70$ as it is presumed that this value does not greatly change between the two age groups.

**Sex Number of Parners Year**

In [None]:
df['SexNumPartYear'].isna().sum()

In [None]:
df['SexNumPartYear'].hist(bins=20)
plt.show()

In [None]:
len(df[(df['SexNumPartYear'].isna()) & (df['SexEver'] == 'Yes')])

In [None]:
df.loc[(df['SexNumPartYear'].isna()) & (df['SexEver'] == 'Yes')]['Age'].hist(bins=14)
plt.show()

Similartly to `SexNumPartnLife`, missing values for people that have the `SexEver` attribute to `No` are filled in with 0. The remaining missing values are filled in by using the mean of the age group that the given individual is part of. However the dataset doesn't contain any value for people of 70 years or older. These values are filled in by using the mean of the instances with $60 \ge$ `SexNumPartnYear` $< 70$ as it is presumed that this value does not greatly change between the two age groups.

**Same Sex**

In [None]:
df['SameSex'].isna().sum()

In [None]:
len(df.loc[(df['SameSex'].isna()) & (df['SexEver']=='Yes')])

In [None]:
df['SameSex'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['SameSex'].isna()]['Age'].hist(bins=16)
plt.show()

Lots of missing values. Mostly for the younger and older age groups.

**Sex Orientation**

In [None]:
len(df.loc[(df['SexOrientation'].isna()) & (df['Age'] >= 14)])

In [None]:
df['SexOrientation'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[df['SexOrientation'].isna()]['Age'].hist(bins=16)
plt.show()

This column is dropped as there are too many missing values.

**Pregnant Now**

In [None]:
len(df.loc[(df['PregnantNow'].isna()) & (df['Gender'] == 'female')])

In [None]:
df['PregnantNow'].value_counts().plot.bar(rot=0)
plt.show()

In [None]:
df.loc[(df['PregnantNow'].isna()) & (df['Gender'] == 'female')]['Age'].hist(bins=16)
plt.show()

All missing values are either from males or people that are less than 20 years old or more than 45 years old. As such it can be assumed that all individual with missing values are not pregnant.

**Flu**

In [None]:
df['flu'].isna().sum()

In [None]:
df['flu'].value_counts().plot.bar(rot=0)
plt.show()

There are no rows with flu value missing that would need to be dropped.

## Step 2: Model Choice

The first task is to decide which classifier to use (from the ones that we learned this block), i.e. which one would best suit our task and our data. Note that our data are heavily unbalanced, thus you need to do some exploration on how different classifiers handle inbalances in the data (we will discuss some of these techniques during week 3 lecture).

It would be possible to do brute force model comparison here - i.e. tune all models and compare which does best with respect to various benchmarks. However, it is also reasonable to do a first round of model comparison by running models (with out of the box parameter settings) on the training data and eliminating some models which performed very poorly.

Let the best model win!

In [None]:
def expected_score(model, x_test, y_test):
    overall = 0
    class_0 = 0
    class_1 = 0
    for i in range(100):
        sample = np.random.choice(len(x_test), len(x_test))
        x_sub_test = x_test[sample]
        y_sub_test = y_test[sample]
        
        overall += model.score(x_sub_test, y_sub_test)
        class_0 += model.score(x_sub_test[y_sub_test==0], y_sub_test[y_sub_test==0])
        class_1 += model.score(x_sub_test[y_sub_test==1], y_sub_test[y_sub_test==1])

    return pd.Series([overall / 100., 
                      class_0 / 100.,
                      class_1 / 100.],
                      index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

score = lambda model, x_test, y_test: pd.Series([model.score(x_test, y_test), 
                                                 model.score(x_test[y_test==0], y_test[y_test==0]),
                                                 model.score(x_test[y_test==1], y_test[y_test==1])], 
                                                index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

In [None]:
from sklearn.linear_model import LogisticRegression
cw = []
for i in np.linspace(start = 0.001, stop = 0.4, num = 40):
    cw.append({0:i, 1:1-i})
cw.append('balanced')

param_grid = {
    'C':[x for x in np.linspace(start = 0.01, stop = 100, num = 20)],
    'penalty':['l1', 'l2', 'elasticnet'],
    'max_iter':[10, 100, 1000, 10000],
    'class_weight': cw
}

In [None]:
lr = LogisticRegression()
lr_r = GridSearchCV(lr, param_grid, scoring='roc_auc', cv=3, return_train_score=True, verbose=0)
lr_r.fit(ndata.drop('flu', axis=1), ndata['flu'])

In [None]:
params = lr_r.best_params_

print('The best parameters are {} giving an average ROC AUC score of {:.4f}'.format(params, lr_r.best_score_))

In [None]:
lr_r.best_estimator_.predict(ntest)

In [None]:
### fancy models that solve the problem

## On evaluation

### AUROC

As mentioned abbove, we will use the accuracy scores for each class and for the whole dataset, as well as the AUROC score from Kaggle platform. You can coimpute AUROC locally (e.g. on your train/validation set) by calling the relevant scikit learn function:

In [None]:
###AUROC locally

#score = roc_auc_score(real_labels, predicted_labels)

#real_labels: the ground truth (0 or 1)
#predicted_labels: labels predicted by your algorithm (0 or 1)

### Accuracy (per class)

Below there is a function that will be handy for your models. It computes the accuracy per-class, based on a model you pass as parameter and a dataset (split to x/y)

In [None]:
def extended_score(model, x_test, y_test):
    overall = 0
    class_0 = 0
    class_1 = 0
    for i in range(100):
        sample = np.random.choice(len(x_test), len(x_test))
        x_sub_test = x_test[sample]
        y_sub_test = y_test[sample]
        
        overall += model.score(x_sub_test, y_sub_test)
        class_0 += model.score(x_sub_test[y_sub_test==0], y_sub_test[y_sub_test==0])
        class_1 += model.score(x_sub_test[y_sub_test==1], y_sub_test[y_sub_test==1])

    return pd.Series([overall / 100., 
                      class_0 / 100.,
                      class_1 / 100.],
                      index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

In [None]:
#same job as before, but faster?

score = lambda model, x_val, y_val: pd.Series([model.score(x_val, y_val), 
                                                 model.score(x_val[y_val==0], y_val[y_val==0]),
                                                 model.score(x_val[y_val==1], y_val[y_val==1])], 
                                                index=['overall accuracy', 'accuracy on class 0', 'accuracy on class 1'])

## Solution extraction for Kaggle

Make sure that you extract your solutions (predictions) in the correct format required by Kaggle

## Step 3: Conclusions

Highlight at the end of your notebook, which were the top-3 approaches that produced the best scores for you. That is, provide a table with the scores you got (on the AUROC score you get from Kaggle) and make sure that you judge these in relation to your work on the training set