# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Alexandre Flachs - __[alexandre.flachs@ulb.be](mailto:alexandre.flachs@ulb.be) - Student ID 474748__
### Marie Giot - __[marie.giot@ulb.be](mailto:marie.giot@ulb.be) - Student ID 474915__
### Jeanne Szpirer - __[jeanne.szpirer@ulb.be](mailto:jeanne.szpirer@ulb.be) - Student ID 477286__

### Video presentation: www.youtube.com/abcd1234

## Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines


# Introduction


*Ajouter du texte d'intro avec jolies images ?* ?

# Data preprocessing

Before working any model we need to preprocess the data to make it usefull. This pipeline in divided intro three parts :
1. **Missing value imputation** : Replace missing values, possibly using other known values
2. **Feature engineering** : Define useful features from available ones. 
3. **Feature selection** : Some features might be useless or give wrong indications to the model, we might need to remove some features.

Let's start by importing our data, then develop each of the above parts.

In [None]:
# Training set features
training_set_features <- read.csv("training_set_features.csv", stringsAsFactors = T)
dim(training_set_features)

# Test set features
test_set_features <- read.csv("test_set_features.csv", stringsAsFactors = T)
dim(test_set_features)

# Training set labels
training_set_labels <- read.csv("training_set_labels.csv", stringsAsFactors = T)
dim(training_set_labels)

We can see that the training features set and the training labels set has the same amount of lines, this is a first good sign because it means that we have an "answer" for every training line.

## Missing value imputation


We summarize our data before doing any work.

In [None]:
summary(training_set_features)

We see that we have many missing values, in most features. We can compare the number of lines left if we remove any line containing any missing value.

In [None]:
# First method
cat("Training set : ", dim(training_set_features)[1], "->", dim(na.omit(training_set_features))[1], "\n")
cat("Test set     : ", dim(test_set_features)[1], "->", dim(na.omit(test_set_features))[1], "\n")
cat("Training labs: ", dim(training_set_labels)[1], "->", dim(na.omit(training_set_labels))[1])

# Second method (when TRUE, there is at least one missing value in the column)
# apply(is.na(training_set_features),2,any)
# apply(is.na(test_set_features),2,any)
# apply(is.na(training_set_labels),2,any)

At least no line from the training labels misses any value, we can thus use every entry from the training set for both targets.
Counting the number of missing value per feature allows us to see if some of the could be useless. The health insurance line is the emptiest (almost half of the lines miss this data) but by intuition this might be a huge factor in the vaccination decision so we keep it for now.

In [None]:
# On peut aussi regarder si certaines colonnes n'ont vraiment quasi aucune valeur, dans ce cas, ça vaut pas vraiment la peine de garder
print(sapply(training_set_features, function(x) sum(is.na(x))))

# A partir d'ici j'ai juste copié-collé ce que Jeanne avait fait

### Je le fais d'abord avec le training_set_features mais il faudra appliquer pareil au test_set_features je pense

In [None]:
head(training_set_features)

In [None]:
sapply(training_set_features[1,],class)

In [None]:
dim(training_set_features)
factor_variables<-which(sapply(training_set_features[1,],class)=="factor")
factor_variables
data_factor<-training_set_features[,factor_variables]
dim(data_factor)
data_preprocessed<-training_set_features[,-factor_variables]
head(data_preprocessed)
dim(data_preprocessed)

### Histogramme des différentes données déjà traitées

In [None]:
# Méthode pas ouf mais OK

# install.packages("Hmisc")
# library(Hmisc)
# hist.data.frame(na.omit(training_set_features[, -c(1)]))
# hist(training_set_features[33])

# Meilleure méthode mais on n'a pas encore remplacé les missing values
library(reshape2)
library(ggplot2)
d <- melt(as.data.frame(data_preprocessed[,-c(1)]))
ggplot(d,aes(x = value)) + 
    facet_wrap(~variable,scales = "free_x") + 
    geom_histogram()

We can choose several variables that should influence the output variables in our opinion and according to the scientific articles we have read.
### Pas hésiter à les mettre quand on en aura hihi
Mais selon moi, "marital_status", "rent_or_own", "education" et "employment_status" devraient pas trop influencer. Et "employment_industry" & "employment_occupation" sont des random short strings hyper nombreux donc pas convaincue que ça soit très utile.

In [None]:
# We will need dummies package to transform string values with one-hot-encoding.
# install.packages('dummies')
library(dummies)

In [None]:
# We keep only some factor variables
variables_to_keep<-c("age_group","race","sex","income_poverty","hhs_geo_region","census_msa")
data_factor_onehot <- dummy.data.frame(data_factor[,variables_to_keep], sep="_")

In [None]:
dim(data_factor_onehot)

In [None]:
data_factor_onehot[1:2,]

In [None]:
data_preprocessed_extended<-cbind(data_preprocessed,data_factor_onehot)
dim(data_preprocessed_extended)

On peut passer au traitement des missing values. Pour le moment, j'implémente la même chose que dans les TP (on remplace par la mean value) mais on peut 100% faire plus de recherches et remplacer par autre chose. J'ai vu que la median value était souvent utilisée.

In [None]:
replace_na_with_mean_value<-function(vec) {
    mean_vec<-mean(vec,na.rm=T)
    vec[is.na(vec)]<-mean_vec
    vec
}

In [None]:
data_preprocessed_extended<-data.frame(apply(data_preprocessed_extended,2,replace_na_with_mean_value))
summary(data_preprocessed_extended)

In [None]:
dim(na.omit(data_preprocessed_extended))

A ce stade, on a un set de data avec seulement des variables numériques qui ont du sens d'être là (selon moi hihi) et plus de missing values donc il manque surtout feature engineering. On pourra ensuite faire feature selection avec une mesure de la corrélation entre input et output.

## Feature engineering


## Feature selection


# Model selection

## Model 1

## Model 2

## Model 3

#### Example of simple equation
\begin{equation}
e = mc^2
\end{equation}

#### Example of matrix equation - Cross product formula:

\begin{equation*}
\mathbf{V}_1 \times \mathbf{V}_2 =  \begin{vmatrix}
\mathbf{i} & \mathbf{j} & \mathbf{k} \\
\frac{\partial X}{\partial u} &  \frac{\partial Y}{\partial u} & 0 \\
\frac{\partial X}{\partial v} &  \frac{\partial Y}{\partial v} & 0
\end{vmatrix}
\end{equation*}

#### Example of multiline equation - The Lorenz Equations:

\begin{align}
\dot{x} & = \sigma(y-x) \\
\dot{y} & = \rho x - y - xz \\
\dot{z} & = -\beta z + xy
\end{align}

#### Example of Markdown Table:

| This | is   |
|------|------|
|   a  | table|


# Alternative models





# Conclusions