# INFO-F-422 -  Statistical Foundations of Machine Learning 

### Alexandre Flachs - __[alexandre.flachs@ulb.be](mailto:alexandre.flachs@ulb.be) - Student ID 474748__
### Marie Giot - __[marie.giot@ulb.be](mailto:marie.giot@ulb.be) - Student ID 474915__
### Jeanne Szpirer - __[jeanne.szpirer@ulb.be](mailto:jeanne.szpirer@ulb.be) - Student ID 477286__

### Video presentation: www.youtube.com/abcd1234

## Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines


# Introduction


The project's goal is to propose efficient machine learning methods to predict the likelihood of patients to get a vaccine for H1N1 or Seasonal Flu. This challenge was inititated by the "Driven Data" plateforme. Four different techniques are presented hereunder and their costs and their results will be discussed. The group also submitted its result on the Driven Data plateforme for the challenge in order to receive a score and compare it to the other paticipants.

# Data preprocessing

Before working any model we need to preprocess the data to make it usefull. This pipeline in divided intro three parts :
1. **Missing value imputation** : Replace missing values, possibly using other known values
2. **Feature engineering** : Define useful features from available ones. 
3. **Feature selection** : Some features might be useless or give wrong indications to the model, we might need to remove some features.

Let's start by importing our data, then develop each of the above parts.

In [None]:
# Training set features
training_set_features <- read.csv("training_set_features.csv", stringsAsFactors = T, na.strings = c("NA", ""))
dim(training_set_features)

# Test set features
test_set_features <- read.csv("test_set_features.csv", stringsAsFactors = T, na.strings = c("NA", ""))
dim(test_set_features)

# Training set labels
training_set_labels <- read.csv("training_set_labels.csv", stringsAsFactors = T, na.strings = c("NA", ""))
dim(training_set_labels)

We can see that the training features set and the training labels set has the same amount of lines, this is a first good sign because it means that we have an "answer" for every training line.

## Missing value imputation


We summarize our data before doing any work.

In [None]:
summary(training_set_features)

We see that we have many missing values, in most features. We can compare the number of lines left if we remove any line containing any missing value.

In [None]:
# First method
cat("Training set : ", dim(training_set_features)[1], "->", dim(na.omit(training_set_features))[1], "\n")
cat("Test set     : ", dim(test_set_features)[1], "->", dim(na.omit(test_set_features))[1], "\n")
cat("Training labs: ", dim(training_set_labels)[1], "->", dim(na.omit(training_set_labels))[1])


At least no line from the training labels misses any value, we can thus use every entry from the training set for both targets.
Counting the number of missing value per feature allows us to see if some of the could be useless. The health insurance, employment occupation and employment industry lines are the emptiest (almost half of the lines miss this data) but by intuition this might be a huge factor in the vaccination decision so we keep it for now.

In [None]:
# On peut aussi regarder si certaines colonnes n'ont vraiment quasi aucune valeur, dans ce cas, ça vaut pas vraiment la peine de garder
a <- sapply(training_set_features, function(x) sum(is.na(x)))
print(a[order(a, decreasing=T)])

What happens if we remove these fields ?

In [None]:
useful_nas <- which(sapply(test_set_features, function(x) sum(is.na(x))) > 10000)

cat("Training set : ",
    dim(training_set_features[,-useful_nas])[1], "->",
    dim(na.omit(training_set_features[,-useful_nas]))[1], "\n")
cat("Test set     : ", dim(test_set_features[,-useful_nas])[1], "->",
    dim(na.omit(test_set_features[,-useful_nas]))[1], "\n")
cat("Training labels: ", dim(training_set_labels)[1], "->", dim(na.omit(training_set_labels))[1])


There are much less lines containing missing values. We will thus manage the three most missed fields differently.

## Imputation of missing numerical values

In [None]:
# Imputation will be the same for training and testing so we merge them
features_set <- rbind(training_set_features, test_set_features)


# Note the indexes of each dataset to get them back after
tr_indexes <- 1:nrow(training_set_features)
ts_indexes <- (nrow(training_set_features)+1):(nrow(training_set_features) + nrow(test_set_features))


### Integer encoding of some variables

In [None]:
levels(features_set[, "age_group"])
levels(features_set[, "age_group"]) <- 0:4
features_set[, "age_group"] <- as.numeric(features_set[, "age_group"])
features_set[, "age_group"] <- (features_set[, "age_group"] - 1)/4

levels(features_set[, "education"])
levels(features_set[, "education"]) <- 0:3
features_set[, "education"] <- as.numeric(features_set[, "education"])
features_set[, "education"] <- (features_set[, "education"] - 1)/3

levels(features_set[, "income_poverty"])
levels(features_set[, "income_poverty"]) <- c(1, 2, 0)
features_set[, "income_poverty"] <- as.numeric(features_set[, "income_poverty"])
features_set[, "income_poverty"] <- (features_set[, "income_poverty"] - 1)/2

levels(features_set[, "census_msa"])
levels(features_set[, "census_msa"]) <- c(1, 2, 0)
features_set[, "census_msa"] <- as.numeric(features_set[, "census_msa"])
features_set[, "census_msa"] <- (features_set[, "census_msa"] - 1)/2


### Non negative matrix factorization based imputation

Non negative matrix factorization is a procedure which consists of approximating a matrix with non negative values as the product of two matrices having smaller dimensions. More formally, a matrix $M$ of dimension $m\times n$ is decomposed into two matrices $W$ and $H$ of dimensions $m\times p$ and $p\times n$ respectively such that $M \approx WH$.

More details concerning the imputation method can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510447/

PseudoCode :

```pseudocode
Consider M of dim m x n having missing values
Fix N > 0
Normalize M and replace NAs using mean imputation
k1 <- floor(max(abs(rank(M) - N/2), 1))
kN = min(abs(rX + N/2), m, n)

# Compute approximation
initialize X array of size kN-k1+1
for k in 1...kN-k1+1 do
    Compute the nNMF of M on non missing values and store the result in X[k]

## Based on all X[k], reconstruct M
# Weights based only on non missing values of M
initialize d array of size kN-k1+1
for k in 1...kN-k1+1 do
    Compute d[k] = sum(X_ij - M_ij)/ nb_nonNA

# Reconstruct
Initialize divisor=0, M_hat of 0 with same dim as M
for k in 1...kN-k1+1 do
    M_hat += exp(-d[k])*X[k]
    denum += exp(-d[k])
M_hat <- M_hat/denum
```


After these steps, M_hat does not contain missing values and the imputation is based on local and global information.

In [None]:
library(NMFN)

# We work only on the columns of numeric type
numeric_variables_idx<-which(sapply(features_set[1,],class)!="factor")
numeric_variables_idx <- numeric_variables_idx[-c(1, 16)] # Remove resp ID and health insur

# We need to normalize this data to use efficient nNMF
library("caret")

ss <- preProcess(as.data.frame(features_set[, numeric_variables_idx]), method=c("range"))
features_set[,numeric_variables_idx] <- predict(ss, as.data.frame(features_set[, numeric_variables_idx]))
summary(features_set[, numeric_variables_idx])


In [None]:
# Define the function that will compute nmf
nfm_mult_upd <- function(R, K, missing_idx, maxit=800, eps=2.2204e-16) {
    # Using weighted multiplicative rule Zhu 2016
    # init random W and H
    print(paste("[INFO] : NMF with k=", K))
    R <- as.matrix(R)
    I <- dim(R)[1]
    J <- dim(R)[2]
    M <- matrix(1, nrow = dim(R)[1], ncol = dim(R)[2])
    M[missing_idx] <- 0
    X <- R # Store original R
    R <- R*M
    W <- matrix(runif(I*K), nrow = I, ncol = K)
    H <- matrix(runif(K*J), nrow = K, ncol = J)
    
    n <- 0
    d1 <- 1000
    d2 <- 1000
    while(n < maxit && !(d1 < eps && d2 < eps)) {
        if (n %% 100 == 0) {
            print(paste("[INFO] : iter", n, " Relative error is :", distance2(X, W%*%H)/distance2(X, R*0)))
        }
        newH <- H* (t(W) %*% R) / (t(W) %*% W %*% H)
        newW <- W*(R %*% t(newH)) / ((W %*% newH) %*% t(newH))
        
        d1 <- distance2(newH, H)
        d2 <- distance2(newW, W)
        
        H <- newH
        W <- newW
        n <- n+1
    }
    
    Res <- W%*%H
    #Res[missing_idx] <- X[missing_idx]
    nmf <- list("res"=Res, "dst"=distance2(R, Res)/distance2(R, 0))
    return(nmf)
}


In [None]:
# 1: initialize by defining N and replace NAs by mean
N <- 4

replace_na_with_mean_value<-function(vec) {
    mean_vec <- mean(as.numeric(vec), na.rm=TRUE)
    vec[is.na(vec)]<-mean_vec
    vec
}

X<-data.frame(apply(features_set[, numeric_variables_idx], MARGIN=2, replace_na_with_mean_value))
miss_idx = which(is.na(features_set[,numeric_variables_idx]), arr.ind = T)
non_miss_idx <- which(!is.na(features_set[,numeric_variables_idx]), arr.ind = T)



In [None]:
# 2: nNMF for k=k1 to K=kN with masking missing values
library(Matrix) # For rankMatrix 
rX <- rankMatrix(X)
k1 = floor(max(abs(rX - N/2), 1))
kN = min(abs(rX + N/2), dim(X)[1], dim(X)[2])

X_hat = array(0, dim = c(kN-k1+1, nrow(X), ncol(X)))
dim(X_hat)
for (K in k1:kN) {
    # compute NMF
    nmf <- nfm_mult_upd(X, K, missing_idx = miss_idx, maxit=800)
    print(paste("Computed nmf, final dist is", nmf$dst))
    X_hat[K-k1+1,,] <- nmf$res
}


In [None]:
# 3: weighted reconstruction
d = array(0, dim = c(kN-k1+1))
for (K in 1:(kN-k1+1)) {
    # Reconstruction error based on non missing values
    d[K] <- sum(abs(X_hat[K,,][non_miss_idx] - X[non_miss_idx]))/nrow(non_miss_idx)
}

X_hat_f <- matrix(0, nrow = nrow(X), ncol = ncol(X))
denum <- 0
for (K in 1:(kN-k1+1)) {
    # Reconstruction matrix
    X_hat_f <- X_hat_f + exp(-d[K])*X_hat[K,,]
    denum <- denum + exp(-d[K])
}
X_hat_f <- X_hat_f / sum(exp(-d))

In [None]:
# Replace in the feature set
X[miss_idx] <- X_hat_f[miss_idx]
features_set[, numeric_variables_idx] <- X

## Factor variables

Now that we managed numerical variabels we can work on encoding the categorical ones which we did not work on yet.

Getting back to the columns containing many missing values, we use one hot encoding and create a category "missing", considering that not having this piece of information could be meaning something.

In [None]:
useful_nas <- which(sapply(test_set_features, function(x) sum(is.na(x))) > 10000)

library(fastDummies)
features_set <- dummy_cols(features_set, 
                           select_columns = names(useful_nas),
                           remove_selected_columns = T)

## Change NAs to 0 for the new one hot encoded columns
replace_na_with_0<-function(vec) {
    vec[is.na(vec)]<-0
    vec
}

useful_nas_feat_idx <- c(grep("health_insurance_*", colnames(features_set)))
useful_nas_feat_idx <- c(useful_nas_feat_idx, grep("employment_industry*", colnames(features_set)))
useful_nas_feat_idx <- c(useful_nas_feat_idx, grep("employment_occup*", colnames(features_set)))

features_set[,useful_nas_feat_idx] <- replace_na_with_0(features_set[,useful_nas_feat_idx])

The remaining columns will be treated the same.

In [None]:
factor_variables<-which(sapply(features_set[1,],class)=="factor")
factor_variables
factor_col_names <- colnames(factor_variables)

features_set_int_encoded <- dummy_cols(features_set,
                                       select_columns = factor_col_names,
                                       ignore_na = F,
                                       remove_selected_columns = T)


features_set_int_encoded <- replace_na_with_0(features_set_int_encoded)

tr_set_enc <- features_set_int_encoded[tr_indexes,]
ts_set_enc <- features_set_int_encoded[ts_indexes,]

summary(features_set_int_encoded)

## Feature engineering


## Feature selection


PCA was used to select the best features. This technique creates new fewer but more representative features from the current set of features.

In [None]:
training_features <- tr_set_enc[, -c(1)]
summary(training_features)
training_labels <- training_set_labels
label_1 <- training_labels[,"h1n1_vaccine"]
label_2 <- training_labels[, "seasonal_vaccine"]
labels <- cbind(h1n1_vaccine=label_1, seasonal_vaccine=label_2)

testing_features <- ts_set_enc[, -c(1)]
feature_set <- rbind(training_features, testing_features)
feature_set <- feature_set[, -c(1)] # remove resp id
pca <- prcomp(feature_set)
summary(pca)

pca_feat_set <- pca$x[,1:54]

library("caret")

ss <- preProcess(as.data.frame(pca_feat_set), method=c("range"))
pca_feat_set <- predict(ss, as.data.frame(pca_feat_set))

pca_tr_set <- pca_feat_set[1:26707,]
pca_ts_set <- pca_feat_set[26708:53415,]


The choice of the 54 features was made on the basis of the analysis of the pca summary. Indeed, the standard deviations and variances allowed us to conclude that 54 features were sufficient to train a model. The cumulative proportion also shows that after taking 54 features, the problem was defined at a rate of 95% which seems enough.

# Model selection


One of the first thing to define is the way in which the models will be compared and evaluated. To fit with the way the Driven Data plateforne is evaluating scores, the ROC and AUC are used to estimate the efficienty of a method. Those two mesures are computed with the following code.

In [None]:
AUROC <- function(Y_pred, Y, title_of_curve) {
    thresholds <- seq(0,0.99,0.05)
    FPR <- c()
    TPR <- c()

    for(threshold in thresholds){
      Y_hat <- ifelse(Y_pred > threshold,1,0) 
      confusion_matrix <- table(Y_hat,Y)

      if(dim(confusion_matrix)[1] < 2){ 
        if(rownames(confusion_matrix) == 0){
          confusion_matrix <- rbind(confusion_matrix,c(0,0))
          rownames(confusion_matrix)[2] <- 1
        }
        if(rownames(confusion_matrix) == 1){
          confusion_matrix <- rbind(c(0,0),confusion_matrix)
          rownames(confusion_matrix)[1] <- 0
        }
      }

      FP <- confusion_matrix[2,1]
      TP <- confusion_matrix[2,2]
      N_N <- sum(confusion_matrix[,1]) # Total number of 0's
      N_P <- sum(confusion_matrix[,2]) # Total number of 1's

      FPR <- c(FPR,FP/N_N)
      TPR <- c(TPR,TP/N_P)
    }
    FPR <- c(1, FPR, 0)
    TPR <- c(1, TPR, 0)
    plot.new()
    plot(FPR,TPR)
    lines(FPR,TPR,col="blue")
    lines(thresholds,thresholds,lty=2)
    title(title_of_curve)
    AUC <- sum(abs(diff(FPR)) * (head(TPR,-1)+tail(TPR,-1)))/2
    return (AUC)
}

## Model 1
First the linear model is going to be used. It is one the fastest model to train and it will give a good idea of the kind of relation that exists between features and labels. The only difference with the model viewed in practicals is that two outputs are predicted at a time, it means that the train set needs to have the two labels.

In [None]:
total_train_set <- cbind(pca_tr_set, labels)

Before making the prediction for the given test values, it is important to check that the model fits by doing a cross-validation and calculating the AUC and ROC. First, the cross-validation with the multivariate linear model.

In [None]:
# Separation of the train in two groups in order to calculate the error between prediction and reality.
vaccine_idx <- sample(1:nrow(total_train_set))
half_split <- floor(nrow(total_train_set)/2)
train_data_set <- total_train_set[vaccine_idx[1:half_split],]
test_data <- total_train_set[vaccine_idx[(half_split+1):nrow(total_train_set)],]
target_idx <- ncol(train_data_set)
targets <- c(target_idx, target_idx-1)

# Linear model & prediction
model <- lm(cbind(h1n1_vaccine, seasonal_vaccine)~., data=train_data_set)
Y_pred <- predict(model,test_data[,-targets])

In [None]:
k = 10
accuracy_vec_h1n1 <- array(0,k)
accuracy_vec_seasonal <- array(0,k)
threshold <- 0.5

# 1. Shuffle the dataset randomly.
vaccine_idx <- sample(1:nrow(train_data_set))

# 2. Split the dataset into k groups
max <- ceiling(nrow(train_data_set)/k)
splits <- split(vaccine_idx, ceiling(seq_along(vaccine_idx)/max))

# 3. For each unique group:
for (i in 1:k){
  #3.1 Take the group as a hold out or test data set
  test_data <- train_data_set[splits[[i]],]
  
  #3.2 Take the remaining groups as a training data set
  train_data <- train_data_set[-splits[[i]],]
  print(paste("[INFO] - Training set size:",dim(train_data)[1],"- Testing set size",dim(test_data)[1]))
  
  #3.3 Fit a model on the training set and evaluate it on the test set
  model <- lm(cbind(h1n1_vaccine, seasonal_vaccine) ~ ., data=train_data)
  Y_pred <- predict(model,test_data[,-targets])
  Y_h1n1 <- test_data[,targets[2]]
  Y_seasonal <- test_data[,targets[1]]
  
  #3.4 Store the prediction of the tree (2 is to take only the P(Y="spam"|x))
  Y_hat <- ifelse(Y_pred > threshold,1,0) 
  # Need one confusion matrix for h1n1 and one for seasonal
  confusion_matrix_h1n1 <- table(Y_hat[,1],Y_h1n1)
  confusion_matrix_seasonal <- table(Y_hat[,2],Y_seasonal)
  
  #3.5 Retain the evaluation score and discard the model
  accuracy_vec_h1n1[i] = (confusion_matrix_h1n1[1,1]+confusion_matrix_h1n1[2,2])/sum(confusion_matrix_h1n1)
  misclassification_rate = 1 - accuracy_vec_h1n1[i]
  print(paste("[INFO] - Misclassification rate h1n1 -",i,"fold:",misclassification_rate))
  
  accuracy_vec_seasonal[i] = (confusion_matrix_seasonal[1,1]+confusion_matrix_seasonal[2,2])/sum(confusion_matrix_seasonal)
  misclassification_rate = 1 - accuracy_vec_seasonal[i]
  print(paste("[INFO] - Misclassification rate seasonal -",i,"fold:",misclassification_rate))
  
}

#4. Summarize the skill of the model using the sample of model evaluation scores
print(paste("[INFO] - Mean misclassification rate h1n1:",1-mean(accuracy_vec_h1n1)))
print(paste("[INFO] - Mean misclassification rate seasonal:",1-mean(accuracy_vec_seasonal)))

This gives us the misclassification rate for each of the targets. The missclassification rate is not perfect but is enough to decide to use this model for the first try.
Then, the ROC and AUC for each of the target.

In [None]:
# ROC for seasonal vaccine

AUC_seasonal = AUCROC(Y_pred[,2], test_data[,targets[1]],"ROC Curve for seasonal vaccine")

# ROC for h1n1 vaccine

AUC_h1n1 =  AUCROC(Y_pred[,1], test_data[,targets[2]],"ROC Curve for h1n1 vaccine")

Now that the linear model is validated, it can be used to predict the outcome with the test set.

In [None]:
model <- lm(cbind(h1n1_vaccine, seasonal_vaccine)~., data=total_train_set)
summary(model)
Y_pred <- predict(model,pca_ts_set)

The summary of the model gives a better understanding of how it works with 26607 lines and 54 features. In reality, the outputs are split in two to present one and then the other. The p-value is a good indicator of the quality of the model and the features selected. Indeed, for the h1n1 vaccine, 32 features have a p-value indicating that they are statistically highly significant. For the seasonal vaccine, 37 features are in this case. The AUC of the seasonal vaccine is indeed higher. In all cases, the summary confirms that the chosen linear model is not the best one but works quite well for this problem.

## Model 2
Then we use the nnet package to make another model.

## Model 3

The third model chosen is the SVM (Support Vector Machine), using the e1071 package. This method has usually good results even for data sets with many dimensions, which will be practical with the 54 features set selected by the PCA. The kernel and its parameters are also easily modular which allows us to test multiple combinations in order to find the one with the better results. The fact that the kernel can take other types then the linear one (polynomial, sigmoid or radial) also allows to evaluate efficiently data that are difficult to separate.

In [None]:
library("e1071")
#prediction for H1N1 column
data_svm1 = svm(label_1 ~ ., data = pca_tr_set, kernel = "polynomial",degree=3, cost = 1, scale = FALSE)
Y_pred1 <- predict(data_svm1,pca_ts_set)
ss <- preProcess(as.data.frame(Y_pred1), method=c("range"))
Y_pred1 <- predict(ss, as.data.frame(Y_pred1))

#prediction for Seasonal Flu column
data_svm2 = svm(label_2 ~ ., data = pca_tr_set, kernel = "polynomial",degree=3, cost = 1, scale = FALSE)
Y_pred2 <- predict(data_svm2,pca_ts_set)
ss <- preProcess(as.data.frame(Y_pred2), method=c("range"))
Y_pred2 <- predict(ss, as.data.frame(Y_pred2))

final_pred_SVM <- cbind(testing_features[, c(1)], Y_pred1, Y_pred2)
write.csv(final_pred_SVM, "test_submission_SVM.csv", row.names = F)

The function "svm()" can take a large amount of  different arguments,like the kernel type and its parameters. The arguments chosen in the code were selected after testing multiple combinations by computing their ROC and the AUC. Those results are shown in the following table. If some parameters are not specified hereunder, it means that the default values were used. Some values of the AUC for the "seasonal flue vaccines" are also missing because in view of the bad results of the "H1N1" column for this line, it was not deemed necessary to compute it.

| cost   | kernel       | gamma | coef0 | AUC H1N1 | AUC Flue |
|--------|--------------|-------|-------|----------|----------|
|0,00001 | linear       |       |       |   0,804  |   0,738  |
|0,00001 | polynomial 2 |       |       |   0,804  |   0,739  |
|0,00001 | polynomial 3 |       |       |   0,840  |   0,732  |
|0,00001 | polynomial 3 | 0,0001|   5   |   0,812  |   0,732  |
|0,000001| polynomial 3 |       |       |   0,813  |          |
|0,00001 | polynomial 4 |       |       |   0,811  |   0,725  |
|0,1     | polynomial 3 |       |       |   0,849  |   0,729  |
|0,01    | polynomial 3 |       |       |   0,844  |          |
|10      | polynomial 3 |       |       |   0,840  |   0,732  |
|1       | polynomial 3 |       |       |   0,849  |   0,739  |
|0,01    | radial       |       |       |   0,849  |   0,735  |
|0,1     | radial       |       |       |   0,849  |   0,741  |
|1       | radial       |       |       |   0,839  |   0,733  |
|10      | radial       |       |       |   0,842  |   0,733  |
|100     | radial       |       |       |   0,849  |   0,730  |
|0,01    | sigmoid      |       |       |   0,848  |   0,736  |
|0,1     | sigmoid      |       |       |   0,849  |   0,740  |
|1       | sigmoid      |       |       |   0,839  |   0,734  |
|10      | sigmoid      |       |       |   0,838  |   0,731  |

# Alternative models





## MRMR

The MRMR features selection method is chosen as an alternative to the PCA, to implement a method without using a library contained in the guidelines' list. This selection is the Minimum redundancy maximal relevancy filter (MRMR). It allows to have less features than the PCA but with a prediction that is normally just as good, maybe even better.

In [None]:
#install.packages(praznik)
library(praznik)

test_set <- read.csv("ts_set_imputed.csv")[, -c(1)]


X <- read.csv("tr_set_imputed.csv")[, -c(1)]
Y <- read.csv("training_set_labels.csv")
summary(Y)

results <- MRMR(X[, -c(1)], Y[, c(2)], 40, positive = T) # arbitrary number of cols to keep for h1n1
results$selection
results$score

results <- MRMR(X[, -c(1)], Y[, c(3)], 40, positive = T) # arbitrary number of cols to keep for seasonal
results$selection
results$score

features_to_keep = c(3,6,7,10,11,12,13,14,15,16,17,19,20,22,28,29,30,32,36,38,41,45,46,47,55,56,66,67,70,72,77,80,87,91,94)

mrmr_tr_set <- X[, features_to_keep]
mrmr_ts_set <- test_set[, features_to_keep]

The selection of features is made for one output and then the other. It remained to choose which features to keep from all those that emerged. 19 features were common to both labels so they were kept. For the h1n1_vaccine label, 8 additional features were selected but only 7 had a sufficient score to be considered interesting. For seasonal_vaccine, 13 additional features and 9 are interesting. In total there will be 35 features for both outputs. This is less than with the PCA.

### Linear model with MRMR

The steps are the same than with PCA, only the results will eventually change.

In [None]:
total_train_set <- cbind(mmr_tr_set, labels)

In [None]:
# Separation of the train in two groups in order to calculate the error between prediction and reality.
vaccine_idx <- sample(1:nrow(total_train_set))
half_split <- floor(nrow(total_train_set)/2)
train_data_set <- total_train_set[vaccine_idx[1:half_split],]
test_data <- total_train_set[vaccine_idx[(half_split+1):nrow(total_train_set)],]
target_idx <- ncol(train_data_set)
targets <- c(target_idx, target_idx-1)

# Linear model & prediction
model <- lm(cbind(h1n1_vaccine, seasonal_vaccine)~., data=train_data_set)
Y_pred <- predict(model,test_data[,-targets])

In [None]:
k = 10
accuracy_vec_h1n1 <- array(0,k)
accuracy_vec_seasonal <- array(0,k)
threshold <- 0.5

# 1. Shuffle the dataset randomly.
vaccine_idx <- sample(1:nrow(train_data_set))

# 2. Split the dataset into k groups
max <- ceiling(nrow(train_data_set)/k)
splits <- split(vaccine_idx, ceiling(seq_along(vaccine_idx)/max))

# 3. For each unique group:
for (i in 1:k){
  #3.1 Take the group as a hold out or test data set
  test_data <- train_data_set[splits[[i]],]
  
  #3.2 Take the remaining groups as a training data set
  train_data <- train_data_set[-splits[[i]],]
  print(paste("[INFO] - Training set size:",dim(train_data)[1],"- Testing set size",dim(test_data)[1]))
  
  #3.3 Fit a model on the training set and evaluate it on the test set
  model <- lm(cbind(h1n1_vaccine, seasonal_vaccine) ~ ., data=train_data)
  Y_pred <- predict(model,test_data[,-targets])
  Y_h1n1 <- test_data[,targets[2]]
  Y_seasonal <- test_data[,targets[1]]
  
  #3.4 Store the prediction of the tree (2 is to take only the P(Y="spam"|x))
  Y_hat <- ifelse(Y_pred > threshold,1,0) 
  # Need one confusion matrix for h1n1 and one for seasonal
  confusion_matrix_h1n1 <- table(Y_hat[,1],Y_h1n1)
  confusion_matrix_seasonal <- table(Y_hat[,2],Y_seasonal)
  
  #3.5 Retain the evaluation score and discard the model
  accuracy_vec_h1n1[i] = (confusion_matrix_h1n1[1,1]+confusion_matrix_h1n1[2,2])/sum(confusion_matrix_h1n1)
  misclassification_rate = 1 - accuracy_vec_h1n1[i]
  print(paste("[INFO] - Misclassification rate h1n1 -",i,"fold:",misclassification_rate))
  
  accuracy_vec_seasonal[i] = (confusion_matrix_seasonal[1,1]+confusion_matrix_seasonal[2,2])/sum(confusion_matrix_seasonal)
  misclassification_rate = 1 - accuracy_vec_seasonal[i]
  print(paste("[INFO] - Misclassification rate seasonal -",i,"fold:",misclassification_rate))
  
}

#4. Summarize the skill of the model using the sample of model evaluation scores
print(paste("[INFO] - Mean misclassification rate h1n1:",1-mean(accuracy_vec_h1n1)))
print(paste("[INFO] - Mean misclassification rate seasonal:",1-mean(accuracy_vec_seasonal)))

In [None]:
# ROC for seasonal vaccine

AUC_seasonal = AUCROC(Y_pred[,2], test_data[,targets[1]],"ROC Curve for seasonal vaccine")

# ROC for h1n1 vaccine

AUC_h1n1 =  AUCROC(Y_pred[,1], test_data[,targets[2]],"ROC Curve for h1n1 vaccine")

The AUC is better for seasonal_vaccine but worse for h1n1_vaccine. The linear model is therefore better suited with PCA to predict h1n1_vaccine.

In [None]:
model <- lm(cbind(h1n1_vaccine, seasonal_vaccine)~., data=total_train_set)
summary(model)
Y_pred <- predict(model,pca_ts_set)

# Conclusions