# Making predictions on test dataset

This notebook was created to use trained and hiperparametrized models to predict which passengers has survived on the test set. The first thing we gotta do is to load the libraries and functions we have created to pre process test data and make models predictions.

In [27]:
import pandas as pd

#pre process
from modelling.pre_process import pre_processing

# modelling
from modelling.fit import *

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Pre processing test dataset

As you can see there are some variables in the dataset that contains NaN values. This is problematic because Sklearn models can't handle with missing values. Lets fix it!

In [40]:
# Load test dataset
to_predict = pd.read_csv('test.csv')

# Saving Ids informations 
to_predict.index = to_predict['PassengerId']

#checking how many missing values are in each variable
for c in to_predict.columns:
    print(f'{c}: {to_predict[c].isna().sum()}')

PassengerId: 0
Pclass: 0
Name: 0
Sex: 0
Age: 86
SibSp: 0
Parch: 0
Ticket: 0
Fare: 1
Cabin: 327
Embarked: 0


There are missing values on Age, Fare and Cabin variables. First, lets complete the only one missing value on Fare variable with the mean.

In [41]:
to_predict['Fare'].fillna(to_predict['Fare'].mean(), inplace=True)

Ok! Lets, check again:

In [42]:
for c in to_predict.columns:
    print(f'{c}: {to_predict[c].isna().sum()}')

PassengerId: 0
Pclass: 0
Name: 0
Sex: 0
Age: 86
SibSp: 0
Parch: 0
Ticket: 0
Fare: 0
Cabin: 327
Embarked: 0


We don't have to worry about the missing values on Cabin variable because the entire column will be dropped as well the variables PassengerId, Name, Ticket and Embarked. 

Therefore, we gotta work on the missing values on Age. This will not be a problem since we have build a method to deal with that! The method created will fill NaN values with an OLS regression!

Now we are ready to do all the pre processing steps. Lets do it!

In [43]:
df = pre_processing(to_predict) #create the class of pre processing
df.select_columns(drop=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked']) #drop unecessary columns
df.Create_I_Matrix() #create the design matrix (One Hot Encoding)
df.Standardize() # Standartize numerical columns
# df.FactorCategorical()
# df.RemoveOutliers() # Remove outliers
df.fill_nan_ols('Age') # Fill NaN's values with regression on Age column
df.df.drop('Parch_0', axis=1, inplace=True) # Drop an extra column of Parch category

Detecting and removing outliers...
Total rows deleted containing outliers: 99


## Predictions

At this point our test dataset is ready to be used to predict the survivers. To do that let's use some models that we already hiperparemetrized and trained with the train dataset. The classifiers trained were the Logistic model, Naive Bayes model, Random Forest model, Multilayer Perceptron, XGboost and Support Vector Machine model. 

In [46]:
# Choosing trained models to do predictions
models = ['Logistic', 
          'Naive_Bayes', 
          'Random_Forest', 
          'MLP',
          'XGBoost',
          'SVC']

# Creating the class of predictions
prediction = predict_from_load_model(X_test=df.df, 
                                     choosen_models=models)

# Making predictions!
y_hats = prediction.do_it()
y_hats.head()

Unnamed: 0,Naive_Bayes,SVC,XGBoost,Random_Forest,MLP,Logistic,PassengerId
0,0,0,0,0,0,0,892
1,1,0,0,0,0,0,893
2,0,0,0,0,0,0,894
3,0,0,1,0,0,0,895
4,1,1,1,1,1,1,896


As we have seen on the comparison models notebook, the best model according to the $F_\beta$ score was the Random Forest classifier. Thus, we gotta to send the predictions of this model to kaggle!

In [51]:
predictions = y_hats[['PassengerId','Random_Forest']]
predictions.columns = ['PassengerId','Survived']
predictions.to_csv('Survived_predictons.csv', index=False)