<a href="https://colab.research.google.com/github/giuliaries/MachineLearning/blob/main/Titanic_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Giulia Santoiemma](mailto:giulia.santoiemma@studenti.unipd.it) 2004775<br/>
Machine Learning<br/> 
Master Degree in Computer Science<br/>
19 November 2021

In [None]:
# Import libraries
from google.colab import files
from matplotlib import pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.svm import SVC
from tabulate import tabulate

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

## Dataset

This is the [Titanic dataset by Kaggle](https://www.kaggle.com/c/titanic).

I have used machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

In [None]:
# Import the train.csv file from Titanic dataset by Kaggle
uploaded = files.upload()

Saving train.csv to train (1).csv


In [None]:
# Read the CSV and show the first and last rows of the dataset
# titanic = pd.read_csv(io.BytesIO(uploaded['train.csv']))
titanic = pd.read_csv("./train.csv")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
print("Example:", titanic.shape[0])
print("Features:", titanic.shape[1])
print("\nExample per each feature:")
print(titanic.count())

Example: 891
Features: 12

Example per each feature:
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64


<table>
  <thead>
    <tr>
      <th><b>Variable</b></th>
      <th><b>Definition</b></th>
      <th><b>Type</b></th>
      <th><b>Key</b></th>
      <th><b># Missing Values</b></th>
      <th><b>Relevant</b></th>
    </tr>
  <thead>
  <tbody>
    <tr>
      <td>PassengerId</td>
      <td>Identification for each passenger within the dataset</td>
      <td>Progressive integer</td>
      <td></td>
      <td>-</td>
      <td>No</td>
    <tr>
      <td>Survived</td>
      <td>The passenger survived the shipwreck or not</td>
      <td>Binary number</td>
      <td>0 = No,<br/>1 = Yes</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>Pclass</td>
      <td>Class of the ticket purchased by the passenger</td>
      <td>Integer</td>
      <td>1 = 1st,<br/>2 = 2nd,<br/>3 = 3rd</td>
      <td>-</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Name</td>
      <td>Full name of the passenger</td>
      <td>Alphanumeric string</td>
      <td></td>
      <td>-</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Sex</td>
      <td>Passenger's sex</td>
      <td>Alphanumeric string</td>
      <td>male,<br/>female</td>
      <td>-</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Age</td>
      <td>Passenger's age in years</td>
      <td>Decimal number<br/>(decimal to indicate uncertain ages<br/>or ages less than one year of life)</td>
      <td></td>
      <td>177</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Sibsp</td>
      <td># of siblings / spouses aboard the Titanic</td>
      <td>Integer</td>
      <td></td>
      <td>-</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Parch</td>
      <td># of parents / children aboard the Titanic</td>
      <td>Integer</td>
      <td></td>
      <td>-</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Ticket</td>
      <td>Number of the ticket purchased by the passenger</td>
      <td>Alphanumeric string</td>
      <td></td>
      <td>-</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Fare</td>
      <td>Fare paid by the passenger for the purchase of the ticket</td>
      <td>Decimal</td>
      <td></td>
      <td>-</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Cabin</td>
      <td>Cabin number in which the passenger was</td>
      <td>Alphanumeric string</td>
      <td></td>
      <td>687</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Embarked</td>
      <td>Port of Embarkation</td>
      <td>Alphanumeric string</td>
      <td>C = Cherbourg,<br/>Q = Queenstown,<br/>S = Southampton</td>
      <td>2</td>
      <td>Yes</td>
    </tr>
  </tbody>
</table>

In the table above we can see the description of the Titanic dataset.

The dataset is made up of 891 distinct examples and each of them is represented by 12 different features.

I have identified what I believe are the characteristics relevant to a passenger's survival expectancy.

The `Survived` feature is not included in the evaluation because it is used as target value.

Then I have executed 4 different preprocesses, combining the features relevant to me, to see the effectiveness of the predictions as the features change.

## Missing Values

To use the dataset with learning models in order to make predictions and evaluate their performance, it is necessary that each example has all the characteristics evaluated.

Therefore I have filled in the characteristics which contain missing values.

In [None]:
# Feature "Age"
# If the value is missing, the average of all the other examples has been entered
titanic['Age'] = SimpleImputer(missing_values=np.NaN, strategy='mean').fit_transform(np.array(titanic['Age'].values)[:, np.newaxis])

# Feature "Embarked"
# If the value is missing, the most frequent value has been entered
titanic['Embarked'] = SimpleImputer(missing_values=np.NaN, strategy='most_frequent').fit_transform(np.array(titanic['Embarked'].values)[:, np.newaxis])

## Encoding

To get better performance from the learning models, it is preferable that the input features are all of the same type.

Therefore I have codified the string features into numeric features.

In [None]:
# Encoding of the "Sex" feature with OneHot Encoding
# Encode categorical features as a one-hot numeric array.
# The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. 
# This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)
titanic['Sex'] = OneHotEncoder(categories='auto').fit_transform(np.array(titanic['Sex'].values)[:, np.newaxis]).toarray()

# Encoding of the "Embarked" feature with Ordinal Encoding
# The features are converted to ordinal integers. 
titanic['Embarked'] = OrdinalEncoder(categories='auto').fit_transform(np.array(titanic['Embarked'].values)[:, np.newaxis])

## Study of correlations among variables

I have selected the target feature and defined the 4 datasets used for training the prediction models.

I have then divided each dataset into training sets and test sets, to train and evaluate learning models.

In [None]:
# Target value to evaluate performance
target = titanic['Survived']

# Number of dataset used
dataset_number = 5

headers = {}
dataset = {}
i = 0

# Personal features
headers[i] = "Personal"
dataset[i] = titanic.filter(['Sex','Age'], axis=1)
i += 1

# Personal features and features about family
headers[i] = "Personal + Family"
dataset[i] = titanic.filter(['Sex','Age','SibSp','Parch'], axis=1)
i += 1

# Personal features and features about the ticket
headers[i] = "Personal + Ticket"
dataset[i] = titanic.filter(['Sex','Age','Pclass','Embarked','Fare'], axis=1)
i += 1

# Personal features and features about family and ticket
headers[i] = "Personal + Family + Ticket"
dataset[i] = titanic.filter(['Sex','Age','SibSp','Parch','Pclass','Embarked','Fare'], axis=1)
i += 1

# Features about the ticket
headers[i] = "Ticket"
dataset[i] = titanic.filter(['Pclass','Embarked','Fare'], axis=1)

# Split each datasets into random train and test subsets
X_train, X_test, y_train, y_test = {}, {}, {}, {}
for i in range(dataset_number):
  X_train[i], X_test[i], y_train[i], y_test[i] = train_test_split(dataset[i], target, test_size=0.3, random_state=1)

## Choice of the predictor and Model Selection

To evaluate the quality of the 4 preprocessing strategies, I have trained 3 different learning algorithms on each training set:

* Support Vector Machine
* K Nearest Neighbors
* Neural Network

Then I verified how the quality of the prediction changes, depending on the model and preprocessing used.

To compare the performance I have used 3 metrics:

* accuracy
* precision
* recall

Each metric, for each example, evaluates the model classification compared to the predicted target value.

In [None]:
# Name of the Classifiers
classifier = ["Support Vector Machine", "K-Nearest Neighbors", "Neural Network"]
# Set string lenght to have the tabulate with the same width
for key, value in enumerate(classifier):
  classifier[key] = value.ljust(22)

# Foreach Classifier
for k, model in enumerate([SVC(), KNeighborsClassifier(), MLPClassifier()]):

  report = [["Accuracy"], ["Precision"], ["Recall"]]

  # Foreach dataset
  for i in range(dataset_number):

    # Fit the current model according to the given training dataset
    model.fit(X_train[i], y_train[i])

    # Predict the classification for the provided data (the test set)
    y_pred = model.predict(X_test[i])

    # Accuracy classification score.
    # In multilabel classification, this function computes subset accuracy: 
    # the set of labels predicted for a sample must exactly match the corresponding set of labels in y_test.
    # If normalized, the best value is 1 and the worst value is 0.
    report[0].append(accuracy_score(y_test[i], y_pred))

    # Compute the precision.
    # The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. 
    # The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
    # The best value is 1 and the worst value is 0.
    report[1].append(precision_score(y_test[i], y_pred))

    # Compute the recall.
    # The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. 
    # The recall is intuitively the ability of the classifier to find all the positive samples.
    # The best value is 1 and the worst value is 0.
    report[2].append(recall_score(y_test[i], y_pred))

  # Show the results for the current Classifier
  print(tabulate(report, headers=[classifier[k]] + list(headers.values()), tablefmt="rst"), "\n")

Support Vector Machine      Personal    Personal + Family    Personal + Ticket    Personal + Family + Ticket    Ticket
Accuracy                   0.567164             0.578358              0.641791                      0.652985  0.626866
Precision                  0.470588             0.571429              0.666667                      0.703704  0.682927
Recall                     0.0695652            0.0695652             0.330435                      0.330435  0.243478

K-Nearest Neighbors         Personal    Personal + Family    Personal + Ticket    Personal + Family + Ticket    Ticket
Accuracy                    0.731343             0.716418             0.716418                      0.705224  0.645522
Precision                   0.721649             0.729412             0.724138                      0.7       0.638889
Recall                      0.608696             0.53913              0.547826                      0.547826  0.4

Neural Network              Personal    Personal + 

## Conclusions

As we can see from the outputs, SVM has better results as the number of relevant features increases, but it never has too high values.

The other two models, on the other hand, have better and constant results as the number of functions increases.
The only dataset in which they have worse values is the one with only travel and ticket data.

So we can infer that personal features and features about the number of family members are important.
This makes us understand that the removal of relevant features affects the goodness of the forecast.
