# Titanic - Machine Learning from Disaster (Skeleton)

In this laboratory, you will implement a full machine learning pipeline, from preprocessing to model evaluation. 
The dataset we use is a famous one: [the Titanic dataset](https://www.kaggle.com/competitions/titanic/data) (yes, the big boat that sank...).

The idea is simple: use ML to predict which passengers survived the Titanic shipwreck. The dataset is quite simple to understand but presents some real-world challenges (e.g., missing values). The explanation of the dataset is available at the link above.

## Goals
The goal of this lab is to guide you towards a higher level of autonomy when dealing with ML problems, in particular, classification problems (and in the later part, you will deal with a regression problem). 
This document provides just the skeleton of your program, reminding you of the main steps to be accomplished.
At the end of this lab, you will be able to:
- Work on a jupyter notebook for a ML problem.
- Develop a full Machine Learning pipeline starting from a skeleton.
- Perform data exploration and data preparation
- Train, tune and **properly** evaluate different ML models (decision tree, random forest, etc.)

## 1 Data exploration 

Check the intructions on the readme.md file on the assignment *W2_HD1 - Data exploration*

### 1.1 Load the data

In [1]:
# The library jupyter_black is used to format the code in the Jupyter Notebook in a format called "Black"
# By using it, you agree to cede control over minutiae of hand-formatting.
# You will save time and mental energy for more important matters.
# You can make Jupyter auto-format every cell upon execution simply by adding the following lines at the top of the notebook
import jupyter_black

jupyter_black.load()

In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Load the data (this may change depending on the location of the data: download the data from https://www.kaggle.com/competitions/titanic/data)
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

### 1.2 Explore the data

##### To Do

1. **Explore the Dataset**:
   - Utilize functions and methods available in pandas (e.g., `.head()`, `.info()`, `.describe()`) or general Python techniques.
   - Each time you gain new insights from the data, add a Markdown section to explain your observations and **discuss** the results.
   - Refer to the questions in the `readme.md` file to deepen your analysis and answer them at the end of this notebook.

##### Suggestions

- **Start with the Training Set**:
  - Perform initial analysis on the training set. Repeat the same operations on the test set, but remember to treat the test set as unavailable during training.
  
- **Initial Data Examination**:
  - Inspect the first few rows of the data. Identify the features and the target variable. Determine which features might be relevant for predicting survival and which might not.
  
- **Dataset Size**:
  - Assess the size of the dataset. Is it large or small? (Hint: It might be too early to tell at this stage.)
  
- **Feature Types**:
  - Determine whether the features are numerical, categorical, or a mix of both. Identify their data types (e.g., float, int, string).
  
- **Dataset Balance**:
  - Check if the dataset is balanced. Why or why not?
  
- **Missing Values**:
  - Identify any missing values in the features. Are there missing values in the target variable `y`?
  
- **Feature Correlation**:
  - Examine the correlation between features. Identify which features are correlated with the target variable `y`. Discuss the implications of these correlations. Is it beneficial or detrimental? Display the correlation matrix.

#### 1.2.1 Explore the training set

#### 1.2.2 Explore the test set

### Answers to the questions (Data Exploration)
- **Why is it important to know the type (numerical vs. categorical) of a feature?**

*ToDo*

- **Why is it important to understand correlations? Is it good to have features highly correlated with each other? Is it good to have features highly correlated with the target?**

*ToDo*
 
- **List possible problems related to an unbalanced dataset.**

*ToDo*

- **List possible solutions for dealing with an unbalanced dataset.**

*ToDo*

---

## 2 Data Preparation

Check the intructions on the readme.md file of the assignment *W2_HD2 - Data preparation*

### 2.1 Split the data into training and test set

### 2.2 Fill missing values (if any)

##### Suggestions:
- take care of not losing the original loaded dataset, then drop useless features
- check the missing values on the train set and in the test set. If you decide to *impute* some values, the same policy used in the training set should be used in the test set.
- take time to understand the difference between the scikit learn methods:
    - .fit()
    - .transform()
    - .fit_transform()  

##### Warning:
Pandas behaviour is not the same when you use these two instructions:

In [5]:
test = train_df["Embarked"]
type(test)  # print the type of the object (in this case, a pandas series)

pandas.core.series.Series

Or: 

In [6]:
test = train_df[["Embarked"]]
type(test)  # print the type of the object (in this case, it is a DataFrame)

pandas.core.frame.DataFrame

In the first case, it returns a Series, in the second case a Dataframe. [They are not the same thing](https://www.geeksforgeeks.org/dataframe-vs-series-in-pandas/). Most of the time, you need a dataframe

### 2.3 Separate the features X, from the label y

### 2.4 Scale and encode the features

#### 2.4.1 Scale the numerical features

NOTE: it is arguable whether we should scale the `Pclass`, `SibSp` and `Parch` columns. We will do it here, but you can try to remove them from the list of columns to scale and see whther improves (or not) the performance of the model.

#### 2.4.2 Encode the categorical features (one hot encoder)

### 2.5 Encode the labels

In [7]:
# Encode the labels (label encoder)
# we do not need to do this, as the labels are already binary (survived: 0 = No, 1 = Yes)

### Answers to the questions (Data preparation)
- **Why and when is it important to rescale/standardize the features?**


- **What is the difference between Min-Max Scaler, Standard Scaler, and Robust Scaler?**
    - Understand the formulas used for the Min-Max Scaler, Standard Scaler.
    - Explain the possible impact of outliers on rescaling.


- **Should I "fit" my scaler**:
    - On the training set only?
    - On the test set only?
    - On both datasets (conjointly or separately)?
    (and why)


- **Feature Encoding**:
    - What does the `OneHotEncoder` do?
    - How does it differ from the `LabelEncoder`? When should you use one instead of the other?

---

## 3 Training and evaluation
Check the intructions on the readme.md file of the assignment *W2_HD3 - Classification with Decision Tree*

Suggestion:
- structure the code below in a way that we can easily try different classifiers and compare their performance. Later on, we will use Random Forest on the same data.
- review the readme.md file for the tasks to perform. 

In [8]:
# Import libraries (we will use GridSearchCV to find the best hyperparameters for each classifier)
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

**NOTE**: it is very important to print scores related to the training set **and** to the validation set. This allows verifying possible overfitting/underfitting conditions.

### 3.1 k-nn classifier

### 3.2 Decision tree classifier

### Answers to the questions (Classification with Decision Tree)
- **What are the hyperparameters of a Decision Tree? What do they represent?**


- **How can you use k-fold cross-validation to select the best hyperparameters?**


- **What is the difference between grid search and random search? Advantages and disadvantages?**


- **How can the learning curve help you understand if your model is overfitting or underfitting the data? What are possible solutions when using a decision tree?**
 **Hint**: which hyperparameter(s) should you change to increase/decrease the probability to get overfitting? 


- **What is a confusion matrix for a 2-class problem? How does it differ for a multiclass problem?**

---

### 3.3 Random Forest classifier

Check the intructions on the readme.md file of the assignment *W2_HD4 - Classification with Decision Tree*

### Answers to the questions (Classification with Decision Tree)

- **Present a step-by-step approach to implementing a Random Forest (RF) from Decision Trees.**


- **What are the hyperparameters of a Random Forest? What do they represent?**


- **How can the learning curve help you understand if your model is overfitting or underfitting the data?**


- **What are the values in the x axis of a learning curve?**


- **What are possible solutions to overfit/underfit using Random Forest ?**


- **Talking about model evaluation, why is considering only "accuracy" not enough? Can you provide an example?**


#### Random Forest (RF) vs. Decision Tree (DT)
   - **Which approach is better for explainability and why?**


   - **Do these approaches require feature rescaling? Why or why not?**


   - **How should feature encoding be handled in each approach, and why?**

   - **Which of these models provided a better classificatoin? Why? Is the difference statistically significant? How could you prove it?**



---

## Conclusion

In this notebook you implemeneted a full ML pipeline: from data exploration, to the evaluation of multiple ML models (k-nn, decision tree, and random forest).
We encourage you to use this same pipeline to all your ML projects. Most of the time, the model that you use is not as important as how you processed the data. 