# Titanic ML Competition
Goal: Predict whether a passenger survives or not

## Exploratory Data Analysis (EDA)

Considerations why it would be useful to conduct EDA 
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- Get an overview over the data

In [None]:
from src.pipeline import MLPipeline
from IPython.display import Image

ml_pipeline = MLPipeline(
    df_path_train='data/train.csv', 
    df_path_test='data/test.csv', 
    id_col='PassengerId', 
    ground_truth='Survived',
    results_dir='results'
)
ml_pipeline.run_eda()

2021-02-17 23:35:55,146; src.utils; INFO; Root logger is set up
2021-02-17 23:35:55,151; src.pipeline; INFO; Cleared results dir
2021-02-17 23:35:55,153; src.pipeline; INFO; Created results dir
2021-02-17 23:35:55,161; src.preprocessing; INFO; Train samples: "891"
2021-02-17 23:35:55,169; src.preprocessing; INFO; Test samples: "418"
2021-02-17 23:35:55,188; src.preprocessing; INFO; Total samples: "1309"
2021-02-17 23:35:55,190; src.preprocessing; INFO; Joined train and test sets together for the preprocessing. For training and testing, they will be separated again
2021-02-17 23:35:55,194; src.preprocessing; INFO; Generating the profiling report


Summarize dataset:   0%|          | 0/26 [00:00<?, ?it/s]

### Note: 
The results of the EDA might not be properly displayed in the browser. To get a better view of the profiling report, clone the repo, install the conda environment and rerun the code. 

### EDA Analysis
- Ground truth (Survived) not very equally distributed, so also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's title, so extract the title from the `Name` column 
- Lots of missing values in the `Age` column and its distribution looks skewed with some outliers, so fill its missing values with median, because in skewed distributions, the median might be better representation of a "common" value than e.g. the mean 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first, so `Sex` is probably a strong predictor
- Extract some more information from the `Ticket` column. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- Some missing values in the `Fare` column. Also fill these missing values with the median, because its distribution is highly skewed 
- Around 75% missing values in the `Cabin` column, so ignore this column completely
- Few missing values in the `Embarked` column, so fill these with the mode

## Models
In this challenge, I want to experiement with 3 different models:
- Support Vector Machines (SVM)
- Decision Tree (DT)
- Random Forest (RF)

In a nutshell, the SVM transforms the input space (our training set) into an enlarged feature space and then fits a linear discriminant function in the enlarged feature space. Why would it be helpful to transform the input space in this enlarged feature space? Consider the following distribution of data in the input space,

In [None]:
Image(filename='resources/img1.png') 

where each color represents data from a different category. Clearly, the data isn't linearly separably in the input space, but imagine, we could transform the 2 dimensional input space into a 3 dimensional feature space which might take the following form,

In [None]:
Image(filename='resources/img2.png')

Where the blue feature vectors are supposed to be located directly "above" the red feature vectors and where the green plane is supposed to indicate that we separate the data by "slicing" horizontally through the feature space.

The above idea has actually been formalized by Vapnik & Chervonenkis (VC). VC state that N input data vectors can always be perfectly linearly separated, i.e. "shattered", if the input data space has N-1 dimensions. So a good strategy would be to transform the input data space into an at least N-1 dimensional feature space and then fit a linear discriminant function in that N-1 dimensional feature space. 

Apart from possible overfitting when perfectly separating the training data, there is also a computational constraint to this idea, because the dimensionality of the feature space (M) may sometimes be much higher than N. When using the radial basis function (RBF) kernel, the feature space is even infinitely large, which is totally impractical when done explicitly. Luckily though, this computational constraint can be solved with the so called kernel-trick, which allows us to skip the explicit transformation into the enlarged feature space by applying well defined kernel functions. One property that all these kernel functions must satisfy is $k(\textbf{x}, \textbf{x}') = \phi(\textbf{x})^T \phi(\textbf{x}')$, i.e. result of the kernel function must be equal to the dot product of the feature vectors.

So, then after applying the kernel trick, the training duration only scales with N, and not with M anymore, which is only practical, if M >> N and if N is relatively small. The titanic dataset is relatively small, so SVM shouldn't train for too long.

Besides SVM, I also want to use the good old decision tree which separates the input space into non-overlapping, mutually exclusive regions. When a new input vector is supposed to be classified, it is first checked into which "leaf", i.e. terminal node, of the tree this vector falls, and then it will be assigned to the category which is represented by the majority of the other input vectors in that leaf. Ties may need to be handled by random assignment. When fitting a decision tree, one needs to make sure that the tree is not too deep, because otherwise, it is very likely to overfit.

The random forest estimator is an extension to decision trees by using a seemingly odd trick. Multiple, very deep, trees are fit to the training data, but when deciding on a feature which to use for the next split, only a random subset of the features may be considered. This has the effect that each individual tree is totally overfitting the training data, but is very different from the other trees. Combining all individual trees in an ensemble of trees reduces the high variance of each individual tree and therefore generalizes quite well to unseen test data. 

In what follows, I will conduct three iterations which all build on each other and become more and more complex. 

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Fill `Fare` with median
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Use the above models with default parameters
- When training each model, min-max scale all features into the range `[0, 1]`. Whether or not to scale features is dependent on the model, so I put the scaling method into the model config. E.g. the decision tree and random forest are not very sensitive to scaling/not scaling, but if we didn't scale the inputs for the SVM, the results would be considerably worse

In [None]:
# Setup
MISSING_VALUE_CONFIG = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
ENCODING_CONFIG = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': {'male': 1, 'female': 0},
    'Embarked': 'one_hot'
}

MODEL_CONFIG = {
    'svm': {'kernel': 'rbf', 'scaling_mode': 'min_max'},
    'decision_tree': {'scaling_mode': None},
    'random_forest': {'scaling_mode': 'min_max'},
}

# Iteration 1
ml_pipeline.run(
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    advanced_preprocessing=False,
    model_config=MODEL_CONFIG
)

### Results iteration 1
- SVM
    - Results on the validation set: accuracy 0.80, precision 0.86, recall 0.61, f1 0.72
    - Results on the kaggle test set: 0.77751
- DT
    - Results on the validation set: accuracy 0.75, precision 0.69, recall 0.71, f1 0.70
    - Results on the kaggle test set: 0.74880
- RF
    - Results on the validation set: accuracy 0.78, precision 0.74, recall 0.71, f1 0.72
    - Results on the kaggle test set: 0.77511
    - While the test set is the same like the one of SVM, I verified that the submission files are acutally different

Note: Model performances may vary slightly with each run due to a little bit of randomness in the model training process

## Iteration 2
- Build upon iteration 1
- Extract the title from the `Name` column
- Extract more information from the `Ticket` column
- Generate continuous variables, e.g. `family_size`
- Generate indicator variables, e.g. `has_cabin`, `is_along`
- Bin continuous variables, e.g. `fare_category`, `age_category`

In [None]:
# Iteration 2
ml_pipeline.run(
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    advanced_preprocessing=True,
    model_config=MODEL_CONFIG
)

### Results iteration 2
- SVM
    - Results on the validation set: accuracy 0.83, precision 0.84, recall 0.72, f1 0.78
    - Results on the kaggle test set: 0.78947
- DT
    - Results on the validation set: accuracy 0.78, precision 0.75, recall 0.70, f1 0.73
    - Results on the kaggle test set: 0.75598
- RF
    - Results on the validation set: accuracy 0.82, precision 0.84, recall 0.71, f1 0.77
    - Results on the kaggle test set: 0.78708
    
Note: Model performances may vary slightly with each run due to a little bit of randomness in the model training process

## Iteration 3
On top of iteration 2, I also want to experiment with hyper-parameter tuning a little bit. When specifying the hyper-parameter values to tune, I made sure to also include the default values, so that the results are not going to be worse than in iteration 2. 

In SVM, I tuned `C`, which controls how much classifications on the wrong side of the margin are punished. A higher value of C means more punishment for wrong classifcations which forces the SVM to classify more training points correctly, which in turn may lead to more overfitting. I also tried a linear kernel function which is just the dot product and `gamma`, which is a coefficient in the rbf kernel

In the decision tree, I tuned the number of observations in each leaf node (`min_samples_leaf`), the splitting criterion (`criterion`), and the minimum amount of improvement needed to decide whether to further split the current leaf node (`min_impurity_decrease`). The splitting criterion is just a function measuring the quality of a split. 

In my opinion, the random forest classifier does not need much hyper-parameter tuning, as long as the number of individual estimators (`n_estimators`) is sufficiently large to counter-effect the overfitting of each individual estimator. So, I only added an higher number of estimators and I also tried different splitting criterions.  

In [None]:
# Iteration 3
MODEL_CONFIG = {
    'svm': {
        'kernel': ['rbf', 'linear'],
        'C': [0.1, 1.0, 10.0],
        'gamma': ['scale', 'auto'], 
        'scaling_mode': 'min_max'
    },
    'decision_tree': {
        'min_samples_leaf': [1, 10, 100],
        'criterion': ['gini', 'entropy'],
        'min_impurity_decrease': [0.0, 0.1], 
        'scaling_mode': None
    },
    'random_forest': {
        'n_estimators': [100, 1000],
        'criterion': ['gini', 'entropy'],
        'scaling_mode': 'min_max'
    },
}
ml_pipeline.run(
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    advanced_preprocessing=True,
    model_config=MODEL_CONFIG,
    hp_tuning=True
)

### Results iteration 3
- SVM
    - The best cross-validated score was an accuracy of 0.85
    - Results on the kaggle test set: 0.79425
- DT
    - The best cross-validated score was an accuracy of 0.82
    - Results on the kaggle test set: 0.74880
- RF
    - The best cross-validated score was an accuracy of 0.83
    - Results on the kaggle test set: 0.77272

## Conclusion
So, the winner is SVM with advanced preprocessing and hyper parameter tuning