# Titanic ML Competition
Goal: Predict whether a passenger survives or not

## Exploratory Data Analysis
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- Get an overview over the data

In [1]:
from src.pipeline import MLPipeline

ml_pipeline = MLPipeline(
    df_path_train='data/train.csv', 
    df_path_test='data/test.csv', 
    id_col='PassengerId', 
    ground_truth='Survived'
)
ml_pipeline.run_eda()

2021-02-16 22:44:12,375; src.utils; INFO; Root logger is set up
2021-02-16 22:44:12,383; src.preprocessing; INFO; Train samples: "891"
2021-02-16 22:44:12,389; src.preprocessing; INFO; Test samples: "418"
2021-02-16 22:44:12,406; src.preprocessing; INFO; Total samples: "1309"
2021-02-16 22:44:12,408; src.preprocessing; INFO; Joined train and test sets together for the preprocessing. For training and testing, they will be separated again
2021-02-16 22:44:12,411; src.preprocessing; INFO; A profiling report was already generated and will be loaded from ``results/ds_profile_report.html``


- Ground truth (Survived) not very equally distributed, so also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's title, so extract the title from the `Name` column 
- Lots of missing values in the `Age` column and its distribution looks skewed with some outliers, so fill its missing values with median, because in skewed distributions, the median might be better representation of a "common" value than e.g. the mean 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first, so `Sex` is probably a strong predictor
- Extract some more information from the `Ticket` column. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- Some missing values in the `Fare` column. Also fill these missing values with the median, because its distribution is highly skewed 
- Around 75% missing values in the `Cabin` column, so ignore this column completely
- Few missing values in the `Embarked` column, so fill these with the mode

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Fill `Fare` with median
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Use the following models with default parameters
    - Support Vector Machines
    - Decision Tree
    - Random Forest
- When training the model, min-max scale all features into the range `[0, 1]`

In [2]:
# Setup
MISSING_VALUE_CONFIG = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
ENCODING_CONFIG = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': {'male': 1, 'female': 0},
    'Embarked': 'one_hot'
}

MODEL_CONFIG = {
    'svm': {'kernel': 'rbf', 'scaling_mode': 'min_max'},
    'decision_tree': {'scaling_mode': 'min_max'},
    'random_forest': {'scaling_mode': 'min_max'},
}

# Iteration 1
ml_pipeline.run(
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    advanced_preprocessing=False,
    model_config=MODEL_CONFIG
)

2021-02-16 22:44:12,463; src.preprocessing; INFO; The median of column ``Age`` equals: 28.0
2021-02-16 22:44:12,468; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-16 22:44:12,479; src.preprocessing; INFO; The median of column ``Fare`` equals: 14.4542
2021-02-16 22:44:12,492; src.preprocessing; INFO; Converted column ``Pclass`` using the custom mapping ``{1: 3, 2: 2, 3: 1}``
2021-02-16 22:44:12,505; src.preprocessing; INFO; Converted column ``Sex`` using the custom mapping ``{'male': 1, 'female': 0}``
2021-02-16 22:44:12,535; src.preprocessing; INFO; One-hot encoded the column ``Embarked``
2021-02-16 22:44:12,539; src.preprocessing; INFO; Preprocessing finished
2021-02-16 22:44:12,542; src.preprocessing; INFO; Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S']


train dataframe:


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,1,0.0,1,1,22.0,1,0,7.2500,0,0,1
1,2,1.0,3,0,38.0,1,0,71.2833,1,0,0
2,3,1.0,1,0,26.0,0,0,7.9250,0,0,1
3,4,1.0,3,0,35.0,1,0,53.1000,0,0,1
4,5,0.0,1,1,35.0,0,0,8.0500,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0.0,2,1,27.0,0,0,13.0000,0,0,1
887,888,1.0,3,0,19.0,0,0,30.0000,0,0,1
888,889,0.0,1,0,28.0,1,2,23.4500,0,0,1
889,890,1.0,3,1,26.0,0,0,30.0000,1,0,0



test dataframe:


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,892,,1,1,34.5,0,0,7.8292,0,1,0
1,893,,1,0,47.0,1,0,7.0000,0,0,1
2,894,,2,1,62.0,0,0,9.6875,0,1,0
3,895,,1,1,27.0,0,0,8.6625,0,0,1
4,896,,1,0,22.0,1,1,12.2875,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,1,1,28.0,0,0,8.0500,0,0,1
414,1306,,3,0,39.0,0,0,108.9000,1,0,0
415,1307,,1,1,38.5,0,0,7.2500,0,0,1
416,1308,,1,1,28.0,0,0,8.0500,0,0,1


2021-02-16 22:44:12,646; src.training; INFO; Training ``svm`` started
2021-02-16 22:44:12,652; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:12,670; src.training; INFO; Training finished
2021-02-16 22:44:12,680; src.training; INFO; Results on the validation set: accuracy 0.80, precision 0.86, recall 0.61, f1 0.72
2021-02-16 22:44:12,682; src.training; INFO; Saved the trained model to ``results/svm.pickle``
2021-02-16 22:44:12,686; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:12,688; src.training; INFO; Generating predictions on the test set
2021-02-16 22:44:12,699; src.training; INFO; Saved submission to ``results/svm_basic.csv``
2021-02-16 22:44:12,700; src.training; INFO; Training ``decision_tree`` started
2021-02-16 22:44:12,704; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:12,708; src.training; INFO; Training finished
2021-02-16 22:44:12,716; src.

### Results iteration 1
- Results on the validation set: accuracy 0.80, precision 0.86, recall 0.61, f1 0.72
- After uploading our submission file `results/basic_prepreprocessing_submission.csv` to kaggle, the result on the test set was an accuracy score of `0.77751`

## Iteration 2
- Build upon iteration 1
- Use feature engineering ideas from https://www.kaggle.com/imoore/titanic-the-only-notebook-you-need-to-see

In [3]:
ml_pipeline.run(
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    advanced_preprocessing=True,
    model_config=MODEL_CONFIG
)

2021-02-16 22:44:13,053; src.preprocessing; INFO; Preprocessing was already conducted
2021-02-16 22:44:13,122; src.preprocessing; INFO; Number of columns: 483


train dataframe:


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,...,ticket_group_39207,ticket_group_39208,ticket_group_39209,ticket_group_39414,ticket_group_310126,ticket_group_310127,ticket_group_310128,ticket_group_310129,ticket_group_310130,ticket_group_310131
0,1,0.0,1,1,22.0,1,0,7.2500,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1.0,3,0,38.0,1,0,71.2833,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1.0,1,0,26.0,0,0,7.9250,0,0,...,0,0,0,0,0,0,1,0,0,0
3,4,1.0,3,0,35.0,1,0,53.1000,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0.0,1,1,35.0,0,0,8.0500,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0.0,2,1,27.0,0,0,13.0000,0,0,...,0,0,0,0,0,0,0,0,0,0
887,888,1.0,3,0,19.0,0,0,30.0000,0,0,...,0,0,0,0,0,0,0,0,0,0
888,889,0.0,1,0,28.0,1,2,23.4500,0,0,...,0,0,0,0,0,0,0,0,0,0
889,890,1.0,3,1,26.0,0,0,30.0000,1,0,...,0,0,0,0,0,0,0,0,0,0



test dataframe:


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,...,ticket_group_39207,ticket_group_39208,ticket_group_39209,ticket_group_39414,ticket_group_310126,ticket_group_310127,ticket_group_310128,ticket_group_310129,ticket_group_310130,ticket_group_310131
0,892,,1,1,34.5,0,0,7.8292,0,1,...,0,0,0,0,0,0,0,0,0,0
1,893,,1,0,47.0,1,0,7.0000,0,0,...,0,0,0,0,0,0,0,0,0,0
2,894,,2,1,62.0,0,0,9.6875,0,1,...,0,0,0,0,0,0,0,0,0,0
3,895,,1,1,27.0,0,0,8.6625,0,0,...,0,0,0,0,0,0,0,0,0,0
4,896,,1,0,22.0,1,1,12.2875,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,1,1,28.0,0,0,8.0500,0,0,...,0,0,0,0,0,0,0,0,0,0
414,1306,,3,0,39.0,0,0,108.9000,1,0,...,0,0,0,0,0,0,0,0,0,0
415,1307,,1,1,38.5,0,0,7.2500,0,0,...,0,0,0,0,1,0,0,0,0,0
416,1308,,1,1,28.0,0,0,8.0500,0,0,...,0,0,0,0,0,0,0,0,0,0


2021-02-16 22:44:13,214; src.training; INFO; Training ``svm`` started
2021-02-16 22:44:13,234; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:13,300; src.training; INFO; Training finished
2021-02-16 22:44:13,354; src.training; INFO; Results on the validation set: accuracy 0.83, precision 0.84, recall 0.72, f1 0.78
2021-02-16 22:44:13,360; src.training; INFO; Saved the trained model to ``results/svm.pickle``
2021-02-16 22:44:13,368; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:13,370; src.training; INFO; Generating predictions on the test set
2021-02-16 22:44:13,448; src.training; INFO; Saved submission to ``results/svm_advanced.csv``
2021-02-16 22:44:13,450; src.training; INFO; Training ``decision_tree`` started
2021-02-16 22:44:13,460; src.training; INFO; Saved the trained model to ``results/min_max.pickle``
2021-02-16 22:44:13,480; src.training; INFO; Training finished
2021-02-16 22:44:13,490; s

### Results iteration 2
- Results on the validation set: accuracy 0.82, precision 0.85, recall 0.68, f1 0.76
- After uploaded the `results/advanced_preprocessing_submission.csv` file to kaggle, the accuracy on the test set was `0.77751`
- While the results on the validation set slightly improved, astonishingly, the test set score remained exactly the same. I made sure that `results/basic_preprocessing_submission.csv` and `results/advanced_preprocessing_submission.csv` actually contain different values, so this mus have happened by chance

## Iteration 3
- Build upon iteration 2
- Use hyper-parameter tuning to find optimal values for `C, gamma`

## TODO: 
- Explain used model
- Extract numbers and letters from the `Ticket` column
- Use Decision Tree Model
- Use Random Forest Model
- Simple meta modelling
- Precision recall Plot
- Confusion matrix
- Update docstrings