# Titanic ML Competition
Goal: Predict whether a passenger survives or not

In [1]:
from src.pipeline import MLPipeline

2021-02-16 18:11:22,916; src.utils; INFO; Root logger is set up


## Exploratory Data Analysis
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? ==> Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- "Datenverständnis" gewinnen

In [2]:
# Setup
MISSING_VALUE_CONFIG = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
ENCODING_CONFIG = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': {'male': 1, 'female': 0},
    'Embarked': 'one_hot'
}

MODEL_CONFIG = {
    'svm': {'kernel': 'rbf'},
    'decision_tree': {},
    'random_forest': {}
}
ml_pipeline = MLPipeline(
    df_path_train='data/train.csv', 
    df_path_test='data/test.csv', 
    id_col='PassengerId', 
    ground_truth='Survived', 
    missing_value_config=MISSING_VALUE_CONFIG, 
    encoding_config=ENCODING_CONFIG, 
    model_config=MODEL_CONFIG
)

ml_pipeline.run_eda()

2021-02-16 18:11:22,933; src.preprocessing; INFO; Train samples: "891"
2021-02-16 18:11:22,940; src.preprocessing; INFO; Test samples: "418"
2021-02-16 18:11:22,958; src.preprocessing; INFO; Total samples: "1309"
2021-02-16 18:11:22,960; src.preprocessing; INFO; Joined train and test sets together for the preprocessing. For training and testing, they will be separated again
2021-02-16 18:11:22,961; src.preprocessing; INFO; A profiling report was already generated and is located at ``results/ds_profile_report.html``


- Ground truth (Survived) not very equally distributed ==> Also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's name or title. ==> Try to split up the name column in `first_name`, `middle_name`, `last_name`, `title`
- Lots of missing values in `Age` and distribution looks skewed with some outliers. ==> Fill with median, because in skewed distributions, the median might be better representation of a "common" value 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first
- Extract some more information from `Ticket`, such as sections, floors, etc. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- 0.2% Missing values of `Fare` in the test dataset ==> Fill with median, because its distribution is highly skewed 
- more than 75% missing values in `Cabin` ==> Either ignore that column completely, or fill with mode
- Few missing values in `Embarked`. ==> Fill with mode

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Only use the following predictors: `Pclass, Sex, Age, SibSp, Parch, Fare, Embarked`
- Use SVM with default parameters

In [3]:
# Configurations
col_name_to_fill_method = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
col_name_to_encoding = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': {'male': 1, 'female': 0},
    'Embarked': 'one_hot'
}

# Preprocessing
ds.do_basic_preprocessing(col_name_to_fill_method, col_name_to_encoding)
cols_to_drop = ['Name', 'Ticket', 'Cabin', 'Embarked']
train_df = ds.select(cols_to_drop, mode='training')
display(train_df)
test_df = ds.select(cols_to_drop, mode='testing')

# Training, predicting and evaluating
model = Model(
    model_name='svm',
    model_path='results/svm_model.pickle', 
    ground_truth='Survived', 
    id_col_name='PassengerId',
    scaling_mode='min_max',
    scaler_path='results/min_max_scaler.pickle',
    kernel='rbf'
)
model.train_and_evaluate(train_df)
model.gen_submission_file(test_df, submission_path='results/basic_preprocessing_submission.csv')

2021-02-16 18:11:23,007; src.preprocessing; INFO; The median of column ``Age`` equals: 28.0
2021-02-16 18:11:23,013; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-16 18:11:23,020; src.preprocessing; INFO; The median of column ``Fare`` equals: 14.4542
2021-02-16 18:11:23,032; src.preprocessing; INFO; Converted column ``Pclass`` using the custom mapping ``{1: 3, 2: 2, 3: 1}``
2021-02-16 18:11:23,041; src.preprocessing; INFO; Converted column ``Sex`` using the custom mapping ``{'male': 1, 'female': 0}``
2021-02-16 18:11:23,061; src.preprocessing; INFO; One-hot encoded the column ``Embarked``
2021-02-16 18:11:23,062; src.preprocessing; INFO; Preprocessing finished
2021-02-16 18:11:23,068; src.preprocessing; INFO; Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S']


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,1,0.0,1,1,22.0,1,0,7.2500,0,0,1
1,2,1.0,3,0,38.0,1,0,71.2833,1,0,0
2,3,1.0,1,0,26.0,0,0,7.9250,0,0,1
3,4,1.0,3,0,35.0,1,0,53.1000,0,0,1
4,5,0.0,1,1,35.0,0,0,8.0500,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0.0,2,1,27.0,0,0,13.0000,0,0,1
887,888,1.0,3,0,19.0,0,0,30.0000,0,0,1
888,889,0.0,1,0,28.0,1,2,23.4500,0,0,1
889,890,1.0,3,1,26.0,0,0,30.0000,1,0,0


2021-02-16 18:11:23,125; src.training; INFO; Training ``svm`` started
2021-02-16 18:11:23,129; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-16 18:11:23,148; src.training; INFO; Training finished
2021-02-16 18:11:23,158; src.training; INFO; Results on the validation set: accuracy 0.80, precision 0.86, recall 0.61, f1 0.72
2021-02-16 18:11:23,159; src.training; INFO; Saved the trained model to ``results/svm_model.pickle``
2021-02-16 18:11:23,164; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-16 18:11:23,166; src.training; INFO; Generating predictions on the test set
2021-02-16 18:11:23,177; src.training; INFO; Saved submission to ``results/basic_preprocessing_submission.csv``


### Results iteration 1
- Results on the validation set: accuracy 0.80, precision 0.86, recall 0.61, f1 0.72
- After uploading our submission file `results/basic_prepreprocessing_submission.csv` to kaggle, the result on the test set was an accuracy score of `0.77751`

## Iteration 2
- Build upon iteration 1
- Use feature engineering ideas from https://www.kaggle.com/imoore/titanic-the-only-notebook-you-need-to-see

In [4]:
# Preprocessing
ds.do_advanced_preprocessing()
cols_to_drop += ['title']
train_df = ds.select(cols_to_drop, mode='training')
display(train_df)
test_df = ds.select(cols_to_drop, mode='testing')

# Training, predicting and evaluating
model = Model(
    model_name='svm',
    model_path='results/svm_model.pickle', 
    ground_truth='Survived', 
    id_col_name='PassengerId',
    scaling_mode='min_max',
    scaler_path='results/min_max_scaler.pickle',
    kernel='rbf'
)
model.train_and_evaluate(train_df)
model.gen_submission_file(test_df, submission_path='results/advanced_preprocessing_submission.csv')

2021-02-16 18:11:23,223; src.preprocessing; INFO; Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'name_len', 'family_size', 'has_cabin', 'is_alone', 'fare_category', 'age_category', 'title', 'title_capt', 'title_col', 'title_countess', 'title_don', 'title_dona', 'title_dr', 'title_jonkheer', 'title_lady', 'title_major', 'title_master', 'title_miss', 'title_mlle', 'title_mme', 'title_mr', 'title_mrs', 'title_ms', 'title_rev', 'title_sir']


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,...,title_major,title_master,title_miss,title_mlle,title_mme,title_mr,title_mrs,title_ms,title_rev,title_sir
0,1,0.0,1,1,22.0,1,0,7.2500,0,0,...,0,0,0,0,0,1,0,0,0,0
1,2,1.0,3,0,38.0,1,0,71.2833,1,0,...,0,0,0,0,0,0,1,0,0,0
2,3,1.0,1,0,26.0,0,0,7.9250,0,0,...,0,0,1,0,0,0,0,0,0,0
3,4,1.0,3,0,35.0,1,0,53.1000,0,0,...,0,0,0,0,0,0,1,0,0,0
4,5,0.0,1,1,35.0,0,0,8.0500,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0.0,2,1,27.0,0,0,13.0000,0,0,...,0,0,0,0,0,0,0,0,1,0
887,888,1.0,3,0,19.0,0,0,30.0000,0,0,...,0,0,1,0,0,0,0,0,0,0
888,889,0.0,1,0,28.0,1,2,23.4500,0,0,...,0,0,1,0,0,0,0,0,0,0
889,890,1.0,3,1,26.0,0,0,30.0000,1,0,...,0,0,0,0,0,1,0,0,0,0


2021-02-16 18:11:23,278; src.training; INFO; Training ``svm`` started
2021-02-16 18:11:23,286; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-16 18:11:23,308; src.training; INFO; Training finished
2021-02-16 18:11:23,318; src.training; INFO; Results on the validation set: accuracy 0.82, precision 0.85, recall 0.68, f1 0.76
2021-02-16 18:11:23,320; src.training; INFO; Saved the trained model to ``results/svm_model.pickle``
2021-02-16 18:11:23,324; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-16 18:11:23,325; src.training; INFO; Generating predictions on the test set
2021-02-16 18:11:23,339; src.training; INFO; Saved submission to ``results/advanced_preprocessing_submission.csv``


### Results iteration 2
- Results on the validation set: accuracy 0.82, precision 0.85, recall 0.68, f1 0.76
- After uploaded the `results/advanced_preprocessing_submission.csv` file to kaggle, the accuracy on the test set was `0.77751`
- While the results on the validation set slightly improved, astonishingly, the test set score remained exactly the same. I made sure that `results/basic_preprocessing_submission.csv` and `results/advanced_preprocessing_submission.csv` actually contain different values, so this mus have happened by chance

## Iteration 3
- Build upon iteration 2
- Use hyper-parameter tuning to find optimal values for `C, gamma`

## TODO: 
- Explain used model
- Extract numbers and letters from the `Ticket` column
- Use Decision Tree Model
- Use Random Forest Model
- Simple meta modelling
- Precision recall Plot
- Confusion matrix
- Update docstrings