# Titanic ML Competition
Goal: Predict whether a passenger survives or not

In [1]:
import logging
from src.preprocessing import Dataset
from src.training import Model
from src.utils import set_root_logger
from IPython.display import display, HTML

logger = logging.getLogger(__name__)
set_root_logger()

2021-02-15 21:38:09,907; src.utils; INFO; Root logger is set up


## Exploratory Data Analysis
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? ==> Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- "Datenverständnis" gewinnen

In [2]:
# Exploratory data analysis
ds = Dataset(
    df_path_train='/home/kevinsuedmersen/dev/titanic/data/train.csv',
    df_path_test='/home/kevinsuedmersen/dev/titanic/data/test.csv',
    id_col='PassengerId',
    ground_truth='Survived'
)
ds.profile(title='ds_profile_report', html_path='results/ds_profile_report.html')

2021-02-15 21:38:09,933; src.preprocessing; INFO; Train samples: "891"
2021-02-15 21:38:09,950; src.preprocessing; INFO; Test samples: "418"
2021-02-15 21:38:09,989; src.preprocessing; INFO; Total samples: "1309"
2021-02-15 21:38:09,993; src.preprocessing; INFO; Joined train and test sets together for the preprocessing. For training and testing, they will be separated again
2021-02-15 21:38:09,996; src.preprocessing; INFO; Generating the profiling report


Summarize dataset:   0%|          | 0/26 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2021-02-15 21:38:25,668; src.preprocessing; INFO; Saved the pandas-profiling report to ``results/ds_profile_report.html``


- Ground truth (Survived) not very equally distributed ==> Also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's name or title. ==> Try to split up the name column in `first_name`, `middle_name`, `last_name`, `title`
- Lots of missing values in `Age` and distribution looks skewed with some outliers. ==> Fill with median, because in skewed distributions, the median might be better representation of a "common" value 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first
- Extract some more information from `Ticket`, such as sections, floors, etc. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- 0.2% Missing values of `Fare` in the test dataset ==> Fill with median, because its distribution is highly skewed 
- more than 75% missing values in `Cabin` ==> Either ignore that column completely, or fill with mode
- Few missing values in `Embarked`. ==> Fill with mode

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Only use the following predictors: `Pclass, Sex, Age, SibSp, Parch, Fare, Embarked`
- Use SVM with default parameters

In [3]:
# Configurations
col_name_to_fill_method = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
col_name_to_encoding = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': {'male': 1, 'female': 0},
    'Embarked': 'one_hot'
}

# Preprocessing
ds.do_basic_preprocessing(col_name_to_fill_method, col_name_to_encoding)
cols_to_drop = ['Name', 'Ticket', 'Cabin', 'Embarked']
train_df = ds.select(cols_to_drop, mode='training')
test_df = ds.select(cols_to_drop, mode='testing')

# Training, predicting and evaluating
model = Model(
    model_name='svm',
    model_path='results/svm_model.pickle', 
    ground_truth='Survived', 
    id_col_name='PassengerId',
    scaling_mode='min_max',
    scaler_path='results/min_max_scaler.pickle',
    kernel='rbf'
)
model.train_and_evaluate(train_df)
model.gen_submission_file(test_df, submission_path='results/basic_preprocessing_submission.csv')

2021-02-15 21:38:25,689; src.preprocessing; INFO; The median of column ``Age`` equals: 28.0
2021-02-15 21:38:25,692; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-15 21:38:25,701; src.preprocessing; INFO; The median of column ``Fare`` equals: 14.4542
2021-02-15 21:38:25,707; src.preprocessing; INFO; Converted column ``Pclass`` using the custom mapping ``{1: 3, 2: 2, 3: 1}``
2021-02-15 21:38:25,710; src.preprocessing; INFO; Converted column ``Sex`` using the custom mapping ``{'male': 1, 'female': 0}``
2021-02-15 21:38:25,725; src.preprocessing; INFO; One-hot encoded the column ``Embarked``
2021-02-15 21:38:25,729; src.preprocessing; INFO; Preprocessing finished
2021-02-15 21:38:25,730; src.preprocessing; INFO; Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
2021-02-15 21:38:25,747; src.training; INFO; Training ``svm`` started
2021

### Results iteration 1
After uploading our submission file `results/basic_prepreprocessing_submission.csv` to kaggle, the result on the test set was an accuracy score of `0.77751`

## Iteration 2
- Use the same configurations like in iteration 1
- 

In [4]:
# Preprocessing
ds.do_advanced_preprocessing()
cols_to_drop += ['title']
train_df = ds.select(cols_to_drop, mode='training')
test_df = ds.select(cols_to_drop, mode='testing')

# Training, predicting and evaluating
model = Model(
    model_name='svm',
    model_path='results/svm_model.pickle', 
    ground_truth='Survived', 
    id_col_name='PassengerId',
    scaling_mode='min_max',
    scaler_path='results/min_max_scaler.pickle',
    kernel='rbf'
)
model.train_and_evaluate(train_df)
model.gen_submission_file(test_df, submission_path='results/advanced_preprocessing_submission.csv')

2021-02-15 21:38:25,891; src.preprocessing; INFO; Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'name_len', 'family_size', 'has_cabin', 'is_alone', 'fare_category', 'age_category', 'title', 'title_capt', 'title_col', 'title_countess', 'title_don', 'title_dona', 'title_dr', 'title_jonkheer', 'title_lady', 'title_major', 'title_master', 'title_miss', 'title_mlle', 'title_mme', 'title_mr', 'title_mrs', 'title_ms', 'title_rev', 'title_sir']
2021-02-15 21:38:25,896; src.training; INFO; Training ``svm`` started
2021-02-15 21:38:25,902; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-15 21:38:25,927; src.training; INFO; Training finished
2021-02-15 21:38:25,942; src.training; INFO; Results on the validation set: accuracy 0.82, precision 0.85, recall 0.68, f1 0.76
2021-02-15 21:38:25,944; src.training; INFO; Saved the train

### Results iteration 2
After uploaded the `results/advanced_preprocessing_submission.csv` file to kaggle, the accuracy on the test set was ``

## TODO: 
- Explain used model
- Feature engineering
- Precision recall Plot
- Confusion matrix
- Update docstrings