# Titanic ML Competition
Goal: Predict whether a passenger survives or not

In [1]:
import logging
from src.preprocessing import Dataset
from src.training import Model
from src.utils import set_root_logger
from IPython.display import display, HTML

logger = logging.getLogger(__name__)
set_root_logger()

2021-02-14 20:58:16,577; src.utils; INFO; Root logger is set up


## Exploratory Data Analysis
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? ==> Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- "Datenverständnis" gewinnen

In [2]:
train_ds = Dataset(df_path='/home/kevinsuedmersen/dev/titanic/data/train.csv')
train_ds.profile(title='train_ds_profile', html_path='results/train_ds_profiling.html')
test_ds = Dataset(df_path='/home/kevinsuedmersen/dev/titanic/data/test.csv')
test_ds.profile(title='test_ds_profile', html_path='results/test_ds_profiling.html')

2021-02-14 20:58:16,596; src.preprocessing; INFO; Read dataframe from ``/home/kevinsuedmersen/dev/titanic/data/train.csv`` into memory
2021-02-14 20:58:16,598; src.preprocessing; INFO; A profiling report was already generated and is located at ``results/train_ds_profiling.html``
2021-02-14 20:58:16,611; src.preprocessing; INFO; Read dataframe from ``/home/kevinsuedmersen/dev/titanic/data/test.csv`` into memory
2021-02-14 20:58:16,614; src.preprocessing; INFO; A profiling report was already generated and is located at ``results/test_ds_profiling.html``


- Ground truth (Survived) not very equally distributed ==> Also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's name or title. ==> Try to split up the name column in `first_name`, `middle_name`, `last_name`, `title`
- Lots of missing values in `Age` and distribution looks skewed with some outliers. ==> Fill with median, because in skewed distributions, the median might be better representation of a "common" value 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first
- Extract some more information from `Ticket`, such as sections, floors, etc. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- 0.2% Missing values of `Fare` in the test dataset ==> Fill with median, because its distribution is highly skewed 
- more than 75% missing values in `Cabin` ==> Either ignore that column completely, or fill with mode
- Few missing values in `Embarked`. ==> Fill with mode

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Only use the following predictors: `Pclass, Sex, Age, SibSp, Parch, Fare, Embarked`
- Use SVM with default parameters

In [3]:
# Basic preprocessing configuration
predictors = [
    'Pclass', 
    'Age', 
    'SibSp', 
    'Parch', 
    'Fare', 
    'Sex_female', 
    'Sex_male', 
    'Embarked_C', 
    'Embarked_Q', 
    'Embarked_S'
]
col_name_to_fill_method = {
    'Age': 'median', 
    'Embarked': 'mode',
    'Fare': 'median'
}
col_name_to_encoding = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': 'one_hot',
    'Embarked': 'one_hot'
}

# Preprocess the training data and get a subset of predictors
train_ds.do_basic_preprocessing(col_name_to_fill_method, col_name_to_encoding)
train_df = train_ds.get_df_subset(predictors, id_col='PassengerId', ground_truth='Survived')

# Preprocess the test data and get a subset of predictors
test_ds.do_basic_preprocessing(col_name_to_fill_method, col_name_to_encoding)
test_df = test_ds.get_df_subset(predictors, id_col='PassengerId')

# Display the training data
display(train_df)

2021-02-14 20:58:16,635; src.preprocessing; INFO; The median of column ``Age`` equals: 28.0
2021-02-14 20:58:16,643; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-14 20:58:16,652; src.preprocessing; INFO; The median of column ``Fare`` equals: 14.4542
2021-02-14 20:58:16,661; src.preprocessing; INFO; Converted column ``Pclass`` using the custom mapping ``{1: 3, 2: 2, 3: 1}``
2021-02-14 20:58:16,685; src.preprocessing; INFO; One-hot encoded the column ``Sex``
2021-02-14 20:58:16,701; src.preprocessing; INFO; One-hot encoded the column ``Embarked``
2021-02-14 20:58:16,705; src.preprocessing; INFO; Preprocessing finished
2021-02-14 20:58:16,714; src.preprocessing; INFO; The median of column ``Age`` equals: 27.0
2021-02-14 20:58:16,717; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-14 20:58:16,724; src.preprocessing; INFO; The median of column ``Fare`` equals: 14.4542
2021-02-14 20:58:16,727; src.preprocessing; INFO; Converted column

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Survived,PassengerId
0,1,22.0,1,0,7.2500,0,1,0,0,1,0,1
1,3,38.0,1,0,71.2833,1,0,1,0,0,1,2
2,1,26.0,0,0,7.9250,1,0,0,0,1,1,3
3,3,35.0,1,0,53.1000,1,0,0,0,1,1,4
4,1,35.0,0,0,8.0500,0,1,0,0,1,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,0,1,0,0,1,0,887
887,3,19.0,0,0,30.0000,1,0,0,0,1,1,888
888,1,28.0,1,2,23.4500,1,0,0,0,1,0,889
889,3,26.0,0,0,30.0000,0,1,1,0,0,1,890


In [4]:
# Train the model and generate the submission file
model = Model(
    model_name='svm',
    model_path='results/svm_model.pickle', 
    ground_truth='Survived', 
    id_col_name='PassengerId',
    scaling_mode='min_max',
    scaler_path='results/min_max_scaler.pickle',
    kernel='rbf'
)
model.train_and_evaluate(train_df)
model.gen_submission_file(test_df, submission_path='results/submission.csv')

2021-02-14 20:58:16,792; src.training; INFO; Training ``svm`` started
2021-02-14 20:58:16,801; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-14 20:58:16,825; src.training; INFO; Training finished
2021-02-14 20:58:16,841; src.training; INFO; Results on the validation set: accuracy 0.7985074626865671, precision 0.8607594936708861, recall 0.6126126126126126, f1 0.7157894736842104
2021-02-14 20:58:16,847; src.training; INFO; Saved the trained model to ``results/svm_model.pickle``
2021-02-14 20:58:16,852; src.training; INFO; Saved the trained model to ``results/min_max_scaler.pickle``
2021-02-14 20:58:16,854; src.training; INFO; Generating predictions on the test set
2021-02-14 20:58:16,869; src.training; INFO; Saved submission to ``results/submission.csv``


## Iteration 2
- Use the same configurations like in iteration 1
- 

## TODO: 
- Explain used model
- Feature engineering
- Precision recall Plot
- Confusion matrix