# Titanic ML Competition
Goal: Predict whether a passenger survives or not

In [1]:
from src.preprocessing import Dataset
from src.utils import set_root_logger
from IPython.display import display, HTML

In [2]:
set_root_logger()
train_ds = Dataset(
    df_path='/home/kevinsuedmersen/dev/titanic/data/train.csv',
    ground_truth='Survived',
    id_col='PassengerId'
)

2021-02-14 16:04:54,539; src.utils; INFO; Root logger is set up
2021-02-14 16:04:54,551; src.preprocessing; INFO; Read dataframe from ``/home/kevinsuedmersen/dev/titanic/data/train.csv`` into memory


## Exploratory Data Analysis
- What kind of columns does the dataset have? Which one are numeric, which one are categorical?
- Are there missing values?
- Are there duplicates?
- How is the ground truth distributed? ==> Important for evaluation metric choice. If unequally distributed, accuracy might be misleading and precision/recall/f1 might be better
- "Datenverständnis" gewinnen

In [3]:
train_ds.profile(html_path='results/train_ds_profiling.html')

2021-02-14 16:04:54,563; src.preprocessing; INFO; A profiling report was already generated and is located at ``results/train_ds_profiling.html``


- Ground truth (Survived) not very equally distributed ==> Also use precision, recall and f1 when evaluating the model. Should hyper-parameter tuning be performed, the best model will be chosen based on accuracy though, because this is the main evaluation metric used in the kaggle competition
- Survival might depend on socio-economic status which might be inherent in the person's name or title. ==> Try to split up the name column in `first_name`, `middle_name`, `last_name`, `title`
- Lots of missing values in `Age` and distribution looks skewed with some outliers. ==> Fill with median, because in skewed distributions, the median might be better representation of a "common" value 
- Survival might depend on gender, because woman and children were supposed to board the emergency boats first
- Extract some more information from `Ticket`, such as sections, floors, etc. Passengers from lower decks might have lower survival chances, because lower decks could have been flooded first
- 77.1% missing values in `Cabin` ==> Either ignore that column completely, or fill with mode
- Few missing values in `Embarked`. ==> Fill with mode

## Iteration 1
- Fill `Age` with median
- Fill `Embarked` with mode
- Label-encode `Pclass` such that class 1 (for the rich and famous people) has the highest value and class 3 (for working class people) has the lowest value
- One-hot encode the following categorical variables: `Sex, Embarked`
- Only use the following predictors: `Pclass, Sex, Age, SibSp, Parch, Fare, Embarked`

In [4]:
# Preprocessing configuration
predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
col_name_to_fill_method = {
    'Age': 'median', 
    'Embarked': 'mode'
}
col_name_to_encoding = {
    'Pclass': {1: 3, 2: 2, 3: 1}, # original_value: encoding_value
    'Sex': 'one_hot',
    'Embarked': 'one_hot'
}
train_ds.preprocess_df_iteration_1(col_name_to_fill_method, col_name_to_encoding, predictors)
display(train_ds.df)
print(train_ds.df.info())

2021-02-14 16:04:54,605; src.preprocessing; INFO; The median of column ``Age`` equals: 28.0
2021-02-14 16:04:54,614; src.preprocessing; INFO; The mode of column ``Embarked`` equals: S
2021-02-14 16:04:54,621; src.preprocessing; INFO; Converted column ``Pclass`` using the custom mapping ``{1: 3, 2: 2, 3: 1}``
2021-02-14 16:04:54,629; src.preprocessing; INFO; One-hot encoded the column ``Sex``
2021-02-14 16:04:54,633; src.preprocessing; INFO; One-hot encoded the column ``Embarked``
2021-02-14 16:04:54,634; src.preprocessing; INFO; Preprocessing finished


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Survived,PassengerId,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,22.0,1,0,7.2500,0,1,0,1,0,0,1
1,3,38.0,1,0,71.2833,1,2,1,0,1,0,0
2,1,26.0,0,0,7.9250,1,3,1,0,0,0,1
3,3,35.0,1,0,53.1000,1,4,1,0,0,0,1
4,1,35.0,0,0,8.0500,0,5,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,27.0,0,0,13.0000,0,887,0,1,0,0,1
887,3,19.0,0,0,30.0000,1,888,1,0,0,0,1
888,1,28.0,1,2,23.4500,0,889,1,0,0,0,1
889,3,26.0,0,0,30.0000,1,890,0,1,1,0,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Pclass       891 non-null    int64  
 1   Age          891 non-null    float64
 2   SibSp        891 non-null    int64  
 3   Parch        891 non-null    int64  
 4   Fare         891 non-null    float64
 5   Survived     891 non-null    int64  
 6   PassengerId  891 non-null    int64  
 7   Sex_female   891 non-null    uint8  
 8   Sex_male     891 non-null    uint8  
 9   Embarked_C   891 non-null    uint8  
 10  Embarked_Q   891 non-null    uint8  
 11  Embarked_S   891 non-null    uint8  
dtypes: float64(2), int64(5), uint8(5)
memory usage: 53.2 KB
None


In [5]:
train_ds.preprocessing_finished

True

## TODO: 
- Explain used model
- Feature Engineering