# Titanic: Survival Model (XGBoost Version)

Build and train an XGBoost model to predict survival on the Titanic using a [cleaned and split dataset](https://huggingface.co/datasets/jamieoliver/titanic-2410), and upload the model to Hugging Face.

Based on https://github.com/jamieoliver/titanic-model-2410.

## Plan
- [x] Download [cleaned and split dataset](https://huggingface.co/datasets/jamieoliver/titanic-2410) from Hugging Face
- [x] Prepare data for model
  - [x] Load dataset splits as Pandas DataFrames
- [ ] Build and train initial model
- [ ] Tune model hyperparameters
- [ ] Test final model
- [ ] Upload model to Hugging Face

## Download Cleaned and Split Dataset From Hugging Face

In [1]:
from datasets import *

datasetDict = load_dataset('jamieoliver/titanic-2410')
datasetDict

DatasetDict({
    train: Dataset({
        features: ['survived', 'name', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'log_fare', 'pclass_1', 'pclass_2', 'pclass_3', 'sex_female', 'sex_male', 'embarked_C', 'embarked_Q', 'embarked_S'],
        num_rows: 1047
    })
    validation: Dataset({
        features: ['survived', 'name', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'log_fare', 'pclass_1', 'pclass_2', 'pclass_3', 'sex_female', 'sex_male', 'embarked_C', 'embarked_Q', 'embarked_S'],
        num_rows: 131
    })
    test: Dataset({
        features: ['survived', 'name', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'log_fare', 'pclass_1', 'pclass_2', 'pclass_3', 'sex_female', 'sex_male', 'embarked_C', 'embarked_Q', 'embarked_S'],
        num_rows: 131
    })
})

## Prepare Data for Model

### Load Dataset Splits as Pandas DataFrames

The dependent variable is the variable we are predicting i.e. `survived`:

In [2]:
dependent_var = {split: dataset.to_pandas()['survived'] for split, dataset in datasetDict.items()}
dependent_var

{'train': 0        True
 1       False
 2       False
 3       False
 4       False
         ...  
 1042     True
 1043    False
 1044     True
 1045    False
 1046    False
 Name: survived, Length: 1047, dtype: bool,
 'validation': 0       True
 1      False
 2      False
 3       True
 4      False
        ...  
 126    False
 127    False
 128    False
 129    False
 130     True
 Name: survived, Length: 131, dtype: bool,
 'test': 0       True
 1      False
 2      False
 3      False
 4      False
        ...  
 126    False
 127    False
 128     True
 129    False
 130     True
 Name: survived, Length: 131, dtype: bool}

The independent variables are the variables we will use to make the prediction:

In [3]:
independent_cols = ['age', 'sibsp', 'parch', 'log_fare', 'pclass_1', 'pclass_2', 'pclass_3', 'sex_female', 'sex_male',
                    'embarked_C', 'embarked_Q', 'embarked_S']

independent_vars = {split: dataset.to_pandas()[independent_cols] for split, dataset in datasetDict.items()}
independent_vars

{'train':        age  sibsp  parch  log_fare  pclass_1  pclass_2  pclass_3  sex_female  \
 0      4.0      1      1  3.178054     False      True     False        True   
 1     20.0      0      0  2.188856     False     False      True       False   
 2     32.5      0      0  5.358942      True     False     False       False   
 3     23.0      0      0  2.775447     False      True     False       False   
 4     47.0      0      0  3.970292      True     False     False       False   
 ...    ...    ...    ...       ...       ...       ...       ...         ...   
 1042  24.0      1      2  4.189655     False      True     False        True   
 1043  24.0      0      0  2.775447     False      True     False       False   
 1044  45.0      0      1  4.164466      True     False     False        True   
 1045  24.0      1      0  2.738146     False     False      True        True   
 1046  24.0      0      0  2.145931     False     False      True       False   
 
       sex_male  