# Final Project - Titanic Dataset 
**Jason "Scott" Person**

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

Fields include:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class
- **Sex** (str) - Sex of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Ticket price paid
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Todo Items
- Figure out how to get best random out of random forest
- Investigate weights for each feature parameter
- SMOTE (optional)
- Sex - should it be labeled or OHE?
- Bin Deck
- Pclass string kludge
- Serialize data preprocessing
- Serialize model

## Talking Points
### Goals:
1. Learn 
1. Meet Requirements
1. Consider production implications

### Lessons Learned:
1. Model training is not deterministic
1. There's a lot of art here - it's not just data science!
1. In the weeds:<br>
  3.1. 
  Calculate bins and imputed values on training set only <br>
  3.2. Training and test columns need to align (OHE issue)<br>
  3.3. OHE doesn't like integers <br>
  3.4. Label encoding works better than OHE.


# Preliminaries

In [0]:
# import libraries required to load, transform, analyze and plot data
# this is from the churn analysis notebook, which is the foundation for this project solution

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
%matplotlib inline
import re
import seaborn as sns
sns.set(context='paper', style='darkgrid', 
        rc={'figure.facecolor':'white'}, font_scale=1.2)

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, LabelEncoder

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import make_scorer, precision_recall_curve, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, auc, f1_score, roc_auc_score
# from sklearn import tree
# from sklearn.dummy import DummyClassifier
import statsmodels.api as sm

# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

In [0]:
# Customize seaborn plot styles
# Seaborn docs: https://seaborn.pydata.org/tutorial/aesthetics.html

# Adjust to retina quality

matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

# Adjust dpi and font size
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook', font_scale = 0.8)

# Display tick marks
sns.set_style('ticks')

# Remove borders
plt.rc('axes.spines', top=False, right=False, left=False, bottom=False)

In [0]:
# Color palettes for plots
# Named colors: https://matplotlib.org/stable/gallery/color/named_colors.html
# Seaborn color palette docs: https://seaborn.pydata.org/tutorial/color_palettes.html
# Seaborn palette chart: https://www.codecademy.com/article/seaborn-design-ii

# cp1 Color Palette - a binary blue/orange palette
blue = 'deepskyblue' # Use 'skyblue' for a lighter blue
orange = 'orange'
cp1 = [blue, orange]

# cp2 Color Palette - 5 colors for use with categorical data
turquoise = 'mediumaquamarine'
salmon = 'darksalmon'
tan = 'tan'
gray = 'darkgray'
cp2 = [blue, turquoise, salmon, tan, gray]

# cp3 Color Palette - blue-to-orange diverging palette for correlation heatmaps
cp3 = sns.diverging_palette(242, 39, s=100, l=65, n=11)

# Set the default palette
sns.set_palette(cp1)

In [0]:
df = pd.read_csv('titanic.csv')
df.head(10)

In [0]:
# View dataframe fundamentals
df.info()

# Explore Categorical Features

In [0]:
# check value counts by column
col_list = ['Pclass', 'Sex', 'Fare', 'Embarked']

for col in col_list:
     print(f'\nValue Counts | column = {col}')
     print(df[col].value_counts(normalize=True, dropna=False))

# Discuss Modeling and Data Prep Decisions

## Analyze Age for Bins

In [0]:
# Histogram: Age Distribution Comparisons by Survival
plt.title("Age Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x='Age', hue='Survived', binwidth=5, alpha=0.7);
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:** We took a look at this during our first module. The interesting characteristic is the outsized survival rate of the younger passengers. To capture this as well as to create bins that correspond to social norms I chose the following bins and respective labels:

`age_edges = [0, 12, 18, 30, 50, 100]`  
`age_labels = ['Child', 'Teen', 'YoungAdult', 'Adult', 'Senior']`

During testing I did try a bin for infants in order to try to capture the even higher survival rate there, but it resulted in lower accuracy overall. My current hypothesis is that this led to overfitting of the model.

## Analyze Fare for Bins

In [0]:
# Histogram: Fare Distribution Comparisons by Survival
plt.title("Fare Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x='Fare', hue='Survived', binwidth=25, alpha=0.7);
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:** I spent a lot of time fiddling with fares. I wanted to achieve a couple of things:
1. Use cut or qcut to figure out the bins rather than hard coding them.
2. Persist the bins created with the training set only.

Fromt he chart we can see that this is a widely dispersed dataset with outliers on the high end. pandas.qcut divides the data into approximatgely balanced quintiles so that should handle outliers better. I started with 5 bins, which gave me an 82.x% accuracy with logistic regression using default parameters. When I switched to 15 bins I got better results (83.x%) with Random Forrest, but only with 5-fold cross validation.

I'll stick with 15 because that gives me a better result, but my takeaway here is that I need to do more research on the various algorithms.

**Coding Note:** I originally just left the bins with default names. This resulted in dataframe columns of names like fare_[xx.x-yy.y]. Some of the modeling algorithms did not like brackets and parenthesis in the column nnames. I had to rename to form fare_n where n is just an incrementing integer.

## Analyze Family Count and Binning

In [0]:
# Histogram: Family Count Distribution Comparisons by Survival
plt.title("Family Count Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x=df['SibSp'] + df['Parch'], hue='Survived', bins=10, alpha=0.7)
ax.set(xlabel='Family Count')
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:** There are three notable breaks in the data. 
- Alone: About average survival rate. 
- 1-3 other family members: High chance of survival
- 4-6: Small chance of survival
- Greater than 6: Much lower chance of survival.

`family_map = {0: 'Alone', 1: 'Small', 2: 'Small', 3: 'Small', 4: 'Medium', 5: 'Medium', 6: 'Medium', 7: 'Large', 8: 'Large', 9: 'Large', 10: 'Large'}`

## Title Feature and Grouping

In [0]:
# Histogram: Family Count Distribution Comparisons by Survival
plt.title("Title Distributions Comparison", fontsize=14, fontweight='bold')
ax = sns.histplot(data=df, x=df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0], hue='Survived', bins=10, alpha=0.7)
ax.set(xlabel='Title Count')
# ax.set(xlabel = 'Custom x axis label', ylabel='Custom y axis label');

**Interpretation:** I did not do a plot this before making some assumptions and doing a lot of experimentation. I tried "military" and "nobility groups", male/female combinations, and eventually settled on the below. This lowers the number of categories as instructed and works the best based on my analysis.

`title_mapping = {
'Mr': 'Mr',
'Miss': 'Miss/Mrs/Ms',
'Mrs': 'Miss/Mrs/Ms',
'Master': 'Master',
'Dr': 'Special',
'Rev': 'Special',
'Col': 'Special',
'Major': 'Special',
'Capt': 'Special',
'Sir': 'Special',
'Don': 'Special',
'Lady': 'Special',
'the Countess': 'Miss/Mrs/Ms',
'Jonkheer': 'Special',
'Mlle': 'Miss/Mrs/Ms',
'Ms': 'Miss/Mrs/Ms',
'Mme': 'Miss/Mrs/Ms'
}`

# Calculate Required Data

## Median age by Sex and Pclass

In [0]:
def calculate_median_age(X_df):
    """Calculates the median age for each Sex and Pclass group.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    median_ages_df (pd.DataFrame)): dataframe with median ages for each Sex and Pclass group
    """
    median_ages = X_df.groupby(['Sex', 'Pclass'])['Age'].median().reset_index()
    median_ages_df = median_ages.rename(columns={'Age': 'Median_Age'})
    return median_ages_df

In [0]:
def process_and_persist_median_ages(X_df):
    """Calculates and persists the median age dataframe to storage.

    Parameters:
    df (pd.DataFrame): Dataframe containing the data

    Returns:
    None
    """
    median_ages_df = calculate_median_age(X_df)
    median_ages_df.to_csv('median_ages.csv', index=False)

## Median third class fare

In [0]:
def calculate_class_median_fare(X_df):
    """Fill missing fare values in the Fare field. There is one missing value and it is a third class passenger so we're going to use the median fare for that class.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    median_fare_df (pd.DataFrame): dataframe with the median fare for third class passengers
    """

    med_fare = X_df.groupby(['Pclass']).Fare.median().reset_index()
    median_fare_df = med_fare.rename(columns={'Fare': 'Median_Fare'})

    return median_fare_df

In [0]:
def process_and_persist_fare(X_df):
    """Calls calculate_class_median_fare function to fill missing fare values and persists the median fare to a CSV file.

    Parameters:
    X_df (pd.DataFrame): Dataframe containing the data

    Returns:
    X_df (pd.DataFrame): Dataframe with filled fare values
    """
    median_class_fares = calculate_class_median_fare(X_df)
    median_class_fares.to_csv('median_class_fares.csv', index=False)

## Splitter persister

In [0]:
def calculate_bins(X_df, column, bin_count):
    """Calculates the bins for the specified column using qcut and returns the splits in a dataframe.

    Parameters:
    X_df (pd.DataFrame): Dataframe containing the data
    column (str): The column to calculate bins for
    bins (int): The number of bins to split the data into

    Returns:
    bins_df (pd.DataFrame): Dataframe with the bin edges
    """
    bin_edges = pd.qcut(X_df[column], q=bin_count, retbins=True)[1]
    bins_df = pd.DataFrame({'Bin_Edges': bin_edges})
    
    return bins_df

In [0]:
def calculate_and_persist_fare_bins(X_df):
    """Calls calculate_bins on the Fare column with x bins and persists the returned dataframe to fare_splist.csv.

    Parameters:
    X_df (pd.DataFrame): Dataframe containing the data

    Returns:
    bins_df (pd.DataFrame): Dataframe with the bin edges
    """
    bin_count = 15
    bins_df = calculate_bins(X_df, 'Fare', bin_count)
    bins_df.to_csv('fare_splits.csv', index=False)



# Data Pipeline Code

## Functions

### fill_age

In [0]:
# fill age with median from group
def fill_age(X_df):
    """Fills missing age values in the age field using the provided age dataframe.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with replaced values
    """
    
    # Load ages from storage - this would go in production pipeline
    median_ages_df = pd.read_csv('median_ages.csv')

    # Create a dictionary for median ages from age_df
    median_ages = median_ages_df.set_index(['Sex', 'Pclass'])['Median_Age'].to_dict()
    
    # Setting Age to the median value based on Sex and Pclass only when Age is not a number
    X_df['age'] = X_df.apply(lambda row: median_ages.get((row['sex'], row['pclass'])) if pd.isnull(row['age']) else row['age'], axis=1)
    
    return X_df

### fill_embarked

In [0]:
# fill embarked with 'S'

def fill_embarked(X_df):
    """Fill missing embarded values in the Embarked field. We use S based on analysis by Evitan that showed that these two passengers actually embarked at Southampton.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with replaced values
    """

    # Filling the missing values in Embarked with S
    X_df['embarked'] = X_df['embarked'].fillna('S')
    
    return X_df

### fill_fare

In [0]:
def fill_fare(X_df):
    """Fill missing fare values in the Fare field using the median fare for the same class.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with replaced values
    """

    # Load median fares from storage - this would go in production pipeline
    median_class_fares_df = pd.read_csv('median_class_fares.csv')

    # Create a dictionary for median fares from median_class_fares_df
    median_class_fares = median_class_fares_df.set_index('Pclass')['Median_Fare'].to_dict()
    
    # Filling the missing value in Fare with the median fare for the same class
    X_df['fare'] = X_df.apply(lambda row: median_class_fares.get(row['pclass']) if pd.isna(row['fare']) else row['fare'], axis=1)

    return X_df

### bin_fare_age

In [0]:
def bin_age(X_df):
    """Creates calculated fields Family_count and groups it then bins fare and age

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with new columns
    """

    # Define age edges
    age_edges = [0, 12, 18, 30, 50, 100]
    age_labels = ['Child', 'Teen', 'YoungAdult', 'Adult', 'Senior']

    # Bin Age using predefined edges
    X_df['age_bin'] = pd.cut(X_df['age'], bins=age_edges, labels=age_labels)

    return X_df

In [0]:
def bin_fare(X_df):
    """Bins fare using predefined edges from fare_edges.csv.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with new columns
    """

    # Load fare edges from fare_edges.csv
    fare_edges_df = pd.read_csv('fare_splits.csv')
    fare_edges = fare_edges_df['Bin_Edges'].tolist()

    # Bin fare using predefined edges
    X_df['fare_bin'] = pd.cut(X_df['fare'], bins=fare_edges)

    return X_df

### bin_family_count


In [0]:
def bin_family_count(X_df):
    """Creates calculated fields Family_count and groups it

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with new columns
    """
    # Family count
    X_df['family_count'] = X_df['sibsp'] + X_df['parch']

    # Bin family size
    family_map = {0: 'Alone', 1: 'Small', 2: 'Small', 3: 'Small', 4: 'Medium', 5: 'Medium', 6: 'Medium', 7: 'Large', 8: 'Large', 9: 'Large', 10: 'Large'}
    X_df['family_count_bin'] = X_df['family_count'].map(family_map)

    # Add "is_alone" feature - this made random forest slightly worse
    # X_df['is_alone'] = X_df['family_count'].map(lambda x: 1 if x == 0 else 0)

    return X_df

### create_title_feature

In [0]:
def create_title_feature(X_df):
    """Creates title feature and groups it; interleave the Is_married feature as well

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with new columns
    """    

    title_mapping = {
        'Mr': 'Mr',
        'Miss': 'Miss/Mrs/Ms',
        'Mrs': 'Miss/Mrs/Ms',
        'Master': 'Master',
        'Dr': 'Special',
        'Rev': 'Special',
        'Col': 'Special',
        'Major': 'Special',
        'Capt': 'Special',
        'Sir': 'Special',
        'Don': 'Special',
        'Lady': 'Special',
        'the Countess': 'Miss/Mrs/Ms',
        'Jonkheer': 'Special',
        'Mlle': 'Miss/Mrs/Ms',
        'Ms': 'Miss/Mrs/Ms',
        'Mme': 'Miss/Mrs/Ms'
    }

    X_df['title'] = X_df['name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
    
    X_df['is_married'] = 0
    X_df['is_married'].loc[X_df['title'] == 'Mrs'] = 1

    X_df['title'] = X_df['title'].map(title_mapping)
    

    return X_df

### create_deck_feature

In [0]:
def create_deck_feature(X_df):
    """Creates deck feature

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with new columns
    """    

    X_df['deck'] = X_df['cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')

    return X_df

### one_hot_encode_categories

In [0]:
# convert categorical columns to one-hot encoding features
def ohe_categories(X_df):
    """Creates one-hot encoded (OHE) features for a list of categorical columns 
    and simplifies column names.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with OHE columns
    """

    # create list of multi-class variables for one-hot encoding
    categoricals = ['pclass', 'sex', 'embarked', 'title', 'deck', 'family_count_bin', 'fare_bin','age_bin']

    # Without this line, I was just getting another Pclass column. Took about an hour to figure out. This feels really kludgy.
    X_df['pclass'] = X_df['pclass'].astype(str)

    # create one-hot encoded dummy variables for categoricals
    X_df_ohe = pd.get_dummies(X_df[categoricals], drop_first=False, dtype=int)

    # leaving this in so that I have an example in the future 
    X_df_ohe.rename(
        columns={'sex_male' : 'sex_male',
                 'sex_female' : 'sex_female'
                }, inplace = True)
    
    # concatenate OHE with original df, and drop original category columns
    X_df = pd.concat([X_df, X_df_ohe], axis=1)
    X_df.drop(categoricals, axis=1, inplace=True)
    
    return X_df

### label_encode_categories

In [0]:
# convert categorical columns to label encoding features
def label_encode_categories(X_df):
    """Creates label encoded features for a list of categorical columns 
    and simplifies column names.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df (pd.DataFrame)): same dataframe with label encoded columns
    """

    # create list of multi-class variables for label encoding
    categoricals = ['sex', 'embarked', 'title', 'deck', 'family_count_bin', 'fare_bin', 'age_bin']

    # create label encoded variables for categoricals
    le = LabelEncoder()
    for col in categoricals:
        X_df[col] = le.fit_transform(X_df[col].astype(str))

    return X_df

### lower_case_column_names

In [0]:
def rename_columns_lowercase(X_df):
    """Renames all column names to lowercase in place.

    Parameters:
    X_df (pd.DataFrame): DataFrame whose columns need to be renamed

    Returns:
    None
    """
    X_df.columns = [col.lower() for col in X_df.columns]

    return X_df

### Drop Unneeded Columns

In [0]:
def drop_columns(X_df, columns_to_drop):
    """Drops specified columns from the dataframe.

    Parameters:
    X_df (pd.DataFrame): DataFrame from which columns need to be dropped
    columns_to_drop (list): List of column names to drop

    Returns:
    X_df (pd.DataFrame): DataFrame with specified columns dropped
    """
    X_df.drop(columns=columns_to_drop, axis=1, inplace=True)
    return X_df

In [0]:
def drop_specified_columns(X_df):
    """Drops specified columns from the dataframe.

    Parameters:
    X_df (pd.DataFrame): DataFrame from which columns need to be dropped

    Returns:
    X_df (pd.DataFrame): DataFrame with specified columns dropped
    """
    columns_to_drop = ['passengerid', 'name', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'family_count']
    return drop_columns(X_df, columns_to_drop)

### Rename Columns with Bin Ranges

In [0]:
def clean_range_column_names(X_df):
    """
    Renames columns that contain numeric range patterns (e.g., '(0.0, 7.854]') 
    by replacing them with a simplified, numbered version using the base column name.
    For example: 'fare_bin_(0.0, 7.854]' → 'fare_bin_1'
    """
    new_columns = {}
    range_pattern = re.compile(r'[\(\[].*?[\)\]]')  # matches ranges like (x, y] or [x, y)

    # Group columns by prefix before the range
    prefix_groups = {}

    for col in X_df.columns:
        match = range_pattern.search(col)
        if match:
            prefix = col[:match.start()].rstrip('_')
            prefix_groups.setdefault(prefix, []).append(col)

    # For each group, assign sequential numbers
    for prefix, cols in prefix_groups.items():
        for i, old_col in enumerate(sorted(cols), start=1):
            new_col = f"{prefix}_{i}"
            new_columns[old_col] = new_col

    # Rename in the dataframe
    X_df = X_df.rename(columns=new_columns)
    return X_df

## Data Prep Pipeline

In [0]:
# function holds data preparation pipeline for X predictors dataframe
def data_prep_pipe(X_df):
    """Executes data preparation pipeline of steps to clean and transform
    an X features dataframe.

    Parameters:
    X_df (pd.DataFrame)): train or test slice contains predictors

    Returns:
    X_df_tr (pd.DataFrame)): train or test dataframe, transformed
    """
    
    # instantiate custom transformer functions
    get_fill_age = FunctionTransformer(fill_age, validate=False)
    get_fill_embarked = FunctionTransformer(fill_embarked, validate=False)
    get_fill_fare = FunctionTransformer(fill_fare, validate=False)
    get_bin_age = FunctionTransformer(bin_age, validate=False)
    get_bin_fare = FunctionTransformer(bin_fare, validate=False)
    get_bin_family_count = FunctionTransformer(bin_family_count, validate=False)
    get_create_title_feature = FunctionTransformer(create_title_feature, validate=False)
    get_create_deck_feature = FunctionTransformer(create_deck_feature, validate=False)
    get_ohe_categories = FunctionTransformer(ohe_categories, validate=False)
    get_label_encode_categeories = FunctionTransformer(label_encode_categories, validate=False)
    get_rename_columns_lowercase = FunctionTransformer(rename_columns_lowercase, validate=False)
    get_drop_specified_columns = FunctionTransformer(drop_specified_columns, validate=False)
    get_clean_range_column_names = FunctionTransformer(clean_range_column_names, validate=False)

    # instantiate data prep pipeline object and steps
    prep_pipe = Pipeline(memory=None, 
                         steps=[('rename_columns_lowercase', get_rename_columns_lowercase),
                                ('fill_age', get_fill_age),
                                ('fill_embarked', get_fill_embarked),
                                ('fill_fare', get_fill_fare),
                                ('bin_age', get_bin_age),
                                ('bin_fare', get_bin_fare),
                                ('bin_family_count', get_bin_family_count),
                                ('create_title_feature', get_create_title_feature),
                                ('create_deck_feature', get_create_deck_feature),
                                # Switched to label encoding for everything - imrpoved accuracy by 1%
                                #('ohe_categories', get_ohe_categories),
                                ('label_encode_categories', get_label_encode_categeories),
                                ('drop_specified_columns', get_drop_specified_columns),
                                ('rename_columns_lowercase_again', get_rename_columns_lowercase),
                                ('clean_range_column_names', get_clean_range_column_names)
                                ])
    
    # apply data prep pipeline to df and store/return new df
    X_df_tr = prep_pipe.fit_transform(X_df)
    return X_df_tr

# Run Pipeline

## Train Test Split

In [0]:
# Create X predictors and y target variable
y = df['Survived']
X = df.drop(columns=['Survived'], axis=1)

# Split into training and test sets
SEED = 42

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=SEED)

## Calculate Persisted Data Prep Values

In [0]:
# Calculate and persist the median ages and fares to fill in the missing data. This is done against the training set only to prevent leakage.
# Eventually I would refactor this to use the pipeline - just don't have time right now!
# Note: Evitan used the union of training and test data to calculate averages. I believe this may be an error. https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial block In [7]

process_and_persist_median_ages(X_train)
process_and_persist_fare(X_train)
calculate_and_persist_fare_bins(X_train)

## Prep Data

In [0]:
# send both X_train and X_test through data prep steps
X_train = data_prep_pipe(X_train)
X_test = data_prep_pipe(X_test)

## Align Train and Test Column Names

In [0]:
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)


In [0]:
X_train.info()

In [0]:
X_test.info(
)

# Train Models

## Run Models with Default Parameters

In [0]:
from pandas import DataFrame
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Define a models list with parameter stubs
models_default = DataFrame({
    'model_name': ['LogisticRegression', 'DecisionTreeClassifier', 'RandomForestClassifier', 'GradientBoostingClassifier', 'XGBClassifier', 'LGBMClassifier'],
    'model_instance': [
        LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=100),
        DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None),
        RandomForestClassifier(max_depth=10, min_samples_leaf=2),
        #RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=10, min_samples_leaf=2),
        #RandomForestClassifier(criterion='gini', bootstrap=True, max_depth=10, max_features=None,min_samples_leaf=2, min_samples_split=2, n_estimators=100),

        #RandomForestClassifier(criterion='gini',bootstrap=True, max_depth=5, max_features=None, min_samples_leaf=2, min_samples_split=5, n_estimators=200),
        GradientBoostingClassifier(loss='log_loss', learning_rate=0.1, n_estimators=100),
        XGBClassifier(objective='binary:logistic', learning_rate=0.1, n_estimators=100),
        LGBMClassifier(boosting_type='gbdt', learning_rate=0.1, n_estimators=100)
    ]
})

# Train the models using the training features and labels
for index, row in models_default.iterrows():
    model = row['model_instance']
    model.fit(X_train, y_train)
    # Report trained model
    print(f'Trained and ready: {row["model_name"]}')

In [0]:
# Test all models by running models list through a for loop

accuracy_scores = []
auc_scores = []
f1_scores = []

for index, row in models_default.iterrows():
    model = row['model_instance']
    # Use the model to generate predictions for the Test split, based on its features only
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    # Compare model's predictive performance to the provided test labels
    accuracy = accuracy_score(y_test, y_pred) * 100
    accuracy_scores.append(accuracy)
    
    auc = roc_auc_score(y_test, y_pred_proba) * 100 if y_pred_proba is not None else None
    auc_scores.append(auc)
    
    f1 = f1_score(y_test, y_pred) * 100
    f1_scores.append(f1)

    # Report the model and its scores
    """print(row['model_name'])
    print(f'  Accuracy: {accuracy}')
    print(f'  AUC: {auc}')
    print(f'  F1: {f1}\n')"""

# Add the accuracy, AUC, and F1 scores to the models dataframe
models_default['accuracy_score'] = accuracy_scores
models_default['auc_score'] = auc_scores
models_default['f1_score'] = f1_scores

In [0]:
# Create ROC Curve and Confusion Matrix for each model

for index, row in models_default.iterrows():
    model = row['model_instance']
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

    # ROC Curve
    if y_pred_proba is not None:
        fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, label=f'{row["model_name"]} (AUC = {row["auc_score"]:.2f})')
        plt.plot([0, 1], [0, 1], 'k--')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'ROC Curve for {row["model_name"]}')
        plt.legend(loc='best')
        plt.show()

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'Confusion Matrix for {row["model_name"]}')
    plt.show()

## Model Results

In [0]:
models_default

In [0]:
# Calculate average survival grouped by distinct values in each column of X_train
average_survival_by_column = {}

for column in X_train.columns:
    avg_survival = X_train.join(y_train).groupby(column)[y_train.name].mean()
    average_survival_by_column[column] = avg_survival

# Display the results
for column, avg_survival in average_survival_by_column.items():
    print(f"Average survival grouped by {column}:")
    display(avg_survival)

# Grid Search for Hyperparameter Tuning

In [0]:
# Grid Search for Hyperparameter Tuning
# Define your models
models = DataFrame({
    'model_name': [#'LogisticRegression',
                   #'DecisionTreeClassifier', 
                   'RandomForestClassifier'#, 
                   #'GradientBoostingClassifier', 
                   #'XGBClassifier', 
                   #'LGBMClassifier'
                   ],
    'model_instance': [#LogisticRegression(),
                       #DecisionTreeClassifier(),
                       RandomForestClassifier()#, 
                       #GradientBoostingClassifier(),
                       #XGBClassifier(),
                       #LGBMClassifier()
                       ]
})

# Define parameter grids
param_grids = {
    'LogisticRegression': {
        'C': [0.1, 1.0, 10.0],
        'solver': ['lbfgs']
    },
    'DecisionTreeClassifier': {
        'max_depth': [3, 5, 10, None],
        'min_samples_split': [2, 10],
        'criterion': ['gini', 'entropy']
    },
    'RandomForestClassifier': {
        'n_estimators': [100, 200, 500, 1000],  # More trees → better performance, longer training
        'max_depth': [5, 10, 20],             # Controls how deep trees grow
        'min_samples_leaf': [1, 2, 4],              # Prevents tiny splits (reduces overfitting)
        'min_samples_split': [2, 5, 10],            # Controls when to split a node
        'max_features': ['sqrt', 'log2', None],     # How many features to consider per split
        'bootstrap': [True, False]                  # Whether to sample data with replacement
    },
    'GradientBoostingClassifier': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5]
    },
    'XGBClassifier': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5],
        'subsample': [0.8, 1.0]
    },
    'LGBMClassifier': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'num_leaves': [15, 31],
        'max_depth': [-1, 5]
    }
}

# Loop through each model and just fit the best one
fitted_models = {}

for index, row in models.iterrows():
    name = row['model_name']
    base_model = row['model_instance']
    print(f"\n🔍 Tuning and fitting {name}...")

    param_grid = param_grids.get(name, {})

    search = GridSearchCV(base_model, param_grid, scoring='accuracy', cv=5, n_jobs=-1)
    search.fit(X_train, y_train)

    best_model = search.best_estimator_
    fitted_models[name] = best_model  # Store it for later use

    print(f'✅ Best Parameters for {name}: {search.best_params_}')

In [0]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score

# Prepare lists to hold scores
accuracies = []
aucs = []
f1s = []

# Loop through models in the DataFrame
for index, row in models.iterrows():
    name = row['model_name']
    model = fitted_models.get(name)

    if model:
        # Get predicted labels and probabilities
        y_proba = model.predict_proba(X_test)[:, 1]
        y_pred = model.predict(X_test)  # uses 0.5 threshold

        # Calculate metrics
        acc = accuracy_score(y_test, y_pred) * 100
        auc = roc_auc_score(y_test, y_proba) * 100
        f1 = f1_score(y_test, y_pred) * 100

        # Append to results
        accuracies.append(acc)
        aucs.append(auc)
        f1s.append(f1)
    else:
        accuracies.append(None)
        aucs.append(None)
        f1s.append(None)

# Add metrics to the models DataFrame
models['Accuracy'] = accuracies
models['AUC'] = aucs
models['F1'] = f1s

In [0]:
models