# Building a Classifier - Titanic Dataset (module 3)
**Author:** Jarred Gastreich 
**Date:** November, 14, 2025 
**Objective:** Predicting a Continuous target with regression using the seaborn Titanic dataset


In [29]:
# Imports

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Section 1: Import and Inspect the Data

In [30]:
# Load Titanic dataset from seaborn and verify
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Section 2: Data Exploration and Preparation

In [31]:
titanic['age'].fillna(titanic['age'].median(), inplace=True)

titanic = titanic.dropna(subset=['fare'])

titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


# Section 3: Feature Selection and Justification

In [32]:
# Case 1. age
X1 = titanic[['age']]
y1 = titanic['fare']

# Case 2. family_size
X2 = titanic[['family_size']]
y2 = titanic['fare']

# Case 3. age, family_size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

# Case 4. pclass
X4 = titanic[['pclass']]
y4 = titanic['fare']

## Section 3 Reflection

Why might these features affect a passenger’s fare: Family Size may affect the fare if there was a discount for packages. Age within a family may affect fare because there may be a discount for children under a certain age. Class may affect fare because the higher class cabins are likely more expensive.

List all available features: sex, pclass, survived, embarked, alone.

Which other features could improve predictions and why: Embarked town and pclass are likely to affect fare. Embarked town could affect fare because the cities are likely different distances which would cost different amount of fuel.

How many variables are in your Case 4: three

Which variable(s) did you choose for Case 4 and why do you feel those could make good inputs: pclass: 1,2, and 3. These are good inputs because they are numerical and could affect the fare.

# Section 4: Train a Regression Model (linear regression)

In [33]:
#split data

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)

In [34]:
# train and evaluate linear regression

lr_model1 = LinearRegression().fit(X1_train, y1_train)
lr_model2 = LinearRegression().fit(X2_train, y2_train)
lr_model3 = LinearRegression().fit(X3_train, y3_train)
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions for Case 1
y1_pred_train = lr_model1.predict(X1_train)
y1_pred_test = lr_model1.predict(X1_test)

# Predictions for Case 2
y2_pred_train = lr_model2.predict(X2_train)
y2_pred_test = lr_model2.predict(X2_test)

# Predictions for Case 3
y3_pred_train = lr_model3.predict(X3_train)
y3_pred_test = lr_model3.predict(X3_test)

# Predictions for Case 4
y4_pred_train = lr_model4.predict(X4_train)
y4_pred_test = lr_model4.predict(X4_test)


In [None]:
#Report performance
print("=== Case 1: age ===")
print("Training R²:", r2_score(y1_train, y1_pred_train))
print("Test R²:", r2_score(y1_test, y1_pred_test))
print("Test RMSE:", mean_squared_error(y1_test, y1_pred_test) ** 0.5)
print("Test MAE:", mean_absolute_error(y1_test, y1_pred_test))
print()

print("=== Case 2: family_size ===")
print("Training R²:", r2_score(y2_train, y2_pred_train))
print("Test R²:", r2_score(y2_test, y2_pred_test))
print("Test RMSE:", mean_squared_error(y2_test, y2_pred_test) ** 0.5)
print("Test MAE:", mean_absolute_error(y2_test, y2_pred_test))
print()

print("=== Case 3: age + family_size ===")
print("Training R²:", r2_score(y3_train, y3_pred_train))
print("Test R²:", r2_score(y3_test, y3_pred_test))
print("Test RMSE:", mean_squared_error(y3_test, y3_pred_test) ** 0.5)
print("Test MAE:", mean_absolute_error(y3_test, y3_pred_test))
print()

print("=== Case 4: pclass ===")
print("Training R²:", r2_score(y4_train, y4_pred_train))
print("Test R²:", r2_score(y4_test, y4_pred_test))
print("Test RMSE:", mean_squared_error(y4_test, y4_pred_test) ** 0.5)
print("Test MAE:", mean_absolute_error(y4_test, y4_pred_test))

=== Case 1: age ===
Training R²: 0.009950688019452314
Test R²: 0.0034163395508415295
Test RMSE: 37.97164180172938
Test MAE: 25.28637293162364

=== Case 2: family_size ===
Training R²: 0.049915792364760736
Test R²: 0.022231186110131973
Test RMSE: 37.6114940041967
Test MAE: 25.02534815941641

=== Case 3: age + family_size ===
Training R²: 0.07347466201590014
Test R²: 0.049784832763073106
Test RMSE: 37.0777586646559
Test MAE: 24.284935030470688

=== Case 4: Custom Feature Set ===
Training R²: 0.3005588037487471
Test R²: 0.3016017234169923
Test RMSE: 31.7873316928033
Test MAE: 20.653703671484056


## Section 4 Reflection

Compare the train vs test results for each:
The first 3 cases performed poorly and much smaller R^2 values. Case 4 performed better and had a bigger R^2 value.

Did Case 1 overfit or underfit? Explain: Underfit because train and test are both very low (0.0099 and 0.0034) which tells us age does not affect fare.

Did Case 2 overfit or underfit? Explain: Underfit because the test (0.0222) is still small although slightler higher than train which tells us it won't be too useful for the model.

Did Case 3 overfit or underfit? Explain: Underfit because combining age and family_size results in a small improvement, but the R^2 remains very low (0.0735). The model is too simple to learn patterns.

Did Case 4 overfit or underfit? Explain: Optimal Fit. This case is generalizing well. Test R^2 is slightly higher than training R2 and the RMSE/MAE are significantly better than the other cases.This tells us that pclass has a good affect on fare.

Adding Age

Did adding age improve the model: Only slightly and not directly. 

Propose a possible explanation (consider how age might affect ticket price, and whether the data supports that): The small improvement in R^2 suggests a weak linear relationship exists between age and the target variable. However, since the R^2 remains very low even after adding age, it's highly likely that the true relationship is non-linear  and cannot be accurately captured by the simple linear model used in these cases.


Worst

Which case performed the worst: age
How do you know: it has the lowest test R^2 of 0.0034

Do you think adding more training data would improve it (and why/why not): No, because the model seems to be too simple and is not a lack of data.


Best

Which case performed the best: pclass

How do you know: It has the higest Test R^2 of 0.3016 and lowest RMSE of 31.79 and MAE of 20.65
Do you think adding more training data would improve it (and why/why not): This model is already generalizing well and there may be an add that helps is slightly but I the values are already simple and numeric. There may be a benefit to tie to a cabin feature, for example.

# Section 5: Compare Alternative Models (Ridge, Elastic Net, Polynomial Regression)

# Section 6: Final Thoughts & Insight