# Project 4 – Regression Modeling with Titanic Dataset  
**Your Name: Moses Koroma** 

In this project, I build multiple regression models to predict a *continuous* target from the Titanic dataset. Instead of predicting survival (classification), we switch to a numeric prediction task.  

### **Target Variable**
- **Fare** (continuous)

### **Feature Cases**
- **Case 1:** Predict Fare using Passenger Age  
- **Case 2:** Predict Fare using Age + Passenger Class (pclass)  
- **Case 3:** Polynomial Regression using Age  
- **Case 4:** Regularized Models (Ridge & ElasticNet)

### Models Used
- Linear Regression  
- Polynomial Regression (degree 3 & 8)  
- Ridge Regression  
- ElasticNet Regression  

We evaluate each model using:
- **MAE**  
- **RMSE**  
- **MSE**  
- **R² score**  

---


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


## Section 1. Load & Inspect the Data


In [6]:
titanic = pd.read_csv("../../data/titanic/train.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
# Check the dataset structure and columns
print("Dataset shape:", titanic.shape)
print("\nColumn names:")
print(titanic.columns.tolist())
print("\nDataset info:")
titanic.info()

Dataset shape: (891, 12)

Column names:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## Section 2. Data Preparation & Feature Engineering

We will clean missing values and create helpful numerical features.


In [9]:
# Handle missing values (using correct column names)
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic['Fare'] = titanic['Fare'].fillna(titanic['Fare'].median())

# Encode sex
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})

# Create family size
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  FamilySize   891 non-null    int64  
dtypes: float64(2), int64(7), object(4)
memory usage: 90.6+ KB


## Section 3. Model 1 — Linear Regression (Age → Fare)


In [11]:
X1 = titanic[['Age']]
y = titanic['Fare']

X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=123)

lr1 = LinearRegression()
lr1.fit(X1_train, y_train)

y1_train_pred = lr1.predict(X1_train)
y1_test_pred = lr1.predict(X1_test)

print("Linear Regression (Age -> Fare) Test R²:", r2_score(y_test, y1_test_pred))

Linear Regression (Age -> Fare) Test R²: 0.0034163395508415295


## Section 4. Model 2 — Linear Regression (Age + Pclass → Fare)


In [13]:
X2 = titanic[['Age', 'Pclass']]
y = titanic['Fare']

X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state=123)

lr2 = LinearRegression()
lr2.fit(X2_train, y_train)

y2_train_pred = lr2.predict(X2_train)
y2_test_pred = lr2.predict(X2_test)

print("Linear Regression (Age + Pclass -> Fare) Test R²:", r2_score(y_test, y2_test_pred))

Linear Regression (Age + Pclass -> Fare) Test R²: 0.31661691734309905


## Section 5. Polynomial Regression (Degree 3 & 8)
We test whether adding curvature improves predictions.


In [14]:
poly3 = PolynomialFeatures(degree=3)
X3_poly = poly3.fit_transform(X1_train)
X3_poly_test = poly3.transform(X1_test)

poly3_model = LinearRegression()
poly3_model.fit(X3_poly, y_train)

y3_train_pred = poly3_model.predict(X3_poly)
y3_test_pred = poly3_model.predict(X3_poly_test)

print("Poly d=3 Test R²:", r2_score(y_test, y3_test_pred))

Poly d=3 Test R²: -0.0033041302146110674


In [15]:
poly8 = PolynomialFeatures(degree=8)
X8_poly = poly8.fit_transform(X1_train)
X8_poly_test = poly8.transform(X1_test)

poly8_model = LinearRegression()
poly8_model.fit(X8_poly, y_train)

y8_train_pred = poly8_model.predict(X8_poly)
y8_test_pred = poly8_model.predict(X8_poly_test)

print("Poly d=8 Test R²:", r2_score(y_test, y8_test_pred))


Poly d=8 Test R²: -0.007251042293693111


## Section 6. Regularized Models (Ridge + ElasticNet)
Regularization helps control overfitting, especially for polynomial models.


In [16]:
ridge = Ridge(alpha=1.0)
ridge.fit(X8_poly, y_train)

yr_train_pred = ridge.predict(X8_poly)
yr_test_pred = ridge.predict(X8_poly_test)

print("Ridge Regression d=8 Test R²:", r2_score(y_test, yr_test_pred))


Ridge Regression d=8 Test R²: -0.02214775563448601


  return f(*arrays, *other_args, **kwargs)


In [17]:
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X8_poly, y_train)

ye_train_pred = elastic.predict(X8_poly)
ye_test_pred = elastic.predict(X8_poly_test)

print("ElasticNet d=8 Test R²:", r2_score(y_test, ye_test_pred))


ElasticNet d=8 Test R²: -0.004022969167496893


  model = cd_fast.enet_coordinate_descent(


## Section 7. Summary Table

In [18]:
# Calculate all metrics for each model
def calculate_metrics(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    return {
        'Model': model_name,
        'MAE': round(mae, 3),
        'MSE': round(mse, 3),
        'RMSE': round(rmse, 3),
        'R²': round(r2, 3)
    }

# Collect results for all models
results = []
results.append(calculate_metrics(y_test, y1_test_pred, 'Linear Regression (Age)'))
results.append(calculate_metrics(y_test, y2_test_pred, 'Linear Regression (Age + Pclass)'))
results.append(calculate_metrics(y_test, y3_test_pred, 'Polynomial Regression (Degree 3)'))
results.append(calculate_metrics(y_test, y8_test_pred, 'Polynomial Regression (Degree 8)'))
results.append(calculate_metrics(y_test, yr_test_pred, 'Ridge Regression (Degree 8)'))
results.append(calculate_metrics(y_test, ye_test_pred, 'ElasticNet Regression (Degree 8)'))

# Create DataFrame and display
results_df = pd.DataFrame(results)
print("=" * 80)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 80)
print(results_df.to_string(index=False))
print("=" * 80)

# Find best model for each metric
print("\nBEST PERFORMING MODELS:")
print("-" * 40)
best_mae = results_df.loc[results_df['MAE'].idxmin()]
best_r2 = results_df.loc[results_df['R²'].idxmax()]
lowest_rmse = results_df.loc[results_df['RMSE'].idxmin()]

print(f"Lowest MAE: {best_mae['Model']} (MAE = {best_mae['MAE']})")
print(f"Highest R²: {best_r2['Model']} (R² = {best_r2['R²']})")
print(f"Lowest RMSE: {lowest_rmse['Model']} (RMSE = {lowest_rmse['RMSE']})")

MODEL PERFORMANCE COMPARISON
                           Model    MAE      MSE   RMSE     R²
         Linear Regression (Age) 25.286 1441.846 37.972  0.003
Linear Regression (Age + Pclass) 20.704  988.711 31.444  0.317
Polynomial Regression (Degree 3) 25.304 1451.569 38.099 -0.003
Polynomial Regression (Degree 8) 25.249 1457.279 38.174 -0.007
     Ridge Regression (Degree 8) 25.539 1478.831 38.456 -0.022
ElasticNet Regression (Degree 8) 25.305 1452.609 38.113 -0.004

BEST PERFORMING MODELS:
----------------------------------------
Lowest MAE: Linear Regression (Age + Pclass) (MAE = 20.704)
Highest R²: Linear Regression (Age + Pclass) (R² = 0.317)
Lowest RMSE: Linear Regression (Age + Pclass) (RMSE = 31.444)


## Section 8. Final Reflection

###  Which features performed best?

**Answer:** The combination of Age + Passenger Class (Pclass) performed significantly better than Age alone. This suggests that:
- **Passenger Class is a strong predictor** of fare prices, which makes intuitive sense as first-class tickets were much more expensive
- **Feature combinations** often outperform single features by capturing more of the underlying relationships
- The **socioeconomic indicator** (class) provides crucial information that age alone cannot capture

**Evidence:** Compare the R² scores between Model 1 (Age only) vs Model 2 (Age + Pclass) from your summary table above.

---

###  Which model performed best?

**Answer:** [Based on your results above, identify which model had the highest R² and lowest error metrics]

**Analysis:**
- **Best R² Score:** [Model name] with R² = [value]
- **Lowest MAE:** [Model name] with MAE = [value]  
- **Best Overall Balance:** [Model name] considering both accuracy and generalization

**Why this model succeeded:**
- Captured the right level of complexity without overfitting
- Balanced bias-variance trade-off effectively
- [Add specific reasons based on your actual results]

---

###  Did polynomial models overfit?

**Answer:** [Yes/No based on your results] - Here's the evidence:

**Degree 3 Polynomial:**
- Performance compared to linear: [Better/Worse/Similar]
- Signs of overfitting: [Present/Absent]

**Degree 8 Polynomial:**
- Performance compared to simpler models: [Analysis]
- Overfitting indicators:
  - Large gap between training and test performance? [Yes/No]
  - Worse generalization than simpler models? [Yes/No]
  - Unstable predictions? [Evidence from results]

**Conclusion:** Higher-degree polynomials [did/did not] show clear signs of overfitting because [specific evidence from your results].

---

###  Did regularization help?

**Answer:** [Yes/No] - Regularization [was/was not] beneficial for the high-degree polynomial models.

**Ridge Regression Analysis:**
- Performance vs unregularized Degree 8 polynomial: [Better/Worse/Similar]
- Evidence: R² = [value], MAE = [value]

**ElasticNet Analysis:**
- Performance vs unregularized Degree 8 polynomial: [Better/Worse/Similar]  
- Evidence: R² = [value], MAE = [value]

**Key Insights:**
- Regularization [helped/hurt] by [specific mechanism - reducing overfitting, improving generalization, etc.]
- The optimal regularization strength appears to be [analysis based on results]
- [Ridge/ElasticNet/Both/Neither] was most effective for this dataset

**Recommendation:** For future polynomial regression tasks with this dataset, [use/avoid] regularization because [reasoning based on evidence].