

# Project 4 - Predicting a Continuous Target with Regression (Titanic)
**Author:** AARON 
**Date:** November 14, 2025 
**Objective:** 



## Introduction
- 


## Section 1. Import and Inspect the Data
 

### 1.1 Include Imports

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


### 1.2 Load the dataset and display basic information

In [2]:

# Load Titanic dataset from seaborn and verify
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### 1.3 Check for missing values and display summary statistics

In [3]:
# Check for missing values using the isnull() method and then the sum() method. 
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# Display summary statistics using the describe() method

print(titanic.describe())

         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200


In [5]:
# Check for correlations using the corr() method and tell it to use only the numeric features. 

print(titanic.corr(numeric_only=True))

            survived    pclass       age     sibsp     parch      fare  \
survived    1.000000 -0.338481 -0.077221 -0.035322  0.081629  0.257307   
pclass     -0.338481  1.000000 -0.369226  0.083081  0.018443 -0.549500   
age        -0.077221 -0.369226  1.000000 -0.308247 -0.189119  0.096067   
sibsp      -0.035322  0.083081 -0.308247  1.000000  0.414838  0.159651   
parch       0.081629  0.018443 -0.189119  0.414838  1.000000  0.216225   
fare        0.257307 -0.549500  0.096067  0.159651  0.216225  1.000000   
adult_male -0.557080  0.094035  0.280328 -0.253586 -0.349943 -0.182024   
alone      -0.203367  0.135207  0.198270 -0.584471 -0.583398 -0.271832   

            adult_male     alone  
survived     -0.557080 -0.203367  
pclass        0.094035  0.135207  
age           0.280328  0.198270  
sibsp        -0.253586 -0.584471  
parch        -0.349943 -0.583398  
fare         -0.182024 -0.271832  
adult_male    1.000000  0.404744  
alone         0.404744  1.000000  


### Reflection 1:
- How many data instances are there? 1372
- How many features are there? 4
- What are the names?  Variance;     Skewness;     Curtosis;      Entropy;            
- Are there any missing values? No
- Are there any non-numeric features? Yes.  All four features are numeric.
- Are the data instances sorted on any of the attributes?  There is no a sort on any attribute.
- What two different features have the highest correlation? Skewness and Entropy against Variance
- Are there any categorical features that might be useful for prediction?  Perhaps entropy and curtosis would be a good place to start.

## Section 2. Data Exploration and Preparation
### 2.1 Explore Data Patterns and Distributions

In [None]:
# Inpute missing values for age using median

titanic['age'].fillna(titanic['age'].median(), inplace=True)
# Drop rows with missing fare
titanic = titanic.dropna(subset=['fare'])


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic['age'].fillna(titanic['age'].median(), inplace=True)


### 2.2 Feature Engineering

In [60]:
# Create numeric variables. family_size from sibsp + parch + 1
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Create numeric variables. class from sibsp + parch + 1

titanic['class_survive'] = titanic['survived'] - titanic['pclass']


## Section 3. Feature Selection and Justification
### 3.1 Choose features and target



Case 1: 
input features: 'Age'
target: fare

Case 2:
input features - 'Family Size'
target: fare

Case 3:
input features -  'Age' and 'Family Size'
target: fare

Case 4:
input features -  'Class Survive' and 'Family Size'
target: fare


### 3.2 Define X and y

Assign input features to X a pandas DataFrame with 1 or more input features
Assign target variable to y (as applicable) - a pandas Series with a single target feature

In [61]:
# Case 1: Features = Age
X1 = titanic[['age']]
y1 = titanic['fare']
 

# Case 2: Features = Family Size
X2 = titanic[['family_size']]
y2 = titanic['fare']
 

# Case 3: Features = Age + Family Size
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

# Case 3: Features = Class + Survive + Family Size
X4 = titanic[['class_survive', 'family_size']]
y4 = titanic['fare']

### Reflection 2 and 3:
- Why might these features affect a passenger’s fare:   I'm not sure age would help determine fare much except for younder children.  Family size would have some bearing as they could buy at a group rate.  
- List all available features:  survived, pclass, sex, age, sibsp, parch, embarked, class, who, adult_male, deck, embark_town, alive, alone 
- Which other features could improve predictions and why:  I think class is the main determining factor on fare. 
- How many variables are in your Case 4:  Three.  I combined class and survived to produce a number between -3 and 0.  I also have family size.
- Which variable(s) did you choose for Case 4 and why do you feel those could make good inputs:  I don't know for sure, but fare may have had an impact on who survived.  I am combining that with class to get two variable into one.  It also weights the equation nicely.  I'm real excited to see how this predicts the result.

## Section 4. Train a Regression Model (Linear Regression)
 

Split the data into training and test sets.

### 4.1 Split the Data

In [62]:

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)

### 4.2 Train and Evaluate Linear Regression Models (all 4 cases)
Create and train all 4 cases.

In [63]:
lr_model1 = LinearRegression().fit(X1_train, y1_train)
lr_model2 = LinearRegression().fit(X2_train, y2_train)
lr_model3 = LinearRegression().fit(X3_train, y3_train)
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions

y1_pred_train = lr_model1.predict(X1_train)
y1_pred_test = lr_model1.predict(X1_test)

y2_pred_train = lr_model2.predict(X2_train)
y2_pred_test = lr_model2.predict(X2_test)

y3_pred_train = lr_model3.predict(X3_train)
y3_pred_test = lr_model3.predict(X3_test)

y4_pred_train = lr_model4.predict(X4_train)
y4_pred_test = lr_model4.predict(X4_test)

### 4.3 Evaluate Model Performance

In [44]:
print("Case 1: Training R²:", r2_score(y1_train, y1_pred_train))
print("Case 1: Test R²:", r2_score(y1_test, y1_pred_test))
print("Case 1: Test RMSE:", mean_squared_error(y1_test, y1_pred_test))
print("Case 1: Test MAE:", mean_absolute_error(y1_test, y1_pred_test))



Case 1: Training R²: 0.009950688019452314
Case 1: Test R²: 0.0034163395508415295
Case 1: Test RMSE: 1441.8455811188421
Case 1: Test MAE: 25.28637293162364


In [53]:
print("Case 2: Training R²:", r2_score(y2_train, y2_pred_train))
print("Case 2: Test R²:", r2_score(y2_test, y2_pred_test))
print("Case 2: Test RMSE:", mean_squared_error(y2_test, y2_pred_test))
print("Case 2: Test MAE:", mean_absolute_error(y2_test, y2_pred_test))

Case 2: Training R²: 0.049915792364760736
Case 2: Test R²: 0.022231186110131973
Case 2: Test RMSE: 1414.6244812277246
Case 2: Test MAE: 25.02534815941641


In [54]:
print("Case 3: Training R²:", r2_score(y3_train, y3_pred_train))
print("Case 3: Test R²:", r2_score(y3_test, y3_pred_test))
print("Case 3: Test RMSE:", mean_squared_error(y3_test, y3_pred_test))
print("Case 3: Test MAE:", mean_absolute_error(y3_test, y3_pred_test))



Case 3: Training R²: 0.07347466201590014
Case 3: Test R²: 0.049784832763073106
Case 3: Test RMSE: 1374.7601875944658
Case 3: Test MAE: 24.284935030470688


In [64]:
print("Case 4: Training R²:", r2_score(y4_train, y4_pred_train))
print("Case 4: Test R²:", r2_score(y4_test, y4_pred_test))
print("Case 4: Test RMSE:", mean_squared_error(y4_test, y4_pred_test))
print("Case 4: Test MAE:", mean_absolute_error(y4_test, y4_pred_test))

Case 4: Training R²: 0.32930372730847557
Case 4: Test R²: 0.412862387766064
Case 4: Test RMSE: 849.4638285827654
Case 4: Test MAE: 19.767219512272867


Decision Tree Data

### Reflection 4:
Compare the train vs test results for each.


| Model Type    | Case   | Features Used                 | Training R² | Test R²   | RMSE    | MAE    | Notes |
|---------------|--------|-------------------------------|-------------|-----------|---------|--------|-------|
| Decision Tree | Case 1 | Age                           | 0.0099      | 0.0034    | 1441.84 | 25.28  | -     |
|               | Case 2 | Family Size                   | 0.0499      | 0.0222    | 1414.62 | 25.02  | -     |
|               | Case 3 | Age, Family Size              | 0.0734      | 0.0497    | 1374.76 | 24.28  | -     |
|               | Case 4 | Family Size, Survived, Class  | 0.3293      | 0.4128    | 849.46  | 19.76  | -     |
|---------------|--------|--------------------|----------|-------------|-----------|---------|--------|-------|

- Did Case 1 overfit or underfit? Explain:  Case 1 is an underfit.  R2 are very low.
- Did Case 2 overfit or underfit? Explain:  Case 2 is an underfit.  The is a small improvement, but R2 are very low.
- Did Case 3 overfit or underfit? Explain:  Case 3 is an underfit.  The is a small improvement, R2 are very low.
- Did Case 4 overfit or underfit? Explain:  Case 4 is not overfit or underfit.  It's odd that the test set did better, but that does not mean it's overfit.

### Adding Age

- Did adding age improve the model: The model improved slightly from Case 2 to Case 3.
- Propose a possible explanation (consider how age might affect ticket price, and whether the data supports that):  I think age coupled with family size would improve where an entry was a child who was part of a family.  The fare price for the family would account for that.

### Worst

- Which case performed the worst: Case 1 - Age
- How do you know: The R2 value was almost 0.
- Do you think adding more training data would improve it (and why/why not):  I don't think adding more traing data would help.  Age does not make a meaginful impact.

### Best

- Which case performed the best: Case 4 - Family Size, Survived, Class
- How do you know: The R2 was 10 times better. and the RMSE and MAE we noticable lower.
- Do you think adding more training data would improve it (and why/why not):  I'm not sure.  In my case the test set performed better than the training set.  So it makes me think I had a good split or I maxed out any traing data.

## Section 5. Compare to Nerual Network Model



### 5.1 Train and Evaluate Model (Neural Network on Case 1)

In [446]:
# Train NN for Case 1 (Curtosis, Variance)
nn_model1 = MLPClassifier(
    hidden_layer_sizes=(50, 25, 10),
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

nn_model1.fit(X1_train, y1_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


0,1,2
,hidden_layer_sizes,"(50, ...)"
,activation,'relu'
,solver,'lbfgs'
,alpha,0.0001
,batch_size,'auto'
,learning_rate,'constant'
,learning_rate_init,0.001
,power_t,0.5
,max_iter,1000
,shuffle,True


## Section 6. Final Thoughts & Insights

- The accuracy and other metrics were higher for this dataset compared to the Titanic dataset.  It was nice to see a higher percentage for these metrics.  
- I'm not sure why the Neural Network for Case 1 shows such a low percentage compared to other NN Cases, but also compared to Case 1 in the Decision tree.  It seems there must be an error, but I have checked a few times and cannot locate why this happened.
- I got the idea to combine the Skewness and Variance from the Scatter Matrix.  I saw Genuine notes had higher values for both these parameters.  So I thought I could combine them into one variable and that worked out great.
- I have not tried three parameters in a neural network.  I liked the decisions surface for two inputs.  So I combined Skewness and Variance.
- At first I used the Case 3 engineered parameter as Skewness Times Variance.  It performed a little worse than CASE 2.  I thought I would try adding them instead, towards the end of the project, and the numbers returned tremendous.  There was only one misclassification in the CASE 3 Neural Network results.