# Titanic Dataset Review
**Author:** Gabriel Richards (gjrich)

**Date:** 1 Apr 2025

**Objective:** This notebook is a review of the Titanic dataset to look at the passengers of the famous ship that struck an iceberg and sunk.


## 1 Imports and Basic Review
In the code cell below, we import the necessary Python libraries for this notebook.  

In [None]:
# all imports get moved to the top

import seaborn as sns
import pandas as pd

from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.model_selection import StratifiedShuffleSplit


In [None]:
# Load Titanic dataset
titanic = sns.load_dataset('titanic')

Display basic information about the dataset using the info() method.

In [None]:
titanic.info()

Display the first 10 rows.  If not the last statement in a Python cell, you'll have to wrap in the print() function to display. 



In [None]:
print(titanic.head(10))


Check for missing values using the isnull() method and then the sum() method. 

In [None]:
titanic.isnull().sum()

Display summary statistics using the describe() method

In [None]:
print(titanic.describe())


Check for correlations using the corr() method and tell it to use only the numeric features. 



In [None]:
print(titanic.corr(numeric_only=True))


### Reflection 1
1) How many data instances are there?   There are 891 data instances (rows)
2) How many features are there?   There are 15 features (columns)
3) What are the names?   survived, pclass, sex, age, sibsp, parch, fare, embarked, class, who, adult_male, deck, embark_town, alive, and alone
4) Are there any missing values?   Yes, there are missing values:

    age: 177 missing values; 
    deck: 688 missing values; 
    embarked: 2 missing values;  
    embark_town: 2 missing values

5) Are there any non-numeric features?   Yes, there are non-numeric features. Image 2 shows these data types:
    object (5): sex, embarked, who, embark_town, alive;  
    category (2): class, deck;  
    bool (2): adult_male, alone

6) Are the data instances sorted on any of the attributes?   No, the data instances don't appear to be sorted on any particular attribute based on the first 10 rows
7) What are the units of age? Years
8) What are the minimum, median and max age? 
    Minimum age: 0.42 years;  
    Median age: 28.0 years;  
    Maximum age: 80.0 years

9)  What two different features have the highest correlation?   The highest correlation is between "alone" and "sibsp" at -0.584471, followed closely by "alone" and "parch" at -0.583398.

10) Are there any categorical features that might be useful for prediction?   Several!

adult_male (correlation with survived: -0.557080);  
pclass (-0.338481);  
class (categorical version of pclass);  
embark_town (potential socioeconomic indicator)

## Section 2. Data Exploration and Preparation

### 2.1 Explore Data Patterns and Distributions
Since Titanic contains both numeric and categorical variables, we'll use only numeric values here.

Create a scatter plot of age vs fare, colored by gender:

In [None]:
# First visualization - Scatter matrix
attributes = ['age', 'fare', 'pclass']
scatter_matrix(titanic[attributes], figsize=(10, 10))
plt.tight_layout()
plt.show()



plt.figure(figsize=(10, 6))

# Create scatter plots for each gender separately
males = titanic[titanic['sex'] == 'male']
females = titanic[titanic['sex'] == 'female']

# Plot males in red, females in blue
plt.scatter(males['age'], males['fare'], color='red', alpha=0.7, label='Male')
plt.scatter(females['age'], females['fare'], color='blue', alpha=0.7, label='Female')

plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare by Gender')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### Histogram of age

In [None]:
sns.histplot(titanic['age'], kde=True)
plt.title('Age Distribution')
plt.show()

### Count plot for class and survival

In [None]:
sns.countplot(x='class', hue='survived', data=titanic)
plt.title('Class Distribution by Survival')
plt.show()

### 2.2 Reflections

#### What patterns or anomalies do you notice?
Age is bimodally distributed with peaks around 25-30 years and at very young ages (<5 years)
Fare distribution is heavily right-skewed with most passengers paying <50 but some extreme outliers over 500
Clear differences in survival rates across passenger classes
Gender strongly correlates with survival 
There is a high concentration of passengers in third class


#### Do any features stand out as potential predictors?
Passenger class (pclass) shows dramatic survival differences - first class had highest survival rate while third class had lowest
Gender appears significant with females (yellow dots) having higher survival rates across all classes
Fare correlates with survival (likely as a proxy for class)
The combination of gender and class appears particularly predictive.



#### Are there any visible class imbalances?
Survival status is imbalanced (~38% survived, 62% perished)
Passenger classes are heavily imbalanced with most passengers in third class
Age has 177 missing values (19.9% of dataset)
Deck information is missing for 688 passengers (77.2%)
Significant gender imbalance with more males than females

### 2.3 Handle Missing Values and Clean Data

Age was missing values. We can impute missing values for age using the median:

In [None]:
titanic['age'] = titanic['age'].fillna(titanic['age'].median())


Embark_town was missing values. We can drop missing values for embark_town (or fill with mode):

In [None]:
titanic['embark_town'] = titanic['embark_town'].fillna(titanic['embark_town'].mode())


### 2.4 Feature Engineering

Create a new feature: Family size

In [None]:
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1



Convert categorical data to numeric:



In [None]:
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

Create a binary feature for 'alone':



In [None]:
titanic['alone'] = titanic['alone'].astype(int)


### 2.5 Further Reflections

#### Why might family size be a useful feature for predicting survival?
- Family dynamics affected survival decisions (staying together vs. individual escape)
- Different-sized groups had different mobility during evacuation [harder to move as a larger group]
- Large families might have been predominantly in certain classes/deck locations


#### Why convert categorical data to numeric?
- Machine learning algorithms require numeric input
- Mathematical operations can't be performed on text
- Numeric encoding enables pattern detection by algorithms
- It standardizes features for consistent processing in models

## Section 3. Feature Selection and Justification

### 3.1 Choose features and target

Select two or more input features (numerical for regression, numerical and/or categorical for classification)
Select a target variable (as applicable)
Classification: Categorical target variable (e.g., gender, species).
Justify your selection with reasoning.
 

For classification, we’ll use survived as the target variable.

Input features: age, fare, pclass, sex, family_size
Target: survived

### 3.2 Define X and y

Assign input features to X
Assign target variable to y (as applicable)

In [None]:
X = titanic[['age', 'fare', 'pclass', 'sex', 'family_size']]
y = titanic['survived'] 

### 3.2 Reflection

#### Why are these features selected?
- Sex: Women and children first policies were followed
- Pclass: Represents socioeconomic status and physical location on ship (lower decks had less access to lifeboats)
- Age: Children were prioritized for rescue
- Fare: Proxy for wealth/status that might influence treatment
- Family_size: Feature could capture group dynamics during evacuation

#### Are there any features that are likely to be highly predictive of survival?
- Sex: Strong correlation with survival (-0.557), with females having higher survival rates
- Pclass: Second strongest predictor (-0.338), with first-class passengers surviving at  higher rates
- Family_size: Related to "alone" status (-0.203 correlation), suggesting traveling with family affected survival chances

## Section 4. Splitting
Split the data into training and test sets using train_test_split first and StratifiedShuffleSplit second. Compare.



### 4.1 Basic Train/Test split 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

print('Train size:', len(X_train))
print('Test size:', len(X_test))

In [None]:
print("Original Class Distribution:\n", X['pclass'].value_counts(normalize=True),"\n")
print("Train Set Class Distribution:\n", X_train['pclass'].value_counts(normalize=True),"\n")
print("Test Set Class Distribution:\n", X_test['pclass'].value_counts(normalize=True),"\n")

### 4.2 Stratified Train/Test split


In [None]:
# Create stratified split
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=456)

# Initialize variables
X_train_strat = None
X_test_strat = None
y_train_strat = None
y_test_strat = None

# Split the data, preserving target distribution
for train_indices, test_indices in splitter.split(X, y):
    X_train_strat = X.iloc[train_indices]
    X_test_strat = X.iloc[test_indices]
    y_train_strat = y.iloc[train_indices]
    y_test_strat = y.iloc[test_indices]

print('Train size:', len(X_train_strat))
print('Test size:', len(X_test_strat))

In [None]:
print("Original Class Distribution:\n", X['pclass'].value_counts(normalize=True),"\n")
print("Train Set Class Distribution:\n", X_train_strat['pclass'].value_counts(normalize=True),"\n") 
print("Test Set Class Distribution:\n", X_test_strat['pclass'].value_counts(normalize=True),"\n")

### 4.4 Reflection

Why might stratification improve model performance?

How close are the training and test distributions to the original dataset?

Which split method produced better class balance?