# Lab 3: Titanic Project- Predicting a Categorical Target and Evaluating Performance

### Name: Mindy Cruz
### Date: 4/5/2025
### Intro: 

In this lab, we use the Titanic dataset to build and evaluate three classifiers: Decision Tree, Support Vector Machine, and Neural Network. We compare model performance across three different feature sets and reflect on their effectiveness for predicting passenger survival.

Three common classification models are listed below. 

Decision Tree Classifier (DT) - A Decision Tree splits data into smaller groups based on decision rules (like "is height greater than 150 cm?"). It’s like a flowchart, where each decision point leads to another question until a final classification is reached. Easy to interpret and fast to train, but can overfit if the tree becomes too complex.

Support Vector Machine (SVM) - A Support Vector Machine tries to find the "best boundary" (a hyperplane) that separates data into classes. It works well with complex data and small datasets. Effective when there is a clear margin of separation between classes, but can be computationally expensive for large datasets.

Neural Network (NN) - A Neural Network is inspired by how human brains process information. It consists of layers of interconnected "neurons" that process input data and adjust based on feedback. It can handle complex patterns and non-linear relationships, but needs more data and tuning to avoid overfitting.

When trying to classify data, using three (or more) models can help:

Decision Trees illustrate how individual features contribute to classification.
SVMs are good at finding complex boundaries.
Neural Networks are good at learning patterns from complex data.

# Section 1: Import and Inspect the Data

In [23]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [24]:
# Load Titanic dataset
titanic = sns.load_dataset('titanic')
# Display a few records to verify
titanic.head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Section 2: Data Exploration and Preparation

2.1 Handle Missing Values and Clean Data

In [25]:
#input missing values for age using the median:
median_age = titanic['age'].median()
titanic['age'] = titanic['age'].fillna(median_age)

#fill in the missinga vleus for the embark_town using the mode:
mode_embark = titanic['embark_town'].mode()[0]
titanic['embark_town'] = titanic['embark_town'].fillna(mode_embark)


2.2 Feature Engineering


In [26]:
# Create new feature
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Map categories to numeric values
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
titanic['alone'] = titanic['alone'].astype(int)

# Section 3: Feature Selection and Justification


3.1 Choose features and target

Case 1: 

input features: alone
target: survived

Case 2:

input features - age
target: survived

Case 3:

input features -  age and family_size 
target: survived

3.2 Define X(features) and y(target)

In [27]:
# Case 1: Feature = alone

# Select the feature 'alone' as input
X1 = titanic[['alone']]

# Select 'survived' as the target for the same rows
y1 = titanic['survived']

In [28]:
# Case 2: Feature = age (drop if na or not available)

# Select the feature 'age', but drop rows where age is missing
X2 = titanic[['age']].dropna()

# Select the matching 'survived' values using the same index
y2 = titanic.loc[X2.index, 'survived']

In [29]:
# Case 3: Features = Age + Family Size (drop if na or not available)

# Select both 'age' and 'family_size', and drop rows where missing (na)
X3 = titanic[['age', 'family_size']].dropna()

# Select the corresponding 'survived' values for those rows
y3 = titanic.loc[X3.index, 'survived']

Reflection 3:

Why are these features selected?

Age and family Size are good indicators to determine a passengers survial or not. From history we know that women and children were the first off the boat

Are there features that are likely to be highly predictive of survival? I think that sex and pclass would also play a factor. Women were more likely to have gotten off first and people who paid more were also more likely to have gotten off first. 

# Section 4: Train a Classification Model (Decision Tree)