<a href="https://colab.research.google.com/github/mdaugherity/MachineLearning2024/blob/main/class/HW3_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW3 - Trees on the Titanic

Name: **YOUR NAME HERE**  
Class: PHYS453 - Machine Learning  
Date: Spring 2024  






In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Input
Load and process the input data.  

*Use the code below to load data without modifications.*

In [None]:
# Load Titanic
data = fetch_openml(name="titanic",version=1, as_frame=True, parser='auto')

df_raw = data.frame # the raw data
df_raw.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [None]:
# Factorize any non-numeric columns we want to use
codes, genders = pd.factorize(df_raw.sex)
df_raw['gender'] = codes
print('Gender codes:', genders.categories)

Gender codes: Index(['female', 'male'], dtype='object')


In [None]:
print('Original data:')
print(df_raw.sex.value_counts())

print('\nFactorized result:')
df_raw.gender.value_counts()

Original data:
male      843
female    466
Name: sex, dtype: int64

Factorized result:


1    843
0    466
Name: gender, dtype: int64

In [None]:
# Survived is a category, so the values are strings instead of numbers!
df_raw.survived.unique()

['1', '0']
Categories (2, object): ['0', '1']

In [None]:
df_raw.survived = df_raw.survived.astype('int64')

In [None]:
df_raw.survived.unique()  # much better!

array([1, 0])

In [None]:
# Choose columns to keep
df_raw.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'gender'],
      dtype='object')

In [None]:
features = ['pclass', 'gender', 'age', 'sibsp', 'parch', 'fare'] # choose columns for features
target = ['survived']
cols = target + features # combination of target and features
df = df_raw[cols].copy()
df.head()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
0,1,1,0,29.0,0,0,211.3375
1,1,1,1,0.9167,1,2,151.55
2,0,1,0,2.0,1,2,151.55
3,0,1,1,30.0,1,2,151.55
4,0,1,0,25.0,1,2,151.55


In [None]:
df.describe()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
count,1309.0,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,0.381971,2.294882,0.644003,29.881135,0.498854,0.385027,33.295479
std,0.486055,0.837836,0.478997,14.4135,1.041658,0.86556,51.758668
min,0.0,1.0,0.0,0.1667,0.0,0.0,0.0
25%,0.0,2.0,0.0,21.0,0.0,0.0,7.8958
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,1.0,39.0,1.0,0.0,31.275
max,1.0,3.0,1.0,80.0,8.0,9.0,512.3292


WARNING!  The count is different for different columns.  We must have some missing data!

In [None]:
# Clean up dataframe by dropping missing rows
print('Row count:\t', len(df))
df.dropna(inplace=True)  # delete rows with missing or bad values
print('After dropna:\t', len(df))

Row count:	 1309
After dropna:	 1045


In [None]:
# Save variables
X = df[features].values
y = df[target].values

# Testing



## Problem 1

* Choose any 2 features and make a new variable called ```X2``` with only these 2 columns
* Train a tree using all rows of ```X2``` and ```y``` with max_depth=3
* Print a nice diagram of the tree.  Make sure it is filled and all features and classes are labeled!
* Make a plot of the decision boundaries and verify it matches the diagram

**Note**: the default colors of ```plot_tree``` are blue and orange, and they are annoyingly difficult to change.  Feel free to keep the default colors, or if you are feeling adventurous see: https://stackoverflow.com/questions/70437840/how-to-change-colors-for-decision-tree-plot-using-sklearn-plot-tree

# Training

## Problem 2
Using all 6 features, find the optimal value of max_depth.

# Predicting


## Load passenger data

In [None]:
df_pass = pd.read_csv('https://raw.githubusercontent.com/mdaugherity/MachineLearning2024/main/data/titanic_passengers.csv')
df_pass

Unnamed: 0,Name,pclass,gender,age,sibsp,parch,fare
0,MinJun,2,1,0.166,1,2,200.0
1,Griffin,3,1,21.0,3,0,10.0
2,Simon,3,1,21.0,0,0,20.0
3,Grant,3,1,21.0,0,0,30.0
4,Nolan,2,1,22.0,1,0,276.45
5,Taylor,3,1,12.0,6,0,8.0
6,Sean,2,1,19.0,3,2,351.6889
7,Skylar,1,1,65.0,1,4,230.0
8,Erik,3,1,21.0,0,0,40.0
9,Aubrey,3,0,30.0,0,0,98.0


In [None]:
X_pred = df_pass[features] # make sure we get the same feature columns in the same order
X_pred

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
0,2,1,0.166,1,2,200.0
1,3,1,21.0,3,0,10.0
2,3,1,21.0,0,0,20.0
3,3,1,21.0,0,0,30.0
4,2,1,22.0,1,0,276.45
5,3,1,12.0,6,0,8.0
6,2,1,19.0,3,2,351.6889
7,1,1,65.0,1,4,230.0
8,3,1,21.0,0,0,40.0
9,3,0,30.0,0,0,98.0


## Problem 3

Use the **best** tree to classify the extra passengers.  

# Discussion

**DISCUSSION GOES HERE**