<a href="https://colab.research.google.com/github/mdaugherity/MachineLearning2023/blob/main/HW3_Trees_on_the_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PHYS 453, Spring 2023

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# HW3 - Trees on the Titanic 

Classify the people in data set and tell me **WHO LIVES AND WHO DIES**.

## Assignment:
1.  Train a tree using all the data with max_depth=3 and print a nice diagram of the tree.  Make sure it is filled and all features and classes are labeled!
1.  Set up a test/train split. Train decision trees on the Titanic data to make a plot of test score vs max_depth for depths from 1 to 25.
1.  Use the **best** tree to classify the extra passengers.   


Use the code below to load data without modifications.

In [None]:
# Load Titanic
data = fetch_openml(name="titanic",version=1, as_frame=True)

df_raw = data.frame # the raw data
df_raw.head()

  warn(


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


This dataset is fascinating.  Let's look at a few columns:
* pclass = Passenger Class 1, 2, or 3
* survived = 1 for people who survived, they will have a lifeboat number in boat and the body column should be blank.  Not all bodies are recovered.
* sibsp = number of siblings (for kids) or spouses (for adults) aboard
* parch = number of parents (for kids) or children (for adults) aboard
* fare is in old British money (pounds / shillings / pence) which gives weird fractions

Notice the tradegies we see in the first 5 rows alone.  Miss Allen (row 0) is a first-class passenger traveling alone who survives.  Next we get the Allison family of four (rows 1-4).  The parents each have 1 spouse (sibsp) and 2 children (parch) aboard, the kids each have 1 sibling and 2 parents.  Only the 11-month old infant is put on a lifeboat and survives. 

To use this for machine learning, we need to clean this up significantly:
* Throw away some columns 
* Make all columns numeric
* Remove rows with missing data



In [None]:
# Factorize any non-numeric columns we want to use
codes, genders = pd.factorize(df_raw.sex)
df_raw['gender'] = codes
print('Gender codes:', genders.categories)

Gender codes: Index(['female', 'male'], dtype='object')


In [None]:
# Choose columns to keep
df_raw.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'gender'],
      dtype='object')

In [None]:
features = ['pclass', 'gender', 'age', 'sibsp', 'parch', 'fare'] # choose columns for features
target = ['survived']
cols = target + features # combination of target and features
df = df_raw[cols].copy()
df.head()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
0,1,1.0,0,29.0,0.0,0.0,211.3375
1,1,1.0,1,0.9167,1.0,2.0,151.55
2,0,1.0,0,2.0,1.0,2.0,151.55
3,0,1.0,1,30.0,1.0,2.0,151.55
4,0,1.0,0,25.0,1.0,2.0,151.55


In [None]:
df.describe()

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,2.294882,0.644003,29.881135,0.498854,0.385027,33.295479
std,0.837836,0.478997,14.4135,1.041658,0.86556,51.758668
min,1.0,0.0,0.1667,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958
50%,3.0,1.0,28.0,0.0,0.0,14.4542
75%,3.0,1.0,39.0,1.0,0.0,31.275
max,3.0,1.0,80.0,8.0,9.0,512.3292


WARNING!  The count is different for different columns.  We must have some missing data!

In [None]:
# Clean up dataframe by dropping missing rows
print('Row count:\t', len(df))
df.dropna(inplace=True)  # delete rows with missing or bad values
print('After dropna:\t', len(df))

Row count:	 1309
After dropna:	 1045


In [None]:
# Save variables
X = df[features].values
y = df[target].values

# Load passenger data

In [None]:
df_pass = pd.read_csv('https://raw.githubusercontent.com/mdaugherity/MachineLearning2022/main/titanic_passengers.csv')
df_pass

Unnamed: 0,Name,pclass,gender,age,sibsp,parch,fare
0,Isla,2,0,21,3,2,512
1,Beth,1,0,22,1,2,215
2,Noah,1,1,56,27,1,279
3,Josh,3,1,6,0,2,3412
4,Ethan,2,1,19,7,2,0
5,Mike,3,1,99,1,0,5


In [None]:
X_pred = df_pass[features] # make sure we get the same feature columns in the same order
X_pred

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
0,2,0,21,3,2,512
1,1,0,22,1,2,215
2,1,1,56,27,1,279
3,3,1,6,0,2,3412
4,2,1,19,7,2,0
5,3,1,99,1,0,5


# Problem 1

*DISCUS YOUR RESULTS HERE*

# Problem 2

*DISCUS YOUR RESULTS HERE*

# Problem 3

*DISCUS YOUR RESULTS HERE*