<a href="https://colab.research.google.com/github/mdaugherity/MachineLearning2024/blob/main/class/HW3_Trees_on_the_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW3 - Trees on the Titanic

Classify the people in data set and tell me **WHO LIVES AND WHO DIES**.

I given you lots of different example code in the past several tutorials. In this assignment you will have to pull the pieces together and demonstrate that you know what you are doing.






# Format
The key skill in problem solving is the ability to break a complex problem into smaller steps.  In data science this often looks like making test cases that start as simply as possible and gradually add complexity to approach the original problem.  You should be able to confirm that each step works, or to put it another way:
**ALWAYS TEST YOUR CODE IN CASES WHERE YOU KNOW THE RIGHT ANSWER**

Fill in the missing steps in the analysis. As always, use the [HW3 template](class/HW3_Template.ipynb) for your submission.  


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Input
Load and process the input data.  

*Use the code below to load data without modifications.*

In [None]:
# Load Titanic
data = fetch_openml(name="titanic",version=1, as_frame=True, parser='auto')

df_raw = data.frame # the raw data
df_raw.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Reminder from our **Titanic Pandas Tutorial** about some of these columns:

* pclass = Passenger Class 1, 2, or 3
* survived = 1 for people who survived, they will have a lifeboat number in boat and the body column should be blank.  Not all bodies are recovered.
* sibsp = number of siblings (for kids) or spouses (for adults) aboard
* parch = number of parents (for kids) or children (for adults) aboard
* fare is in old British money (pounds / shillings / pence) which gives weird fractions




To use this for machine learning, we need to clean this up significantly:
* Throw away some columns
* Make all columns numeric
* Remove rows with missing data


In [None]:
# Factorize any non-numeric columns we want to use
codes, genders = pd.factorize(df_raw.sex)
df_raw['gender'] = codes
print('Gender codes:', genders.categories)

Gender codes: Index(['female', 'male'], dtype='object')


In [None]:
print('Original data:')
print(df_raw.sex.value_counts())

print('\nFactorized result:')
df_raw.gender.value_counts()

Original data:
male      843
female    466
Name: sex, dtype: int64

Factorized result:


1    843
0    466
Name: gender, dtype: int64

In [None]:
# Choose columns to keep
df_raw.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'gender'],
      dtype='object')

In [None]:
features = ['pclass', 'gender', 'age', 'sibsp', 'parch', 'fare'] # choose columns for features
target = ['survived']
cols = target + features # combination of target and features
df = df_raw[cols].copy()
df.head()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
0,1,1,0,29.0,0,0,211.3375
1,1,1,1,0.9167,1,2,151.55
2,0,1,0,2.0,1,2,151.55
3,0,1,1,30.0,1,2,151.55
4,0,1,0,25.0,1,2,151.55


In [None]:
df.describe()

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,2.294882,0.644003,29.881135,0.498854,0.385027,33.295479
std,0.837836,0.478997,14.4135,1.041658,0.86556,51.758668
min,1.0,0.0,0.1667,0.0,0.0,0.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958
50%,3.0,1.0,28.0,0.0,0.0,14.4542
75%,3.0,1.0,39.0,1.0,0.0,31.275
max,3.0,1.0,80.0,8.0,9.0,512.3292


WARNING!  The count is different for different columns.  We must have some missing data!

In [None]:
# Clean up dataframe by dropping missing rows
print('Row count:\t', len(df))
df.dropna(inplace=True)  # delete rows with missing or bad values
print('After dropna:\t', len(df))

Row count:	 1309
After dropna:	 1045


In [None]:
# Save variables
X = df[features].values
y = df[target].values

# Load passenger data

In [None]:
df_pass = pd.read_csv('https://raw.githubusercontent.com/mdaugherity/MachineLearning2024/main/data/titanic_passengers.csv')
df_pass

Unnamed: 0,Name,pclass,gender,age,sibsp,parch,fare
0,Dawson,3,1,49,3,3,58.5
1,Ziyu,2,1,83,4,1,60.5
2,Colton,3,1,21,0,0,1.0
3,Kathryn,2,0,20,1,3,200.0
4,Mirimo,1,1,36,3,0,300.0
5,Sidney,3,1,26,0,0,0.0
6,Justin,1,1,76,4,1,77.0
7,Mike,1,1,99,1,17,250.0


In [None]:
X_pred = df_pass[features] # make sure we get the same feature columns in the same order
X_pred

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
0,3,1,49,3,3,58.5
1,2,1,83,4,1,60.5
2,3,1,21,0,0,1.0
3,2,0,20,1,3,200.0
4,1,1,36,3,0,300.0
5,3,1,26,0,0,0.0
6,1,1,76,4,1,77.0
7,1,1,99,1,17,250.0


# Problems:
**1. Test Cases:** Validate your code with these tests:
 * Choose any 2 features and make a new variable called ```X2``` with only these 2 columns
 * Train a tree using all rows of ```X2``` and ```y``` with max_depth=3
 * Print a nice diagram of the tree.  Make sure it is filled and all features and classes are labeled!
 * Make a plot of the decision boundaries and verify it matches the diagram

**2. Training:**  Using all 6 features, find the optimal value of max_depth.   
**3. Predictions:** Use the **best** tree to classify the extra passengers.  


As always, use the [HW3 template](class/HW3_Template.ipynb) for your submission.  

