<a href="https://colab.research.google.com/github/mdaugherity/MachineLearning2024/blob/main/class/HW4_Trees_on_the_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# HW4 - Trees on the Titanic

Classify the people in data set and tell me **WHO LIVES AND WHO DIES**.

I given you lots of different example code in the past several tutorials. In this assignment you will have to pull the pieces together and demonstrate that you know what you are doing.  

The key skill in problem solving is the ability to break a complex problem into smaller steps.  In data science this often looks like making test cases that start as simply as possible and gradually add complexity to approach the original problem.  You should be able to confirm that each step works, or to put it another way:
**ALWAYS TEST YOUR CODE IN CASES WHERE YOU KNOW THE RIGHT ANSWER**







## Input
Load and process the input data.  

*Use the code below to load data without modifications.*

In [20]:
# Load Titanic
data = fetch_openml(name="titanic",version=1, as_frame=True, parser='auto')

df_raw = data.frame # the raw data
df_raw.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


Reminder from our **Titanic Pandas Tutorial** about some of these columns:

* pclass = Passenger Class 1, 2, or 3
* survived = 1 for people who survived, they will have a lifeboat number in boat and the body column should be blank.  Not all bodies are recovered.
* sibsp = number of siblings (for kids) or spouses (for adults) aboard
* parch = number of parents (for kids) or children (for adults) aboard
* fare is in old British money (pounds / shillings / pence) which gives weird fractions




In [21]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(3), int64(3), object(5)
memory usage: 116.8+ KB


To use this for machine learning, we need to clean this up significantly:
* Throw away some columns
* Make all columns numeric (everything we use should be *int64* or *float64*)
* Remove rows with missing data


In [23]:
# Factorize any non-numeric columns we want to use
codes, genders = pd.factorize(df_raw.sex)
df_raw['gender'] = codes
print('Gender codes:', genders.categories)

Gender codes: Index(['female', 'male'], dtype='object')


In [24]:
print('Original data:')
print(df_raw.sex.value_counts())

print('\nFactorized result:')
df_raw.gender.value_counts()

Original data:
sex
male      843
female    466
Name: count, dtype: int64

Factorized result:


Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
1,843
0,466


In [25]:
# Survived is a category, so the values are strings instead of numbers!
df_raw.survived.unique()

['1', '0']
Categories (2, object): ['0', '1']

In [26]:
df_raw.survived = df_raw.survived.astype('int64')

In [27]:
df_raw.survived.unique()  # much better!

array([1, 0])

In [28]:
# Choose columns to keep
df_raw.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest', 'gender'],
      dtype='object')

In [29]:
features = ['pclass', 'gender', 'age', 'sibsp', 'parch', 'fare'] # choose columns for features
target = ['survived']
cols = target + features # combination of target and features
df = df_raw[cols].copy()
df.head()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
0,1,1,0,29.0,0,0,211.3375
1,1,1,1,0.9167,1,2,151.55
2,0,1,0,2.0,1,2,151.55
3,0,1,1,30.0,1,2,151.55
4,0,1,0,25.0,1,2,151.55


In [30]:
df.describe()

Unnamed: 0,survived,pclass,gender,age,sibsp,parch,fare
count,1309.0,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,0.381971,2.294882,0.644003,29.881135,0.498854,0.385027,33.295479
std,0.486055,0.837836,0.478997,14.4135,1.041658,0.86556,51.758668
min,0.0,1.0,0.0,0.1667,0.0,0.0,0.0
25%,0.0,2.0,0.0,21.0,0.0,0.0,7.8958
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,1.0,39.0,1.0,0.0,31.275
max,1.0,3.0,1.0,80.0,8.0,9.0,512.3292


WARNING!  The count is different for different columns.  We must have some missing data!

In [31]:
# Clean up dataframe by dropping missing rows
print('Row count:\t', len(df))
df.dropna(inplace=True)  # delete rows with missing or bad values
print('After dropna:\t', len(df))

Row count:	 1309
After dropna:	 1045


In [32]:
# Save variables
X = df[features].values
y = df[target].values

In [33]:
print('Feature shape: ', X.shape)
print('Feature names: ', features)

Feature shape:  (1045, 6)
Feature names:  ['pclass', 'gender', 'age', 'sibsp', 'parch', 'fare']


## Load passenger data

In [34]:
df_pass = pd.read_csv('https://raw.githubusercontent.com/mdaugherity/MachineLearning2024/main/data/titanic_passengers.csv')
df_pass

Unnamed: 0,Name,pclass,gender,age,sibsp,parch,fare
0,MinJun,2,1,0.166,1,2,200.0
1,Griffin,3,1,21.0,3,0,10.0
2,Simon,3,1,21.0,0,0,20.0
3,Grant,3,1,21.0,0,0,30.0
4,Nolan,2,1,22.0,1,0,276.45
5,Taylor,3,1,12.0,6,0,8.0
6,Sean,2,1,19.0,3,2,351.6889
7,Skylar,1,1,65.0,1,4,230.0
8,Erik,3,1,21.0,0,0,40.0
9,Aubrey,3,0,30.0,0,0,98.0


In [35]:
X_pred = df_pass[features] # make sure we get the same feature columns in the same order
X_pred

Unnamed: 0,pclass,gender,age,sibsp,parch,fare
0,2,1,0.166,1,2,200.0
1,3,1,21.0,3,0,10.0
2,3,1,21.0,0,0,20.0
3,3,1,21.0,0,0,30.0
4,2,1,22.0,1,0,276.45
5,3,1,12.0,6,0,8.0
6,2,1,19.0,3,2,351.6889
7,1,1,65.0,1,4,230.0
8,3,1,21.0,0,0,40.0
9,3,0,30.0,0,0,98.0


# Problems:
**1. Test Cases:** Validate your code with these tests:
 * Choose any 2 features and make a new variable called ```X2``` with only these 2 columns
 * Train a tree using all rows of ```X2``` and ```y``` with max_depth=3
 * Print a nice diagram of the tree.  Make sure it is filled and all features and classes are labeled!
 * Make a plot of the decision boundaries and verify it matches the diagram

**2. Training:**  Using all 6 features, find the optimal value of max_depth.   
**3. Predictions:** Use the **best** tree to classify the extra passengers.  


As always, use the [HW3 template](HW3_Template.ipynb) for your submission.  



# Problem 1 - Test Cases
Validate your code with these tests:
 * Choose any 2 features and make a new variable called ```X2``` with only these 2 columns.  (Note that if you want your decision boundaries to not be weird, then choose non-integer features like age and fare.)
 * Train a tree using all rows of ```X2``` and ```y``` with max_depth=3
 * Print a nice diagram of the tree.  Make sure it is filled and all features and classes are labeled!
 * Make a plot of the decision boundaries and verify it matches the diagram


# Problem 2 - Training
Using all 6 features, find the optimal value of max_depth and the test score.

# Problem 3 - Predictions
Use the best tree to classify the extra passengers.