# Generating Overfit Tree Models

## Original Prompt

By default, you scikitlearn tree models will grow until every node is pure. To explore this, you are to build different models using the max_depth parameter and determine when the tree begins to overfit the data. For depths from max_depth = 1 until the tree is completed, keep track of the accuracy on training vs. test data and generate a plot with depths as the horizontal axis and accuracy as the vertical axis for train and test data.

Repeat this process with different splits of the data to determine at what depth the tree begins to overfit. Share your results with your peers and discuss your approach to generating the visualization. What are the consequences of this overfitting for your approach to building Decision Trees? We provide a small dataset with health data where your goal is to predict whether or not the individuals survive.

## Translated to Questions and Steps

### Key Questions
- At what depth does the tree begin to overfit data?
- What was your approach to generating the data visualizations?
- What are the consequences of this overfitting for your approach to building trees?

### Steps

#### Data Gathering
- Determine N
    - N is where the tree completes with max depth unbound
    - To determine it, just fit on entire data set with no max depth
- For a number of splits, do the following
    - Make a random train / test split
    - For depths 1 to N, do the following
        - Build a tree
        - Compute accuracy on training and test

Notes
- Number of splits is TBD
    - Start with 20
    - Look at variance vs. number of splits, does it drop off?

#### Data Processing
- Compute mean value over splits to get the average accuracy vs. depth
- Plot accuracy vs. depth for both training and test
- Determine break point in depth demarcating overfit
- Identify takeaways that generalize beyond this problem, in terms of understanding depth & overfit

# Imports

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
import warnings

In [2]:
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

# Data Load

In [3]:
data = pd.read_csv("./data/Whickham.txt")

In [4]:
data.head()

Unnamed: 0,outcome,smoker,age
0,Alive,Yes,23
1,Alive,Yes,18
2,Dead,Yes,71
3,Alive,No,67
4,Alive,No,64


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1314 entries, 0 to 1313
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   outcome  1314 non-null   object
 1   smoker   1314 non-null   object
 2   age      1314 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 30.9+ KB


# Feature Naming

In [6]:
X = data[["smoker", "age"]]
y = data["outcome"]

In [7]:
X["smoker"] = X["smoker"].map({"No": 0, "Yes": 1})

# Determine Max Tree Depth

In [8]:
unbound_tree = DecisionTreeClassifier(max_depth=None).fit(X, y)
max_depth_limit = unbound_tree.tree_.max_depth
max_depth_limit

14

# Data Processing

In [10]:
max_depth_list = list(range(1, max_depth_limit + 1))
max_depth_list

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [11]:
num_train_test_splits = 20

- For a number of splits, do the following
    - Make a random train / test split
    - For depths 1 to N, do the following
        - Build a tree
        - Compute accuracy on training and test

In [32]:
def inner_loop_processing(X, y, random_state, max_depth):
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        random_state=random_state,
        stratify=y,
    )
    dt = DecisionTreeClassifier(max_depth=max_depth).fit(X_train, y_train)
    assert dt.tree_.max_depth <= max_depth
    metrics = {
        "max_depth": max_depth,
        "max_depth_": dt.tree_.max_depth,
        "training_accuracy": accuracy_score(y_train, dt.predict(X_train)),
        "test_accuracy": accuracy_score(y_test, dt.predict(X_test)),
    }
    return metrics

In [33]:
np.random.seed(1234)

for max_depth in max_depth_list:
    for ksplit in range(num_train_test_splits):
        metrics = inner_loop_processing(X, y, ksplit, max_depth)
        print(
            "ksplit = %2d max_depth = %2d\n\tmetrics = %s"
            % (ksplit, max_depth, str(metrics))
        )

ksplit =  0 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.849746192893401, 'test_accuracy': 0.8601823708206687}
ksplit =  1 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.8517766497461929, 'test_accuracy': 0.8541033434650456}
ksplit =  2 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.8446700507614213, 'test_accuracy': 0.8753799392097265}
ksplit =  3 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.8467005076142132, 'test_accuracy': 0.8693009118541033}
ksplit =  4 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.8517766497461929, 'test_accuracy': 0.8541033434650456}
ksplit =  5 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.8598984771573605, 'test_accuracy': 0.8297872340425532}
ksplit =  6 max_depth =  1
	metrics = {'max_depth': 1, 'max_depth_': 1, 'training_accuracy': 0.860913