# **Regression Tree Exercise**

_John Andrew Dixon_

--- 

**Setup**

In [50]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [51]:
# Remote URL to the data
url ="https://docs.google.com/spreadsheets/d/e/2PACX-1vSQc1CsJ25nPMJcuJD04csFCysrzuInd_IQ_drLza49m_3R4MllPcuhduu4GozMJun3MgUJkGl0cw-d/pub?output=csv"
# Load and verify the data
df = pd.read_csv(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


---

## **Tasks**

### **Run a regression tree model with default parameters (unlimited depth).**

In [52]:
# Create feature matrix and target vector
X = df.drop(columns="PRICE")
y = df["PRICE"]

In [53]:
# Make a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [54]:
# Instantiate a decision tree and fit it
decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)

In [55]:
# Function for evaluating regression R^2 scores
def regression_r2(regression, tts_tuple, verbose=False):
    training_r2 = regression.score(tts_tuple[0], tts_tuple[2])
    testing_r2 = regression.score(tts_tuple[1], tts_tuple[3])
    if verbose:
        print("Training R-squared:", training_r2)
        print("Testing R-squared:", testing_r2)

    return training_r2, testing_r2
    

In [56]:
train_test_split_tuple = (X_train, X_test, y_train, y_test)
regression_r2(decision_tree, train_test_split_tuple)

(1.0, 0.6193230918136841)

### **Determine the depth of the default tree.**

In [57]:
# Get depth of decision tree
decision_tree.get_depth()

20

### **Try different values for max_depth and determine the optimal value based on the best (highest) R<sup>2</sup> value. What is the optimal max_depth based on your trials?**

In [60]:
depths = pd.DataFrame({
    "Depth": [],
    "Train R-Squared": [],
    "Test R-Squared": []
})

for number in range(1, 51):
    temp_decision_tree = DecisionTreeRegressor(max_depth=number, random_state=42)
    temp_decision_tree.fit(X_train, y_train)
    train_r2, test_r2 = regression_r2(temp_decision_tree, train_test_split_tuple)
    depths.loc[-1] = [number, train_r2, test_r2]
    depths.index = depths.index + 1

In [67]:
depths.sort_values("Test R-Squared", ascending=False).head(3)

Unnamed: 0,Depth,Train R-Squared,Test R-Squared
43,7.0,0.958517,0.846377
40,10.0,0.986796,0.84601
39,11.0,0.9911,0.829736


### **What is the R<sup>2</sup> of your final model on the training set and on the test set?**

**Train R<sup>2</sup>**: 0.958517

**Test R<sup>2</sup>**: 0.846377