#**Constructing Regression Trees in Python**  

`DecisionTreeRegressor` creates a regression tree named `regTree`. It has a max depth of 3, requires 5 points for a split, and 2 points in each leaf. Additional parameters/values are in [scikit-learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). Fit the tree using `regTree.fit(X, y)` with features in X and outcomes in y.  

The Python code below fits a regression tree for predicting body mass based on flipper length and bill length for the Palmer penguins dataset.

In [None]:
!pip install palmerpenguins

In [None]:
# Import packages and functions
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor, export_text
from sklearn import tree, metrics
from palmerpenguins import load_penguins

import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

In [None]:
# Load the penguins data from palmerpenguins package
penguins = load_penguins()

# Drop penguins with missing values
penguins = penguins.dropna()

# Create a new data frame with only Gentoo penguins
gentoo = penguins[penguins['species'] == 'Gentoo'].copy()

# Calculate summary statistics using .describe()
gentoo.describe(include='all')

In [None]:
# Create a matrix of input features with sex, flipper length, and bill length
X = gentoo[['sex', 'flipper_length_mm', 'bill_length_mm']]
X

`DecisionTreeRegressor` accepts only numerical features, so non-numeric ones like `sex` and `island` need encoding as dummy variables using `get_dummies` from `pandas`.

In [None]:
# Use pd.get_dummies to convert sex to a binary (0/1) dummy variable
X_dummies = pd.get_dummies(X, drop_first=True)
X_dummies

Setting `drop_first=True` generates a single dummy variable, sufficient to represent sex in the dataset.

- `sex_male=0`: female
- `sex_male=1`: male

In [None]:
y = gentoo['body_mass_g']

regtreeModel = DecisionTreeRegressor(max_depth=2, min_samples_leaf=2)
regtreeModel.fit(X_dummies, y)

In [None]:
# The print() statement outputs a text version of the regression tree
print(export_text(regtreeModel, feature_names=X.columns.to_list()))

In [None]:
# Using tree.plot_tree() makes a cleaner figure

# Resize the plotting window
plt.figure(figsize=[12, 8])

p = tree.plot_tree(
    regtreeModel,
    feature_names=X.columns,
    class_names=y.unique(),
    filled=False,
    fontsize=10,
)

In [None]:
# Add the predictions to the original data set
gentoo['pred'] = regtreeModel.predict(X_dummies)
gentoo

In [None]:
# Plot observed vs. predictions
p = sns.scatterplot(data=gentoo, x='body_mass_g', y='pred', hue='sex')
p.set_xlabel('Observed body mass', fontsize=14)
p.set_ylabel('Predicted body mass', fontsize=14)

In [None]:
# Calculate MSE
metrics.mean_squared_error(gentoo['pred'], y)

#**Constructing Classification Trees in Python**  

`DecisionTreeClassifier` creates a classification tree called `classTree` with max depth 3, requires 5 points for a split, and 1 point in each leaf. Find more parameters/values in [scikit-learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). To use it, fit the tree with `classTree.fit(X, y)` where X holds features and y is the outcome.  

The Python code below fits a classification tree for predicting species based on flipper length and bill length for the Palmer penguins dataset.

In [None]:
# Import packages and functions
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn import metrics, tree

from palmerpenguins import load_penguins

In [None]:
# Load the penguins data from palmerpenguins module
penguins = load_penguins()

# Drop penguins with missing values
penguins = penguins.dropna()

# Calculate summary statistics using .describe()
penguins.describe(include='all')

In [None]:
# Save output features as y
y = penguins[['species']]

# Save input features as x
X = penguins[['flipper_length_mm', 'bill_length_mm']]

# Initialize the model
classtreeModel = DecisionTreeClassifier(max_depth=2)

# Fit the model
classtreeModel = classtreeModel.fit(X, y)

In [None]:
# Print tree as text
print(export_text(classtreeModel, feature_names=X.columns.to_list()))

In [None]:
# Resize the plotting window
plt.figure(figsize=[12, 8])

# Values in brackets represent classes in alphabetical order
# [Adelie, Chinstrap, Gentoo]
p = tree.plot_tree(classtreeModel, feature_names=X.columns, filled=False, fontsize=10)

In [None]:
# Calculate cross-entroy and error rate

print("Cross-entropy: ", metrics.log_loss(y, classtreeModel.predict_proba(X)))
print("Error rate: ", 1 - metrics.accuracy_score(y, classtreeModel.predict(X)))

# Calculate the confusion matrix
metrics.confusion_matrix(y, classtreeModel.predict(X))

# Plot the confusion matrix
metrics.ConfusionMatrixDisplay.from_predictions(y, classtreeModel.predict(X))

In [None]:
# Calculate the Gini index
probs = pd.DataFrame(data=classtreeModel.predict_proba(X))

print("Gini index: ", (probs * (1 - probs)).mean().sum())

#**Constructing Classification Random Forests in Python**  

`RandomForestClassifier` creates a random forest model called `rfc`. It uses 100 trees, 'sqrt' features per node, 'gini' criterion, and bootstrapping. More details in [scikit-learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Fit it with `rfc.fit(X, y)` using X for features and y for outcomes.  


The Python code below fits a classification random forest for predicting species of penguins from the Palmer Penguins dataset.

In [None]:
# Import packages and functions
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from palmerpenguins import load_penguins

In [None]:
# Load the penguins data from palmerpenguins module
penguins = load_penguins()

# Drop penguins with missing values
penguins = penguins.dropna()

# Calculate summary statistics using .describe()
penguins.describe(include='all')

In [None]:
# y = output features
y = penguins['species']

# X = input features
X = penguins.drop('species', axis=1)

# Convert categorical inputs like species and island into dummy variables
X = pd.get_dummies(X, drop_first=True)

X

In [None]:
# Create a training/testing split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=8675309
)

# Initialize the random forest model
rfModel = RandomForestClassifier(max_depth=2, max_features='sqrt', random_state=99)

# Fit the random forest model on the training data
rfModel.fit(X_train, y_train)

In [None]:
pd.DataFrame(
    data={
        'feature': rfModel.feature_names_in_,
        'importance': rfModel.feature_importances_,
    }
).sort_values('importance', ascending=False)

In [None]:
# Predict species on the testing data
y_pred = rfModel.predict(X_test)

In [None]:
# Calculate a confusion matrix
metrics.confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

In [None]:
# Calculate the Gini index
probs = pd.DataFrame(data=rfModel.predict_proba(X_test))
print("Gini index ", (probs * (1 - probs)).mean().sum())

In [None]:
# Save the first random forest tree as singleTree
singleTree = rfModel.estimators_[0]

# Set image size
plt.figure(figsize=[15, 8])

# Plot a single regression tree
tree.plot_tree(singleTree, feature_names=X.columns, filled=False, fontsize=10)

#**Constructing Regression Random Forests in Python**  

`RandomForestRegressor` initializes a regression random forest model named `rfr`. It uses 100 trees, squared error metric, and square root of features for node choice. Find more parameters/values in [scikit-learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). Fit it with `rfr.fit(X, y)` using X for features and y for outcomes.  

The Python code below fits a regression random forest for predicting body mass of penguins from the Palmer Penguins dataset.



In [None]:
# Import packages and functions

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics, tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from palmerpenguins import load_penguins

In [None]:
# Load the penguins data from palmerpenguins module
penguins = load_penguins()

# Drop penguins with missing values
penguins = penguins.dropna()

# Calculate summary statistics using .describe()
penguins.describe(include='all')

In [None]:
# Random forest models require all numerical inputs
# Convert categorical inputs like species and island into binary indicators

penguinDummies = pd.get_dummies(penguins, drop_first=True)

# Ex: species_Chinstrap = {1 if Chinstrap, 0 else}
penguinDummies

In [None]:
# Save output features as y
y = penguinDummies["body_mass_g"]

# Save input features as X
X = penguinDummies.drop("body_mass_g", axis=1)

# Create a training/testing split
# 30% of instances held out for testing
# 70% of instances used for training
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=8675309
)

# Define a regression random forest model
rfModel = RandomForestRegressor(max_depth=2, max_features='sqrt', random_state=99)

# Fit the model
rfModel.fit(X_train, y_train)

In [None]:
pd.DataFrame(
    data={
        'feature': rfModel.feature_names_in_,
        'importance': rfModel.feature_importances_,
    }
).sort_values('importance', ascending=False)

In [None]:
# Predict body mass on the testing data
y_pred = rfModel.predict(X_test)

In [None]:
# Compare testing predictions to actual values
p = sns.scatterplot(x=y_test, y=y_pred)
p.set_xlabel("Actual values", fontsize=14)
p.set_ylabel("Predicted values", fontsize=14)

# Add a diagonal line
# If the testing predictions are close to the actual values,
# points should fall along this line
plt.axline((3000, 3000), (6000, 6000), color='r', ls='--')

In [None]:
# Print mean squared error (MSE)
print("MSE: ", metrics.mean_squared_error(y_test, y_pred))

In [None]:
# Save the first random forest tree as singleTree
singleTree = rfModel.estimators_[0]

# Set image size
plt.figure(figsize=[18, 6])

# Plot a single regression tree
tree.plot_tree(singleTree, feature_names=X.columns, filled=False, fontsize=10)

In [None]:
# Calculate predictions from the single tree
y_pred_single = singleTree.predict(X_test)

# Which has lower error: the single tree or the random forest?
print("MSE single tree: ", metrics.mean_squared_error(y_test, y_pred_single))