# Lab: Trees and Model Stability

Trees are notorious for being **unstable**: Small changes in the data can lead to noticeable or large changes in the tree. We're going to explore this phenomenon, and a common rebuttal.

In the folder for this lab, there are three datasets that we used in class: Divorce, heart failure, and the AirBnB price dataset.

1. Pick one of the datasets and appropriately clean it.
2. Perform a train-test split for a specific seed (save the seed for reproducibility). Fit a classification/regression tree and a linear model on the training data and evaluate their performance on the test data. Set aside the predictions these models make.
3. Repeat step 2 for three to five different seeds (save the seeds for reproducibility). How different are the trees that you get? Your linear model coefficients?. Set aside the predictions these models make.

Typically, you would see the trees changing what appears to be a non-trivial amount, while the linear model coefficients don't vary nearly as much. Often, the changes appear substantial. 

But are they?

4. Instead of focusing on the tree or model coefficients, do three things:
    1. Make scatterplots of the predicted values on the test set from question 2 against the predicted values for the alternative models from part 3, separately for your trees and linear models. Do they appear reasonably similar?
    2. Compute the correlation between your model in part 2 and your alternative models in part 3, separately for your trees and linear models. Are they highly correlated or not?
    3. Run a simple linear regression of the predicted values on the test set from the alternative models on the predicted values from question 2, separately for your trees and linear models. Is the intercept close to zero? Is the slope close to 1? Is the $R^2$ close to 1?

5. Do linear models appear to have similar coefficients and predictions across train/test splits? Do trees?
6. True or false, and explain: "Even if the models end up having a substantially different appearance, the predictions they generate are often very similar."

In [6]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load and clean data

df = pd.read_csv("./data/airbnb_hw.csv") # Chose Airbnb dataset

# Standardize column names
df.columns = df.columns.str.strip().str.replace(" ", "_")

# Convert price column (string) into numeric
df["Price_num"] = (
    df["Price"]
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(float)
)

predictors = [
    "Neighbourhood",
    "Property_Type",
    "Room_Type",
    "Review_Scores_Rating_(bin)",
    "Zipcode",
    "Beds",
    "Number_of_Records",
    "Number_Of_Reviews",
    "Review_Scores_Rating"
]

df_clean = df.dropna(subset=predictors + ["Price_num"]).copy()

# One-hot encode categoricals
X = pd.get_dummies(df_clean[predictors], drop_first=True)
y = df_clean["Price_num"]

def fit_models(seed):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=seed
    )

    # Tree
    tree = DecisionTreeRegressor(random_state=seed)
    tree.fit(X_train, y_train)
    tree_preds = tree.predict(X_test)

    # Linear regression
    lm = LinearRegression()
    lm.fit(X_train, y_train)
    lm_preds = lm.predict(X_test)

    return {
        "seed": seed,
        "tree": tree,
        "lm": lm,
        "tree_preds": tree_preds,
        "lm_preds": lm_preds,
        "X_test": X_test,
        "y_test": y_test,
        "tree_rmse": mean_squared_error(y_test, tree_preds),
        "tree_r2": r2_score(y_test, tree_preds),
        "lm_rmse": mean_squared_error(y_test, lm_preds),
        "lm_r2": r2_score(y_test, lm_preds),
    }


# 2

main_seed = 0
results_main = fit_models(main_seed)

print("Question 2: MAIN SEED (0) RESULTS")
print(f"Tree RMSE: {results_main['tree_rmse']:.3f}")
print(f"Tree R^2: {results_main['tree_r2']:.3f}")
print(f"Linear RMSE: {results_main['lm_rmse']:.3f}")
print(f"Linear R^2: {results_main['lm_r2']:.3f}")


# 3
other_seeds = [7, 21, 42]
results_other = [fit_models(s) for s in other_seeds]

print("\nQuestion 3: OTHER SEED RESULTS")
for r in results_other:
    print(f"\nSeed {r['seed']}")
    print(f"Tree RMSE: {r['tree_rmse']:.3f}, Tree R^2: {r['tree_r2']:.3f}")
    print(f"Linear RMSE: {r['lm_rmse']:.3f}, Linear R^2: {r['lm_r2']:.3f}")

# predictions on each seed's OWN test set
tree_predictions = {r["seed"]: r["tree_preds"] for r in [results_main] + results_other}
lm_predictions   = {r["seed"]: r["lm_preds"]   for r in [results_main] + results_other}

X_test_main = results_main["X_test"]
y_test_main = results_main["y_test"]

tree_preds_common = {}
lm_preds_common = {}

for r in [results_main] + results_other:
    seed = r["seed"]
    tree_preds_common[seed] = r["tree"].predict(X_test_main)
    lm_preds_common[seed]   = r["lm"].predict(X_test_main)


Question 2: MAIN SEED (0) RESULTS
Tree RMSE: 14169.154
Tree R^2: 0.060
Linear RMSE: 9326.344
Linear R^2: 0.381

Question 3: OTHER SEED RESULTS

Seed 7
Tree RMSE: 36741.003, Tree R^2: -0.418
Linear RMSE: 19463.556, Linear R^2: 0.249

Seed 21
Tree RMSE: 14987.259, Tree R^2: -0.177
Linear RMSE: 7332.116, Linear R^2: 0.424

Seed 42
Tree RMSE: 18188.914, Tree R^2: -0.029
Linear RMSE: 11918.809, Linear R^2: 0.325


1. We chose the airbnb_hw.csv dataset. To clean this dataset, we standardized the column names, replacing spaces with underscores. We also converted the Price column to a numeric variable Price_num by removing any $ or , characters and casting to float, dropping the null values and encoding the categorical columns.

2. Code above. For the seed 0, the Tree RMSE and R^2 was 14169.154 and 0.060 respectively, while the Linear RMSE and Linear R^2 was 9326.344 and 0.381.

3. Across the different seeds, the decision trees changed a lot. Their test RMSE values jumped from about 14K (seed 0) to 36K (seed 7), and the R² values even became negative for every alternative seed. This shows that the tree structure and its predictive behavior are highly unstable: small changes in the data lead to very different splits and very different performance. The linear models were noticeably more stable. Their RMSE values moved, but not nearly as dramatically, and the R² values stayed within a narrower range (about 0.25–0.42). Even when performance changed, the linear model’s coefficients remained much more consistent across seeds. Overall, the trees varied substantially, while the linear models changed only moderately.