[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/drbob-richardson/stat220/blob/main/Assignments/Stat_220_HW7.ipynb)

**Problem 1**: Consider the data set on bike share counts in Seoul Korea. You can read in the data using



bikes = pd.read_csv("https://richardson.byu.edu/220/bikes.csv")

Counts is the number of bicycles rented during the lunch hour each day. The continuous predictors are Temperature, Humidity, Wind_speed, Visibility and Rainfall. Seasons is a categorical variable with multiple levels and Holiday is a categorical variable with two levels.

Part a. Split the data into a training and test set.

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

bikes = pd.read_csv("https://richardson.byu.edu/220/bikes.csv")

X = bikes.drop('Count', axis=1)
y = bikes['Count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Part b.  Build a linear regression model for the training data using all the predictors in the model with Count as the target variable.  Remove all predictors from the model with a P-Value greater than 0.05. What predictors are left?

In [34]:

X_train_dum = pd.get_dummies(X_train, drop_first=True, dtype=float)
X_test_dum = pd.get_dummies(X_test, drop_first=True, dtype=float)
X_test_dum = X_test_dum.reindex(columns=X_train_dum.columns, fill_value=0)

X_with_const = sm.add_constant(X_train_dum)
full_model = sm.OLS(y_train, X_with_const).fit()

print("Full model:")
print(full_model.summary())

keep = full_model.pvalues[1:][full_model.pvalues[1:] <= 0.05]

print("\nKeeping:", list(keep.index))

X_reduced = X_with_const[['const'] + list(keep.index)]
reduced_model = sm.OLS(y_train, X_reduced).fit()

print("\nReduced model:")
print(reduced_model.summary())

Full model:
                            OLS Regression Results                            
Dep. Variable:                  Count   R-squared:                       0.510
Model:                            OLS   Adj. R-squared:                  0.494
Method:                 Least Squares   F-statistic:                     32.61
Date:                Thu, 13 Nov 2025   Prob (F-statistic):           6.17e-39
Time:                        23:44:51   Log-Likelihood:                -2084.7
No. Observations:                 292   AIC:                             4189.
Df Residuals:                     282   BIC:                             4226.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const               1147

Part c. Instead of removing all predictors with a P-Value greater than 0.05, remove the largest P-Value and refit, the repeat that process until all the predictors that remain are significant (have P-Values greater than 0.05). What predictors are left in the model?

In [35]:


print("Backward elimination:")

current_features = list(X_train_dum.columns)

while True:
    X_curr = sm.add_constant(X_train_dum[current_features])
    mod = sm.OLS(y_train, X_curr).fit()

    p_vals = mod.pvalues[1:]

    max_p = p_vals.max()
    worst = p_vals.idxmax()

    print(f"  Worst: {worst} (p={max_p:.4f})")

    if max_p <= 0.05:
        print("  Done! All p-values <= 0.05")
        break

    current_features.remove(worst)

X_train_final = sm.add_constant(X_train_dum[current_features])
model_1c = sm.OLS(y_train, X_train_final).fit()

print("\nFinal predictors:", current_features)
print("\n", model_1c.summary())

Backward elimination:
  Worst: Seasons_Spring (p=0.9399)
  Worst: Wind_speed (p=0.7413)
  Worst: Holiday_No Holiday (p=0.2449)
  Worst: Visibility (p=0.1887)
  Worst: Rainfall (p=0.0610)
  Worst: Seasons_Summer (p=0.0284)
  Done! All p-values <= 0.05

Final predictors: ['Temperature', 'Humidity', 'Seasons_Summer', 'Seasons_Winter']

                             OLS Regression Results                            
Dep. Variable:                  Count   R-squared:                       0.498
Model:                            OLS   Adj. R-squared:                  0.491
Method:                 Least Squares   F-statistic:                     71.26
Date:                Thu, 13 Nov 2025   Prob (F-statistic):           7.52e-42
Time:                        23:44:51   Log-Likelihood:                -2088.1
No. Observations:                 292   AIC:                             4186.
Df Residuals:                     287   BIC:                             4205.
Df Model:                       

Part d. Regardless of whether or not you got the same set of predictors in problems 1 and 2, the two approaches can potentially give different results. Explain why.

Each time you remove a feature it could potentially impact the other features and their relationships to one another. A great exmpale we talked about in class was age and years since graduation. These variables are kind of similar becuase they both involve age and the older someone is the longer it has been since they have graduated. In that case taking out one of the variables might fix the problem even thought they both might have had a large p-value to begin with. If you take out one of them it might improve the p-value of the other. This is why it is best to remove one feature at a time until you get to where you want.

Part e. Find the out of sample MSE for both the model with all predictors, the model with all variables with p values above 0.05 removed, and the model with variables removed 1 at a time. which model is best?

In [36]:

X_test_with_const_full = sm.add_constant(X_test_dummies)

y_pred_full = model_full.predict(X_test_with_const_full)
mse_full = mean_squared_error(y_test, y_pred_full)

X_test_1b = X_test_with_const_full[['const'] + significant_features]
y_pred_1b = model_1b.predict(X_test_1b)
mse_1b = mean_squared_error(y_test, y_pred_1b)

X_test_1c = X_test_with_const_full[['const'] + current_features]
y_pred_1c = model_1c.predict(X_test_1c)
mse_1c = mean_squared_error(y_test, y_pred_1c)


print("OUT-OF-SAMPLE MSE COMPARISON")
print()
print(f"\n1. Full Model (all predictors):")
print(f"   MSE = {mse_full:.2f}")

print(f"\n2. Part b Model (remove all p>0.05 at once):")
print(f"   Predictors: {significant_features}")
print(f"   MSE = {mse_1b:.2f}")

print(f"\n3. Part c Model (backward elimination):")
print(f"   Predictors: {current_features}")
print(f"   MSE = {mse_1c:.2f}")

best_model = min([(mse_full, "Full Model"), (mse_1b, "Part b Model"), (mse_1c, "Part c Model")], key=lambda x: x[0])
print(f"\n** BEST MODEL: {best_model[1]} with MSE = {best_model[0]:.2f} **")


OUT-OF-SAMPLE MSE COMPARISON


1. Full Model (all predictors):
   MSE = 77830.43

2. Part b Model (remove all p>0.05 at once):
   Predictors: ['Temperature', 'Humidity', 'Rainfall', 'Seasons_Winter']
   MSE = 72715.22

3. Part c Model (backward elimination):
   Predictors: ['Temperature', 'Humidity', 'Seasons_Summer', 'Seasons_Winter']
   MSE = 76984.62

** BEST MODEL: Part b Model with MSE = 72715.22 **


**Problem 2** Use the same data as above and the same train-test split. Build a regression tree with a maximum depth of 2. Find the out of sample MSE.  

Part a. Build a regression tree with a maximum depth of 2. Find the out of sample MSE.

In [37]:

tree_depth2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_depth2.fit(X_train_dummies, y_train)

y_pred_tree2 = tree_depth2.predict(X_test_dummies)

mse_tree2 = mean_squared_error(y_test, y_pred_tree2)

print("Problem 2a: Regression Tree with max_depth=2")
print(f"Out-of-sample MSE: {mse_tree2:.2f}")

Problem 2a: Regression Tree with max_depth=2
Out-of-sample MSE: 63280.79


Part b. Increase the depth to 3, 4, 5, and 6. Check the out of sample MSE for each and report them.

In [38]:

depths = [3, 4, 5, 6]
mse_results = {}

for depth in depths:
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_train_dummies, y_train)

    y_pred = tree.predict(X_test_dummies)

    mse = mean_squared_error(y_test, y_pred)
    mse_results[depth] = mse

    print(f"Depth = {depth}: Out-of-sample MSE = {mse:.2f}")

all_mse = {2: mse_tree2}
all_mse.update(mse_results)

print("\nAll depths tested:")
for depth in sorted(all_mse.keys()):
    print(f"  Depth {depth}: MSE = {all_mse[depth]:.2f}")

best_depth = min(all_mse, key=all_mse.get)
best_mse = all_mse[best_depth]

print()
print(f"** BEST DEPTH: {best_depth} with MSE = {best_mse:.2f} **")


Depth = 3: Out-of-sample MSE = 71484.28
Depth = 4: Out-of-sample MSE = 64447.18
Depth = 5: Out-of-sample MSE = 66266.62
Depth = 6: Out-of-sample MSE = 86835.98

All depths tested:
  Depth 2: MSE = 63280.79
  Depth 3: MSE = 71484.28
  Depth 4: MSE = 64447.18
  Depth 5: MSE = 66266.62
  Depth 6: MSE = 86835.98

** BEST DEPTH: 2 with MSE = 63280.79 **


Part c. Based on the out of sample MSE, which depth is best?

Out of the sample MSE, the best depth for this particular model is 2. The MSE is about 1000 lower than the closest competitor with the depth of 2. I love that there are hyperparameters that we can play with to understand the data and build better models. I think that it is important to understand these meterics and how they can help us choose better models so we can create good models that can help people in the real world!

**Problem 3** Explain why using out of sample metrics is important for finding the best model as opposed to using in sample metrics. Out of all the models, both regression tree and linear regression models, which is the best model using out-of-sample MSE.

Out of sample metrics are important for finding the best model compared to in sample metrics. They are the best becuase they will provide us with a real opportunity to test our model on the test data that it has never seen before. It is a great way to reflect how the model will perform in real life on unseen data.

Out of all of the models, the best model using out of smaple MSE is the regression tree with the depth of 2. With an MSE of 63,281, this model beat all the others which consisted of MSE values similar to 77,830, 72,715, and 76,985. All in all, in this case the simple tree structure ended up being the best model for the situation and resulting in the most accuracy.

**Problem 4**: A store with an online presence collects revenue data by month. This data can be found at [richardson.byu.edu/220/revenue_data.csv](https://richardson.byu.edu/220/revenue_data.csv). The variable MonthlyRevenue is the target variable. Money spent on ads (AdSpend), site traffic (AvgTraffic), and discount rates (DiscountRate) are the variables.

Part a. Split this data into a training set and a test set.

In [39]:

revenue = pd.read_csv("https://richardson.byu.edu/220/revenue_data.csv")

X_rev = revenue.drop('MonthlyRevenue', axis=1)
y_rev = revenue['MonthlyRevenue']

X_rev_train, X_rev_test, y_rev_train, y_rev_test = train_test_split(X_rev, y_rev, test_size=0.2, random_state=42)

Part b. Fit a linear regression model on the training set. Report the p-values for each variable.

In [40]:

X_rev_train_const = sm.add_constant(X_rev_train)

revenue_model = sm.OLS(y_rev_train, X_rev_train_const).fit()

print(revenue_model.summary())

print("\nP-values for each variable:")
for var, pval in revenue_model.pvalues[1:].items():
    print(f"  {var}: p-value = {pval:.6f}")

                            OLS Regression Results                            
Dep. Variable:         MonthlyRevenue   R-squared:                       0.707
Model:                            OLS   Adj. R-squared:                  0.702
Method:                 Least Squares   F-statistic:                     125.7
Date:                Thu, 13 Nov 2025   Prob (F-statistic):           2.01e-41
Time:                        23:44:52   Log-Likelihood:                -599.10
No. Observations:                 160   AIC:                             1206.
Df Residuals:                     156   BIC:                             1218.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           50.7875      6.977      7.279   

Part c. Interpret the p-value for AdSpend in the context of the problem. What does the value of that p-value imply for the relationship between these variables.

The p-value for AdSpend is 0 which means it is highly statistically significant. This super small p-value helps us understand that there is a super strong relationship between AdSpend and Monthly Revenue. With a p-value of 0 you can be pretty sure that this relationship is not just random chance and that if you spend more money on Ads you will increase your monthly revenue. This can help in a business context to know that advertising is working and is a good way to invest in the company.

Part d. Interpret the p-value for DiscountRate in the context of the problem. What does the value of that p-value imply for the relationship between these variables.

The p-value for DiscountRate is 0.175, which is greater than 0.05. This means the relationship between DiscountRate and MonthlyRevenue is not statistically significant. Basically, we don't have enough evidence to say that discount rates actually affect revenue - the effect we see could just be random chance.

**Problem 5** Using the same data as problem 4.

Part a. Build three regression tree models on the training data set with a max depths of 2, 3, and 5.

In [41]:

tree_models = {}
depths = [2, 3, 5]

for depth in depths:
    tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    tree.fit(X_rev_train, y_rev_train)
    tree_models[depth] = tree

Part b. Find the in sample and out of sample R^2 for all three models. (you shouold have 6 R^2 in total).

In [42]:

for depth in [2, 3, 5]:
    tree = tree_models[depth]

    y_train_pred = tree.predict(X_rev_train)
    r2_train = r2_score(y_rev_train, y_train_pred)

    y_test_pred = tree.predict(X_rev_test)
    r2_test = r2_score(y_rev_test, y_test_pred)

    print(f"\nDepth = {depth}:")
    print(f"  In-sample R² (training):  {r2_train:.4f}")
    print(f"  Out-of-sample R² (test):  {r2_test:.4f}")


Depth = 2:
  In-sample R² (training):  0.6327
  Out-of-sample R² (test):  0.6226

Depth = 3:
  In-sample R² (training):  0.7236
  Out-of-sample R² (test):  0.6404

Depth = 5:
  In-sample R² (training):  0.8661
  Out-of-sample R² (test):  0.5100


Part c. Use these R^2 values in terms of

1.   List item
2.   List item

detecting to detect any underfitting or overfitting in the models.


Looking at the R² values, depth 2 has in-sample R² of 0.63 and out-of-sample R² of 0.62, depth 3 has 0.72 and 0.64, and depth 5 has 0.87 and 0.51. Depth 2 is clearly the best model because it has the smallest gap between training and test performance and actually gets the best out-of-sample score. Depth 5 has a lot of overfitting. It fits training data great but does terrible on test data becuase it is overfitting badly.