# Stat 220 Final Examination – Take Home


You will work with three separate datasets (your instructor will provide actual files or simulation code):



In [None]:
# Standard imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.metrics import mean_squared_error, accuracy_score

import statsmodels.api as sm
import statsmodels.formula.api as smf

import bambi as bmb
import arviz as az


## Part 1. Home Energy Use (Regression, Transformations, Trees)

You are given the dataset `energy.csv`, which contains information about houses and their monthly electricity usage.

The target variable is **`monthly_energy_kwh`**, and other variables may include:

- `house_size` (square feet, continuous)  
- `num_occupants` (integer)  
- `avg_outdoor_temp` (°F, continuous)  
- `num_appliances` (integer)  
- `insulation_quality` (categorical: `"low"`,  `"high"`)  
- `heating_type` (categorical: e.g., `"electric"`, `"gas"`, `"heat_pump"`)  

The dataset cn be found at "https://github.com/drbob-richardson/stat220/raw/main/data/energy.csv"



### Problem 1. 

- Read in `energy.csv`.
- Plot a histogram of `monthly_energy_kwh`.
- Based on the plot, comment briefly on whether energy looks skewed enough that a log transformation might be useful.


### Problem 2.

- Create dummy variables for categorical predictors (e.g., `insulation_quality`, `heating_type`), remembering to drop one category for each as a baseline.
- Regardless of what you decided in Problem 1, do not take a log transformation of the target variable


### Problem 3.

- Split the data into a training and test set.
- Fit a linear regression model to the training set
- Report the MSE for the train set and the test set

### Problem 4

- Create a plot of the fitted values against the residuals
- Based on this plot, does the target variable require a log transformation?



### Problem 5

- Report the coefficient associated with average outdoor temperature in context along with the confidence interval


### Problem 6

- Report the coefficient or coefficients associated with heating type in context along with the confidence interval(s)

### Problem 7

- Look at the p-values and identify any predictors with p-values above 0.05
- Iteratively remove one predictor at a time, starting with the largest p-value, refit the model and report the test MSE each time. 


### Problem 8. 

- Use the same data set with all the variables and fit a regression tree. 
- Fit the regression tree with depth 2, depth 3, depth 4, and depth 5. 


### Problem 9

- Report the train and test set R^2 for all 4 models in Problem 8.
- Based on these metric which model is best?

### Problem 10

- Add 4 interactions to the final linear regression model from Problem 7
  - `avg_outdoor_temp^2`
  - `house_size^2`
  - `num_occupants * house_size` (interaction)
  - `avg_outdoor_temp * insulation_quality` (interaction with a dummy)
- Refit the model and keep any interactions that are significant. 
- Remove the ones that are not significant and refit. 

### Problem 11

You have 3 final models,
- The final regression model from Problem 7
- The best regression tree model from Problem 9
- The final model from Problem 10
Use a justifiable method to determine which model is the best model out of those three. 

### Problem 12. Prediction for a new house.

Suppose you are given a new dataset `energy_new.csv` containing one new house.

- Read in that file and apply the same preprocessing as your training data:
- Using your linear regression model from Problem 10, compute:
  - The predicted **mean** `monthly_energy_kwh` for this house.
  - A 95% confidence interval for the mean prediction.
  - A 95% prediction interval for a single new observation.

The data set can be found at "https://github.com/drbob-richardson/stat220/raw/main/data/energy_new.csv"

### Problem 13

Use the best regression tree model from Problem 9. Find a 95% bootstrap confidence interval for the new home in `energy_new.csv`.

### Problem 14. 

- Start with the original data frame. 
- Use only the variables for house size and insulation quality as predictors. 
- Fit a Bayesian linear regression model to predict eergy usage using only those two variables. 

### Problem 15

Using the model fit in Problem 14, find the probability that the coefficient associated with higher quality insulation is positive. 