In [1]:
%pylab inline
%config InlineBackend.figure_format = 'retina'

from ipywidgets import interact

import pandas as pd
import scipy.stats as stats
import pymc as pm
import seaborn as sns
import arviz as az

Populating the interactive namespace from numpy and matplotlib


# Question 1

During your internship at EPCOR, you are given a dataset containing 10,000 observations of monthly utility bills (in Canadian dollars) for Edmonton houses over the last couple of years. Along with the monthly bill, you are also given:

- `avg_temperature`: the average temperature during the billing month (in Celsius).
- `household_size`: the number of people living in the house during the billing month.
- `house_taxes`: the yearly property taxes according to the last tax notice (in Canadian dollars).

Under the consent of residents, the City of Edmonton kindly provided household and taxes information for the dataset. This information was matched, but it's not perfect.

Your goal is to determine whether the utility bill amount can be predicted using the other three variables.

## A (15 points)

Load the data from the file `EPCOR1.csv` (link provided below) into a Pandas DataFrame. Clean the data by removing any corrupted values.

Do you think using mean imputation is a better approach to deal with corrupted observations in this case? Explain.

In [10]:
path_to_data = 'https://raw.githubusercontent.com/ccontrer/MATH509-Winter2025-JupyterNotebooks/main/Data/Epcor1.csv'
data = pd.read_csv(path_to_data, index_col=0)
display(data.describe())


Unnamed: 0,household_size,house_taxes,bill
count,10000.0,10000.0,10000.0
mean,2.0985,4107.637598,310.312478
std,1.175388,6314.119179,13.754624
min,0.0,682.06,287.571982
25%,1.0,2718.06,300.194427
50%,2.0,3613.9,307.332263
75%,3.0,4519.92,317.376281
max,8.0,99999.0,446.63979


In [23]:
data.loc[data['house_taxes'] == 99999, 'house_taxes'] = data.loc[data['house_taxes'] != 99999, 'house_taxes'].mean()
data2 = data.loc[data['household_size'] == 8]
display(data.describe())
display(data2.describe())

#looking at the sample where the corrupted value occurred (household size = 8), it doesnt look to be a great way.
#Looking at the existing samples, we unforunately only have 3 points. However, all 3 values are much lower than the mean of 3712.
#The issue with using the mean intuitively is that as the household size increases, there are more tax breaks, and hence the tax would
#be lower. It may have been better to use the 25 percentile since its more likely they are on te lower side of tax

Unnamed: 0,household_size,house_taxes,bill
count,10000.0,10000.0,10000.0
mean,2.0985,3712.864442,310.312478
std,1.175388,1417.384687,13.754624
min,0.0,682.06,287.571982
25%,1.0,2718.06,300.194427
50%,2.0,3613.9,307.332263
75%,3.0,4499.56,317.376281
max,8.0,13641.2,446.63979


Unnamed: 0,household_size,house_taxes,bill
count,3.0,3.0,3.0
mean,8.0,2799.5,347.278488
std,0.0,111.98,10.637596
min,8.0,2687.52,337.246874
25%,8.0,2743.51,341.701241
50%,8.0,2799.5,346.155609
75%,8.0,2855.49,352.294296
max,8.0,2911.48,358.432982


## B (10 points)

Create scatter plots for the utility bill versus the average temperature and the property taxes in different subplots within the same figure.

## C (10 points)

Preprocess the data by standardizing all predictor variables.

## D (25 points)

Propose prior distributions for a simple linear model for `bill` using `avg_temperature` and `household_size` as predictors. Create prior predictive plots. Use MCMC sampling (make 4 chains with 1000 samples and 1000 warm-up steps) to estimate the posterior probability (it should take less than 20 seconds). Display a summary and plot all posterior distributions.

Justify your choice of prior distributions.

## E (10 points)

Based on the scatter plot obtained in part B, do you think this is a reasonable model? Interpret the mean of the posterior for each of the parameters. Comment on the relationship between the monthly average temperature and the utility bill amount. How would you approach modeling the utility bill as a function of the average monthly temperature?

## F (20 points)

Extend the linear model by adding `house_taxes` as a predictor. Use MCMC sampling (again, make 4 chains with 1000 samples and 1000 warm-up steps) to estimate the posterior probability (it should take less than 25 seconds). Display a summary and plot all posterior distributions.

Justify your choice of prior distribution.

## G (10 points)

Interpret the mean of the posterior for the new parameter. Compare the other mean posterior values with those obtained in part E. Based on the scatter plot you obtained in part B, do you think this is a reasonable model? Comment on the relationship between the yearly property taxes and the utility bill amount. What do you think is happening here?