In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw11_sp24.ipynb")

# Homework 11: Regression Inference

**Helpful Resource:**

- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Reading**: 

* [Using Confidence Intervals](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html)
* [The Regression Line](https://inferentialthinking.com/chapters/15/2/Regression_Line.html#the-regression-line-in-standard-units)
* [Inference for Regression](https://www.inferentialthinking.com/chapters/16/Inference_for_Regression.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *
import d8error

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore')
from datetime import datetime

## An Introduction to Regression Inference

Previously in this class, we've used confidence intervals to quantify uncertainty about estimates. We can also run hypothesis tests using a confidence interval under the following procedure:

1. Define a null and alternative hypothesis (they must be of the form "The parameter is X" and "The parameter is not X").
2. Choose a p-value cutoff, and call it $q$.
3. Construct a $(100-q)\%$ interval using bootstrap sampling (for example, if your p-value cutoff is 0.01, or 1%, then construct a 99% confidence interval).
4. Using the confidence interval, determine if your data are more consistent with your null or alternative hypothesis:
   * If the null hypothesis parameter X is in your confidence interval, the data are more consistent with the null hypothesis.
   * If the null hypothesis parameter X is *not* in your confidence interval, the data are more consistent with the alternative hypothesis.

More recently, we've discussed the use of linear regression to make predictions based on correlated variables. For example, we can predict the height of children based on the heights of their parents.

We can combine these two topics to make powerful statements about our population by using the following techniques:

- Bootstrapped interval for the true slope
- Bootstrapped prediction interval for y (given a particular value of x)

This homework explores these two methods.

## The Data
[American muscle cars](https://en.wikipedia.org/wiki/Muscle_car) are iconic vehicles celebrated for their powerful engines and bold styling. Known for their presence on roads and in pop culture, these cars represent a passion for speed and power but also come with challenges such as fuel inefficiency and increased emissions.

<img src="Muscle cars.png" alt="Plover and Eggs">

The data for this assignment pertains to various models of American muscle cars, focusing on the relationship between engine size, horsepower, and vehicle efficiency.


The data for this assignment was modeled from automotive testing grounds and manufacturers' specifications. The aim is to analyze how engine size relates to horsepower and fuel efficiency. The greater the horsepower, the more robust the car, but often at the cost of reduced fuel efficiency.

<img src="Blue Muscle Car.png" alt="Plover and Chick">

Muscle Car Specifications

Each row in the dataset corresponds to one muscle car model. Note the robust build and design:

- `Engine Size (liters)` and `Horsepower (hp)` are measures of engine capacity and power output, respectively.
- `Fuel Efficiency (mpg)` and `Vehicle Weight (lbs)` are also included. For comparison, a typical small car may weigh around 2,500 lbs.

In [None]:
muscle_cars = Table.read_table('Muscle_Car_Data.csv')
muscle_cars

In this analysis, we will use `Engine Size (liters)` to predict `Horsepower (hp)`. This relationship is crucial in understanding how the size of an engine influences the horsepower output of muscle cars. Running the cell below will generate a scatter plot of `Engine Size (liters)` and `Horsepower (hp)`, along with their line of best fit.

In [None]:
# Just run this cell and examine the scatter plot.
muscle_cars.scatter('Engine Size (liters)', "Horsepower (hp)", fit_line=True)

## 1. Investigating the Linear Relationship between Engine Size and Horsepower
Upon examining the scatter plot of our muscle car dataset, we observe a potential linear relationship between engine size and horsepower. However, it's essential to verify if this apparent relationship holds true in the broader population of muscle cars.

We aim to determine if there is indeed a linear relationship between engine size and horsepower for muscle cars. If there is no such relationship, we would expect a correlation coefficient of 0. Consequently, the slope of the regression line would also be 0.

<!-- BEGIN QUESTION -->

**Question 1.1.**  We're conducting a hypothesis test using confidence intervals to explore the potential linear relationship between engine size and horsepower in muscle cars. Define the null and alternative hypotheses required for this test. **(8 points)**

Please write your answer **in the cell below** using the following format:

- **Null Hypothesis:**
- **Alternative Hypothesis:**




_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.2.** Define the following two functions tailored for analyzing the relationship between engine size and horsepower in muscle cars:

1. `std_units`: This function takes in an array of values, such as engine sizes, and returns an array of those values converted to standard units.
2. `correlation`: This function takes in a table and two column names (one for *x* and one for *y*) and returns the correlation between these columns. 

Please write your answer in the cell below, providing the code for each function. **(8 points)**

In [None]:
def std_units(arr):
    ...

def correlation(tbl, x_col, y_col):
    ...

In [None]:
grader.check("q1_2")

**Question 1.3.** Using the functions you implemented earlier, create a function called `fit_line` tailored for analyzing the relationship between engine size and horsepower in muscle cars. The function should take a table like  `muscle cars` and the column names associated with *x* and *y* as its arguments. It should return an *array* containing the slope and intercept of the regression line (in that order) that predicts the horsepower using the engine size.

Please write your answer in the cell below, providing the code for the `fit_line` function. **(8 points)**


In [None]:
def fit_line(tbl, x_col, y_col):
    ...

fit_line(muscle_cars, "Engine Size (liters)", "Horsepower (hp)")

In [None]:
grader.check("q1_3")

**Run** this cell to plot the line produced by calling `fit_line` on the `muscle_cars` table.  

**Note:** You are not responsible for the code in the cell below, but make sure that your `fit_line` function generated a reasonable line for the data.

In [None]:
# Ensure your fit_line function fits a reasonable line 
# to the data in muscle cars, using the plot below.

# Just run this cell
slope, intercept = fit_line(muscle_cars, "Engine Size (liters)", "Horsepower (hp)")
muscle_cars.scatter("Engine Size (liters)", "Horsepower (hp)")
plt.plot([min(muscle_cars.column("Engine Size (liters)")), max(muscle_cars.column("Engine Size (liters)"))], 
         [slope*min(muscle_cars.column("Engine Size (liters)"))+intercept, slope*max(muscle_cars.column("Engine Size (liters)"))+intercept])
plt.show()

Now equipped with the essential tools, we can construct a confidence interval to quantify our uncertainty about the true association between engine size and horsepower in muscle cars.

<!-- BEGIN QUESTION -->

**Question 1.4.** Generate an array named `bootstrapped_slopes` containing the slope of the best fit line for 1000 bootstrap resamples of the `muscle_cars` dataset. Visualize the distribution of these slopes using a histogram. **(8 points)**








In [None]:
bootstrapped_slopes = ...

for i in np.arange(1000): 
    muscle_cars_bootstrap = ...
    bootstrap_line = ...
    one_bootstrap_slope = ...
    bootstrapped_slopes = ...
    
# DO NOT CHANGE THIS LINE
Table().with_column("Slope estimate", bootstrapped_slopes).hist()

In [None]:
grader.check("q1_4")

<!-- END QUESTION -->

**Question 1.5.** Use your `bootstrapped slopes` to construct an 90% confidence interval for the true value of the slope. **(8 points)**


In [None]:
lower_end = ...
upper_end = ...
print("90% confidence interval for slope: [{:g}, {:g}]".format(lower_end, upper_end))

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

**Question 1.6.** Based on your confidence interval, would you accept or reject the null hypothesis that the true slope is 0?  Why?  What p-value cutoff are you using? **(8 points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.7.** What do you think the true slope is? You do not need an exact number. How confident are you of this estimate? **(8 points)**

*Hint:* Can you provide an interval that you think the true slope falls in?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Finding the Bootstrap Prediction Interval

Suppose we're exploring a muscle car showroom and come across various engine sizes; we're interested in predicting their horsepower based on these engine sizes. In other words, we aim to utilize our regression line to make predictions about a car's horsepower based on the engine size.

However, just as we're uncertain about the slope of the true regression line, we're also uncertain about the predictions made based on the true regression line.

**Question 2.1.** 

Define the function `predicted_value` which takes in the following four arguments:

1. `table:` a table similar to the `muscle_car` dataset. We'll be predicting the values in the second column using the first.
2. `x_column:` the name of the x-column within the input `table`.
3. `y_column:` the name of the y-column within the input `table`.
4. `given_x:` a number, the value of the explanatory variable for which we'd like to make a prediction.

The function should utilize the `fit_line` function defined in Question 1.3 to return the line's prediction for the given engine size. (8 points)


In [None]:
def predicted_value(table, x_column, y_column, given_x):
    regression_line = ...
    slope = ...
    intercept = ...
    ...

# Here's an example of how predicted_value is used. The code below
# computes the prediction for the horsepower (hp), based on
# an engine size in liters.
engine_size_seven = predicted_value(muscle_cars, "Engine Size (liters)", "Horsepower (hp)", 7)
engine_size_seven

In [None]:
grader.check("q2_1")

**Question 2.2.** Jonathon, the expert on muscle cars at our testing facility, informs us that a Dodge Challenger Scatpack R/T he's been closely monitoring has an engine size of 6.2 liters. Utilizing the `predicted_value` function defined earlier, assign the variable `experts_horsepower` to the predicted horsepower for Jonathon's muscle car.

In [None]:
experts_horsepower = ...
experts_horsepower

In [None]:
grader.check("q2_2")

In [None]:
# Let's look at the number of rows in the muscle car table.
muscle_cars.num_rows

A fellow automotive enthusiast raises the following objection to your prediction:

> "Your prediction depends on your sample of muscle cars. You only used 100 muscle cars. Wouldn't your prediction change if you had a different sample of muscle cars?"

Drawing upon your knowledge from the course materials, you understand the significance of sample variability in regression analysis. Indeed, had the sample of muscle cars been different, the regression line would have varied as well. Consequently, this would lead to different predictions. To accurately assess the reliability of our prediction, we must gauge the variability in our predictions across different samples of muscle cars.

*Hint:* you can find the answer to in [16.3](https://inferentialthinking.com/chapters/16/3/Prediction_Intervals.html) of the textbook.

**Question 2.3.**
Define a function called `compute_bootstrapped_line` that takes in a table `table` and two column names `x_column` and `y_column`. It should return an array containing the parameters of the best fit line (slope and intercept) for one bootstrapped resample of the dataset.

In [None]:
def compute_bootstrapped_line(table, x_column, y_column):
    bootstrapped = ...
    bootstrapped_line = ...
    ...

In [None]:
grader.check("q2_3")

**Run** the following cell below in order to define the function `bootstrap_lines`.  It takes in four arguments:
1. `table`: a table like `muscle_cars`
2. `x_column`: the name of our x-column within the input `table`
3. `y_column`: the name of our y-column within the input `table`
4. `number_bootstraps`: an integer, a number of bootstraps to run.

It returns a *table* with one row for each bootstrap resample and the following two columns:
1. `Slope`: the bootstrapped slopes 
2. `Intercept`: the corresponding bootstrapped intercepts 

In [None]:
# Just run this cell
def bootstrap_lines(table, x_column, y_column, number_bootstraps):
    reused_slopes = make_array()
    reused_intercepts = make_array() 
    for i in np.arange(number_bootstraps): 
        reused_line = compute_bootstrapped_line(table, x_column, y_column) 
        reused_slope = reused_line.item(0) 
        reused_intercept = reused_line.item(1) 
        reused_slopes = np.append(reused_slopes,reused_slope)
        reused_intercepts = np.append(reused_intercepts,reused_intercept)
    table_lines = Table().with_columns('Slope', reused_slopes, 'Intercept', reused_intercepts)
    return table_lines

regress_lines = bootstrap_lines(muscle_cars, "Engine Size (liters)", "Horsepower (hp)", 1000)
regress_lines

<!-- BEGIN QUESTION -->

**Question 2.4.** Generate an array named `predictions_for_seven` containing the predicted horsepower based on an engine size of 7 liters for each regression line in `regress_lines` **(8 points)**

In [None]:
predictions_for_seven = ...

# This will make a histogram of your predictions:
table_of_predictions = Table().with_column('Predictions at Horsepower=7', predictions_for_seven)
table_of_predictions.hist('Predictions at Horsepower=7', bins=20)

In [None]:
grader.check("q2_4")

<!-- END QUESTION -->

**Question 2.5.** Create an approximate 90% confidence interval for these predictions. **(6 points)**


In [None]:
lower_bound = ...
upper_bound = ...

print('90% Confidence interval for predictions for x=7: (', lower_bound,",", upper_bound, ')')

In [None]:
grader.check("q2_5")

**Question 2.6.** Set `car_statements` to an array of integer(s) that correspond to statement(s) that are true. **(6 points)**

1. The 90% confidence interval covers 90% of the predicted horsepower values for muscle cars with an engine size of seven liters.

2. The 90% confidence interval quantifies the uncertainty in our estimation of how engine size influences horsepower in `muscle_cars`.

3. The 90% confidence interval provides insight into the variability of actual horsepower values relative to the predicted values.


In [None]:
car_statements = ...

In [None]:
grader.check("q2_6")

You're done with Homework 11!  

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)