# Lab 3: Inference for Regression

In this lab, we explore a dataset that contains information about mothers and their child's birth weight. We are intersted in how these variables are related, and whether or not we can trust our regression lines.

In [4]:
# Run this cell, but please don't change it.

# These lines import the numpy and pandas modules
import numpy as np
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

First, load the data and look at a few rows.

In [6]:
baby = pd.read_csv('baby.csv')
baby.head()

Unnamed: 0,Birth Weight,Gestational Days,Maternal Age,Maternal Height,Maternal Pregnancy Weight,Maternal Smoker
0,120,284,27,62,100,False
1,113,282,33,64,135,False
2,128,279,28,64,115,True
3,108,282,23,67,125,True
4,136,286,25,62,93,False


### 1. Could the True Slope Be 0? ###

Suppose we believe that our data follow the regression model, and we fit the regression line to estimate the true line. If the regression line isn’t perfectly flat, as is almost invariably the case, we will be observing some linear association in the scatter plot.

But what if that observation is spurious? In other words, what if the true line was flat – that is, there was no linear relation between the two variables – and the association that we observed was just due to randomness in generating the points that form our sample?

Here is a simulation that illustrates why this question arises. We will call the function ``draw_and_compare``, which takes in a true slope, intercept, and sample size, and generates data based on a normally distributed error. That is, given a true slope $m$ and intercept $b$, we generate random samples $\epsilon \sim N(0, \sigma)$ and generate data of the form
$$y = mx + b + \epsilon$$

Remember that the arguments to the function draw_and_compare are the slope and the intercept of the true line, and the number of points to be generated.

In [None]:
# just run this cell
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(t, x, y):
    return np.mean(standard_units(t[x])*standard_units(t[y]))

def slope(table, x, y):
    r = correlation(table, x, y)
    return r * np.std(table[y])/np.std(table[x])

def intercept(table, x, y):
    a = slope(table, x, y)
    return np.mean(table[y]) - a * np.mean(table[x])

def fit(table, x, y):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table[x] + b

def scatter_fit(table, x, y):
    plt.scatter(table[x], table[y], s=20)
    plt.plot(table[x], fit(table, x, y), lw=2, color='gold')
    plt.xlabel(x)
    plt.ylabel(y)
    
def draw_and_compare(true_slope, true_int, sample_size):
    x = np.random.normal(50, 5, sample_size)
    xlims = np.array([np.min(x), np.max(x)])
    eps = np.random.normal(0, 6, sample_size)
    y = (true_slope*x + true_int) + eps
    tyche = pd.DataFrame(
        data = np.vstack([ x,y]).T, 
        columns = ["x", "y"]
    )
    
    plt.figure(figsize=(6, 16))
    plt.subplot(4, 1, 1)
    plt.scatter(tyche['x'], tyche['y'], s=20)
    plt.plot(xlims, true_slope*xlims + true_int, lw=2, color='green')
    plt.title('True Line, and Points Created')

    plt.subplot(4, 1, 2)
    plt.scatter(tyche['x'],tyche['y'], s=20)
    plt.title('What We Get to See')

    plt.subplot(4, 1, 3)
    scatter_fit(tyche, 'x', 'y')
    plt.xlabel("")
    plt.ylabel("")
    plt.title('Regression Line: Estimate of True Line')

    plt.subplot(4, 1, 4)
    scatter_fit(tyche, 'x', 'y')
    plt.ylabel("")
    xlims = np.array([np.min(tyche['x']), np.max(tyche['x'])])
    plt.plot(xlims, true_slope*xlims + true_int, lw=2, color='green')
    plt.title("Regression Line and True Line")

#### Question 1

Call this function a few different times, setting the true_slope to be zero. What do you observe?

In [None]:
...

You will notice that while the slope of the true line is 0, the slope of the regression line is typically not 0. The regression line sometimes slopes upwards, and sometimes downwards, each time giving us a false impression that the two variables are correlated.

To decide whether or not the slope that we are seeing is real, we would like to test the following hypotheses:

**Null Hypothesis.** The slope of the true line is 0. 

**Alternative Hypothesis.** The slope of the true line is not 0. 

To do this, we can construct a 95% confidence interval for the true slope, and then all we have to do is see whether the interval contains 0. 

If it doesn't, then we can reject the null hypothesis (with the 5% cutoff for the P-value). 

If the confidence interval for the true slope does contain 0, then we don't have enough evidence to reject the null hypothesis. Perhaps the slope that we are seeing is spurious.

Let's use this method in an example. Suppose we try to estimate the birth weight of the baby based on the mother's age. 

#### Question 2

Make a scatterplot of the birth weight vs. the mother's age and plot the least squares line on top. We've given you the code to find the slope and intercept of the least squares fit below.

In [3]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)  

def correlation(df, x, y):
    return np.mean(standard_units(df[x])*standard_units(df[y]))

def slope(df, x, y):
    r = correlation(df, x, y)
    return r * np.std(df[y])/np.std(df[x])

def intercept(df, x, y):
    a = slope(df, x, y)
    return np.mean(df[y]) - a * np.mean(df[x])


# Plot here
...

Ellipsis

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
baby_slope = slope(baby, 'Maternal Height', 'Birth Weight')
baby_intercept = intercept(baby, 'Maternal Height', 'Birth Weight')
xlims = np.array([min(baby['Maternal Height']), max(baby['Maternal Height'])])
baby.plot.scatter('Maternal Height', 'Birth Weight')
plt.plot(xlims, xlims*baby_slope + baby_intercept, color='gold')
</pre>
</details>

In [None]:
intercept(baby, )

We notice that the the regression line looks like it is positive. However, does this reflect the fact that the true line has a positive slope? To answer this question, let us see if we can estimate the true slope. We certainly have one estimate of it: the slope of our regression line. But had the scatter plot come out differently, the regression line would have been different and might have had a different slope. How do we figure out how different the slope might have been? 

We need another sample of points, so that we can draw the regression line through the new scatter plot and find its slope. But from where will get another sample?

You have guessed it – we will *bootstrap our original sample*. That will give us a bootstrapped scatter plot, through which we can draw a regression line.

### Bootstrapping the Scatter Plot ###
We can simulate new samples by random sampling with replacement from the original sample, as many times as the original sample size. Each of these new samples will give us a scatter plot. We will call that a *bootstrapped scatter plot*, and for short, we will call the entire process *bootstrapping the scatter plot*.


#### Question 3
Plot 4 different scatterplots of bootstrapped samples of the original dataset. These samples should be the same size as the original dataset, and you should sample with replacement. The pandas function `df.sample()` will be helpful here. What do you notice about these plots compared with the original?

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
for i in range(5):
    rep = baby.sample(baby.shape[0], replace=True)
    rep.plot.scatter('Maternal Age', 'Birth Weight')
    
The scatterplots are a bit sparser than the original because they are sampling with replacement.
</pre>
</details>


#### Question 4

To create a confidence interval for the true slope, we can bootstrap the scatter plot a large number of times, and draw a regression line through each bootstrapped plot. Each of those lines has a slope. We can simply collect all the slopes and draw their empirical histogram.

Please create a histogram of 5000 bootstrapped slopes.

In [None]:
slopes = []
for i in range(5000):
    bootstrap_sample = ...
    ...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
slopes = []
for i in range(5000):
    bootstrap_sample = baby.sample(baby.shape[0], replace=True)
    bslope = slope(bootstrap_sample, 'Maternal Age', 'Birth Weight')
    slopes.append(bslope)
plt.hist(slopes)
</pre>
</details>




#### Question 5

Construct an approximate 95% confidence interval for the slope of the true line, using the bootstrap percentile method. The confidence interval should extend from the 2.5th percentile to the 97.5th percentile of the 5000 bootstrapped slopes. What does this interval tell us about the true slope?

In [None]:
left = ...
right = ...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
left = percentile(2.5, slopes)
right = percentile(97.5, slopes)

The 95% confidence interval contains 0 - thus we don't have reason to suspect the true slope is nonzero
</pre>
</details>




Because the interval contains 0, we cannot reject the null hypothesis that the slope of the true linear relation between maternal age and baby's birth weight is 0. Based on this analysis, it would be unwise to predict birth weight based on the regression model with maternal age as the predictor.


### 2. Prediction Intervals ###
One of the primary uses of regression is to make predictions for a new individual who was not part of our original sample but is similar to the sampled individuals. In the language of the model, we want to estimate $y$ for a new value of $x$.

Our estimate is the height of the true line at $x$. Of course, we don't know the true line. What we have as a substitute is the regression line through our sample of points.

The **fitted value** at a given value of $x$ is the regression estimate of $y$ based on that value of $x$. In other words, the fitted value at a given value of $x$ is the height of the regression line at that $x$.

Suppose we try to predict a baby's birth weight based on the number of gestational days. The function ``fitted_value()`` (below) will compute a fitted value for `given_x` given the data. 

#### Question 1
Make a scatterplot of the baby's birth weight vs the number of gestational days, add the fitted line (as you did in the first question), and add a vertical line from the axis to the fitted value of `given_x = 300`.

In [None]:
def fitted_value(df, x, y, given_x):
    a = slope(df, x, y)
    b = intercept(df, x, y)
    return a * given_x  + b

# plot here
... # reuse code from before to plot the scatterplot with the new predictor
fit_300 = ... # call the fitted_value function to get the prediction for x = 300
plt.plot([300,300], [0, fit_300], color='red', lw=2)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
baby_slope = slope(baby, 'Gestational Days', 'Birth Weight')
baby_intercept = intercept(baby, 'Gestational Days', 'Birth Weight')
xlims = np.array([min(baby['Gestational Days']), max(baby['Gestational Days'])])
baby.plot.scatter('Gestational Days', 'Birth Weight')
plt.plot(xlims, xlims*baby_slope + baby_intercept, color='gold')
fit_300 = fitted_value(baby, 'Gestational Days', 'Birth Weight', 300)
plt.plot([300,300], [0, fit_300], color='red', lw=2)
</pre>
</details>

The fitted value at 300 gestational days is about 129.2 ounces. In other words, for a pregnancy that has a duration of 300 gestational days, our estimate for the baby's weight is about 129.2 ounces.

### The Variability of the Prediction ###

We have developed a method making one prediction of a new baby's birth weight based on the number of gestational days, using the data in our sample. But as data scientists, we know that the sample might have been different. Had the sample been different, the regression line would have been different too, and so would our prediction. To see how good our prediction is, we must get a sense of how variable the prediction can be.

To do this, we must generate new samples. We can do that by bootstrapping the scatter plot as in the previous section. We will then fit the regression line to the scatter plot in each replication, and make a prediction based on each line. 

#### Question 2

Complete the code to bootstrap the sample 10 different times and use the provided code to plot the 10 different fitted lines and their prediction for birth weight at 300 days.

In [None]:
x = 300

lines = []
for i in range(10):
    bootstrap_sample = ...
    a = ... #slope
    b = ... #intercept
    lines.append([a, b])

    
# this code will plot (no need to change)
lines = pd.DataFrame(lines, columns=['slope', 'intercept'])
lines['prediction at x='+str(x)] = lines['slope']*x + lines['intercept']

xlims = np.array([291, 309])
left = xlims[0]*lines['slope'] + lines['intercept']
right = xlims[1]*lines['slope'] + lines['intercept']
fit_x = x*lines['slope'] + lines['intercept']

for i in range(10):
    plt.plot(xlims, np.array([left[i], right[i]]), lw=1)
    plt.scatter(x, fit_x[i], s=30)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
x = 300

lines = []
for i in range(10):
    rep = baby.sample(baby.shape[0], replace=True)
    a = slope(rep, 'Gestational Days', 'Birth Weight')
    b = intercept(rep, 'Gestational Days', 'Birth Weight')
    lines.append([a, b])
</pre>
</details>



The predictions vary from one line to the next. How do we determine how confident we are in the prediction?

### Bootstrap Prediction Interval ###

If we increase the number of repetitions of the resampling process, we can generate an empirical histogram of the predictions. This will allow us to create an interval of predictions, using the same percentile method that we used create a bootstrap confidence interval for the slope.

Let us define a function called ``bootstrap_prediction`` to do this. The function takes five arguments:
- the name of the table
- the column labels of the predictor and response variables, in that order
- the value of $x$ at which to make the prediction
- the desired number of bootstrap repetitions

In each repetition, the function bootstraps the original scatter plot and finds the predicted value of $y$ based on the specified value of $x$. Specifically, it calls the function `fitted_value` that we defined earlier in this section to find the fitted value at the specified $x$.

Finally, it draws the empirical histogram of all the predicted values, and prints the interval consisting of the "middle 95%" of the predicted values. It also prints the predicted value based on the regression line through the original scatter plot.

#### Question 3
Complete the code below and run it to create a prediction interval for `new_x = 300` with 5000 repetitions.

In [None]:
def bootstrap_prediction(df, x, y, new_x, repetitions):
    
    # For each repetition:
    # Bootstrap the scatter; 
    # get the regression prediction at new_x; 
    # augment the predictions list
    predictions = []
    n = df.shape[0]
    for i in np.arange(repetitions):
        bootstrap_sample = ...
        bootstrap_prediction = ...
        predictions.append(bootstrap_prediction)
        
    # Find the ends of the approximate 95% prediction interval
    left = ...
    right = ...
    
    # Prediction based on original sample
    original = fitted_value(table, x, y, new_x)
    
    # Display results
    plt.hist(predictions)
    plots.xlabel('predictions at x='+str(new_x))
    plt.plot([left, right], [0,0], color='yellow', lw=8);
    print('Height of regression line at x='+str(new_x)+':', original)
    print('Approximate 95%-confidence interval:')
    print(left, right)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
def bootstrap_prediction(df, x, y, new_x, repetitions):
    
    # For each repetition:
    # Bootstrap the scatter; 
    # get the regression prediction at new_x; 
    # augment the predictions list
    predictions = []
    n = df.shape[0]
    for i in np.arange(repetitions):
        bootstrap_sample = df.sample(n, replace=True)
        bootstrap_prediction = fitted_value(bootstrap_sample, x, y, new_x)
        predictions.append(bootstrap_prediction)
        
    # Find the ends of the approximate 95% prediction interval
    left = percentile(2.5, predictions)
    right = percentile(97.5, predictions)
    
    # Prediction based on original sample
    original = fitted_value(df, x, y, new_x)
    
    # Display results
    plt.hist(predictions)
    plt.xlabel('predictions at x='+str(new_x))
    plt.plot([left, right], [0,0], color='yellow', lw=8);
    print('Height of regression line at x='+str(new_x)+':', original)
    print('Approximate 95%-confidence interval:')
    print(left, right)
    
bootstrap_prediction(baby, 'Gestational Days', 'Birth Weight', 300, 5000)
</pre>
</details>

### The Effect of Changing the Value of the Predictor ###


#### Question 4
Create a prediction interval for `new_x = 285`. How is this interval different? Why is this?

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
The prediction interval is narrower. This is because 285 is nearer to the center of the distribution than 300 is. Typically, the regression lines based on the bootstrap samples are closer to each other near the center of the distribution of the predictor variable. Therefore all of the predicted values are closer together as well. This explains the narrower width of the prediction interval.
</pre>
</details>

### Words of caution ###

All of the predictions and tests that we have performed in this chapter assume that the regression model holds. Specifically, the methods assume that the scatter plot resembles points generated by starting with points that are on a straight line and then pushing them off the line by adding random normal noise.

If the scatter plot does not look like that, then perhaps the model does not hold for the data. If the model does not hold, then calculations that assume the model to be true are not valid.

Therefore, we must first decide whether the regression model holds for our data, before we start making predictions based on the model or testing hypotheses about parameters of the model. A simple way is to do what we did in this section, which is to draw the scatter diagram of the two variables and see whether it looks roughly linear and evenly spread out around a line. We should also run the diagnostics we developed in the previous section using the residual plot.

### 3. Multiple Linear Regression ###

Now, suppose we are interested in predicting the birth weight based on both the age of the mother and the number of gestational days. To do this, we use multiple linear regression. In this exercise we will use a function we write ourselves to estimate the coefficients, and then compare with estimates from scikit-learn.

#### Question 1
Fill in the code to finish your own function to minimize the least squares objective for multiple linear regression with 2 predictors.

In [None]:
def mlr2(df, X_cols, y_col):
    # df is the input dataframe
    # X_cols is a list of strings that contains the names of the two columns being used as predictors
    
    # define the response
    y = df[y_col]
    def rss(b0,b1,b2):
        estimate = ... #Fill this in!
        return (np.mean((y - estimate) ** 2))

    # the estimated values of the parameters
    return(minimize(rss,method="CG"))


<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
estimate = b0+b1\*df[X_cols[0]] + b2\*df[X_cols[1]]
</pre>
</details>

#### Question 2

Use this function to find the intercept and both coefficients for the multiple linear regression described above.

In [None]:
...

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
code: mlr2(baby, ['Maternal Age', 'Gestational Days'], 'Birth Weight') 
intercept=-15.78; coefficient for age=.154, coefficient for gestational days=.47
</pre>
</details>

#### Question 3

Now, run the following cell to use the scikit-learn LinearRegression to compute the same values. Are they the same? Discuss.

In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(baby[['Maternal Age', 'Gestational Days']], baby['Birth Weight'])

print('intercept:', lm.intercept_)
print('coefficients:', lm.coef_)