# Linear Regression Practice Problems

CBE 20258. Numerical and Statistical Analysis. Spring 2020.

&#169; University of Notre Dame

In [1]:
# load libraries
import scipy.stats as stats
import numpy as np
import math
import matplotlib.pyplot as plt

## Learning Objectives

After studying this notebook and your lecture notes, you should be able to:
* Interpret correlation coefficient
* Compute simple linear regression best fits
* Check linear regression error assumptions using residual analysis (plots)
* Compute residual standard error and covariance matrix for fitted parameters
* Assemble confidence intervals for fitted parameters

## Supplemental Exercise 7.5 (Navidi 2015)

A chemist is calibrating a spectrophotometer that will be used to measure the concentration of carbon monoxide (CO) in atmospheric samples. To check the calibration, samples of known concentration are measured. The true concentrations (x) and the measured concentrations (y) are given in the variables below. Because of random error, repeated measurements on the same sample will vary. The machine is considered to be in calibration if its mean response is equal to the true concentration. 

In [2]:
x = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]) ## True concentrations (ppm)
y = np.array([1, 11, 21, 28, 37, 48, 56, 68, 75, 86, 96])  ## Measured concentrations (ppm)

print ("The true concentrations:\n", x)
print ("The measured concentrations:\n", y)

The true concentrations:
 [  0  10  20  30  40  50  60  70  80  90 100]
The measured concentrations:
 [ 1 11 21 28 37 48 56 68 75 86 96]


To check the calibration, the linear model $y = \beta_0 + \beta_1 x + \epsilon$ is fit. Ideally, the value of $\beta_0$ should be 0 and the value of $\beta_1$ should be 1. 

a. Compute the least-squares estimates ${\hat{\beta_{0}}}$ and ${\hat{\beta_{1}}}$.

In [3]:
### BEGIN SOLUTION
##There are multiple ways to solve this problem. One is to use the
##analytical equations provided. 
xbar = np.mean(x)
ybar = np.mean(y)

x_diff = x - xbar
y_diff = y - ybar

beta_hat_1 = (x - xbar).dot(y - ybar)/((x - xbar).dot(x - xbar))
beta_hat_0 = ybar - beta_hat_1*xbar

print("analytical beta_hat", beta_hat_0, beta_hat_1)

##The other is to use the generalized form and the normal equations.

#feature matrix

X = np.ones((len(x), 2))
X[:,1] = x

XXinv = np.linalg.inv(X.transpose().dot(X))
beta_hat = XXinv @ X.transpose() @ y
print("matrix beta_hat =",beta_hat)

### END SOLUTION

analytical beta_hat 0.8181818181818201 0.9418181818181818
matrix beta_hat = [0.81818182 0.94181818]


b. Can you reject the null hypothesis $H_0$ : $\beta_0$ = 0?

In [4]:
### BEGIN SOLUTION

##First find s. In order to do so, need to find residuals.

e = y - (beta_hat[0] + beta_hat[1]*x)

s = np.sqrt(e.dot(e)/(len(x)-2))

##Now find s for beta_0

variance_beta_hat = s**2*XXinv

s_beta_0 = np.sqrt(variance_beta_hat[0][0])

##Now find t-score

t_score_beta_0 = (beta_hat[0] - 0)/s_beta_0

##Since the hypothesis is of the form H = mu, need to find areas in the tails.

#Calculate p-value

#print(len(x))
pvalue_beta_0 = 2*stats.t.cdf(-t_score_beta_0, len(x)-2)

print(pvalue_beta_0)


print(variance_beta_hat)

### END SOLUTION

0.23469843169056898
[[ 4.12672176e-01 -5.89531680e-03]
 [-5.89531680e-03  1.17906336e-04]]


c. Can you reject the null hypothesis $H_0$ : $\beta_1$ = 1?

In [5]:
### BEGIN SOLUTION

##Find s for beta_1

s_beta_1 = np.sqrt(variance_beta_hat[1][1])
#print(s_beta_1)

##Find t-score

t_score_beta_1 = (beta_hat[1] - 1)/s_beta_1
#print(t_score_beta_1)

pvalue_beta_1 = 2*stats.t.cdf(t_score_beta_1, len(x)-2)
print(pvalue_beta_1)

### END SOLUTION

0.00045739577879748416


d. Do the data provide sufficient evidence to conclude that the machine is out of calibration?

In [6]:
### BEGIN SOLUTION

print("Yes, since we know beta_1 is clearly not equal to 1.")
### END SOLUTION

Yes, since we know beta_1 is clearly not equal to 1.


e. Compute a 95% interval for the mean measurement $\hat{y}$ when the true concentration is 20 ppm.

In [7]:
### BEGIN SOLUTION

t_score_95 = stats.t.ppf([0.025, 0.975], len(x)-2)

##Predicted mean measurement

y_20 = beta_hat[0]+ beta_hat[1]*20

#print(y_20)

#Find sy
#print(s_beta_0, s_beta_1)
s_y = np.sqrt(s_beta_0**2 + (20**2)*s_beta_1**2 +2*20*variance_beta_hat[0][1])

print(s_y)
#print(t_score_95)

interval = y_20+t_score_95*s_y
print("The 95% confidence interval for 20 ppm is ", interval)

### END SOLUTION

0.47330966456168216
The 95% confidence interval for 20 ppm is  [18.58384461 20.7252463 ]


f. Compute a 95% interval for the mean measurement when the true concentration is 80 ppm.

In [8]:
### BEGIN SOLUTION

##Predicted mean measurement

y_80 = beta_hat[0]+ beta_hat[1]*80

#print(y_20)

#Find sy
#print(s_beta_0, s_beta_1)
s_y = np.sqrt(s_beta_0**2 + (80**2)*s_beta_1**2 +2*80*variance_beta_hat[0][1])

#print(s_y)
#print(t_score_95)

interval = y_80+t_score_95*s_y
print("The 95% confidence interval for 20 ppm is ", interval)

### END SOLUTION

The 95% confidence interval for 20 ppm is  [75.09293552 77.23433721]


g. Someone claims that the machine is in calibration for concentrations near 20 ppm. Do these data provide sufficient evidence for you to conclude that this claim is false? Explain.

In [12]:
### BEGIN SOLUTION

print("No. Part e shows that when x = 20, the confidence interval includes it.")
### END SOLUTION

No. Part e shows that when x = 20, the confidence interval includes it.


## Supplemental Exercise 7.8 (Navidi 2015)

Rate of lipase production, y (in $\mu$mol per mL enzyme per minute) and x, the cell mass (in g/L) were measured and results are bellow:

In [13]:
x = np.array([4.5, 4.68, 5.4, 5.45, 4.2, 4.12, 4, 4.41, 3.98, 4.72, 3.41, 4.8, 3.6, 4.95, 3.25, 4.4, 3.65, 4.23, 4.1, 5.03, 
              4.19, 4.4, 3.92, 3.5, 4.15, 4.3, 4.9, 5.23, 5.4, 4.85, 5.1, 4.94]) ##The cell mass in g/L
y = np.array([2.06, 2.1, 3.15, 4.1, 2.2, 3.2, 2.85, 4.5, 2.1, 2.75, 2.8, 4.6, 2.5, 4.1, 2.15, 4.4, 2.2, 2.3, 2.4, 4.75, 3.15,
              3.9, 3.2, 2.1, 3.75, 3.15, 5.1, 5.04, 4.96, 5, 4.92, 4.98]) ##Lipase production in micromol per mL enzyme per minute


print ("The cell mass:\n", x)
print ("Lipase production:\n", y)


The cell mass:
 [4.5  4.68 5.4  5.45 4.2  4.12 4.   4.41 3.98 4.72 3.41 4.8  3.6  4.95
 3.25 4.4  3.65 4.23 4.1  5.03 4.19 4.4  3.92 3.5  4.15 4.3  4.9  5.23
 5.4  4.85 5.1  4.94]
Lipase production:
 [2.06 2.1  3.15 4.1  2.2  3.2  2.85 4.5  2.1  2.75 2.8  4.6  2.5  4.1
 2.15 4.4  2.2  2.3  2.4  4.75 3.15 3.9  3.2  2.1  3.75 3.15 5.1  5.04
 4.96 5.   4.92 4.98]


a. Compute the least-squares line for predicting lipase production from cell mass.

b. Compute 95% confidence intervals for $\beta_0$ and $\beta_1$.

c. In two experiments, the cell masses differed by 1.5 g/L. By how much do you estimate that their lipase production will differ?

d. Find a 95% confidence interval for the mean lipase production when the cell mass is 5.0 g/L.

e. Can you conclude that the mean lipase production when the cell mass is 5.0 g/L is less than 4.4? Explain. 

In [14]:
### BEGIN SOLUTION


### Part a. Use matrix formulation

X = np.ones((len(x), 2))
X[:,1] = x

XXinv = np.linalg.inv(X.transpose().dot(X))
beta_hat = XXinv @ X.transpose() @ y
print("matrix beta_hat =",beta_hat)

### Part b. 95% confidence interval

t_score_95 = stats.t.ppf([0.025, 0.975], len(x)-2)

## First calculate s_B_0 and s_B_1

#Find s

e = y - (beta_hat[0] + beta_hat[1]*x)

s = np.sqrt(e.dot(e)/(len(x)-2))

##Now find covariance matrix

variance_beta_hat = s**2*XXinv

s_beta_0 = np.sqrt(variance_beta_hat[0][0])
s_beta_1 = np.sqrt(variance_beta_hat[1][1])

beta_0_interval = beta_hat[0]+t_score_95*s_beta_0
beta_1_interval = beta_hat[1]+t_score_95*s_beta_1

print("The intervals for beta0 and beta1 are: ", beta_0_interval, beta_1_interval)


##Part c. 

##Changing x by 1.5, how much does y change? Multiply the slope, beta_1 by 1.5

print("The change is: ", beta_hat[1]*1.5)

##Part d. 

#Find 95% interval for the mean when cell mass is 5 g/L.

y_5 = beta_hat[0] + beta_hat[1]*5

s_y = np.sqrt(s_beta_0**2 + (5**2)*s_beta_1**2 +2*5*variance_beta_hat[0][1])

interval_y_5 = y_5 + t_score_95*s_y

print("The intervals for lypase production: ", interval_y_5)


##Part e. 

##null hypothesis is Lipase production is > 4.4 when cell mass is 5 g/L. 

#Calculate t-score
t_score = (y_5 - 4.4)/s_y

p_value_lipase = stats.t.cdf(t_score, len(x)-2)

print("The P-value is: ", p_value_lipase, "Therefore, we cannot conclude it.")

### END SOLUTION

matrix beta_hat = [-2.17086563  1.26924168]
The intervals for beta0 and beta1 are:  [-4.36872445  0.02699319] [0.77750867 1.76097468]
The change is:  1.903862516396653
The intervals for lypase production:  [3.77059915 4.58008637]
The P-value is:  0.13297328711115267 Therefore, we cannot conclude it.
