# CS251: Data Analysis and Visualization

## Using SciPy's Least Squares Solver to perform Linear Regression

Spring 2021

Oliver W. Layton

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import lstsq

In [2]:
import data

ModuleNotFoundError: No module named 'data'

## Load in Maine COVID-19 case data

CSV filename: `maine_covid.csv`

In [3]:
covidData = data.Data('maine_covid.csv')
print(covidData)

-------------------------------
maine_covid.csv (290x8)
Headers:
  deathIncrease	hospitalizedCurrently	hospitalizedIncrease	inIcuCurrently	onVentilatorCurrently	positiveIncrease	totalTestResultsIncrease	totalTestsViralIncrease
-------------------------------
Showing first 5/290 rows.
3.0	91.0	3.0	25.0	9.0	104.0	37661.0	37964.0
2.0	92.0	3.0	24.0	11.0	91.0	91.0	0.0
0.0	94.0	0.0	25.0	10.0	148.0	148.0	0.0
2.0	101.0	8.0	28.0	10.0	110.0	110.0	0.0
4.0	100.0	7.0	27.0	9.0	160.0	160.0	0.0

-------------------------------


## 1. Predict number of people on ventilators from number of people in the ICU

Given how many number of people in the ICU, can we predict the number on ventilators?

What is our independent variable?

In [None]:
# x var? = inIcuCurrently 

What is our dependent variable?

In [None]:
# y var? = onVentilatorCurrently

The linear regression equation is: $$A\vec{c} = \vec{y}$$

Let's setup:
- `A` data matrix.
- `y` dependent variable column vector

In [4]:
x = covidData.select_data(['inIcuCurrently'])
y = covidData.select_data(['onVentilatorCurrently'])

Let's add an intercept to the data matrix, consistent with the linear regression model:

$$y = c_0 + c_1x_1$$

where $x_1$ is number of people in the ICU.

In [6]:
A = np.hstack([np.ones([x.shape[0], 1]), x])
A.shape  

(290, 2)

Let's use SciPy's least squares solver to determine the unknown intercept $c_0$ and slope $c_1$ coefficient for us.

In [7]:
c, _, _, _ = lstsq(A, y)
c # 1st thing is c0, second thing is c1

array([[0.16237239],
       [0.39510366]])

Let's draw the regression line! We need:
- Evenly spaced sample points between the min and max independent variable values in the dataset ("x")
- **Predicted** dependent variable values according to the fitted regression model (using the coefficients we solved for with SciPy).

Here is the mean sum-of-square error (MSSE) and quality of fit measure (R^2) values:

MSSE: 128.73

R^2: 0.89

## 2. Predict number of deaths from number of positive cases

Independent and dependent variables?

In [None]:
# ind var: 
# dep var: 

Let's follow the same steps to set this up, but for practice, let's reorder the linear regression model slope and intercept:

$$y = c_0x_0 + c_1$$

What is $x_0$ now?

Here is the mean sum-of-square error (MSSE) and quality of fit measure (R^2) values:

MSSE: 27.26

R^2: 0.41

## 3. What association between variables are you interested in exploring?

Which variable are you trying to predict from another?

Independent and dependent variables?

In [None]:
# ind var: 
# dep var: 

Let's follow the same steps to set this up, but let's return to our usual form for the linear regression model slope and intercept:

$$y = c_0 + c_1x_1$$

What is $x_1$ now?

In [None]:
s