# Introduction to Linear Regression



**Objectives**

- Explain what supervised learning is
- Explain what a regression problem is
- Explain what a model is
- Set up a DataFrame for modeling (without TTS)
- Explain what linear regression is doing at a high level
- Instantiate, fit, generate predictions from, and evaluate a linear regression model in `scikit-learn`
- Interpret the coefficients of a linear regression model




## Data
### Let's make a model for car prices. 🚙

#### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib'

Data are from [carvana](https://www.carvana.com/cars).

A DataFrame of prices.

In [None]:
df_cars = pd.DataFrame({
    'price':
    [34990, 32590, 25990, 32590, 30990, 36990, 44990, 28990, 39990, 
     30990, 31990, 28590, 15990, 21990, 35590, 27990, 21990, 24990, 21990, 20590, 22990, 19990],
})

In [None]:
df_cars.head()

In [None]:
#histogram of prices
plt.hist(df_cars)

--- 
## The Null (Baseline) model ⭐️

If we had to guess the price of a new data point - with no other information - what would you pick?

## How could we improve our model?

### Let's add year as a predictor

In [None]:
year=[
    2019, 2018, 2019, 2015, 2018, 2017, 2020, 2019, 2019, 
    2014, 2019, 2019, 2010, 2018, 2018, 2019, 2014, 2017, 2018, 2017, 2014, 2018
]

In [None]:
df_cars['year']=year

In [None]:
df_cars.head()

In [None]:
#scatterplot 


#### Correlations

The strength of a Linear Relationship between two variables.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/800px-Correlation_examples2.svg.png)

Mathematically this is expressed as:

$${\displaystyle r_{xy}\quad {\overset {\underset {\mathrm {def} }{}}{=}}\quad {\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{(n-1)s_{x}s_{y}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sqrt {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}\sum \limits _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}},}$$

In [None]:
#examine the correlation


#### What does the relationsip between year and price  look like?

Newer year the higher the price

#### The OLS regression line
Seaborn will plot the best fit line for us.

In [None]:
#compare the model to the baseline


#### Is that a line better fit the data than our old line that was just the mean? 

Looks like it. We'll discuss evaluation metrics in a bit.

Let's look at how we make a regression line.

## Lines

This was the equation I learned for a line. Look familiar?

$$ \large y = mx + b$$

In data science it gets changed to 

$$ \large  y = \beta_0 + \beta_1 x_1 $$

### Errors

Our model isn't going to be perfect. The things our model doesn't capture are errors and denoted by $\epsilon$ (epsilon).


$$ \large y = \beta_0 + \beta_1 x_1 + \epsilon $$

### OLS Regression Modeling

We have _x_ and we have _y_. That's our data. 

## **Our model is trying to figure out the best betas. 😀**

$$ \large y = \hat \beta_0 + \hat \beta_1 x_1 $$


$\hat \beta_0$ is the y-intercept that our model learns. The point where the line crosses 0 on the y-axis.

$\hat \beta_1$ is the coefficient that we multiply by our $x_1$ variable. It's the slope. For ever 1 unit it change in $x_1$, y increases by the value of $\beta1$.

$y$ is the ground truth of our target variable. 


$$ \large \hat y =  \beta_0 +  \beta_1 x_1 $$

When we have a model that has been fit with the data (the betas have been computed) we can plug in a new *x* value and solve for $\hat y$. 

### $\hat y$ is your model's prediction! 


___ 
### Let's fit an OLS regression model in scikit-learn.

### Step 1: Assemble our X and y variables

 We need an X matrix that is n-by-p.
- n = rows
- p = features

A feature just means a predictor column.

In the simple linear regression case, p = 1. We have one feature. Usually you'll have more than 1 feature. 

In [None]:
# X = year
X = df_cars[['year']]
X.shape

### Scikit-learn estimators expect a two dimensional object. 

Usually we have more than one predictor variable. Not here.

y is the outcome variable

In [None]:
# y = price


#### What's the shape of y?

In [None]:
# shape of X
X.shape

In [None]:
# shape of y
y.shape

#### Why is the target variable a pandas Series or 1D numpy array? 

Scikit-learn supervised learning estimators are expecting a single output column. Estimators predict one value for each observation, generally.

### Step 2: Import our model class

In [None]:
from sklearn.linear_model import LinearRegression

### Step 3: Instantiate the model

#### What is `lr`?

### Step 4: Fit the model

#### What did we just do?

Called the fit method on the object. We passed it X and y, in that order.

The fit method did matrix multiplication to estimate $\beta_0 $ and $\beta_1 $.

## Step 5: Check our model weights
#### Take a peek at the model's intercept coefficient

In [None]:
#intercept


#### What does that mean?

A car from year 0 wouldn't be worth anything. The model doesn't understand classic cars. 😉

#### What does that coefficient mean ($\beta_1$)?

In [None]:
#coef



We now have the following model of reality:

$$\hat{y} = -3,051,883 + 1,527  x$$

### ⭐️ For every one unit increase in year, the price of the car would be expected to increase by $1,527 ⭐️

## Step 6: Make predictions

If we had new data points for year we could pass it to the predict method to generate price predictions.

We don't have any new data, so let's just see what predictions our model would have made. This is the same as saying "Find the x value for a prediction on the plot and go up to our line. That value for y is our prediction."

We do this for all the X values.

In [None]:
# predictions


#### Why don't we pass `y`?

We are trying to predict y!

#### What type of object is y_pred?

In [None]:
# chect yhat


### Plot Predictions

In [None]:
#plot predictions


## Step 7: Score the predictions

### Mean squared error is a popular scoring metric. 
Lower is better. That's the case whenever "error" is in the metric name.

$$ MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2 $$

$$ = \frac{1}{n} \sum e_i^2 $$

### MSE by hand:

#### Create residuals (a.k.a. errors). The left-over values.

*y* is our ground truth. The actual values.

In [None]:
# the ground truth


In [None]:
# our model's predictions


In [None]:
# residuals


In [None]:
# examine residual histogram


Square the residuals. Then take the mean.

#### Compute the MSE

In [None]:
# square the residuals aka square the errors


Let's check our answer with the result of the scikit-learn function that computes MSE for us.

In [None]:
# mse


In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
#use mean_squared error


### How does our model with year model compare to our null model?

Our null model, also known as the baseline model, is just guessing the mean every time.

There are a bunch of ways to make a 1D array that's the same length as y, filled with the mean value.

`np.full_like` is a nice one. Check out the function signature.

In [None]:
y.shape

In [None]:
# use np.full_like


In [None]:
# mean squared error of baseline


#### Which model fits the data better?

Our regression model wins!

## You made your first Linear Regression Model🎉

___ 
# Linear Regression Exercise with Electricity

Now you make a Linear Regression Model and Null model for electricity demand data. Ignore that there is potentially some time series component to the data.

## The Data
Data source: [here](https://www.rdocumentation.org/packages/fpp2/versions/2.3/topics/elecdemand)

The data consist of electricity demand for Victoria, Australia every half-hour in 2014. We have three columns:

* Total electricity demand (in gigawatts)
* Whether or not it is a workday (0/1)
* Temperature (Celsius)

In [None]:
elec = pd.read_csv('https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/elecdemand.csv')

In [None]:
elec.head()

#### We'll limit our focus to only days in which it was at least 15 degrees Celsius (59 F)

In [None]:
#subset to temp > 15


#### Plot temperature vs. demand


In [None]:
#scatter


### Step 1: Assemble our X and y variables

 We need an X matrix that is n-by-p (in this case, p = 1)
- n = rows
- p = features

X is the predictor variable. We are looking at temperature. 

y is the outcome variable

In [None]:
#x and y


### Step 2: Import our model class

In [None]:
# from sklearn.linear_model import LinearRegression  # already imported

### Step 3: Instantiate the model

### Step 4: Fit the model

## Step 5: Check our model weights

#### Interpret $\beta_1$


## Step 6: Make predictions

### Plot the predictions

## Step 7: Score the predictions with MSE

### Create the predictions for the "null model"

#### The null MSE

#### Does your OLS regression model better fit the data than a null model? ⚠

You've seen linear regression with a single predictor variable. That's called _simple linear regression_. Next time we will discuss multiple regression with many input variables.

#### The `DummyRegressor`

An easier way to set a baseline.

In [None]:
#import


In [None]:
#instantiate


In [None]:
#fit 


In [None]:
#predict


In [None]:
#score
