# Linear Regression

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt


%matplotlib inline

## Introduction

**Regression Analysis** is a **parametric technique** meaning a set of parameters are used to **predict** the value of an unknown target variable (or dependent variable)  𝑦  based on one or more of known input features (or independent variables, predictors), often denoted by  𝑥 .

The term **linear** implies that the model functions along with a straight (or nearly straight) line.

**Simple Linear Regression** uses a single feature (one independent variable) to model a linear relationship with a target (the dependent variable) by fitting an optimal model (i.e. the best straight line) to describe this relationship.

**Multiple Linear Regression** uses more than one feature to predict a target variable by fitting the best linear relationship.

## Simple linear regression

A straight line can be written as :
$$ 𝑦=𝛽_0+𝛽_1𝑥 $$

There are **4 key components** :
- A **dependent variable** that needs to estimated and predicted (here:  𝑦 )
- An **independent variable**, the input variable (here:  𝑥 )
- The **slope** which determines the angle of the line (here: $𝛽_1$ ).
- The **intercept** which is the constant determining the value of  𝑦  when  𝑥  is 0. We denoted the intercept here as $𝛽_0$ .

When we draw our regression line, we use the following notations:

$$ 𝑦̂ =𝛽̂_0+𝛽̂_1𝑥 $$
 
As you can see, you're using a "hat" notation which stands for the fact that we are working with **estimations**.

### Steps

Calculate the following:

- The mean of the X  ($ \bar X $) 
- The mean of the Y  ($ \bar Y $) 
- The standard deviation of the X values  ($ 𝑆_𝑋 $) 
- The standard deviation of the y values  ($ 𝑆_𝑌 $) 
- The correlation between X and Y ( often denoted by the Greek letter "Rho" or  𝜌  - Pearson Correlation)

**Calculating Slope**

With the above ingredients in hand, we can calculate the slope (shown as  𝑏  below) of the best-fit line, using the formula:

$$ 𝛽̂_1 =𝜌\frac{𝑆_Y}{𝑆_𝑋} $$
 
This formula is also known as the **least-squares** method.

**Calculating the intercept**

Now that we have the slope value ($𝛽̂_1$ ), we can put it back into our formula  ($ 𝑦̂ =𝛽̂_0+𝛽̂_1𝑥 $ )  to calculate intercept.

$$ 𝛽̂_0 = \bar Y −  𝛽̂_1 \bar X $$ 

**Predicting from the model**

When you have a regression line with defined parameters for slope and intercept as calculated above, you can easily predict the  𝑦̂   (target) value for a new  𝑥  (feature) value using the estimated parameter values:

$$ 𝑦̂ = 𝛽̂_1 𝑥+ 𝛽̂_0 $$
 
Remember that the difference between y and  𝑦̂   is that  𝑦̂   is the value predicted by the fitted model, whereas  𝑦  carries actual values of the variable (called the truth values) that were used to calculate the best fit.