# Linear Regregression Theory

This all started in the 1800s with a guy named Francis Galton. Galton was studying the relationship between parents and their children. In particular, he investigated the relationship between the heights of fathers and their sons.


What he discovered was that a man's son tended to be roughly as tall as his father. However Galton's breakthrough was that the son's height tended to be closer to the overall average height of all people. 

Let's take Shaquille O'Neal as an example. Shaq is really tall:7ft 1in (2.2 meters). If Shaq has a son, chances are he'll be pretty tall too. However, Shaq is such an anomaly that there is also a very good chance that his son will be not be as tall as Shaq. 

Turns out this is the case: Shaq's son is pretty tall (6 ft 7 in), but not nearly as tall as his dad. Galton called this phenomenon regression, as in "A father's son's height tends to regress (or drift towards) the mean (average) height." 

Let's take the simplest possible example: calculating a regression with only 2 data points. All we're trying to do when we calculate our regression line is draw a line that's as close to every dot as possible. For classic linear regression, or "Least Squares Method", you only measure the closeness in the "up and down" direction 

Now wouldn't it be great if we could apply this same concept to a graph with more than just two data points? 
 
By doing this, we could take multiple men and their son's heights and do things like tell a man how tall we expect his son to be...before he even has a son

Our goal with linear regression is to minimize the vertical distance between all the data points and our line. 

So in determining the best line, we are attempting to minimize the distance between all the points and their distance to our line

There are lots of different ways to minimize this, (sum of squared errors, sum of absolute errors, etc), but all these methods have a general goal of minimizing this distance

# The Linear Regression Equation

A simple linear regression model takes the following form:

**ŷ = β0 + β1(x)**

where:

+ ŷ: The predicted value for the response variable
+ β0: The mean value of the response variable when x = 0
+ β1: The average change in the response variable for a one unit increase in x
+ x: The value for the predictor variable

# Linear Regression in Python

![supervised learning](Images/supervised_learning.png)
Supervised Learning Diagram

## Imports

In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Fetching Our Data

In [None]:
df = pd.read_csv('USA_Housing.csv')
df.head(3)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

## Exploratory Data Analysis

In [None]:
sns.pairplot(df)

In [None]:
sns.displot(df['Price'])

In [None]:
sns.heatmap(df.corr(), annot=True)

In [None]:
df.columns

## Split our Data

In [None]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

## Creating and Fitting our Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train, y_train) # train or fit model

## Analysing Our Model

In [None]:
print(lm.intercept_)

In [None]:
print(lm.coef_)

In [None]:
X_train.columns

In [None]:
cdf = pd.DataFrame(lm.coef_, X.columns, columns=['Coeff'])

In [None]:
cdf

**Avg Area Income**: If we hold all other units fixed, a one unit increase in avg area income, is associated with a $21.53 increase in house price.
**Avg Area House Age**: If we hold all other units fixed, a one unit increase in avg area house age, is associated with a $164883.28 increase in house price.

## Making Predictions

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.scatter(y_test, predictions)

This should be close to a straight line. This is how we can tell how accurate our predictions are.

A histogram of the distrubition of our resdiuals.

In [None]:
sns.distplot(y_test-predictions)

This should be normally distributed if our predictsions are close.

In [None]:
from sklearn import metrics

In [None]:
metrics.mean_absolute_error(y_test, predictions)

In [None]:
metrics.mean_squared_error(y_test, predictions)

In [None]:
np.sqrt(metrics.mean_squared_error(y_test, predictions))