## Multiple Linear Regression

Now that we have had a look at linear regression with just 2 variables, lets look at regression with more. 

**Note: most of the contents of this notebook were created after following along with 
[this video](https://www.youtube.com/watch?v=dQNpSa-bq4M&list=PLIeGtxpvyG-IqjoU8IiF0Yu1WtxNq_4z-)**
 
Lets start by loading common library pieces. 

In [1]:
import matplotlib.pyplot as plt

%matplotlib inline

Now lets start by loading our dataset.

In [2]:
import os
import json
import random

from pprint import pprint

data_dir = os.path.join('..', 'raw_data', 'iris')
dataset_filename = os.path.join(data_dir, 'iris_data.json')

with open(dataset_filename, 'r') as iris_file:
    iris_data = json.load(iris_file)
        
#pprint(iris_data[:1])
#[{'petal_length': 1.4,
#  'petal_width': 0.2,
#  'sepal_length': 5.1,
#  'sepal_width': 3.5,
#  'species': 'Iris-setosa'}]

## Data Analysis

In our past notebook we looked at how the petal length effects the petal width.  For this model we are going to add in the idea of the sepal length to see how that effects the petal width. 

* Independent Variables:
  - petal_length
  - sepal_length
* Dependent Variable:
  - petal_width

One important item to bring up is that just by adding more independant variables does not make a model better, but can lead to predictions that are actually worse in the long run.  This is called **OVERFITTING**.  

Also when more independent variables create more relationships among them. So not only are the independent variables potentitally related to the dependent variable, they are also potentially related to each other.  When this happens it is called **MULTICOLLINEARITY**.  

The ideal world is for the independent variables to be related with the dependent variable, but not with each other.  

**This is why using sepal_width as well as sepal_length could be bad**

Part of the "art" of linear regression is to pick which independent variables provide value and which ones don't towards the dependent variable.  

### Multiple Regression Equations

* Multiple Regression Model
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon$$

* Multiple Regression Equation
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p$$

* Estimated Regression Equation
$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + ... + b_p x_p$$

Now when we have an equation with multiple variables it is important to understand how they relate.  If we were to
take an example equation below. 

$$\hat{y} = 1.6 + 3.2 x_1 + 4.1 x_2$$

given:

- $x_1$ is petal length (cm)
- $x_2$ is sepal length (cm)

We can interpret the relationship that the variable has to the petal width ($\hat{y}$) by taking a single variable and stating the change to $\hat{y}$ that will take place when all other variables are constant.   

So that means that the petal width will have an increase of 3.2 cm when the corresponding petal length ($x_1$) has an increase of 1 cm. 

## Finding $\beta$ 

In multiple linear regression we have an equation where each unique variable (feature) is given a coefficient that is used to calculate the
dependent variable.  Our equations would look something like the below.  

$$\hat{y} = \beta_0x_0 + \beta_1x_1 + ... + \beta_px_p + \epsilon$$

Basically each coefficient is a value of how much the result can be explained by the variable given the other variables are constant.  

Much like single variable linear regression we need to determine the residual 
