# Multiple Linear Regression
*[Stats Source](http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/) 
*[Code Source](http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/)*

Multiple linear regression is when we try to predict one response variable from more than one independent variables. 

Here are some assumptions we are making when we use this technique:
1. Linear relationship between response variable and predictor variables
2. Multivariate Normality 
    * Predictor variables are each normally distributed
3. No Multicollinearity
    * Predictor variables are independent of each other
        * There should be little or no correlation between the variables
4. Homoscedasticity
    * Variance of residuals are roughly the same across all independent variables

We could use the Scikit Learn package to run, but I think that it does not show enough information about the analysis. You are definitely welcome to use it, but we will use statsmodels.api to see the summary of our results.

In [20]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

auto_data = pd.read_csv("auto_data.csv")
print(auto_data.head())

    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0         130    3504          12.0   
1  15.0          8         350.0         165    3693          11.5   
2  18.0          8         318.0         150    3436          11.0   
3  16.0          8         304.0         150    3433          12.0   
4  17.0          8         302.0         140    3449          10.5   

   model year  origin                         car name  
0          70       1  "chevrolet chevelle malibu"      
1          70       1          "buick skylark 320"      
2          70       1        "plymouth satellite"       
3          70       1              "amc rebel sst"      
4          70       1               "ford torino"       


We are going to predict mpg from two predictor variables: horsepower and weight. 

In [21]:
# Predictor Variables
X = (auto_data[["horsepower", "weight"]]).astype(float)
X = sm.add_constant(X)

# Response Variable
Y = auto_data["mpg"]

# Fit the model and predict
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

#Print model summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.706
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     467.9
Date:                Sat, 17 Jun 2017   Prob (F-statistic):          3.06e-104
Time:                        15:39:12   Log-Likelihood:                -1121.0
No. Observations:                 392   AIC:                             2248.
Df Residuals:                     389   BIC:                             2260.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         45.6402      0.793     57.540      0.0

### Exercise:
Figure out how to predict the mpg of these cars using 80% of the data. Then, use the remaining 20% of the data to test your model. Feel free to use any column features and try any transformations on the data. Try to get as high of an "Adjusted R squared" value as possible! 

Don't forget to randomize your data before splitting!