## Coach Lab Linear Regression

### Objectives
*Put together a linear regression model <br>
*Understand the steps in modeling<br>
*Evaluate linear regression model


![caption](images/Model_Process_Part1.png)
![caption](images/Model_Process_Part2.png)

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#### For this lesson we will use the computer hardware dataset from https://www.mldata.io/dataset-details/computer_hardware/

In [None]:
comp = pd.read_csv('computer_hardware_dataset.csv')

### Project Question Formulation


What do we want to evaluate/explore/answer????

In [None]:
comp.head()

In [None]:
comp.info()

In [None]:
comp.describe().T

### Step 1 Train/test Split
#### Target is PRP

In [None]:
# create y and X as the target and the features
y = comp['PRP']
X = comp.drop(columns=['PRP'])

In [None]:
#split data into test and train sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = .25)

In [None]:
#get shape of the training and test sets
X_train.shape, y_train.shape, X_test.shape, y_test.shape

### Step 2- Exploratory Data Analysis 

We will use only the training data for this part

#### As a first step in data cleaning lets look for missing values

#### Next let's look at the distribution of our variables

In [None]:
#create scatterplot matrix

#### Next let's look at the distribution of our target

In [None]:
#histogram of y_train

#### Do we have any categorical variables we need to encode?

#### Next let's scale our data

Why do we do this?  What does it mean about the comparability of our variables?  What about interpretation of coefficients?

#### Outlier Removal

### Step 4 - Messy Model

#### First let's use statsmodels

In [None]:
#Linear regression using statsmodels


#### Now let's use sklearn

In [None]:
#initialize a linear regression model in sklearn

In [None]:
#fit linear model to training data

### Step 5 - Model Evaluation

In [None]:
#get summary statistics from statsmodels


In [None]:
#get r squared value from sklearn


In [None]:
#create formula for adjusted r squared
def adjusted_r_suared(r_squared, num_samples, num_regressors):
    return 1 - ((1-r_squared)*(num_samples - 1) / (num_samples - num_regressors - 1))

In [None]:
#calculate adjusted r squared
adjusted_r_suared()

#### What does the r-squared value mean? What do the r-squared values tell us about the fit of our model?  What about adjusted r-squared?

#### Now let's look at predictions of relative performance to compare to actual relative performance

In [None]:
linreg_results_df = pd.DataFrame(linreg.predict(X_train), y_train).reset_index()

In [None]:
linreg_results_df.columns = ['Actual_Price', 'Predicted_Price']

In [None]:
linreg_results_df.head()

## Let's use regularization to see if that helps our model fit

In [None]:
#importing Lasso and Ridge models from sklearn
from sklearn.linear_model import Lasso, Ridge

### Lasso Regularizer

In [None]:
#conduct lasso regression on training data

In [None]:
#Evaluation of lasso on training data

In [None]:
#examine coefficients from lasso

### Ridge Regularizer

In [None]:
#conduct ridge regression on training data

In [None]:
#Evaluation of ridge on training data

In [None]:
#examine coefficients from ridge

### Step 6 - Best Model Evaluation

In [None]:
#apply data cleaning process to test set

In [None]:
#run best model on test set

### Step 7 - Model Interpretation

What take-ways do we have?  Conclusions about our initial question?  Who would care?  Why do they care?