In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("automobile.csv")
df.head()

### Linear Regression and Multiple Linear Regression

#### Linear Regression

Let's start with Simple Liner Regression, a method to help us understand the relationship between two variables:
- Predictor/Independent variable(X)
- Response/dependent variable ( that we want to predict)(Y)

Any changes in Independent variables results in changes of Dependent varaible
<li>
Linear function : Yhat = a + bX </li>
- a is slope of the regression line
- b refers to the slope of the regression line

In [None]:
# Load the module for LR
from sklearn.linear_model import LinearRegression

In [None]:
# Linear Regression estimator object
lm = LinearRegression()
lm

### How could highway-mpg help us predict car price ?

In [None]:
X = df[['highway-mpg']]
Y = df['price']

In [None]:
# Fit the linear model using highway-mpg
lm.fit(X,Y)

In [None]:
# Output a prediction, the first 5 records
Yhat = lm.predict(X)
Yhat[0:5]

In [None]:
# Value of the intercept (a)?
lm.intercept_

In [None]:
#Value of the slope(b)?
lm.coef_

#### Final estimated linear model?
<li>
Yhat = a + bX</li>
price = 38423.31 - 821.73 x highway-mpg


In [None]:
# Now it's your turn to write the code.
# 1)Create a linear regression object
# 2)Train the model using 'engine-size' as the independent variable and 'price' as the dependent variable
# 3) Find the slope and intercept
# 4) what is the equation of the predicted line?













#### Multiple Linear Regression

What is we want to predict car price using more than one variable? Similar to what happens in the real-world


$$
Y: Response \ Variable\\
X_1 :Predictor\ Variable \ 1\\
X_2: Predictor\ Variable \ 2\\
X_3: Predictor\ Variable \ 3\\
X_4: Predictor\ Variable \ 4\\
$$

$$
a: intercept\\
b_1 :coefficients \ of\ Variable \ 1\\
b_2: coefficients \ of\ Variable \ 2\\
b_3: coefficients \ of\ Variable \ 3\\
b_4: coefficients \ of\ Variable \ 4\\
$$

The equation is given by

$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$

Let's take 4 predictor variables such as the follows:
- Horsepower(watts)
- Curb-weight(mass)
- Engine-size (L or cc )
- Highway-mpg (miles per gallon, if converted to km: km/L)

In [None]:
Multi = df[['horsepower','curb-weight','engine-size','highway-mpg']]

In [None]:
lm.fit(Multi,df['price'])

In [None]:
lm.intercept_

In [None]:
lm.coef_

In [None]:
### What is the Linear equation??







### Model Evaluation

how do we choose the best model?

In [None]:
# Let's import visualization package and analyze the correlation( regression plots ) via plotting : seaborn

import seaborn as sns
%matplotlib inline



In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
plt.show()

The plot shows that price is slightly negatively correlated to peak-rpm. Pay attention to the data points scattered aroung the regression line. That gives you a good indication of the variance of the data and whether linear model would be a best fit or not.
In this case the data points are more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing.

#### How about we check the correlation values now?

In [None]:
# execute correlation of peak-rpm, highway-mpg and price

df[['peak-rpm','highway-mpg','price']].corr()

Highway-mpg has a stronger correlation with price, it is approximate -0.70 compared to peak-rpm

#### Residual Plot

is a good way to visualize the variance of the data.
<li>
    Difference between observed values(Y) and the predicted values(Yhat)</li>
    If the points is a  residual plot are randomly spread out around the x-ais, then a linear model is appropriate for the data

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'],df['price'])
plt.show()

<i> What is this plot telling us ? </i>
<p> the residual plots are not randomly spread around the x-axis, which leads us to believe that may be a non-linear model is more appropriate for this data.</p>

#### Let's visualize the model of MLR(Multi Linear Regression).
we will look at the easiest view that is the distribution plot ( actual values vs fitted values)

In [None]:
Y_pred = lm.predict(Multi)

In [None]:
plt.figure(figsize=(width, height))

ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual")
sns.distplot(Y_pred, hist=False, color="b", label="Fitted",ax=ax1)

plt.title('Actual vs Fitted values for Price')
plt.xlabel('Price($)')
plt.ylabel('Proportion of cars')

plt.show()
plt.close()

#### Polynomial Regression

is a generalized linear regression model, transforming liner to non-linear relationship by squaring or setting higher -order terms of the predictor variables

<p> such as: </p>
    Quadratic - 2nd order
    <p> 
$$
Yhat = a + b_1 X^1 + b_2 X^2
$$

In [None]:
x = df['highway-mpg']
y = df['price']

In [None]:
# Let's fit the polynomial of the 3rd order (cubic)

fit_3 = np.polyfit(x, y, 3)
predict_3 = np.poly1d(fit_3)
print(predict_3)


In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ higwway-mpg')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()


PlotPolly(predict_3, x, y, 'highway-mpg')

#### Let's use scikitlearn polynomial feature

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Create a polynomial features object
PolyReg = PolynomialFeatures(degree=2)
PolyReg

In [None]:
Multi_PolyReg = PolyReg.fit_transform(Multi)

In [None]:
Multi.shape

In [None]:
Multi_PolyReg.shape

#### Pipeline

This method simplifies the steps of processing the data in scikitlearn.
Let's see what this does

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Pipeline is created by creating a list of tuples including the name of model or estimator and its correspondign constructor

Pip_Input =[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]


In [None]:
pipe = Pipeline(Pip_Input)
pipe

In [None]:
# Persistence using pickle method replacement dump and load
from joblib import dump,load

In [None]:
dump(pipe, 'polyreg.joblib')

In [None]:
# load the pickled file the next time for polyreg project
pipe = load('polyreg.joblib')

In [None]:
Pickle method
#import pickle
#save = pickle.dumps(pipe)
#pipe2 = pickle.loads(save)
#pipe2

In [None]:
pipe.fit(Multi,y)

In [None]:
Ypipehat = pipe.predict(Multi)
Ypipehat[0:4]

#### In-sample Evaluation
Now let us quantify the measures to determine how accurate the model is.
So far we visualized the accuracy

Important measures to determine the aaccuracy of a model are:
- R^2 / R-squared : Indicates how close the data is to the fitted regression line in form of % 
- Mean Squared Error: Measures the average of errors, diff between actual value and predicted
- Or Root Mean squared Error : Square root of the MSE in same unit as absolute or predicted values 

### Model 1: Simple Linear Regression

In [None]:
#Let's calculate the R^2:
lm.fit(X, Y)
print('R-square is: ', lm.score(X, Y))

We can say that ~ 49.65% of the variation of the price is explained by the predictor variable

In [None]:
Yhat = lm.predict(X)
print('The first four predicted value:', Yhat[0:4])

In [None]:
#MSE
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
mse = mean_squared_error(df['price'], Yhat)
print('MSE of predicted price :',mse)
print('RMSE:',sqrt(mse))

#### Model 2: Multiple Linear Regression

In [None]:
# fit the model and execute R^2

lm.fit(Multi,df['price'])
print('R-squared  :', lm.score(Multi, df['price']))

In [None]:
# MSE and RMSE
Y_predict_Multifit = lm.predict(Multi)

print('MSE :', mean_squared_error(df['price'], Y_predict_Multifit))
print('RMSE :', sqrt(mean_squared_error(df['price'], Y_predict_Multifit)))


#### Model 3 : Polynomial Fit

In [None]:
from sklearn. metrics import r2_score

In [None]:
r_squared = r2_score(y, predict_3(x))
print('R-squared :', r_squared)

In [None]:
#MSE and RMSE
print('MSE:' , mean_squared_error(df['price'],predict_3(x)))
print('RMSE:' , sqrt(mean_squared_error(df['price'],predict_3(x))))

##### Now identify which among the 3 model have higher R-Squared value?
##### Which among the 3 models have lower MSE and RMSE value?

SLR using highway-mpg as a predictor variable of price
- R^2 : 0.49
- MSE : 31635042.9
<p>
MLR with horse-power, curb-weight, engine-size and higway-mpg ad predictor variable of price</p>

- R^2 : 0.80
- MSE : 11980366.8

<li>
Polynomial using highway-mpg as a predictor variable of price:
- R^2 : 0.67
- MSE : 20474146.4


The more variable you have, the better the model is at predicting, but not always.
Hence we check the R^2 and MSE values

In [None]:
## Now, conclude which model is best??




