# Linear Regression(OLS)

https://www.statsmodels.org/stable/regression.html
https://www.wikiwand.com/en/Linear_regression
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/440px-Linear_regression.svg.png?raw=true)

Very good online free resources on python: 
* http://interactivepython.org/courselib/static/thinkcspy/index.html
* http://www.diveintopython3.net/index.html
* https://www.python-course.eu/python3_course.php
* https://www.python.org/about/gettingstarted/
* https://wiki.python.org/moin/BeginnersGuide/Programmers

Author @Sergiu Buciumas
email: sbuciuma@students.kennesaw.edu

# 5. How to Implement Linear Regression using statsmodels and sklearn

For this task we are using the merged CLEAN1B, 1B, 1C and goodbad we are not tacking in consideration:

In [None]:
%matplotlib inline

from sklearn import linear_model, metrics
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import math, scipy, numpy as np
from scipy import linalg
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number
sns.set(style="darkgrid")


In [None]:
# Check the version of the packages

print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)
print('Numpy version ' + np.__version__)
print('Seaborn version ' + sns.__version__)

In [None]:
url_data = "https://raw.githubusercontent.com/sb0709/bootcamp_KSU/master/Data/data_reg.csv"
df = pd.read_csv(url_data,sep=',' ) # for specify the index we use here the colums "0" when reading the data: , index_col=0
data = df.copy() # create dataframe "data" as making a copy of the "df" dataframe

In [None]:
# Check the column names
print(list(df.columns))
print()
print(list(data.columns))

# Let just do a Simple (single variable) Linear Regression model

In [None]:
y = df['RBAL']
X = df['CRELIM']

X = X.values
y = y.values

In [None]:
#split/create the dataset in train and test 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Plot outputs
plt.scatter(X_test, y_test,  color='green')
plt.title('Test Data')
plt.xlabel('Credit Limit')
plt.ylabel('RBAL')
plt.xticks(())
plt.yticks(())
 
plt.show()

In [None]:
X=X.reshape(len(X),1)
y=y.reshape(len(y),1)

# Create and fit the model sklearn

In [None]:
# adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py


#split/create the dataset in train and test 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create linear regression object
regr = linear_model.LinearRegression()
 
# Train the model using the training sets
regr.fit(X_train, y_train)
 
# Plot outputs
#plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))

# Plot outputs
plt.scatter(X_test, y_test,  color='green')
plt.plot(X_test, y_pred, color='red', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
# prediction we can do as follow:
print((regr.predict(5000)))

### Using statsmodels

In [None]:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as sm

# create a fitted model in one line
lm = sm.ols(formula='RBAL ~ CRELIM', data=df).fit()

# print the coefficients
lm.params

### Interpreting Model Coefficients
How do we interpret the CRELIM coefficient ($\beta_1$)?

* A "unit" increase in CRELIM ad spending is associated with a 1.086316 "unit" increase in RBAL.

Note that if an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.

In [None]:
# manually calculate the prediction
y2 = 4995.947242 + 1.086316*2

y2

In [None]:
# Plot the Least Square Line:

In [None]:
# create a DataFrame with the minimum and maximum values of TV
X_new = pd.DataFrame({'CRELIM': [df.CRELIM.min(), df.CRELIM.max()]})
X_new.head()

In [None]:
# make predictions for those x values and store them# make p 
preds = lm.predict(X_new)
preds

In [None]:
# first, plot the observed data
data.plot(kind='scatter', x='CRELIM', y='RBAL')

# then, plot the least squares line
plt.plot(X_new, preds, c='red', linewidth=2)

In [None]:
# print the confidence intervals for the model coefficients  
lm.conf_int()

In [None]:
# print the p-values for the model coefficients
lm.pvalues

In [None]:
# print a summary of the fitted model
lm.summary()

# Multivariable Linear Regression Model Example



In [None]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 4, sharey=True)
data.plot(kind='scatter', x='TRADES', y='RBAL', ax=axs[0], figsize=(16, 8))
data.plot(kind='scatter', x='BRNEW', y='RBAL', ax=axs[1])
data.plot(kind='scatter', x='BRAGE', y='RBAL', ax=axs[2])
data.plot(kind='scatter', x='CRELIM', y='RBAL', ax=axs[3])

In [None]:
# create a fitted model with all three features# create 
lm = sm.ols(formula='RBAL ~ CRELIM + TRADES + BRNEW + BRAGE ', data=df).fit()

# print the coefficients
lm.params

In [None]:
# print a summary of the fitted model
lm.summary()

In [None]:
# print the p-values for the model coefficients
lm.pvalues

# Use sklearn for Linear Regression:

In [None]:
# p-value and CI is not easy accesible in sklearn and needs calculation mostly. 

# create X and y# create 
feature_cols = ['CRELIM', 'TRADES', 'BRNEW', 'BRAGE']
X = df[feature_cols]
y = df.RBAL

# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)

# print intercept and coefficients
print(lm.intercept_)
print(lm.coef_)

In [None]:
# pair the feature names with the coefficients
# builtin function zip does create an obsect so we list to can get the content
# source: https://docs.python.org/3/library/functions.html

zipped = zip(feature_cols, lm.coef_)

In [None]:
# pair the feature names with the coefficients

list(zipped)

# Q&A