# LINEAR REGRESSION

# This notebook has 2 parts:

# * Part 1 - faster way using statsmodels package
# * Part 2 - longer, step by step way to do linear regression
# * Part 3 - regression plot

# PART 1 - FASTER WAY USING STATSMODELS

# STEP 0 

IMPORT THE STATSMODELS PACKAGE, AS WELL AS LINEAR REGRESSION PACKAGES

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

import statsmodels.formula.api as smf

# RUN A REGRESSION

In [None]:
lm = smf.ols(formula='y ~ x1 + x2', data=df).fit()

This command is not giving any output. We need to run another command after this in order to get output.

Breakdown of the command:

* lm - just a name of our model
* smf.ols - standard command from statsmodel package that we already defined in Step 0
* y - dependent variable
* x1 and x2 - independent variables
* data - dataframe we want to use

# GET THE OUTPUTS

In [None]:
lm.summary()

This command will give us 3 tables with all the possible regression outputs we might need.

# PART 2 - LONGER, STEP BY STEP WAY

# STEP 0

IMPORT THE LINEAR REGRESSION PACKAGES. YOU SHOULD PREVIOUSLY HAVE PANDAS, NUMPYS AND OTHER BASIC PACKAGES.

In [10]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# STEP 1

DEFINE THE LINEAR MODEL

In [11]:
lm = linear_model.LinearRegression()

# STEP 2

DEFINE DEPENDENT AND INDEPENDENT VARIABLE(S)

In [12]:
# Defining dependent variable y

y = df['y']

# Defining independent variables x. Let's assume that we have 2 independent variables, x1 and x2.

x = df[['x1', 'x2']]

# STEP 3

FITTING THE MODEL

In [13]:
model = lm.fit(x, y)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

# STEP 4

PREDICT y BASED ON x.

In [None]:
predictions = lm.predict(x)

# CHECKING THE R SQUARED

In [None]:
model.score(x, y)

# CHECKING THE COEFFICIENTS

In [2]:
model.coef_

NameError: name 'model' is not defined

Word "model" in previous command is a model that we defined in Step 3.

# CHECKING THE INTERCEPT

In [None]:
model.intercept_

# PLOT THE DIFFERENCE BETWEEN PREDICTED AND ACTUAL VALUES OF DEPENDENT VARIABLE

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format ='retina'

In [None]:
fig = plt.figure(figsize=(10,6))
plt.scatter(predictions, y, s=100, c='b', marker='+')
plt.xlabel("predicted value of y based on x1 and x2")
plt.ylabel("actual value of y")
plt.show()

# PART 3 - REGRESSION PLOT

# STEP 0

IMPORT THE PACKAGES

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format ='retina'

# PLOT THE REGRESSION

In [None]:
fig = plt.figure(figsize=(10,6))
sns.regplot(df['x1'], fast_food['y'], data=df)
plt.show()

# PLOT MULTIPLE REGRESSIONS ON SAME GRAPH 

In [None]:
fig = plt.figure(figsize=(10,6))
sns.regplot(df['x1'], df['y'], data=df)
sns.regplot(df['x2'], df['y'], data=df)
sns.regplot(df['x3'], df['y'], data=df)
plt.show()

This doesn't mean that we are controlling for all variables. This simply means that we will run 3 different regressions with only one independent variable and put them on same graph. So, first regression is regression of y on x1, without controlling for x2 and x3. Second regression is regression of y on x2, without controlling for x1 and x3...