# Linear Regression Lesson

## Objectives

By the end of this lesson you will be able to:
- Understand when to use linear regression
- Understand how linear regression relates to the CRISP-DM Model
- Apply linear regression to solve a real-world task
- Interpret the results of a linear regression model

<img src="imgs/talking.jpeg" width="60" align='left'>

#### Turn and Talk - Review

Last week we used t-tests and ANOVA to answer a variety of questions we had related to our data.  With a partner discuss the following:

- In t-tests and ANOVA what are our hypotheses about?
- In t-test and ANOVA what type of variable is our independent and dependent variable?

#### Great, we've done some review of t-tests and ANOVA.  But what if our independent variable is also a continuous variable???  In this case we can use a new type of statistical modeling called Linear Regression!!

<img src="imgs/regression_cat_noclue.jpg" width="300">

## CRoss-Industry Standard Process for Data Mining (CRISP-DM)

Before we dig into regression, lets talk about how this fits into the steps in the CRISP-DM model.

<img src="imgs/new_crisp-dm.png" width="300">

### The basics of Linear Regression


At the basic level linear regression helps us find the "best" linear relationship between an independent variable and dependent variable.

<img src="imgs/linear-nonlinear-relationships.png" width="600">

The model that we "fit" to is noted as:

$$ Y = b_{0} + b_{1}X + \epsilon$$
 
- $ b_{0}$ is the intercept of the model and represents the average Y value when X is 0
- $b_{1}$ is the coefficient of X and represents the slope between X and Y

- $\epsilon$ is the irreducible error term. Depending on the problem at hand we might assume that these errors are coming from measurement mistakes, personal beliefs, recording errors, etc.


<img src="imgs/talking.jpeg" width="60" align='left'>

#### Turn and Talk 

What are two variables (any two that you can dream up) that we might expect to see a linear relationship between?


### Our Problem 

We want to know if there if an increase of alcohol consumption will increase or happiness

<img src="imgs/beer_happy.jpg" width="400">



[Happiness and Alcohol Consuption Dataset](https://www.kaggle.com/marcospessotto/happiness-and-alcohol-consumption)

In [18]:
#import necessary packages
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

In [10]:
#read in dataset
happy = pd.read_csv('HappinessAlcoholConsumption.csv' )
happy.head()

Unnamed: 0,Country,Region,Hemisphere,HappinessScore,HDI,GDP_PerCapita,Beer_PerCapita,Spirit_PerCapita,Wine_PerCapita
0,Denmark,Western Europe,north,7.526,928,53.579,224,81,278
1,Switzerland,Western Europe,north,7.509,943,79.866,185,100,280
2,Iceland,Western Europe,north,7.501,933,60.53,233,61,78
3,Norway,Western Europe,north,7.498,951,70.89,169,71,129
4,Finland,Western Europe,north,7.413,918,43.433,263,133,97


Let's do a little exploration of our data before we begin

In [25]:
#look at info of data

In [25]:
#look at descriptive stats

In [25]:
#create a pairplot to visually inspect data

In [25]:
#look at correlations

<img src="imgs/talking.jpeg" width="60" align='left'>

#### Turn and Talk 

Based on a little bit of data exploration what features might we expect would have a linear relationship with happiness?




###  Simple Linear Regression

[StatsModels Formula API](https://www.statsmodels.org/stable/api.html#regression)

Upon visual inspection it looks like HDI is positively linearly related to Happiness.  So lets look at that first.

In [22]:
# first we set up our model
model = smf.ols('HappinessScore ~ HDI', happy) #HappinessScore is our Y variable and HDI is our X

#now we fit the OLS model to our data
model = model.fit()

#finally, we print out the summary table to view the results
model.summary()

0,1,2,3
Dep. Variable:,HappinessScore,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.662
Method:,Least Squares,F-statistic:,237.7
Date:,"Sun, 31 May 2020",Prob (F-statistic):,3.1e-30
Time:,17:13:25,Log-Likelihood:,-122.91
No. Observations:,122,AIC:,249.8
Df Residuals:,120,BIC:,255.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8950,0.306,2.921,0.004,0.288,1.502
HDI,0.0062,0.000,15.416,0.000,0.005,0.007

0,1,2,3
Omnibus:,3.079,Durbin-Watson:,1.371
Prob(Omnibus):,0.214,Jarque-Bera (JB):,2.572
Skew:,-0.242,Prob(JB):,0.276
Kurtosis:,2.479,Cond. No.,3830.0


####  There are a couple of things we want to interpret here:

- __R-squared__ = 0.664 indicates that HDI can explain 66% of the variance in HappinessScores
- __Intercept Coefficient__ = 0.895 indicates that the mean HappinessScore is .895 when HDI values are 0
- __HDI Coefficient__ = 0.0062 indicates that with a one unit increase in HDI the HappinessScore is expected to increase by 0.0062
    -  The __p value__ of 0.000 also indicates that this is a significant predictor of HappinessScores

###  Multiple Linear Regression

But my question was about drinking and it's impact on Happiness!  So let's add one of the drinking related variables to the model 

In [26]:
# first we set up our model
model = smf.ols('HappinessScore ~ HDI', happy) #Add some other variables here

#now we fit the OLS model to our data


#finally, we print out the summary table to view the results


0,1,2,3
Dep. Variable:,HappinessScore,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.662
Method:,Least Squares,F-statistic:,237.7
Date:,"Sun, 31 May 2020",Prob (F-statistic):,3.1e-30
Time:,17:50:08,Log-Likelihood:,-122.91
No. Observations:,122,AIC:,249.8
Df Residuals:,120,BIC:,255.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8950,0.306,2.921,0.004,0.288,1.502
HDI,0.0062,0.000,15.416,0.000,0.005,0.007

0,1,2,3
Omnibus:,3.079,Durbin-Watson:,1.371
Prob(Omnibus):,0.214,Jarque-Bera (JB):,2.572
Skew:,-0.242,Prob(JB):,0.276
Kurtosis:,2.479,Cond. No.,3830.0


<img src="imgs/talking.jpeg" width="60" align='left'>

#### Turn and Talk 

Interpret the following:
- R-squared
- Intercept Coefficient
- HDI Coefficient
- Beer Per Capita Coefficient



### ECOMMERCE SALES

Now it's your turn to practice some linear regression using the [Ecommerce Dataset](https://www.kaggle.com/kolawale/focusing-on-mobile-app-or-website)

In [24]:
# your code here