# Linear Regression Checkpoint

In this checkpoint, you'll be using the Advertising data you encountered previously, containing amounts spent on different advertising platforms and the resulting sales.  Each observation is a different product.  

We'll import the relevant modules and load and prepare the dataset for you below.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [2]:
data = pd.read_csv('data/advertising.csv').drop('Unnamed: 0', axis=1)
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [3]:
X = data.drop('sales', axis=1)
y = data['sales']

In the linear regression section of the curriculum, you analyzed how `TV`, `radio`, and `newspaper` spending individually affected figures for `sales`. Here, we'll use all three together in a multiple linear regression model!

## 1) Create a Correlation Matrix for `X`

In [4]:
X.corr()

Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


## 2) Based on this correlation matrix only, would you recommend using `TV`, `radio`, and `newspaper` in the same multiple linear regression model?

In [None]:
"""
The highest correlation is between radio and newspaper, about 0.35.

Multiple acceptable answers here:

a. It would probably not be a good idea to include both of these variables in a regression model 
because then there would be multicollinearity, and an assumption in interpreting the coefficients 
of a regression model is independence of the features.

b. A different rule of thumb is that 0.7 is the threshold for "high" correlation, so we should proceed with caution
but go ahead and include it in the model
"""

## 3) Create a multiple linear regression model (using either `ols()` or `sm.OLS()`).  Use `TV`, `radio`, and `newspaper` as independent variables, and `sales` as the dependent variable.

### Produce the model summary table of this multiple linear regression model.

In [5]:

# Using ols
formula = 'sales ~ TV + radio + newspaper'
model = ols(formula = formula, data = data).fit()
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 27 Feb 2020",Prob (F-statistic):,1.58e-96
Time:,10:41:54,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [6]:

# Using OLS
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Thu, 27 Feb 2020",Prob (F-statistic):,1.58e-96
Time:,10:42:08,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


## 4) For each coefficient:

### - Conclude whether it's statistically significant 

### - State how you came to that conclusion

## Interpret how these results relate to your answer for Question 2

In [None]:
"""
Since the p-value is very small for TV and radio, they are statistically significant at a standard alpha of 0.05.

However, newspaper has a p-value of 0.860, which is not statistically significant.

Alt: since the confidence interval generated at alpha=.05 doesn't include 0 for TV and radio, they can be considered
statistically significant 

However, since the confidence interval generated at alpha=.05 does include 0 for newpapers, we can conclude it is 
not statistically significant

Going back to the answer for Question 2, it seems like there is multicollinearity between newspaper and radio.
If we are interested in the "true" coefficients for newspaper and radio, we should only include one or the other
in our model.
"""