# Linear Regression Checkpoint

In this section, you'll be using the Advertising data to run regression models. In this dataset, each row represents a different product, and we have a sample of 200 products from a larger population of products. We have three features - `TV`, `radio`, and `newspaper` - that describe how many thousands of advertising dollars were spent promoting the product. The target, `sales`, describes how many millions of dollars in sales the product had.

We'll import the relevant modules and load and prepare the dataset for you below.

In [1]:
# Run this cell without changes
import pandas as pd
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [2]:
# Run this cell without changes
data = pd.read_csv('data/advertising.csv', index_col=0)
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


## Simple Linear Regression

### 1) Use StatsModels' `ols` function (imported above) to run a linear regression using just `TV` to predict `sales`

Produce the model summary table of this simple linear regression model.

In [3]:
# Replace None with appropriate code
model = None
results = None
### BEGIN SOLUTION

# The Test class does not seem to be compatible with a StatsModels
# model right now.  When I save then run_test on this model, it
# throws an assert error.  It's possible that the .summary() call
# mutates the actual model object

# Therefore, for now this is a manually graded answer

formula = 'sales ~ TV'
model = ols(formula = formula, data = data)
results = model.fit()

### END SOLUTION

results.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Tue, 22 Sep 2020",Prob (F-statistic):,1.47e-42
Time:,12:33:30,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


### 2) Can we infer that products with higher TV advertising spend tend to have greater sales? Explain how you determined this from the model output.

This question is asking you to use your findings from the sample in your dataset to make an inference about the relationship between TV advertising spend and sales in the broader population.

Assign `ans2` to `True` if products with higher TV advertising tend to have greater sales, `False` if not.  Then explain your answer below.

In [4]:
# Replace None with appropriate code
ans2 = None
### BEGIN SOLUTION
ans2 = True
### END SOLUTION

In [5]:
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS

# ans2 should be True or False
assert type(ans2) == bool

### BEGIN HIDDEN TESTS

assert ans2 == True

### END HIDDEN TESTS

=== BEGIN MARK SCHEME ===

Yes, because the p-value for the TV predictor is very small (<0.001), that means that there is a statistically significant relationship between TV advertising spending and sales. 

Since the coefficient (0.0475) is positive, that means that more TV advertising is associated with greater sales. Every increase of 1000 dollars in TV advertising spending is associated with 0.0475x1,000,000 = 47,500 dollars of increased sales(!)

=== END MARK SCHEME ===

## Multiple Linear Regression

### 3) Create a multiple linear regression model (using either `ols()` or `sm.OLS()`).  Use `TV`, `radio`, and `newspaper` as independent variables, and `sales` as the dependent variable.

Produce the model summary table of this multiple linear regression model.

The cell below separates `X` from `y`; you can use these variables or just use `data` depending on which interface you are using.

In [6]:
# Run this cell without changes

X = data.drop('sales', axis=1)
y = data['sales']

In [7]:
# Replace None with appropriate code
model = None
results = None
### BEGIN SOLUTION

# Using ols
formula = 'sales ~ TV + radio + newspaper'
model = ols(formula = formula, data = data)
results = model.fit()

# Using OLS
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()

### END SOLUTION

results.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 22 Sep 2020",Prob (F-statistic):,1.58e-96
Time:,12:33:30,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


### 4) The coefficient produced by this model for one of these features is not statistically significant at an alpha of 0.05.  Which feature is it, and how do you know?

Assign `ans4` to the name of the feature (you can just "hard-code" this, you don't need to extract it from the model), then explain your answer below.

In [8]:
# Replace None with appropriate code
ans4 = None
### BEGIN SOLUTION
ans4 = "newspaper"
### END SOLUTION

In [9]:
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS

# ans4 should be a string representing the name of a feature
assert type(ans4) == str

### BEGIN HIDDEN TESTS

assert ans4 == "newspaper"

### END HIDDEN TESTS

=== BEGIN MARK SCHEME ===

The `newspaper` feature has a p-value of 0.860, which is much larger than 0.05, so its coefficient is not significant

(Since the p-value is very small for TV and radio, they are statistically significant)

Alt: since the confidence interval generated at alpha=.05 does include 0 for newpapers, we can conclude it is 
not statistically significant

(Since the confidence interval generated at alpha=.05 doesn't include 0 for TV and radio, they can be considered
statistically significant)

=== END MARK SCHEME ===

### 5) The following code creates a correlation matrix for `X`

### Based on this correlation matrix only, would you recommend using `TV`, `radio`, and `newspaper` in the same multiple linear regression model?  How does this answer relate to your answer to Question 4?

In [10]:
X.corr()

Unnamed: 0,const,TV,radio,newspaper
const,,,,
TV,,1.0,0.054809,0.056648
radio,,0.054809,1.0,0.354104
newspaper,,0.056648,0.354104,1.0


=== BEGIN MARK SCHEME ===

The highest correlation is between radio and newspaper, about 0.35.

Multiple acceptable answers here:

a. It would probably not be a good idea to include both of these variables in a regression model because then there would be multicollinearity, and an assumption in interpreting the coefficients of a regression model is independence of the features.

b. A different rule of thumb is that 0.7 is the threshold for "high" correlation, so we should proceed with caution but go ahead and include it in the model

Going back to the answer for Question 4, it seems like there is multicollinearity between newspaper and radio. If we are interested in the "true" coefficients for newspaper and radio, we should only include one or the other in our model.

=== END MARK SCHEME ===