# Adertising Data


In [None]:
import warnings
# conventional way to import pandas 
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import numpy as np

# allow plots to appear directly in the notebook
%matplotlib inline
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

### Load the dataset

In [None]:
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()

What are the **features**?
* TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
* Radio: advertising dollars spent on Radio
* Newspaper: advertising dollars spent on Newspaper

What is the **response**?
* Sales: sales of a single product in a given market (in thousands of widgets)

In [None]:
# print the shape of the dataset
data.shape


There are **200 observations**, and thus 200 markets in the dataset.

In [None]:
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7)

## Questions About the Advertising Data
Let's pretend you work for the company that manufactures and markets this widget. The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future?

This general question might lead you to more specific questions:

1. Is there a relationship between ads and sales?
2. How strong is that relationship?
3. Which ad types contribute to sales?
4. What is the effect of each ad type of sales?
5. Given ad spending in a particular market, can sales be predicted?

We will explore these questions below!

## Simple Linear Regression
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:
$y = \beta_0 + \beta_1x$

What does each term represent?
* $y$ is the response
* $x$ is the feature
* $\beta_0$ is the intercept
* $\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!

## Estimating ("Learning") Model Coefficients
Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

![Estimating coefficients](images/estimating_coefficients.png)


What elements are present in the diagram?

* The black dots are the observed values of x and y.
* The blue line is our least squares line.
* The red lines are the residuals, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

* $\beta_0$ is the intercept (the value of $y$ when $x$=0)
* $\beta_1$ is the slope (the change in $y$ divided by change in $x$)


Here is a graphical depiction of those calculations:

![Slope Intercept](images/slope_intercept.png)

Let's estimate the model coefficients for the advertising data:



In [None]:
### STATSMODELS ###

# create a fitted model
lm1 = smf.ols(formula='Sales ~ TV', data=data).fit()

# print the coefficients
lm1.params

In [None]:
### SCIKIT-LEARN ###

# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales

# instantiate and fit
lm2 = LinearRegression()
lm2.fit(X, y)

# print the coefficients
print lm2.intercept_
print lm2.coef_


## Interpreting Model Coefficients

How do we interpret the TV coefficient ($\beta_1$)?
* A "unit" increase in TV ad spending is associated with a 0.047537 "unit" increase in Sales.
* Or more clearly: An additional $1,000 spent on TV ads is associated with an increase in sales of 47.537 widgets.

Note that if an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative.

## Using the Model for Prediction
Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict for the Sales in that market?


In [None]:
# manually calculate the prediction
7.032594 + 0.047537*50

In [None]:
### STATSMODELS ###

# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})

# predict for a new observation
lm1.predict(X_new)

In [None]:
### SCIKIT-LEARN ###

# predict for a new observation
lm2.predict(50)

Thus, we would predict Sales of 9,409 widgets in that market.

## Plotting the Least Squares Line

Let's plot the least squares line for Sales versus each of the features:

In [None]:
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')

## How Well Does the Model Fit the data?
The most common way to evaluate the overall fit of a linear model is by the R-squared value. **R-squared** is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":

![R Squared](images/r_squared.png)

You can see that the **blue line** explains some of the variance in the data (R-squared=0.54), the **green line** explains more of the variance (R-squared=0.64), and the **red line** fits the training data even further (R-squared=0.66). (Does the red line look like it's overfitting?)

Let's calculate the R-squared value for our simple linear model:

In [None]:
### STATSMODELS ###

# print the R-squared value for the model
lm1.rsquared

In [None]:
### SCIKIT-LEARN ###

# print the R-squared value for the model
lm2.score(X, y)

## Model Evaluation Metrics for Regression,
For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?
**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:
      $$\\frac 1n\\sum_{i=1}^n|y_i-\\hat{y}_i|$$
**Mean Squared Error** (MSE) is the mean of the squared errors:
      $$\\frac 1n\\sum_{i=1}^n(y_i-\\hat{y}_i)^2$$
**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:
      $$\\sqrt{\\frac 1n\\sum_{i=1}^n(y_i-\\hat{y}_i)^2}$$