In [1]:
import sys
sys.path.append("/Users/AdamLiu/anaconda/pkgs")
import pandas as pd

In [2]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

# check the shape of the DataFrame (rows, columns)
data.shape

(200, 4)

In [3]:
# display the first 5 rows
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [4]:
# display the last 5 rows
data.tail()

Unnamed: 0,TV,Radio,Newspaper,Sales
196,38.2,3.7,13.8,7.6
197,94.2,4.9,8.1,9.7
198,177.0,9.3,6.4,12.8
199,283.6,42.0,66.2,25.5
200,232.1,8.6,8.7,13.4


### The features are:

- TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- Radio: advertising dollars spent on Radio
- Newspaper: advertising dollars spent on Newspaper

### The responses are:

- Sales: sales of a single product in a given market (in thousands of items)

In [5]:
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inline


# for the seaborn issue, check out this post:
# http://stackoverflow.com/questions/34963703/ipython-notebook-shows-import-error-for-seaborn-even-when-package-is-installed-i

# and also, this is the post that shows how to deal with this dataset

ImportError: No module named 'seaborn'

### Cross-validation: Feature selection

**Dataset**: Using the advertising dataset to do experiment (http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv)

**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset

In [7]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# read in the advertising dataset
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

In [17]:
summary(data)

NameError: name 'summary' is not defined

In [8]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']

# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]

# select the Sales column as the response (y)
y = data.Sales

In [11]:
# 10-fold cross-validation with all three features

lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')
print(scores)

[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
 -8.17338214 -2.11409746 -3.04273109 -2.45281793]


In [12]:
# fix the sign of MSE scores

mse_scores = -scores
print(mse_scores)

[ 3.56038438  3.29767522  2.08943356  2.82474283  1.3027754   1.74163618
  8.17338214  2.11409746  3.04273109  2.45281793]


In [13]:
# convert from MSE to RMSE

rmse_scores = np.sqrt(mse_scores)
print(rmse_scores)

[ 1.88689808  1.81595022  1.44548731  1.68069713  1.14139187  1.31971064
  2.85891276  1.45399362  1.7443426   1.56614748]


In [14]:
# calculate the average RMSE

print(rmse_scores.mean())

1.69135317081


In [16]:
# 10-fold cross-validation with two features (excluding Newspaper)

feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print(np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

1.67967484191


# Improvements to cross-validation


### Repeated cross-validation

- Repeat cross-validation multiple times (with different random splits of the data) and average the results
- More reliable estimate of out-of-sample performance by reducing the variance associated with a single trial of cross-validation


### Creating a hold-out set

- "Hold out" a portion of the data before beginning the model building process
- Locate the best model using cross-validation on the remaining data, and test it using the hold-out set
- More reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample


### Feature engineering and selection within cross-validation iterations

- Normally, feature engineering and selection occurs before cross-validation
- Instead, perform all feature engineering and selection within each cross-validation iteration
- More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data