# Supervised Learning Regression

## Agenda

- What are the basic supervised learning methods for regression in ScikitLearn and how to use them?
- How do I choose which model to use for regression?
- How do I choose the best tuning parameters for regression model?
- How do I estimate the likely performance of my regression model on out-of-sample data?

---

In [None]:
# conventional way to import pandas
import pandas as pd

In [None]:
# read CSV file directly from a URL and save the results
data = pd.read_csv('Advertising.csv', index_col=0)
data.head()

## Linear regression

**Pros:** fast, no tuning required, highly interpretable, well-understood

**Cons:** unlikely to produce the best predictive accuracy (presumes a linear relationship between the features and response)

### Form of linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the **model coefficients**. These values are "learned" during the model fitting step using the "least squares" criterion. Then, the fitted model can be used to make predictions!

## Preparing X and y using pandas

In [None]:
# create a Python list of feature names
feature_cols = ['TV', 'radio', 'newspaper']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

In [None]:
# select a Series from the DataFrame
y = data['sales']

# equivalent command that works if there are no spaces in the column name
#y = data.sales

## Splitting X and y into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Linear regression in scikit-learn

**Step 1:** Import the class you plan to use

In [None]:
# import model
from sklearn.linear_model import LinearRegression


**Step 2:** "Instantiate" the "estimator"

In [None]:
# instantiate
linreg = LinearRegression()


**Step 3:** Fit the model with data (aka "model training")

In [None]:
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

### Interpreting model coefficients

In [None]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

In [None]:
# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)

$$y = 2.88 + 0.0466 \times TV + 0.179 \times Radio + 0.00345 \times Newspaper$$

How do we interpret the **TV coefficient** (0.0466)?

- For a given amount of Radio and Newspaper ad spending, **a "unit" increase in TV ad spending** is associated with a **0.0466 "unit" increase in Sales**
- Or more clearly: For a given amount of Radio and Newspaper ad spending, **an additional $1,000 spent on TV ads** is associated with an **increase in sales of 46.6 items**

Important notes:

- This is a statement of **association**, not **causation**
- If an increase in TV ad spending was associated with a **decrease** in sales, $\beta_1$ would be **negative**

**Step 4:** Predict the response for a new observation



In [None]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an **evaluation metric** in order to compare our predictions with the actual values!

## Model evaluation metrics for regression

Evaluation metrics for classification problems, such as **accuracy**, are not useful for regression problems

Instead, we need evaluation metrics designed for comparing continuous values

Let's create some example numeric predictions, and calculate **three common evaluation metrics** for regression problems:

In [None]:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
# calculate RMSE by hand
import numpy as np
from sklearn import metrics
print(np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.))

# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

In [None]:
# calculate MAE by hand
print((10 + 0 + 20 + 10)/4.)

# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
# calculate MSE by hand
print((10**2 + 0**2 + 20**2 + 10**2)/4.)

# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error
- **MSE** "punishes" larger errors
- **RMSE** is interpretable in the "y" units

### Computing the RMSE for our Sales predictions

In [None]:
import numpy as np
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## Cross-validation example: feature selection

**Goal**: Select whether the Newspaper feature should be included in the linear regression model on the advertising dataset

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import numpy as np

# 10-fold cross-validation with all three features
lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')
print(scores)

In [None]:
print(np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

In [None]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'radio']
X = data[feature_cols]
print(np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')).mean())

## Exercise

Use linear regression to predict the sales price in the avocado dataset

1. Scale your data
2. Encode categorical variables
3. Fit the model
4. Predict the sales
5. Compute RMSE


In [None]:
# import numpy as np
# from sklearn import metrics

# Load avocado dataset as Avocado_data
Avocado_data = pd.read_csv('avocado.csv', index_col=0)
#split into features and targets
# first features
Avocado_data_features = Avocado_data.drop(['AveragePrice'], axis=1)
Avocado_data_features = Avocado_data_features.drop(['Date'], axis=1)
# then target
Avocado_data_target = Avocado_data['AveragePrice']

region_categories = Avocado_data_features['region'].unique()
type_categories = Avocado_data_features['type'].unique()
from sklearn.preprocessing import LabelEncoder

region_encoder = LabelEncoder()
encoded_region_cats = region_encoder.fit_transform(region_categories)
Avocado_data_features['region'] = region_encoder.transform(Avocado_data_features['region'])
type_encoder = LabelEncoder()
encoded_type_cats = type_encoder.fit_transform(type_categories)
Avocado_data_features['type'] = type_encoder.transform(Avocado_data_features['type'])




In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Avocado_data_features, Avocado_data_target, random_state=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
standardize_Xtrain = scaler.transform(X_train)
standardize_Xtest = scaler.transform(X_test)

In [None]:
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(standardize_Xtrain, y_train)

In [None]:
# make predictions on the testing set
y_pred = linreg.predict(standardize_Xtest)

In [None]:
import numpy as np
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))