## Class 5 Agenda:
  * **Brief introduction to machine learning**
  * **Linear Regression**
  * **Model Evaluation using Train/Test split**
  * **Logistic Regression** (next notebook)

### Brief introduction to Machine Learning

Up to this point, we've just been exploring data. Generating statistical summaries of different rows and columns in our datasets is useful and helps us answer certain kinds of questions about historical trends we see, but it gets us nowhere when we want to **predict** something useful using our data in the future. We've been doing description, but now we are going to move on to prediction.

Before we get started with actual machine learning, we need to understand some of the language used to describe the classes of problems that machine learning can be used to solve.

### Major types of machine learning:
|             | **Supervised**     | **Unsupervised**      |
|-------------|----------------|-------------------|
| **Continuous**  | Regression     | Clustering, PCA   |
| **Categorical** | Classification | Association Rules |

In the machine learning universe the most common tasks are:
  * **Supervised learning problems** involve constructing an accurate model that can **predict some kind of an outcome when past data has labels for those outcomes** [supervised learning wikipedia page](https://en.wikipedia.org/wiki/Supervised_learning)
  * **Unsupervised learning problems** involve constructing models where labels on historical data are unavailable.[unsupervised learning wikipedia page](https://en.wikipedia.org/wiki/Unsupervised_learning)

Today we will be talking about two **supervised learning** approaches.

Within the universe of supervised machine learning problems, there exist two distinct classes of problems.

These classes are based completely on the kind of value or values we are trying to predict:
  * A **classification problem** is a **supervised learning problem** where the objective is to learn to predict a categorical value.
  * A **regression problem** is a **supervised learning problem** where the objective is to learn to predict a continuous value.

We will start with learning about **linear regression**, a machine learning modeling approach that has classically been used for **regression** problems.

### Linear Regression Intro

Linear regression has been used extensively for a whole myriad of distinct regression problems in the past. 

Linear regression is the first model that we will learn because:
  * it is widely used
  * is very quick and easy to set up
  * a trained linear regression model is very easy to understand.
  
Really, it's critical to understand linear regression because it is the foundational machine learning modeling approach on which many other methods are based.

By the end of this notebook you will:

- Have a working conceptual understanding of linear regression and become familiar with some key terminology
- Be able to apply linear regression to a machine learning problem using scikit-learn
- Be able to interpret linear regression model coefficients
- Be able to apply three different evaluation metrics for regression
- Be able to use train/test split to estimate model performance on unseen data 
- Be able to articulate the strengths and weaknesses of linear regression

We will be using the default machine learning library in **Python**, [scikit-learn](http://scikit-learn.org/stable/), which has all of the functionality we will need to explore linear regression.

Let's get started by importing all of the functionality we will need for this lesson:

In [None]:
# Python 2 and 3 compatibility
from __future__ import print_function

# Data handling/modeling
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import scipy.stats as stats

# Visualization
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

### linear regression on one variable (simple linear regression)

Linear regression on one variable (simple linear regression) is an approach for predicting a **continuous response** using a **single feature**. We assume our data can be represented with a function that looks like this:

$y_i = \beta_0 + \beta_1x_i + \epsilon_i$

- $y_i$ is the response of a single observation
- $x$ is one value of the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x
- $\epsilon$ is the error

$\beta_0$ and $\beta_1$ are called the **model coefficients**. They are the values we are going to estimate, or learn, from our data to find the best fit model.

$\beta_0$ is also called the bias (its an offset), and is equivalent to the y-intercept of the model

(remember $y = mx + b$ from school? $m=\beta_1$ and $b=\beta_0$ here).

So, **our model must "learn" the values of these coefficients, and once we've learned these coefficients, we can use the model to predict mpg.**

Let's examine how this happens by looking at some simple data:

In [None]:
height = np.array([64, 69, 72, 73, 74, 76])
shoe = np.array([8,  9, 10, 11, 12, 13])

In [None]:
lineX = [7, 13]
lineY = [62, 78]
plt.plot(lineX, lineY, 'r')
plt.plot(shoe, height, 'bo')
plt.xlabel("Shoe Size")
plt.ylabel("Height")
plt.xlim(7.5,13.5)
plt.ylim(62,78)
plt.show()

Clearly this is not a perfect fit - and in practice it never will be. But we need a way to determine which line is best. We can do this be calculating the sum of the squared errors:

In [None]:
upper_error = [0.75,0,0,0,1.35,2]
lower_error = [0,1.5,2,.35,0,0]
asymmetric_error = [lower_error, upper_error]
plt.plot(lineX, lineY, 'r')
plt.errorbar(shoe, height, yerr=asymmetric_error, fmt='bo')
plt.xlabel("Shoe Size")
plt.ylabel("Height")
plt.xlim(9.5,12.5)
plt.ylim(69,76)
plt.show()

### Estimating ("learning") simple linear regression model coefficients

Coefficients are estimated during the model fitting process using the **least squares criterion**.

We will find the line which minimizes the **sum of squared errors**:

$\textrm{Total error} = \sum\limits_{\textrm{all points}} (y_{actual} - y_{prediction})^2 = \sum\limits_{\textrm{i}} (y_{i} - (\beta_0 + \beta_1x_i))^2$


This is now a calculus problem - finding the values that minimize an equation.

One way to do this is called [Gradient Descent](https://en.wikipedia.org/wiki/Gradient_descent).


Let's look at some real data.

### Reading in the car dataset

Now that we've got all of the libraries we need, lets get some data to work with.

This data comes from the famous [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/):

In [None]:
# read data into a DataFrame
data = pd.read_csv("../data/auto_mpg_data.csv")
data.head()

In [None]:
data.info()

Each row in this dataset, called an **observation** represents **one car model** (392 models in the dataset).

Our goal will be to try to build a model that, when given some features describing a car model, can accurately predict the expected mpg of the vehicle.

What are the **features**? (What data can we use to generate our prediction?)

- **cylinders:** The number of cylinders in the model (numeric discrete)
- **displacement:** [engine displacement](https://en.wikipedia.org/wiki/Engine_displacement) (continuous)
- **horsepower:** horsepower of the model (continuous)
- **weight:** total weight of the car (continuous)
- **acceleration:** The vehicle acceleration rate of the model (continuous)

What is the **response**? (What are we trying to predict?)

- **mpg:** approximate miles per gallon of the model (continuous)

### You should immediately have questions about the data

1. Is there a relationship between any of the properties of the car models in our dataset?
2. How strong is that relationship?
3. Do any of the properties of the cars seem to relate to its mpg?
4. What is the effect of each car attribute on mpg?

### Visualize relationships among the features and the outcome

The quickest, most effective way you can quickly see if any of the features correlate with your response is to use a **scatter plot** to visualize them:

In [None]:
sns.pairplot(data, x_vars=['cylinders','displacement','horsepower'], y_vars='mpg', size=6, aspect=0.8);

Here we just tried to see if there was any relationship between cylinders/displacement/horsepower and mpg for each feature by itself. Looks like all 3 are negatively correlated with mpg.

If we wanted to see what the simple linear regression on each feature by itself looks like (we will get to what that actually is shortly), we can plot a regression line:

In [None]:
sns.pairplot(data, x_vars=['cylinders','displacement','horsepower'], y_vars='mpg', size=6, aspect=0.8, kind='reg');

You can also use a **correlation matrix** to compute the pairwise correlations between all numeric variables.

Let's first just compute and inspect the correlation matrix:

In [None]:
auto_correlations = data.corr()
auto_correlations

In [None]:
sns.heatmap(auto_correlations);

Ok, enough exploring, lets get to building some models.

We will use `scikit-learn` for the first time here and work through the process of training a scikit-learn model.

Here's a link to the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
# These were already imported above, but I'm repeating them here for clarity
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics

# create X and y
feature_cols = ['displacement']
X = data[feature_cols]
y = data.mpg

# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)

# print the coefficients
print("The y intercept:", linreg.intercept_)
print("The single coefficient:", linreg.coef_)

Ok, so what did we do here?

1. We created a matrix `X` that held our features and a vector `y` that held our response variable across all the observations in our dataset.
2. We then instantiated (created) a `LinearRegression` model. 
3. However, that model was initially untrained (didn't have our data fit to it). In order to do that, we had to call the `fit` method of the `LinearRegression` object `linreg` we had just created using our features `X` and outcome `y` as input parameters.
4. After we called `fit` on our model, the simple linear regression model was fit. Following this, we inspected our two coefficients.
  
Just to hammer all of this home, let's take a look at what this "visually" looks like using `pairplot`:

In [None]:
sns.pairplot(data,x_vars=['displacement'],y_vars='mpg',size=7, aspect=1.2,kind='reg')
sns.plt.xlim(0,)
sns.plt.ylim(0,)

### How to interpret model coefficients

So, now that we've got a fitted linear regression model, how do we interpret the acceleration coefficient ($\beta_1$)?

A "unit" increase in displacement is **associated with** a -0.06 change in mpg.

Keep in mind that this is not a statement of **causation**, but of **correlation**.

What would the coefficient look like if an increase in displacement was associated with a **decrease** in mpg?

### Using the model for prediction

Let's say that there was a new car model that had a displacement of 250. What would we predict the mpg of that model to be?

$$y = \beta_0 + \beta_1 x$$
$$y = 35.12 - 0.06 \times 250$$

In [None]:
# manually calculate it and confirm with the plot we created above. Does this value make sense?
35.12 - 0.06 * 250

In [None]:
# predict for a new observation, here where the displacement is 30
linreg.predict(250)

So, we would predict mpg of **~20.1** for a model with a displacement of 30.

### Does the scale of the features matter?

Let's say that displacement was measured in cubic milimeters, rather than cubic centimeters. How would that affect the model?

In [None]:
data['displacement_cml'] = data.displacement * 1000
data.head()

In [None]:
# create X and y
feature_cols = ['displacement_cml']
X_2 = data[feature_cols]
y = data.mpg

# instantiate and fit
linreg2 = LinearRegression()
linreg2.fit(X_2, y)

# print the coefficients
print(linreg2.intercept_)
print(linreg2.coef_)

How do we interpret the new displacement_cml coefficient ($\beta_1$)?

- A "unit" increase in displacement (in cubic mm) is **associated with** a 6e-5 decrease in mpg.
- Which is equivalent to what we found in the first model.

In [None]:
# predict for a new observation
linreg2.predict(250*1000)

So what does this mean?

**The scale of the features is irrelevant for linear regression models, since it will only affect the scale of the coefficients, and we simply change our _interpretation_ of the coefficients**

### How well does the model fit the data?

R-squared is a very common way to evaluate the overall fit of a linear model.
R-squared is defined as the **proportion of variance explained**, meaning the proportion of variance in the observed data that is explained by the model. This value is between 0 and 1, where the higher the value is, the better the model is.

We can get r-squared from our model by getting the pearson-r coefficient from a fancy jointplot and squaring it:

In [None]:
sns.jointplot('displacement', 'mpg',data, kind="reg")
print("R^2:", stats.pearsonr(X.values.flatten(),y.values)[0]**2)

Let's confirm the R-squared value for our simple linear model using `scikit-learn's` prebuilt R-squared scorer:

In [None]:
y_pred = linreg.predict(X)
metrics.r2_score(y, y_pred)

Two things to keep in mind when using R-squared:
  * The threshold for a **"good" R-squared value** is highly dependent on the particular domain.
  * R-squared is more useful as a tool for **comparing models**.

### Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features, which is called **multiple linear regression**:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient:

$y = \beta_0 + \beta_1 \times acceleration + \beta_2 \times displacement + \beta_3 \times horsepower$

In [None]:
# create X and y except now with more columns in X
mult_feature_cols = ['acceleration', 'displacement', 'horsepower']
X_mult = data[mult_feature_cols]
y_mult = data.mpg

# instantiate and fit like last time
multiple_linreg = LinearRegression()
multiple_linreg.fit(X_mult, y_mult)

coeffs = multiple_linreg.coef_
intercept =  multiple_linreg.intercept_
# print the coefficients like last time
print(intercept)
print(coeffs)

In [None]:
# pair the feature names with the coefficients
zip(mult_feature_cols, multiple_linreg.coef_)

With this model we can interpret the coefficients as follows:

  * For a fixed amount of acceleration and engine displacement, an increase of 1 unit in **horsepower** is associated with a **decrease in mpg of the car of ~.09**.
  * For a fixed amount of displacement and horsepower, an increase of 1 m/s^2 in **acceleration** is associated with a **decrease in mpg of ~.41**.
  * For a fixed amount of acceleration and horsepower, an increase of 1 in **displacement** is associated with an **decrease in mpg of ~.04**.

Does this model have a better R<sup>2</sup> value?

In [None]:
y_mult_pred = multiple_linreg.predict(X_mult)
metrics.r2_score(y_mult, y_mult_pred)

#### Exercise Time
* Create the multiple regression when you use every variable except for mpg to predict mpg.
* What is this new r^2 value?

In [None]:
pass

### Evaluation metrics for regression problems

In order to evaluate how good a given regression model is, we need evaluation metrics designed for comparing **continuous values**. We will cover 3 common evaluation metrics for regression models here.

Let's create some example numeric predictions, and calculate the three most common evaluation metrics for regression problems:

In [None]:
# define true and predicted response values
fake_y_true = [101, 40, 30, 20]
fake_y_pred = [90, 50, 50, 30]

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors/residuals:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

In [None]:
print("MAE for fake data:",metrics.mean_absolute_error(fake_y_true, fake_y_pred))

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

In [None]:
print("MSE for fake data:",metrics.mean_squared_error(fake_y_true,fake_y_pred))

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
print("RMSE for fake data:",np.sqrt(metrics.mean_squared_error(fake_y_true, fake_y_pred)))

Lets compare these metrics in terms of their usefulness/interpretability:
  * **MAE** is the easiest to understand, because it's the average error.
  * **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
  * **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are what are called **loss functions**, because we want to minimize the **loss** (from getting stuff wrong).

#### Exercise Time
  * Calculate the MAE/MSE/RMSE of the simple linear regression model
  * Calculate the MAE/MSE/RMSE of the 3 feature multiple regression model
  * Calculate the MAE/MSE/RMSE of the model using all of the features
  * What do you notice about all of these metrics as you keep adding features?

In [None]:
pass

### Using train/test split for model evaluation

How do we know that our model will perform well on new data?

Sure, we may know that our model has really low RMSE on all of the data we have on hand, but can we be sure that it will be exactly the same when we try to use our model in the real world?

One way we can get an estimate of how the model will perform "in the wild" is by building the model on a portion of our data, and then testing it on the remainder that we have.

So, we **act like we have one set of data for model building, and keep a separate set of data and treat it as if it were new.** We then test our model on this "new" data, and, **as long as the test data was taken in an unbiased way**, we can assume that the **loss** on the test data gives us a pretty good idea of what the error "in the wild" will be.

So, let's try to use train/test split to estimate the model's accuracy on unseen data.

The basic approach would be to randomly select a fraction of the data (>50% usually) for training, and the remainder (100-training%) for testing. We will use scikit-learn's `train_test_split` function to do this:

In [None]:
X_mult_train, X_mult_test, y_mult_train, y_mult_test = train_test_split(X_mult, y_mult, test_size=0.3, random_state=1)
print("training data size:",X_mult_train.shape)
print("testing data size:",X_mult_test.shape)

Now, we simply train on `X_mult_train` and `y_mult_train` and then generate predictions and evaluation metrics on `X_mult_test` and `y_mult_test`:

In [None]:
#train on training set
mult_linreg2 = LinearRegression()
mult_linreg2.fit(X_mult_train, y_mult_train)

#generate predictions on training set and evaluate
y_mult_pred_train = mult_linreg2.predict(X_mult_train)
print("Training set RMSE:",np.sqrt(metrics.mean_squared_error(y_mult_train, y_mult_pred_train)))

#generate predictions on test set and evaluate
y_mult_pred_test = mult_linreg2.predict(X_mult_test)
print("Test set RMSE:",np.sqrt(metrics.mean_squared_error(y_mult_test, y_mult_pred_test)))

Notice that the test set error is greater than the training set error. This should almost always be the case (why?).

#### Exercise Time
  * Get MAE/MSE/RMSE training and test set predictions on the full linear regression model (using all features) with a test set of 30% of the data
  * Get MAE/MSE/RMSE training and test set predictions on the full linear regression model (using all features) with a test set of 20% of the data
  * Get MAE/MSE/RMSE training and test set predictions on the full linear regression model (using all features) with a test set of 10% of the data
  * Anything you notice about the test set error metrics?

In [None]:
pass

### Overfitting and Underfitting

How do we know when a model will perform well on new data? What is happening when a model doesn't generalize well?

Let's create a new data set with many features.

Here are a few new sklearn features:

[PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) takes feature columns and generates new columns that are all features multiplied together up to the degree you specify. For 1 feature is generates x, x<sup>2</sup>, x<sup>3</sup>, ...

[Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) is a convenience method that allows you to define a sequence of data transformations and then apply it to any data.

The following code is adapted from [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html).

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation

np.random.seed(0)

n_samples = 30
# Try different values for degree: 1, 5, 15
degree = 1

# Here's the data we are trying to fit
true_fun = lambda X: np.cos(1.5 * np.pi * X)
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(10, 6))
polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
linear_regression = LinearRegression()

pipeline = Pipeline([("polynomial_features", polynomial_features),
                     ("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title("Polynomial degree {}".format(degree))
plt.show()

### Model Complexity

**This is general, not specific to linear regression!**

If the model not complex enough -> then it doesn't really capture the behavior of the underlying function we are trying to learn. This is a high **bias** situation and the model is **underfit**.

If the model is too complex -> then it has learned too much. It is "fit to the noise" and not the underlying function. This is called high **variance** or **overfit**.

Bias and variance are two types of errors. In a well-fit model, they are roughly equal. There are several ways to test and combat these types of errors.

1. Underfit: Build more features.
2. Overfit: Reduce number of features, get more data.

We'll also talk about model specific ways to control model complexity.

![Test](../images/highvariance.png)

### Summary of linear regression and comparison with other models (you will see in the future)

There are some obvious advantages to linear regression models:
  * These kinds of models are very simple to explain
  * They are highly interpretable
  * Model training and prediction is very fast
  * Features do not need to be scaled (we will talk about feature scaling later)
  * They can perform well with a small number of observations

However, linear regression also has some significant disadvantages:
  * It assumes a linear relationship between the features and the outcome. This isn't always (almost never) the case.
  * Performance is (generally) not competitive with the best supervised learning methods
  * When you have lots of features, this approach can become sensitive to useless features
  * This approach can't automatically learn feature interactions (although you can code them into a linear regression, will show you how to do that soon!)