# Regression-1

---

**overview of regression for predictive modeling, and intro to classes**

This will be the first notebook that moves into the machine learning / predictive analytics realm. I'll be operating on the assumption that you have some understanding of calculus and linear algrbra, but even without that the concepts should still make sense. If you skipped the prior notebook then you missed a short rant about a package called **sklearn** that will be essential for nearly everything from here on out. Tldr, if you get an error or you don't understand something that's going on then read the documentation on sklearn or copy paste into google... you aren't the first person to get that error, and someone wrote the code to throw it, which should be comforting.

On a different tangent, there are two general types of machine learning: supervised and unsupervised. Supervised means "we know what we want to predict," and unsupervised means "eh."

Regression is supervised learning as we have a known "target" variable that we're seeking to predict. The target that we would use a regression algorithm to predict is continuous (money, weight, height), not categorical (yes, no, dog, cat). Surprisingly, this is often a part that people get messed up with: "which algorithm will minimize my MSE," is not the first thing you want to be thinking about if your data is "person A bought a pair of doc martens," and your target is, "are they going to buy a northface or a leather coat?"

**Contents:**
1. OLS from scratch
2. Class on Metrics
3. In-sample vs Out-of-sample
4. Common Issues

---

### OLS from scratch

If you'd like to read a full explanation, [here's a good source](https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/13/lecture-13.pdf). The math hasn't changed in a long time, and I'll just be giving a brief overview before we code.

OLS (linear regression, drawing a line through points) is used for solving the following question: "I have a bunch of variables that explain my target, what's the optimal combination of these variables to predict that target?"

The optimal estimated value for each variable is expressed with the greek letter beta, with a little hat, and a subscript indicating which variable it's estimating ($\hat{\beta}_i$). An entire solution of betas is represented by a vector, $\hat{\beta}$.

Linear algrbra allows for a simple solution to OLS: 

let $x$ be a matrix of feature variables and $y$ be a vector of target values for some observations

$\hat{\beta} = (x^T x)^{-1} x^T y$

It's quick, clean, and trivial to make into our own function.

In [None]:
# standard imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# numpy has linear algrbra (linalg) functions built in that we will take advantage of

# small note, X is capitalized and y is lower case to distinguish that X is a matrix and y is a vector
# this stands even if X only has one feature

def ols(X, y):
    '''
    X - feature matrix
    y - target vector
    ---
    returns ols betas for X and y
    '''
    xtx = np.dot(X.T, X)
    inv_xtx = np.linalg.inv(xtx)
    xty = np.dot(X.T, y)
    
    return np.dot(inv_xtx, xty)

In [None]:
# generating some random data about tree growth rates

np.random.seed(42)

trees = np.arange(50) # 50 trees
years = np.arange(1, 11) # over 10 years

data = [(year, np.dot(year, 18) + np.random.normal(6, 6))
        for year in years
        for tree in trees]

X = np.array([d[0] for d in data])
y = np.array([d[1] for d in data])

In [None]:
# plot tree height over time

X_plt = X  # masking plot values is handy for later plotting
y_plt = y

plt.figure(figsize=(10, 10))

plt.scatter(X_plt, y_plt)

plt.xlabel('Years')
plt.ylabel('Tree Height (inches)')
plt.title('Tree Growth Over Time')

plt.show()

In [None]:
# time to run our regression

X_model = np.c_[np.ones(X.shape[0]), X] # adding column of ones to fit y-intercept, again not messing with original X

betas = ols(X_model, y)
betas

In [None]:
# and plot

plt.figure(figsize=(10, 10))

plt.scatter(X_plt, y_plt) # same as before

b = lambda x: betas[0] + betas[1]*x # slope
x = np.arange(min(X), max(X)+1) # range
plt.plot(x, b(x), 'k--')

plt.xlabel('Years')
plt.ylabel('Tree Height (inches)')
plt.title('Tree Growth Over Time')

plt.show()

---

### Class on Metrics

$R^2$ is meaningless because you can arbitrarily add random features to your data until you max it out. If you're ever using regression for predictive analytics and your job depends on reporting $R^2$ then - before you quit for a new job - I implore you to report adjusted $R^2$.

Let's build our first python Class around a couple metrics that we think are more meaningful.

* Mean Squared Error: average squared difference between targets and estimates, $\frac{1}{n
}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

* Mean Absolute Error: average absolute difference between targets and estimates, $\frac{1}{n
}\sum_{i=1}^{n} |y_i - \hat{y}_i|$

In [None]:
# quick tangent:

# now that we just created regression from scratch, let's forget all of that and use sklearn

from sklearn.linear_model import LinearRegression

In [None]:
# distinguishing between class and function, Classes are Capitalized

class Metrics:
    
    '''
    metric functions using X, y, model
    X = independent feature matrix
    y = dependent vector / target
    model = predictive model
    '''
    
    def __init__(self, X, y, model):
        self.data = X
        self.target = y
        self.model = model
        self.predicted = self.model.predict(self.data)
        
    def mse(self):
        squared_errors = (self.target - self.predicted) ** 2
        return np.mean(squared_errors)
    
    def mae(self):
        absolute_errors = abs(self.target - self.predicted)
        return np.mean(absolute_errors)
        
    
# it's a short class, but it's incredibly useful for several reasons...
# it can be stored in a .py script and imported into any other code
# adding more metrics will be relatively trivial

In [None]:
# time for sklearn heroics

X_reg = X.reshape(-1, 1)  # reshaping a 1-D array allows for modeling, pretty common trick
y = y  # y is fine as is

lr = LinearRegression() # easy-mode
lr.fit(X_reg, y); # including the semi-colon stops sklearn from printing lr facts

In [None]:
# let's try out our class now

metrics = Metrics(X_reg, y, lr)

print('mse: {}'.format(metrics.mse()))
print('mae: {}'.format(metrics.mae()))

In [None]:
# sklearn can also abstract away our metrics class

from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
# so essentially we can create our own versions of everything, but they probably exist
# somewhere already. I would only recommend doing everything manually if you need to do
# it for some kind of private/secret entity

mse = mean_squared_error(y, lr.predict(X_reg))
mae = mean_absolute_error(y, lr.predict(X_reg))

print('mse: {}'.format(mse))
print('mae: {}'.format(mae))

---

### In-sample vs Out-of-sample

How do we know if a model is predicting our data meaningfully? Compare it's performance on "training data" vs "testing data."

If you haven't already covered this...  When we have data and we want to build a predictive model, then we typically need to split that data randomly into 3 parts:  holdout set (used for testing a model, not training or tuning) usually 20-30% of the total data; training and validating data is the remaining 70-80%, and will be split to train and tune our model (usually via K-fold cross validation).

In [None]:
# generate some new test data for our model
# one set of good data and one of bad

np.random.seed(42)

trees = np.arange(10) # 10 trees
years = np.arange(1, 11) # over 10 years

new_good = [(year, np.dot(year, 18) + np.random.normal(6, 6))
            for year in years
            for tree in trees]

new_bad = [(year, np.dot(year, 18) + np.random.normal(6+6, 6)) # adding 6 simulates observations off by 4 months
           for year in years
           for tree in trees]

X_good = np.array([d[0] for d in new_good])
y_good = np.array([d[1] for d in new_good])

X_bad = np.array([d[0] for d in new_bad])
y_bad = np.array([d[1] for d in new_bad])

In [None]:
# plot (copy-pasted from above)

plt.figure(figsize=(10, 10))

plt.scatter(X_plt, y_plt) # same as before
plt.scatter(X_good, y_good, c='g')
plt.scatter(X_bad, y_bad, c='r')

b = lambda x: betas[0] + betas[1]*x # slope
x = np.arange(min(X), max(X)+1) # range
plt.plot(x, b(x), 'k--')

plt.xlabel('Years')
plt.ylabel('Tree Height (inches)')
plt.title('Tree Growth Over Time')

plt.show()

In [None]:
# let's measure mse for both out-of-sample datas

bad_mse = mean_squared_error(y_bad,
                             lr.predict(X_bad.reshape(-1, 1)))
good_mse = mean_squared_error(y_good,
                              lr.predict(X_good.reshape(-1, 1)))

print('bad: {}'.format(bad_mse))
print('good: {}'.format(good_mse))

In [None]:
# bad is clearly higher, but in the graph it doesn't look too far off
# a direct comparison will be more useful

bad_mse / mse

In [None]:
# twice as high, this definitely indicates that we should verify that the new data doesn't have any errors

---

### Common Issues

With regression problems, two things pop up commonly: outlier points and high-leverage points.

These phenomena will change prediction functions drastically, so it's usually better to identify and separate them before modeling the rest of a dataset.

In [None]:
# Outliers:

# let's go back to our original data (still called X and y), and change 5 points in y to be outliers

y_outlier = [i+200 if idx in np.arange(240, 245)
             else i
             for idx, i in enumerate(y)]  # enumerate returns (index, object) for every object in a list

outlier_lr = lr.fit(X_model, y_outlier)  # recall X_model is refactored for modeling

outlier_betas = [outlier_lr.intercept_, outlier_lr.coef_[1]]
outlier_betas

In [None]:
# plot (copy-pasted from above)

plt.figure(figsize=(10, 10))

plt.scatter(X_plt, y_outlier) # same X, new y

b = lambda x: betas[0] + betas[1]*x # slope regular
b_o = lambda x: outlier_betas[0] + outlier_betas[1]*x # slope outlier
x = np.arange(min(X), max(X)+1) # range

plt.plot(x, b(x), 'k--') # without outliers
plt.plot(x, b_o(x), 'r--') # with

plt.xlabel('Years')
plt.ylabel('Tree Height (inches)')
plt.title('Tree Growth Over Time')

plt.show()

In [None]:
# it doesn't seem too tragic a difference, but we only changed 1% of the points
# lets compare mse with outliers to without and see what % change there is

outlier_mse = mean_squared_error(y_outlier, outlier_lr.predict(X_model))

print('mse ratio: {}%'.format(round(outlier_mse/mse * 100, 2)))

In [None]:
# High Leverage Points:

# essentially, what happens if the outliers are near one of the edges of our range?

y_leverage = [i+200 if idx in np.arange(10, 15)
             else i
             for idx, i in enumerate(y)]  # enumerate returns (index, object) for every object in a list

leverage_lr = lr.fit(X_model, y_leverage)  # recall X_model is refactored for modeling

leverage_betas = [leverage_lr.intercept_, leverage_lr.coef_[1]]
leverage_betas

In [None]:
# plot (copy-pasted from above)

plt.figure(figsize=(10, 10))

plt.scatter(X_plt, y_leverage) # same X, new y

b = lambda x: betas[0] + betas[1]*x # slope regular
b_l = lambda x: leverage_betas[0] + leverage_betas[1]*x # slope outlier
x = np.arange(min(X), max(X)+1) # range

plt.plot(x, b(x), 'k--') # without outliers
plt.plot(x, b_l(x), 'r--') # with

plt.xlabel('Years')
plt.ylabel('Tree Height (inches)')
plt.title('Tree Growth Over Time')

plt.show()

In [None]:
# so this looks really bad. It's way more distinct than before, but check the mse ratio

leverage_mse = mean_squared_error(y_leverage, leverage_lr.predict(X_model))

print('mse ratio: {}%'.format(round(leverage_mse/mse * 100, 2)))

The difference in impact by the high leverage points may have been significantly more visible, but it wasn't significantly different, even about 6% less bad. So the moral of the story is if a picture is worth a thousand words, your metrics are worth a million.