In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import random

import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

Let's use the tools we've learned so far and the model one type we've learned to see how the modeling process works. In other words:

# You made a model, now what?

### Lab Two Questions?

Note the difference between the two models:

$$
Y_{mass} = \beta_0 + \beta_1 X_{flipper} + \beta_2 X_{is\_male}
$$

$$
Y_{mass} = \beta_0 + \beta_1 X_{flipper} + \beta_2 X_{is\_male} + \beta_3 X_{flipper} X_{is\_male}
$$

- Why might we want this extra term? What happens in these models when $X_{is\_male} = 0$ vs. $= 1$?

In [None]:
df_penguins = sns.load_dataset('penguins')
df_penguins.head()

## Metrics vs. Loss

- In Machine Learning the **Loss Function** refers to the quantity used to *optimize* or create the model. Usually the loss function is minimized (This is just by convention; minimization is equivalent to maximizing the negative of a real-valued function).
- A **Metric** is the quantity used to *evaluate* the model. Usually this is the quantity we really care about.

### Example
- Predict whether a penguin's body mass is higher than average using Linear Regression.
- (We could also use Logistic Regression, but that's for next time!)

In [None]:
print(f'Mean body mass is {df_penguins["body_mass_g"].mean()}')

In [None]:
plt.hist(df_penguins['body_mass_g'])
plt.axvline(x=df_penguins["body_mass_g"].mean(), color='red')
plt.show

In [None]:
sns.pairplot(df_penguins)
plt.show()

In [None]:
df_penguins.corr()

Looks like many of the features are somewhat linearly correlated.. Let's throw them all in a model and see what happens.

In [None]:
model = ols(formula = 'body_mass_g ~ flipper_length_mm + bill_depth_mm + bill_length_mm', data=df_penguins)
res = model.fit()
res.summary()

- What is the Loss Function here?
- What might you use for a Metric if we care about predicting body mass accurately?

### An aside:
- For OLS, minimizing RSS is the same as maximizing $R^2$. Recall:

$$
RSS = \frac{TSS - RSS}{TSS}.
$$

Even in this case, I would call $R^2$ a metric because:
- In my final report I would record the $R^2$ value.
- I would compare the $R^2$ value with another model created in another way (maybe something fancy like a decision tree).

### Adjusted R^2

Let $n$ be the number of samples and $k$ be the number of features, then

$$
Adj \,\, R^2: 1 - (1-R^2)\frac{(n-1)}{(n-k-1)}
$$
- Think about what happens when $k$ increases. How does $n$ affect this?

Let's move on to predicting whether a penguin "is heavier" or not.

In [None]:
# create the new variable
# do you see any problems below?
df_penguins['is_heavier'] = df_penguins['body_mass_g'] >  df_penguins["body_mass_g"].mean()
df_penguins.head()

In [None]:
# fix the problem here!
df_penguins.dropna(inplace=True)
df_penguins.reset_index(inplace=True, drop=True)

Now what I *care* about is how well I predict my new variable ```is_heavier```. I no longer care about $R^2$ but maybe about **Accuracy** which is the ratio of correct predictions with the total amount of predictions.

In [None]:
# let's see how accurate this model is
y_pred = res.predict(df_penguins)
y_pred.head()

In [None]:
y_pred = y_pred > df_penguins["body_mass_g"].mean()
y_pred.head()

In [None]:
# number of correct predictions
# think about why!
correct = sum(y_pred == df_penguins['is_heavier'])
accuracy = correct / len(y_pred)
print(f'This model has accuracy: {100*accuracy}%')

- Nothing about the model has changed. Notice how the metric we use determines how we feel about the model.
- Other important metrics surround classification are **Precision** and **Recall** (More on this next week).
- The Confusion Matrix is very helpful to understand why a binary classification model might be classifying things the way it is.

In [None]:
# seaborn confusion matrix

- Well how can we improve this model?

## Feature Engineering
- We can think of what we did in Lab 2 as Feature Engineering.

In [None]:
df_penguins['is_male'] = df_penguins['sex'].apply(lambda x : int(x == 'Male'))
df_penguins.head()

In [None]:
# fit the model
model = ols(formula = 'body_mass_g ~ flipper_length_mm + bill_depth_mm + bill_length_mm', data=df_penguins)
res = model.fit()

# predict
y_pred = res.predict(df_penguins)
y_pred = y_pred > df_penguins["body_mass_g"].mean()

# evaluate
correct = sum(y_pred == df_penguins['is_heavier'])
accuracy = correct / len(y_pred)
print(f'This model has accuracy: {100*accuracy}%')

Let's go back to the Ads dataset.

In [None]:
df_ads = pd.read_csv('data/Advertising.csv')
df_ads.head()

We want to predict sales so let's look at that bottom row.

In [None]:
sns.pairplot(df_ads)

In [None]:
# tv isn't exactly linear, but looks more like a square root function
df_ads['TV_root'] = df_ads['TV']**(1/3)

plt.scatter(x=df_ads['TV'], y=df_ads['sales'])
plt.title('TV vs. Sales')
plt.show()

plt.scatter(x=df_ads['TV_root'], y=df_ads['sales'])
plt.title('Root of TV vs. Sales')
plt.show()

In [None]:
# we see slight improvement in the correlation matrix
df_ads.corr()

In [None]:
model = ols(formula = 'sales ~ TV', data=df_ads)
res = model.fit()
print(f' The R^2 using TV is {res.rsquared}')

model = ols(formula = 'sales ~ TV_root', data=df_ads)
res = model.fit()
print(f' The R^2 using square root of TV is {res.rsquared}')

Polynomial Regression is just linear regression with new features!

$$
Y = \beta_0 + \beta_1 X + \beta_2 X^2
$$

In [None]:
# the dataset you will working with for HW2
df_taxis = sns.load_dataset('taxis')
df_taxis.head()

In [None]:
df_taxis['pickup'].head()

In [None]:
# time datatype!
pd.to_datetime(df_taxis['pickup'])

In [None]:
# time delta
pd.to_datetime(df_taxis['dropoff']) - pd.to_datetime(df_taxis['pickup'])

In [None]:
delta = pd.to_datetime(df_taxis['dropoff']) - pd.to_datetime(df_taxis['pickup'])
delta

In [None]:
# here's a new feature: length of trip!
# converting to minutes
df_taxis['length_of_trip'] = delta / pd.Timedelta('60s')

In [None]:
plt.scatter(x=df_taxis['length_of_trip'], y=df_taxis['total'])
plt.show()

How about a new categorical feature?

In [None]:
# wow there's a lot!
len(df_taxis['pickup_zone'].unique())

In [None]:
# which is the most frequented?
from collections import Counter

sorted_list = Counter(df_taxis['pickup_zone']).most_common()
print(sorted_list[:5])

In [None]:
df_taxis['pickup_zone_Midtown'] = df_taxis['pickup_zone'].apply(lambda x : x == 'Midtown Center')
df_taxis.head()

## Outliers
- Remember to be careful of outliers!
- Below we have vote counts for various counties in Florida from the 2000 Presidential Election.

Context:
- In the 2000 USA Presidential election, Florida was the "tipping point" state for Bush, meaning that, after ranking the states by margin of victory, Florida was the state that gave Bush enough electoral votes to win the election.
- Additionally, holding all other state results constant, if Gore had won Florida then the election would have flipped to Gore.
- Also Bush only won the state by 537 votes out of a total 6,000,000 cast 😳

In [None]:
df_votes = pd.read_csv('data/2000FL_votes.csv')
df_votes

More context
- Pat Buchanan was another conservative candidate with a similar platform as George W. Bush.
- Counties with a large number of conservative voters would likely see more votes for both Bush and Buchanan
- We can expect to see some correlation here between the two vote counts.

In [None]:
plt.scatter(x=df_votes['George W. Bush'], y=df_votes['Pat Buchanan'])
plt.xlabel('Bush Votes')
plt.ylabel('Buchanan Votes')
plt.show()

Controversy!
- But wait what is that?
- The county of Palm Beach used what were called [Butterfly Ballots](https://upload.wikimedia.org/wikipedia/commons/4/4e/Butterfly_Ballot%2C_Florida_2000_%28large%29.jpg).
- The claim is that many Gore voters accidentally voted for Pat Buchanan. (Research paper [here](https://www.gsb.stanford.edu/faculty-research/publications/butterfly-did-it-aberrant-vote-buchanan-palm-beach-county-florida) giving evidence for this claim)

Pay attention to the confidence interval.

In [None]:
sns.lmplot(x='George W. Bush', y='Pat Buchanan', data=df_votes)
plt.show()

In [None]:
# remove palm beach
df_votes_nopb = df_votes[df_votes['county'] != 'PALM BEACH']

sns.lmplot(x='George W. Bush', y='Pat Buchanan', data=df_votes_nopb)

# keep same x,y limits
plt.xlim(min(df_votes['George W. Bush']), max(df_votes['George W. Bush']))
plt.ylim(min(df_votes['Pat Buchanan']), max(df_votes['Pat Buchanan']))

plt.show()

In [None]:
model = ols(formula = 'PB ~ GWB', data=df_votes.rename(columns={'Pat Buchanan': 'PB',
                                                                'George W. Bush': 'GWB'}))
res = model.fit()
res.summary()

In [None]:
model = ols(formula = 'PB ~ GWB', data=df_votes_nopb.rename(columns={'Pat Buchanan': 'PB',
                                                                'George W. Bush': 'GWB'}))
res = model.fit()
res.summary()

### Interpretation

Look how small that coefficient is! Is there a relationship between Bush and Buchanan votes? Why or why not?

## Prediction
How well does my model perform on unseen data?

In [None]:
df_penguins.corr()

In [None]:
model = ols(formula = 'body_mass_g ~ flipper_length_mm', data=df_penguins)
res = model.fit()
print(f'This model has an R^2 of {res.rsquared}')

What about penguin data points not used to build the model?

In [None]:
from sklearn.model_selection import train_test_split

# take random 10% of the penguins for testing
train, test = train_test_split(df_penguins, test_size=0.1)

- This is a **Train**-**Test** split. The training set is used to create the model while the test set is used to evaluate the model.
- The test set is "unseen" data for the model. It did not "know" what the RSS values were for these penguins during model creation.

In [None]:
# build model on training set
model = ols(formula = 'body_mass_g ~ flipper_length_mm', data=train)
res = model.fit()

# predict using the model
y_pred = res.predict(test)

# score the model
y = test['body_mass_g']
y_mean = train['body_mass_g'].mean()  # use the mean of the training set

TSS = sum((y_pred - y_mean)**2)
RSS = sum((y_pred - y)**2)
print(f'This model has an R^2 on the test set of {(TSS - RSS) / TSS}')

- Why did it go down?
- Is this bad?

## Under/Overfitting

There are two problems one often faces with a train-test split:
- Overfitting: The model performs well on training data, but the model performs poorly on the test data
- Underfitting: The model performs poorly on the training data

This is related to the *Bias-Variance Tradeoff* which we will discuss in Lecture 6.

In [None]:
# let's make some fake nonlinear data to illustrate this
num_pts = 40
x = np.linspace(-2, 2, num_pts)

# noise
epsilon = np.random.normal(0, 3, num_pts)

# degree 3 polynomial
y = 3*(x-1)*(x+2)*(x-1.5) + epsilon

plt.scatter(x=x, y=y)
plt.show()

In [None]:
model = sm.OLS(y, sm.add_constant(x), hasconst=True)
res = model.fit()
res.summary()

We can see from the $R^2$ and visually that the line is underfitting the data.

In [None]:
b, m = res.params

plt.scatter(x=x, y=y)

plt.axline((0, b), slope=m, color='green')

plt.xlim(min(x), max(x))
plt.ylim(min(y), max(y))
plt.show()

Now let's do some feature engineering. Let's create polynomial features and do a linear regression (polynomial regression).

In [None]:
df = pd.DataFrame({'x':x, 'y':y})

n = 30

for i in range(n-1):
    df[f'x_{i+2}'] = x**(i+2)
df.head()

In [None]:
indep_var = 'x'
for i in range(n-1):
    indep_var = indep_var + f' + x_{i+2}'
print(indep_var)

model = ols(formula = f'y ~ {indep_var}', data=df)
res = model.fit()
res.summary()

Looks like a good $R^2$! But wait one second..

In [None]:
plt.scatter(x=x, y=y)

# plot the polynomial
coefs = list(res.params)
coefs.reverse()
poly = [np.polyval(coefs, i) for i in x]
plt.plot(x, poly)

plt.xlim(min(x), max(x))
plt.ylim(min(y), max(y))
plt.show()

That doesn't seem right.. What happens if create a test set:

In [None]:
x_test = np.linspace(-2,2,num_pts)
epsilon = np.random.normal(0,3,num_pts)

# degree 3 polynomial
y_test = 3*(x_test-1)*(x_test+2)*(x_test-1.5) + epsilon

plt.scatter(x=x_test, y=y_test)
plt.show()

In [None]:
plt.scatter(x=x_test, y=y_test)

plt.plot(x, poly)

plt.xlim(min(x_test), max(x_test))
plt.ylim(min(y_test), max(y_test))
plt.show()

In [None]:
train_mean = y.mean()
y_pred = [np.polyval(coefs, i) for i in x_test]

tss = sum((y_test - train_mean)**2)
rss = sum((y_test - y_pred)**2)

r2 = (tss - rss) / tss
print(f'The R squared for this model is {r2}')

We want to find that sweet spot between underfitting and overfitting!
- **Cross Validation**: Split the dataset into disjoint pieces. Use each piece as a **validation** set and train the model on the rest of the data. Average the metrics over the pieces.
- $k$-fold: split training set into $k$ equal, disjoint sets.
- Leave Out One Cross Validation (LOOCV): Each data point is a piece.
- **Bootstrapping**: Do this *with replacement* rather than disjoint pieces.

### Data Leakage
- Why train, *validation*, test split?
- **Data Leakage in Machine Learning** is when information about the test set used to evalute the model or the target variable *leaks* into the modeling process.

Let's say I create a test set from my training data, and I run a linear regression. The $R^2$ isn't where I hoped, so I remove some outliers from the training data. It goes up! Great. Now I engineer 5 features. I see which of these features improves the $R^2$ the most on the test set, and decide to keep that one in my model. My final model is created on the training set with the outliers removed and my best engineered feature. I report the $R^2$ of this model's predictions on the test set.

- Why can't I trust these results?

In [None]:
# which of these variables
df_penguins.head()

If I'm creating a model to predict penguin weight in a zoo, then maybe I won't know what ```island``` a penguin is from! In this case invalid information about the target variable has leaked into the model if I use ```island``` since I won't have this information when used in reality. My model will be overly confident.