# Lec 20 - Polynomial Regression and Step Functions
## CMSE 381 - Fall 2022
## Nov 4, 2022



In this module we are going to implement polynomial regression and step functions as discussed in class.

In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import statsmodels.api as sm


# 0. Loading in the data

We're going to use the `Wage` data used in the book, so note that many of your plots can be checked by looking at figures in the book.

In [None]:
df = pd.read_csv('Wage.csv', index_col =0 )
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Here's the plot we used multiple times in class to look at a single variable:  `age` vs `wage`

In [None]:
plt.scatter(df.age,df.wage)
plt.xlabel('Age')
plt.ylabel('Wage')

&#9989; **<font color=red>Do this:</font>** Modify the plot above so that the people earning above 250 are in a different color and/or symbol set.




In [None]:
# Your code here 

# 1. Polynomial Regression 

Our first step is to build a polynomial regression model using the age data to predict wage.  So, as in class, we are in $p=1$ world here where we are going to fit the model
$$
\texttt{wage} = \beta_0 + \beta_1 \texttt{age} + \beta_2 \texttt{age}^2 + \cdots + \beta_p \texttt{age}^p +\varepsilon.
$$

The trick here is to build a matrix $X$ which has a column containing `age`, one with `age^2`, one with `age^3`, etc.  Then we hand this to your favorite regression tool (it doesn't need to know it's getting polynomial matrix inputs, it just sees a matrix of features and does it's thing). 

So, here's some code to take our $\texttt{age}$ data column and create a bunch of new columns in our data frame that are simply each the $k$th power of the `age` column

In [None]:
# Here's the column I care about
df.age

In [None]:
# Here's one way to get out the pandas series that squares
# each entry
df.age.apply(lambda x: x**2)

&#9989; **<font color=red>Do this:</font>** Use the code above (or any other tricks you might know) to generate a data frame called `polydf` with 4 columns, where the $k$th column has $\texttt{age}^k$




In [None]:
# Your code here #

Did I need to make you do that? It turns out no. As with many things we've talked about in class, there is already some automated code for us to work with.  In this case, the only difference is that it will hand us back a matrix rather than a data frame. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(4)
X = poly.fit_transform(df.age.values.reshape(-1,1)) #<--- this nastyness is because it wants to be handed a matrix
X[:10,:]

&#9989; **<font color=red>Q:</font>** What other major difference do you notice between the dataframe you constructed above and the matrix provided here? Why do you think that is happening?

*Your answer here*





In [None]:
# Your code here #

&#9989; **<font color=red>Do this:</font>** Train a linear regression model on these features. What are the coefficients? 


In [None]:
# Your code here #

&#9989; **<font color=red>Q:</font>** What is the equation for the polynomial that you learned? 

*Your equation here*

&#9989; **<font color=red>Do this:</font>** Draw the polynomial that you learned on top of the age vs wage plot. Note that you can either do this using the polynomial you just figured out, or by using the model you just set up to predict the values. In either case, use the vector of ages `t` below.

In [None]:
# Your code here #
t = np.linspace(10,100,100)


![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# 2. Step functions

Now let's try to use step functions to learn a model. Like with the polynomial example above, all we're going to do is build a data frame or feature matrix that has the step function values in each column, and then pass that matrix to our favorite linear modeling function. 

First off, it's easy to find the locations for the knots, which are the places where we switch step functions. Here's the pandas `cut` command, which in this case on some toy data, gives me 3 equal-sized bins, where here, equal-sized means that the width of the intervals are all the same.

In [None]:
fakeData = np.array([1, 7, 3, 5, 4, 6, 3, 3 , 10,2])
cuts, knots = pd.cut(fakeData, 3,retbins=True, right = False)
print(cuts)
print(knots)

The `retbins=True` tells the command to return the breakpoints in the bins, which I saved in my output as `knots`. The `right=False` command makes it so that we have intervals closed on the bottom (e.g. $[3,5)$, $[5,7)$, etc), which I am simply using here to make the results match with the textbook notation. 

We can either see the intervals chosen by looking at the `categories` saved to the cuts, or by looking at the knots list. 

In [None]:
print(cuts.categories)
print(knots)

I can find out what bin the $i$th entry is mapped to by just checking the cuts list. 

In [None]:
i = 5
print('Entry is:', fakeData[i])
print('This comes from bin:', cuts[i])

We can also see how many data points ended up in each interval.

In [None]:
cuts.value_counts()

Once we've got this list of bins, we can build the data frame that keeps track of all the true/false values for whether a data point is in a particular interval by using the dummy variable trick. 

In [None]:
X_stepFunction = pd.get_dummies(cuts)
X_stepFunction

In [None]:
# This might be easier to check also if we draw the 
# input X data next to the dummy variables we made
X_stepFunction['X'] = fakeData
X_stepFunction

Then, if I want to figure out which bin is assigned for some other matrix of values that I want to test, I can use the `np.digitize` function as follows.

In [None]:
u = np.array([4, 6, -7, 8, 13, 25, 0, 1, np.pi])
print(u)
np.digitize(u,knots)

&#9989; **<font color=red>Q:</font>** What interval does each entry in the array above correspond to? In particular, we had what do the entries with 0 and 4 mean? 

*Your answer here* 

&#9989; **<font color=red>Do this:</font>**
- Use the `cut` tool above to create a feature matrix for the `age` data where each column corresponds to a step function using 4 bins. 
- Drop the first bin.... remember we don't need all of our dummy variables, so we'll just use the remaining 3 to predict.
- Pass this matrix to a linear regression model. 

What is the equation for your learned model? 

In [None]:
# Your code here #

&#9989; **<font color=red>Do this:</font>** Our goal is to plot the learned equation on top of the scatter plot data. To do this:
- Plot the  original sampled data.
- Using your linear regression model from above, predict the values on 
```
t = np.linspace(10,100,100)
```
    to get a vector `y`. 
- Plot `(t,y)` on the figure. 

What range of ages has the highest predicted wage?



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.