# Lec 22 - Polynomial Regression and Step Functions
## CMSE 381 - Fall 2023
## Nov 1, 2023



In this module we are going to implement polynomial regression and step functions as discussed in class.

In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import statsmodels.api as sm


# 0. Loading in the data

We're going to use the `Wage` data used in the book, so note that many of your plots can be checked by looking at figures in the book.

In [None]:
df = pd.read_csv('../../DataSets/Wage.csv', index_col =0 )
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Here's the plot we used multiple times in class to look at a single variable:  `age` vs `wage`

In [None]:
plt.scatter(df.age,df.wage)
plt.xlabel('Age')
plt.ylabel('Wage')

&#9989; **<font color=red>Do this:</font>** Modify the plot above so that the people earning above 250 are in a different color and/or symbol set.




In [None]:
# Your code here 

# 1. Polynomial Regression 

Our first step is to build a polynomial regression model using the age data to predict wage.  So, as in class, we are in $p=1$ world here where we are going to fit the model
$$
\texttt{wage} = \beta_0 + \beta_1 \texttt{age} + \beta_2 \texttt{age}^2 + \cdots + \beta_p \texttt{age}^p +\varepsilon.
$$

The trick here is to build a matrix $X$ which has a column containing `age`, one with `age^2`, one with `age^3`, etc.  Then we hand this to your favorite regression tool (it doesn't need to know it's getting polynomial matrix inputs, it just sees a matrix of features and does it's thing). 

So, here's some code to take our $\texttt{age}$ data column and create a bunch of new columns in our data frame that are simply each the $k$th power of the `age` column

In [None]:
# Here's the column I care about
df.age

In [None]:
# Here's what the second column should be 
df.age**2

&#9989; **<font color=red>Do this:</font>** Use the `PolynomialFeatures` command to generate a data frame called `polydf` with columns $\texttt{age}$,  $\texttt{age}^2$,  $\texttt{age}^3$,  $\texttt{age}^4$ like we did a few lectures ago. 




In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Your code here #

&#9989; **<font color=red>Do this:</font>** Train a linear regression model on these features. What are the coefficients? 


In [None]:
# Your code here #

&#9989; **<font color=red>Q:</font>** What is the equation for the polynomial that you learned? 

*Your equation here*

&#9989; **<font color=red>Do this:</font>** Draw the polynomial that you learned on top of the age vs wage plot. Note that you can either do this using the polynomial you just figured out, or by using the model you just set up to predict the values. In either case, use the vector of ages `t` below.

In [None]:
# Your code here #
t = np.linspace(10,100,100)




![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# 2. Step functions

Now let's try to use step functions to learn a model using `age` to predict `wage`. Like with the polynomial example from last time, all we're going to do is build a data frame or feature matrix that has the step function values in each column, and then pass that matrix to our favorite linear modeling function. 

First, we want to get a dataframe with the cuts. 

In [None]:
df_cut, bins = pd.cut(df.age, 4, retbins = True, right = False)

Note that the `df_cut` is a pandas series with each data point now represented as the interval it's contained in. 

In [None]:
df_cut

Here I'm just printing it out in a column next to the `age` information that was used to generate it.

In [None]:
pd.DataFrame({'age': df.age, 'df_cut': df_cut})


The `bins` output gives me the $c_i$'s as follows. 


In [None]:
print(bins)

In [None]:
# This is how it matches with our notation.
print(r'c_1 = ', bins[0])
print(r'c_2 = ', bins[1])
print(r'c_3 = ', bins[2])
print(r'c_4 = ', bins[3])
print(r'c_5 = ', bins[4])

&#9989; **<font color=red>Do this:</font>**
 For each of the functions $C_0(X)$, $C_1(X)$, $C_2(X)$, $C_3(X)$, $C_4(X)$, $C_5(X)$ (following our notation in class), determine the domains where they have value 1. 

*Your answer here*

- $C_0(X)$:
- $C_1(X)$:
- $C_2(X)$: 
- $C_3(X)$: 
- $C_4(X)$: 
- $C_5(X)$: 

Below is my code that generates the data frame storing $C_i(X)$ for all our entries. 

In [None]:
df_steps_dummies = pd.get_dummies(df_cut)
df_steps_dummies.head()

As with our other uses of dummy variables, I don't actually need all of them. I can just include three since I can always figure out the entry of the first based on the rest. It doesn't actually matter which one you drop (as long as you interpret proplerly after the fact) so for the sake of making the notation no worse than it already is I'm going to drop the last one. (*Note I could use `drop_first = True` instead, again just with some slight changes in interpretation later*)

In [None]:
df_steps_dummies = df_steps_dummies.iloc[:,:3]
df_steps_dummies.head()


&#9989; **<font color=red>Q:</font>** Which of the functions $C_i(X)$ for $i=0,\cdots, 5$ have columns represented in this matrix? *Note: it's not all of them*


* Your answer here*

&#9989; **<font color=red>Do this:</font>** Pass this matrix to a linear regression model and use it to predict `wage`. What is the equation for your learned model? Be specific in terms of the $C_i$ functions you learned earlier.

In [None]:
# Your code here #

&#9989; **<font color=red>Do this:</font>** Using the function $f(X)$ you just learned, what is the function value on each of the following values of $X$?

| X  | f(X)| 
| ---| --- | 
| 10 |     |
| 20 |     |
| 30 |     | 
| 40 |     |
| 50 |     | 
| 60 |     | 
| 70 |     |
| 80 |     |



Assuming you stored your linear regression model as `linreg`, the following code will plot the learned function. Check that the answers you got in the table above match with what you're seeing in the graph.  

In [None]:
t = np.linspace(10,80,100)

bin_mapping = np.digitize(t, bins)

# print(bin_mapping)
t_dummies = pd.get_dummies(bin_mapping)
t_dummies = t_dummies.drop(columns =[0,4])
t_dummies.head()

In [None]:
stepPredict = linreg.predict(t_dummies) #<---- If you named your linear regression 
                                        #      something else, you can fix this to match.
            
#--------Uncomment below to draw the scatter plot of the data as well-------#
plt.scatter(df.age,df.wage,marker = '+')
plt.xlabel('Age')
plt.ylabel('Wage')

plt.plot(t,stepPredict,color='red')



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.