# Lecture 21 - PLS
## CMSE 381 - Fall 2023
## Oct 30, 2023



In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time

import seaborn as sns

# ML imports we've used previously
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA


# PLS on Hitters Data

# Loading in the data

Ok, here we go, let's play with a baseball data set again. Note this cleanup is all the same as the last lab. 

In [None]:
df = pd.read_csv('../../DataSets/Hitters.csv').dropna().drop('Player', axis = 1)
df.info()
dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])

In [None]:
y = df.Salary

# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis = 1).astype('float64')

# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis = 1)

X.info()

In [None]:
# And here we have the normalized data.
from sklearn.preprocessing import StandardScaler
X_normalized = StandardScaler().fit_transform(X)
X_normalized = pd.DataFrame(X_normalized, columns = X.columns)
X_normalized.head()

# Principal Least Squares (PLS)

The command do do PLS in `Scikit-learn` is  `PLSRegression`. Below is a quick code that runs PLS on our dataset. 

In [None]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold

In [None]:
pls = PLSRegression(n_components=3)
pls.fit(X_normalized,y)
yhat = pls.predict(X_normalized)
mean_squared_error(y,yhat)

But like last time, we can also use the `cross_val_score` function to get the CV score easily. 

In [None]:
pls = PLSRegression(n_components=3)
scores = cross_val_score(pls, X_normalized, y, cv=10, scoring='neg_mean_squared_error')
scores.mean()

&#9989; **<font color=red>Do this:</font>**  Like last time, your job is to test a PLS model for an increasing number of components used. I recommend using the `cross_val_score` with `scoring='neg_mean_squared_error'`. What number of components would you use? 

In [None]:
n = len(X_normalized)
mse = []

# Calculate MSE using CV for an increasing number of components, 
# adding one component at a time.
for i in np.arange(1, 20): # i is the number of components to use each time
    # ====
    score = 0 # Your code to figure out the score each time goes in here. 
    # ====

    mse.append(score)
    
# Plot results    
plt.plot(mse, '-v')
plt.xlabel('Number of  components in regression')
plt.ylabel('MSE')
plt.title('Predicting Salary')
plt.xlim(xmin=-1);

## GridSearchCV


Let's make our lives a little easier! We keep doing $k$-fold CV over lots of parameters, here's a command that we can use to do what we did above in fewer lines. 



First, I'm going to use `Pipeline` to build up a list of things I want to do for my data. Here, I'm going to do the PCR system we used last time (In a little bit you're going to update all this to do PLS for you). 

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Create instances of PCA and linear regression
pca = PCA(n_components =2)
linreg = LinearRegression ()

# Buid the pipeline and give each thing in the pipeline a name
pipe = Pipeline ([('pca', pca), ('linreg', linreg)])

# Do the usual fitting with our input data
pipe.fit(X_normalized, y)

# Pull out whatever stuff from the specific step I'm interested in
pipe.named_steps['linreg'].coef_

&#9989; **<font color=red>Do this:</font>**  How do you get the principal components used in the PCA step? 

*Hint: They're stored in the PCA step as `components_`*

In [None]:
# Your code here

Now what we can do is work with a grid of inputs we want to search. You can be all kinds of fancy and change more than one input, but we're only ever doing one for this class. 

So in my case, what I want to do is mess around with the number of components passed into PCA by setting this from 1 to 19.  Notice that because of my pipeline step, the key for the entry in the dictionary for `param_grid` has `pca` first since that's the part of the pipeline I want, then two underscores, then the name of the input for `pca` that I'm messing with. 

In [None]:
# Here's me creating my parameter grid
param_grid = {'pca__n_components': range (1, 20)}

Now I get to pass this into the `GridSearchCV` command, which does exacly what you did above. It takes everything in the defined pipeline, does 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# This actually does the fit
gridPCA = GridSearchCV(pipe , param_grid, cv=kf_10 ,scoring='neg_mean_squared_error')
gridPCA.fit(X_normalized, y)

Now I want to find out what it figured out. 

Here's how I can find the mean test score over all the entries in the parameter grid. The negative is because internally, sklearn uses negative MSE. Note that there are entries corresponding to $[1,\cdots,19]$ 

In [None]:
-gridPCA.cv_results_['mean_test_score']

And now I can plot to see what's up. 

In [None]:
n_comp = param_grid['pca__n_components']

plt.plot(n_comp , -gridPCA.cv_results_['mean_test_score'], label = 'PCR')
plt.legend()
plt.ylabel('Cross -validated MSE')
plt.xlabel('# principal components')
plt.xticks(n_comp [::2])
plt.ylim ([100000 ,140000]);

&#9989; **<font color=red>Do this:</font>**  Do the same thing but for the PLS pipeline discussed above. 
- I recommend changing my named `gridPCA` to something like `gridPLS`. 
- You actually don't need the `Pipeline` here since you're only doing `PLSRegression` so the code should actually be simpler. 
- Draw the resulting plot with the PCR and PLS drawn on top of each other.


In [None]:
# Your code here 



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.