## CPSC 340 Lecture 15: feature selection

This notebook is for the in-class activities. It assumes you have already watched the [associated video](https://www.youtube.com/watch?v=YIGk_QCgm-A&list=PLWmXHcz_53Q02ZLeAxigki1JZFfCO6M-b&index=14).

<font color='red'>**REMINDER TO START RECORDING**</font>

Also, reminder to enable screen sharing for Participants.

# TODO

discuss non-uniqueness of OLS solution / multicollinearity 

## Pre-class music

1. Crazy Dream by Tom Misch, Loyle Carner
2. Big Iron by Marty Robbins

## Admin

- a3 in progress, due Wednesday
- a3 bug/fix: https://edstem.org/us/courses/3226/discussion/248155
- Countdown to reading week: 1 more class!

## Video chapters

- change of basis notation
- finding the “true” model
- complexity penalties and information criteria
- feature selection intro
- association approach
- regression weight approach
- search and score approach
- L0 norm penalty
- forward selection
- summary

## Extra discussion

Finding the "true" model - what about non-uniqueness of the optimization problem? (This was discussed in lecture 12 but the whiteboard was unreadable in the video.)

$\hat{y}_i = w_0 + w_1^Tx_1 + w_2^Tx_2 + \ldots + w_d^Tx_d$

What if $x_1=x_2$ are the same for all training examples?

Then in fact we have:

$\hat{y}_i = w_0 + w_1^Tx_1 + w_2^Tx_1 + \ldots + w_d^Tx_d$

or, 

$\hat{y}_i = w_0 + (w_1 + w_2)^Tx_1 + \ldots + w_d^Tx_d$


- This is a problem because we can change $w_1$ and $w_2$ without changing $w_1+w_2$. 
- Thus, for any solution to the optimization problem (that minimizes the squared error objective), we can construct another solution be replacing $w_1$ with $w_1+a$ and $w_2$ with $w_2-a$.
- So, we have infinitely many solutions to the optimization problem.
- We can try to draw this in 2D:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

In [12]:
def plot_loss(X, y):
    plt.figure()
    w1 = np.arange(-15, 15, 0.25)
    w2 = np.arange(-15, 15, 0.25)
    W1, W2 = np.meshgrid(w1, w2)
    W = np.vstack((W1.flatten(), W2.flatten()))

    loss = np.sum((X@W-y)**2, axis=0)
    LOSS = np.reshape(loss, (len(w1), len(w2)))
        
    plt.imshow(LOSS, extent=(np.min(w1), np.max(w1), np.min(w2), np.max(w2)), aspect="auto");
    plt.xlabel('w1');
    plt.ylabel('w2');
    plt.colorbar();
    plt.title('loss')

In [13]:
np.random.seed(1)
X = np.random.rand(50,2)
y = 2*X[:,0][:,None] + X[:,1][:,None] - 1
plot_loss(X, y)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [14]:
Xcopy = np.hstack((X[:,:1],X[:,:1]))
plot_loss(Xcopy, y)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [15]:
%matplotlib widget

In [16]:
def plot_loss_3D(X, y):
    w1 = np.arange(-10, 10, 0.25)
    w2 = np.arange(-10, 10, 0.25)
    W1, W2 = np.meshgrid(w1, w2)
    W = np.vstack((W1.flatten(), W2.flatten()))

    loss = np.sum((X@W-y)**2, axis=0)
    LOSS = np.reshape(loss, (len(w1), len(w2)))
        
    fig = plt.figure()
    ax = fig.gca(projection='3d')

    surf = ax.plot_surface(W1, W2, LOSS, cmap=cm.coolwarm, linewidth=0, antialiased=False)

    plt.xlabel('w1');
    plt.ylabel('w2');


In [17]:
plot_loss_3D(X, y)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [8]:
plot_loss_3D(Xcopy, y)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## True/False questions

1. Simple association-based feature selection approaches do not take into account the interaction between features.
2. You can carry out feature selection using linear models by pruning the features which have very small weights. 
3. Forward selection is guaranteed to find the best feature set.

The loss function for least squares with the L0-norm penalty is $f(w) = \frac{1}{2}\lVert{Xw -y}\rVert^2 + \lambda \lVert w\rVert_0$

1. We can minimize this loss function with the normal equations.
2. We can minimize this loss function with gradient descent.
3. Decreasing $\lambda$ encourages the model to select more features.
4. Decreasing $\lVert w\rVert_0$ encourages the model to select more features.

Follow up Q:

Imagine we duplicated every example in the training set, thus doubling the number of rows in $X$ and $y$. We leave everything else the same. This may change the number of features selected when minimizing the above loss. 

## Student questions

https://edstem.org/us/courses/3226/discussion/249407