# Exercise 5

## 1 Bias and variance of ridge regression

* True model: y = Xβ* + ε

* Zero-mean Gaussian noise: ε ~ N(0, σ^2)

* Centered features assumption: (1/N) * Σ Xi = 0

* Regularization parameter: τ ≥ 0

Prove that the expectation E[βτ] and covariance Cov[βτ] of the regularized solution βτ are given by:
        
        E[βτ] = (Sτ)⁻¹S β∗
        Cov[βτ] = (Sτ)⁻¹S (Sτ)⁻¹σ^2
 

**Step 1: Use the SVD of X:**

X = UΣVᵀ (where U and V are orthogonal matrices, Uᵀ = U⁻¹)

**Step 2: Compute S and Sτ:**

S = XᵀX = (UΣVᵀ)ᵀ(UΣVᵀ) = VΣᵀΣVᵀ

Sτ = XᵀX + τI = VΣᵀΣVᵀ + τI

**Step 3: Compute E[βτ]:**

βτ = argminβ (y - Xβ)ᵀ(y - Xβ) + τβᵀβ

= (yᵀ - βᵀXᵀ)(y - Xβ) + τβᵀβ

= yᵀy - yᵀXβ - βᵀXᵀy + βᵀXᵀXβ + τβᵀβ

To compute E[βτ], we can take the derivative of the function with respect to β and set it to zero:

0 = -Xᵀy + XᵀXβ + τβ

Xᵀy = XᵀXβ + τβ = β(XᵀX + τ)

β = (XᵀX + τI)⁻¹Xᵀy

E[βτ] = E[(XᵀX + τI)⁻¹Xᵀy] = (XᵀX + τI)⁻¹XᵀE[y]

Since we assumed that the noise ε is zero-mean, E[y] = Xβ*

E[βτ] = (XᵀX + τI)⁻¹XᵀXβ*

E[βτ] = (S + τI)⁻¹XᵀXβ*

Finally, since S = XᵀX and Sτ = XᵀX + τI, we have:

E[βτ] = (Sτ)⁻¹Sβ*

**Step 4: Compute Cov[βτ]:**

Cov[βτ] = Cov[(XᵀX + τI)⁻¹Xᵀy]

= (XᵀX + τI)⁻¹XᵀCov[y](XᵀX + τI)⁻¹

Since we assumed that the noise ε is zero-mean, Cov[y] = σ^2I:

Cov[βτ] = (XᵀX + τI)⁻¹Xᵀ(σ^2I)(XᵀX + τI)⁻¹

= (XᵀX + τI)⁻¹Xᵀσ^2I(XᵀX + τI)⁻¹

= σ^2(XᵀX + τI)⁻¹Xᵀ(XᵀX + τI)⁻¹

Finally, since S = XᵀX and Sτ = XᵀX + τI, we have:

Cov[βτ] = σ^2(Sτ)⁻¹Xᵀ(Sτ)⁻¹

## 2 LDA-Derivation from Least Squares Error

## 3 Automatic feature selection for LDA as regression

## 3.1 Implement Orthagonal Matching Pursuit

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
from sklearn.datasets import load_digits
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [2]:
# load digits data set
digits = load_digits()
data = digits["data"]
images = digits["images"]
target = digits["target"]
target_names = digits["target_names"]
print(f"data.shape = {data.shape}")
print(f"data.dtype = {data.dtype}")
print(f"images.shape = {images.shape}")
print(f"images.dtype= {images.dtype}")
print(f"target.shape = {target.shape}")
print(f"target.dtype = {target.dtype}")
print(f"target_names.shape = {target_names.shape}")
print(f"target_names.dtype= {target_names.dtype}")
print(f"target[:20] = {target[:20]}")

data.shape = (1797, 64)
data.dtype = float64
images.shape = (1797, 8, 8)
images.dtype= float64
target.shape = (1797,)
target.dtype = int32
target_names.shape = (10,)
target_names.dtype= int32
target[:20] = [0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


In [32]:
def omp_regression(X,y,T):
    #X:[N,D]
    #y:[N]
    #T>0
    
    y.reshape(len(y), 1)
    
    A=[]
    #B= [j for j in range(X.shape[1])]#np.zeros(X.shape[1]) #[1,2,.....,D]
    r=y #y
    
    theta=np.zeros(X.shape[1])
    X_list = []
    
    
    for t in range(1,T+1):
        j=np.argmax(np.abs(np.dot(X.T,r)),axis=0)#column
        
        A.append(j)
        #A.sort()
        
        #B.remove(j)
        
        #X_active=X.take(j-1,axis=0)
        X_active = X[:,A]
        
        #if len(A)==1:
        #    beta=np.zeros(1)
        #else:
        beta=np.linalg.inv(X_active.T.dot(X_active)).dot(X_active.T).dot(y)
        
        theta[A]=beta
        
        r=y-np.dot(X,theta)
        
        error= np.linalg.norm(r)
        print(t,": error",error)
        
    print(beta.shape)
    return r

## 3.2 Classification with sparse LDA

In [4]:
"""
This function filters the digits (3, 9) from the dataset and randomly splits it in train and test set.
"""
# Load data
digits = load_digits()
data = digits["data"]
target = digits["target"]
# Data filering 
num_1, num_2 = 3, 9
mask = np.logical_or(target == num_1, target == num_2)
data = data[mask]/data.max()
target = target[mask]
# Relabel targets
target[target == num_1] = -1
target[target == num_2] = 1
# split into train and test data
X_all = data
y_all = target
X_train, X_test, y_train, y_test = model_selection.train_test_split(
 X_all, y_all, test_size=0.4 , random_state=0)
print(f"X_train.shape = {X_train.shape}")
print(f"X_test.shape = {X_test.shape}")
print(f"y_train.shape = {y_train.shape}")
print(f"y_test.shape = {y_test.shape}")
print(f"y_test[:10] = {y_test[:10]}")

X_train.shape = (217, 64)
X_test.shape = (146, 64)
y_train.shape = (217,)
y_test.shape = (146,)
y_test[:10] = [ 1  1  1 -1 -1  1 -1  1 -1  1]


In [33]:
beta=omp_regression(X_train,y_train,50)

1 : error 13.315009890048637
2 : error 9.071445029525995
3 : error 7.680538712586313
4 : error 7.296486079091541
5 : error 6.954311899197579
6 : error 6.537792147169524
7 : error 6.328508719076106
8 : error 6.003608135989588
9 : error 5.746248907097857
10 : error 5.575198926811804
11 : error 5.387370266570813
12 : error 5.16774279081024
13 : error 5.079951648602172
14 : error 5.038192593411319
15 : error 4.9458185372673995
16 : error 4.884330365024108
17 : error 4.803529510116022
18 : error 4.772507273508386
19 : error 4.742549750913815
20 : error 4.597034394498034
21 : error 4.569790570358204
22 : error 4.506768100187714
23 : error 4.46820131060768
24 : error 4.429403356920796
25 : error 4.4164768517712405
26 : error 4.393591323048129
27 : error 4.371254541997926
28 : error 4.343868921475703
29 : error 4.3377456086967054
30 : error 4.323478587774914
31 : error 4.3046725989667936
32 : error 4.290884302994312
33 : error 4.284523610546172
34 : error 4.275759217863128
35 : error 4.2721026

# How many pixels should be used for acceptable error rates?
at least 15 pixels
# Is it necessary/benecial to standardize the data before training and testing?
yes