Consider regularized linear regression (also called ridge regression) for classification

$\mathbf{w}_{reg} = argmin_w(\frac{\lambda}{N}\|\mathbf{w}\|+\frac{1}{N}\|\mathbf{X}\mathbf{w}-\mathbf{y}\|^2)$

Run the algorithm on the following data set as $\mathcal{D}$:

https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_algo/hw4_train.dat

and the following set for evaluating $E_{out}$:

https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_algo/hw4_test.dat

Because the data sets are for classification, please consider only the 0/1 error for all Questions below.
Break the tie by selecting the largest $\lambda$. 

In [1]:
import numpy as np
from sklearn.linear_model import RidgeClassifier
import math

data = np.genfromtxt('hw4_train.dat')
X_train = data[:, :-1]
y_train = data[:, -1]

data = np.genfromtxt('hw4_test.dat')
X_test = data[:, :-1]
y_test = data[:, -1]

$13.$ Let $\lambda = 10$, what is the corresponding $E_{in}$ and $E_{out}$?

In [2]:
# run ridge regression
LR = RidgeClassifier(alpha=10)
LR.fit(X_train, y_train)

# evaluate errors
yin_pred = LR.predict(X_train) 
e_in = np.sum(yin_pred != y_train) / len(y_train)

yout_pred = LR.predict(X_test)
e_out = np.sum(yout_pred != y_test) / len(y_test)

print("E_in: {} \t E_out: {}".format(e_in, e_out))

E_in: 0.035 	 E_out: 0.022


Run the algorithm among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9. -10 \right\}$.

In [3]:
for log10_lam in range(2, -11, -1):
    # run ridge regression
    LR = RidgeClassifier(alpha=math.pow(10, log10_lam))
    LR.fit(X_train, y_train)

    # evaluate errors
    yin_pred = LR.predict(X_train) 
    e_in = np.sum(yin_pred != y_train) / len(y_train)

    yout_pred = LR.predict(X_test)
    e_out = np.sum(yout_pred != y_test) / len(y_test)

    print("log10(lambda): {} \t E_in: {} \t E_out: {}".format(log10_lam, e_in, e_out))

log10(lambda): 2 	 E_in: 0.1 	 E_out: 0.091
log10(lambda): 1 	 E_in: 0.035 	 E_out: 0.022
log10(lambda): 0 	 E_in: 0.035 	 E_out: 0.017
log10(lambda): -1 	 E_in: 0.03 	 E_out: 0.016
log10(lambda): -2 	 E_in: 0.03 	 E_out: 0.016
log10(lambda): -3 	 E_in: 0.03 	 E_out: 0.016
log10(lambda): -4 	 E_in: 0.03 	 E_out: 0.016
log10(lambda): -5 	 E_in: 0.03 	 E_out: 0.016
log10(lambda): -6 	 E_in: 0.035 	 E_out: 0.016
log10(lambda): -7 	 E_in: 0.03 	 E_out: 0.015
log10(lambda): -8 	 E_in: 0.015 	 E_out: 0.02
log10(lambda): -9 	 E_in: 0.015 	 E_out: 0.02
log10(lambda): -10 	 E_in: 0.015 	 E_out: 0.02


$14.$ What is the $\lambda$ with the minimum $E_{in}$?

$\lambda = 10^{-8}$

$15.$ What is the $\lambda$ with the minimum $E_{out}$?

$\lambda = 10^{-7}$

Now split the given training examples in $\mathcal{D}$ to the first 120 examples for $\mathcal{D}_{\text{train}} 
$ and 80 for $\mathcal{D}_{\text{val}}$.

Ideally, you should randomly do the 120/80 split, because the given examples are already randomly permuted. 
However, we would use a fixed split for the purpose of this problem.

In [4]:
X_train, X_val = X_train[:120, :], X_train[120:, :]
y_train, y_val = y_train[:120], y_train[120:]

Run the algorithm on $\mathcal{D}_{\text{train}}$ to get $g^{-}_{\lambda}$ with $\mathcal{D}_{val}$.

In [5]:
for log10_lam in range(2, -11, -1):
    # run ridge regression
    LR = RidgeClassifier(alpha=math.pow(10, log10_lam))
    LR.fit(X_train, y_train)

    # evaluate errors
    yin_pred = LR.predict(X_train) 
    e_in = np.sum(yin_pred != y_train) / len(y_train)
    
    yval_pred = LR.predict(X_val) 
    e_val = np.sum(yval_pred != y_val) / len(y_val)

    yout_pred = LR.predict(X_test)
    e_out = np.sum(yout_pred != y_test) / len(y_test)

    print("log10(lambda): {} \t E_in: {:.4f} \t E_val:{:.4f} \t E_out: {:.4f}".format(log10_lam, e_in, e_val, e_out))

log10(lambda): 2 	 E_in: 0.2500 	 E_val:0.3250 	 E_out: 0.3030
log10(lambda): 1 	 E_in: 0.0333 	 E_val:0.0625 	 E_out: 0.0430
log10(lambda): 0 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0220
log10(lambda): -1 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -2 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -3 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -4 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -5 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -6 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -7 	 E_in: 0.0333 	 E_val:0.0375 	 E_out: 0.0210
log10(lambda): -8 	 E_in: 0.0000 	 E_val:0.0500 	 E_out: 0.0250
log10(lambda): -9 	 E_in: 0.0000 	 E_val:0.1000 	 E_out: 0.0380
log10(lambda): -10 	 E_in: 0.0083 	 E_val:0.1250 	 E_out: 0.0400


$16.$ What is the $\lambda$ with minimum $E_{train}(g^{-}_{\lambda})$? 

$\lambda = 10^{-8}$

$17.$ What is the $\lambda$ with minimum $E_{val}(g^{-}_{\lambda})$? 

$\lambda = 10^{0}$

$18.$ With the optimal $\lambda$ from the previous question, what are the values of $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$?

$E_{in}(g_\lambda) = 0.0333$

$E_{out}(g_\lambda) = 0.0220$

For Questions 19-20, split the given training examples in $\mathcal{D}$ to five folds, the first 40 being fold 1, the next $40$ being fold $2$, and so on. Again, we take a fixed split because the given examples are already randomly permuted.



In [6]:
X_1 = X_train[:40, :]
y_1 = y_train[:40]

X_2 = X_train[40:80, :]
y_2 = y_train[40:80]

X_3 = X_train[80:, :]
y_3 = y_train[80:]

X_4 = X_val[:40, :]
y_4 = y_val[:40]

X_5 = X_val[40:, :]
y_5 = y_val[40:]

In [7]:
Xs = [X_1, X_2, X_3, X_4, X_5]
ys = [y_1, y_2, y_3, y_4, y_5]

$19.$ Run the algorithm with cross-validation among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9. -10 \right\}$.

In [8]:
for log10_lam in range(2, -11, -1):
    e_cv = []
    
    for v in range(5):
        X_train = Xs[:v] 
        y_train = ys[:v]
        if v + 1 < 5:
            X_train += Xs[(v+1):]
            y_train += ys[(v+1):]
        X_train = np.vstack(X_train)
        y_train = np.hstack(y_train)
        
        X_val = Xs[v]
        y_val = ys[v]
        
        # run ridge regression
        LR = RidgeClassifier(alpha=math.pow(10, log10_lam))
        LR.fit(X_train, y_train)
        
        # evaluate errors
        yval_pred = LR.predict(X_val) 
        e_val = np.sum(yval_pred != y_val) / len(y_val)

        e_cv.append(e_val)
    
    e_cv = sum(e_cv) / len(e_cv)
    print("log10(lambda): {} \t E_cv:{:.4f}".format(log10_lam, e_cv))

log10(lambda): 2 	 E_cv:0.1400
log10(lambda): 1 	 E_cv:0.0400
log10(lambda): 0 	 E_cv:0.0350
log10(lambda): -1 	 E_cv:0.0350
log10(lambda): -2 	 E_cv:0.0350
log10(lambda): -3 	 E_cv:0.0350
log10(lambda): -4 	 E_cv:0.0350
log10(lambda): -5 	 E_cv:0.0350
log10(lambda): -6 	 E_cv:0.0350
log10(lambda): -7 	 E_cv:0.0350
log10(lambda): -8 	 E_cv:0.0300
log10(lambda): -9 	 E_cv:0.0500
log10(lambda): -10 	 E_cv:0.0500


Among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9, -10 \right\}$, what is the $\lambda$ with minimum $E_{cv}$? 

$\lambda = 10^{-8}$

$20.$ With the optimal $\lambda$ from the previous question, what are the values of $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$ on the whole dataset $\mathcal{D}$?

In [9]:
X_train = np.vstack([X_train, X_val])
y_train = np.hstack([y_train, y_val])
        
# run ridge regression
LR = RidgeClassifier(alpha=math.pow(10, -8))
LR.fit(X_train, y_train)

# evaluate errors
yin_pred = LR.predict(X_train) 
e_in = np.sum(yin_pred != y_train) / len(y_train)

yout_pred = LR.predict(X_test)
e_out = np.sum(yout_pred != y_test) / len(y_test)

print("E_in: {} \t E_out: {}".format(e_in, e_out))

E_in: 0.015 	 E_out: 0.02
