# Homework 6

## Problem 4 - Weight Decay

Same as problem 3, except that we have to use $\lambda = 10^{3}$ .


## Import libraries and read data

Let's first import libraries and read data into a pandas dataframe.

In [1]:
import pandas as pd        # for reading in the data
import numpy as np

## A. Read in training set

We read in the in-sample data.

In [2]:
# dataframe df
# assign header names for each column
# tell pandas that data is separated by whitespace
# tell pandas that datatype is float64 
df_train = pd.read_csv('in.dta.txt', names = ["x1", "x2", "y"], sep='\s+', dtype=np.float64)
print('The first 5 rows of the table:\n')
print(df_train.head(5))
print()


# Examine data
rows, col = df_train.shape
print("The table has {0} rows and {1} columns.".format(rows, col))
print("So we have N = {0} data points (x1,x2) with classification y.".format(rows))

The first 5 rows of the table:

         x1        x2    y
0 -0.779470  0.838221  1.0
1  0.155635  0.895377  1.0
2 -0.059908 -0.717780  1.0
3  0.207596  0.758933  1.0
4 -0.195983 -0.375487 -1.0

The table has 35 rows and 3 columns.
So we have N = 35 data points (x1,x2) with classification y.


## Implement Linear Regression with regularization (weight decay)

In [3]:
def problem_4_weight_decay(dataframe, lambda_param):
    '''
    Takes a pandas dataframe as test set.
    
    Returns the classification error and the weight vector w_tilde_reg
    using weight decay as regularization.
    '''
    
    # Use data from the pandas dataframe
    x1 = np.array(dataframe['x1'])
    x2 = np.array(dataframe['x2'])
    y = np.array(dataframe['y'])
    N = dataframe.shape[0]
    
    # feature matrix Z with 8 columns
    Z = np.array([np.ones(N), x1, x2,
                  x1**2, x2**2, x1*x2,
                  np.absolute(x1-x2), np.absolute(x1+x2)]).T

    num_columns_Z = Z.shape[1]
    
    # see lecture 3, slide 17
    Z_dagger_reg = np.dot(np.linalg.inv(np.dot(Z.T, Z) + lambda_param * np.identity(num_columns_Z)), Z.T)

    # Use linear regression to get weight vector
    w_tilde_reg = np.dot(Z_dagger_reg, y)

    # compute classification error
    error = sum(y != np.sign(np.dot(Z, w_tilde_reg))) / N
    return (error, w_tilde_reg)

## Compute in-sample classification error $E_{in}$

In [4]:
lambda_param = 10**3
E_in, w_tilde_reg = problem_4_weight_decay(df_train, lambda_param)
print('The in-sample classification error is E_in = {0}'.format(E_in))

The in-sample classification error is E_in = 0.37142857142857144


## B. Read in test set

We read in the out-of-sample data.

In [5]:
# dataframe df
# assign header names for each column
# tell pandas that data is separated by whitespace
# tell pandas that datatype is float64 
df_test = pd.read_csv('out.dta.txt', names = ["x1", "x2", "y"], sep='\s+', dtype=np.float64)
print('The first 5 rows of the table:\n')
print(df_test.head(5))
print()


# Examine data
rows, col = df_test.shape
print("The table has {0} rows and {1} columns.".format(rows, col))
print("So we have N = {0} data points (x1,x2) with classification y.".format(rows))

The first 5 rows of the table:

         x1        x2    y
0 -0.106006 -0.081467 -1.0
1  0.177930 -0.345951 -1.0
2  0.102162  0.718258  1.0
3  0.694078  0.623397 -1.0
4  0.023541  0.727432  1.0

The table has 250 rows and 3 columns.
So we have N = 250 data points (x1,x2) with classification y.


## Compute out-of-sample classification error $E_{out}$

In [6]:
# Use data from the pandas dataframe
x1 = np.array(df_test['x1'])
x2 = np.array(df_test['x2'])
y = np.array(df_test['y'])
N = df_test.shape[0]

# feature matrix Z
Z = np.array([np.ones(N), x1, x2,
             x1**2, x2**2, x1*x2,
             np.absolute(x1-x2), np.absolute(x1+x2)]).T

# Compute out-of-sample error
E_out = sum(y != np.sign(np.dot(Z, w_tilde_reg))) / N
print('The out-of-sample classification error is E_out = {0}'.format(E_out))

The out-of-sample classification error is E_out = 0.436


# Pick answer

As per problem statement we use the Euclidian distance to determine which of the possible answers is closest to our computed values.

In [7]:
choices = [(0.2, 0.2), (0.2, 0.3), (0.3, 0.3), (0.3, 0.4), (0.4, 0.4)]

computed_values = (E_in, E_out)

min_distance = 2**64
pick_choice = None

print("Our computed values are (E_in, E_out) = ", computed_values, "\n")

for choice in choices:
    distance = np.linalg.norm(np.array(choice) - np.array(computed_values))
    if distance < min_distance:
        min_distance = distance
        pick_choice = choice
    print("choice=", choice, "\tEuclidian distance:", distance)

    
print("\nWe pick:", pick_choice)

Our computed values are (E_in, E_out) =  (0.37142857142857144, 0.436) 

choice= (0.2, 0.2) 	Euclidian distance: 0.291691198191
choice= (0.2, 0.3) 	Euclidian distance: 0.218823570719
choice= (0.3, 0.3) 	Euclidian distance: 0.153616538225
choice= (0.3, 0.4) 	Euclidian distance: 0.0799877541648
choice= (0.4, 0.4) 	Euclidian distance: 0.0459600536402

We pick: (0.4, 0.4)


# Result

The correct answer is **4[e]** (0.4, 0.4).