# Assigment 4

# This is a mini-project assignment that includes only programming questions. You are asked to implement optimization algorithms for ML classification problems. 

## Marking of this assignment will be based on the correctness of your ML pipeline and efficiency of your code. 

## Upload your code on Learn dropbox and submit pdfs of the code and to Crowdmark.

## -----------------------------------------------------------------------------------------------------------

In [None]:
# !pip install numpy, scipy, sys, scikit-image, skimage, matplotlib

import time
import math 
import random 
import itertools

import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.sparse as sp
from scipy.sparse import diags
from scipy.sparse import kron
from scipy.sparse import identity
from scipy.sparse.linalg import eigsh
from scipy.sparse import csr_matrix
from scipy import real


## Suggested way of loading data to python for the assigment. There are alternatives of course, you can use your preferred way if you want.

In [None]:
# Download the LIBSVM package from here: https://www.csie.ntu.edu.tw/~cjlin/libsvm/#download 
# If your download is successfull you should have the folder with name: libsvm-3.24.
# We will use this package to load datasets. 

# Enter the downloaded folder libsvm-3.24 through your terminal. 
# Run make command to compile the package.

# Load this auxiliary package.
import sys

# add here your path to the folder libsvm-3.24/python
path = "/home/oymamatt/workplace/opt4ml/optmization4ML/libsvm-3.24/python/"
# Add the path to the Python paths so Python can find the module.
sys.path.append(path)

# Load the LIBSVM module.
from svmutil import *

## Datasets that you will need for this assignment.

In [None]:
# There is an extended selection of classification and regression datasets 
# https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

# Out of all these datasets you will need the following 3 datasets, which are datasets for classification problems.
# 
# a9a dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a 
# This dataset is small, it is recommened to start your experiments with this dataset.
#
# news20.binary dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
#
# covtype.binary dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#covtype.binary
#
# Exploit the sparsity of the problem when you implement optimization methods.

In [None]:
def to_sparse(list_dict_features, list_labels=None):
    samples = len(list_dict_features)
    
    # protect against an all zero feature
    feats = {
        f_id: idx 
        for idx, f_id in 
        enumerate(sorted(list({s for sample in list_dict_features for s in sample.keys()})))
    }
    
    num_of_features = len(feats)
    
    mat = sp.dok_matrix((samples, num_of_features), dtype=np.float64)
    
    for sample_id, sample in enumerate(list_dict_features):
        for feature_id, feature_val in sample.items():
            mat[sample_id, feats[feature_id]] = feature_val

    mat = mat.tocsr()
    l = csr_matrix(list_labels).transpose() if list_labels else None
    return mat, l


def split_train_validate(A, B, valid_ratio=0.1, seed=777):
    samples = B.shape[0]
    idx = np.arange(samples)
    np.random.seed(seed)
    np.random.shuffle(idx)
    valid_size = math.floor(valid_ratio * samples)
    train_size = samples - valid_size
    return A[idx[:train_size], :], B[idx[:train_size]], A[idx[train_size:], :], B[idx[train_size:]]

In [None]:
# Add here your path to the dataset file
path = "/home/oymamatt/workplace/opt4ml/optmization4ML/a9a.txt"
# path = "/home/oymamatt/workplace/opt4ml/optmization4ML/iris"
# path = "/home/oymamatt/workplace/opt4ml/optmization4ML/news20.binary"
# Use "svm_read_problem" function to load data for your assignment.
# it will store the labels in "b" and the data matrix in "A".
B, A = svm_read_problem(path)

# Note that matrix "A" stores the data in a sparse format. 
# In particular matrix "A" is a list of dictionaries. 
# The length of the list gives you the number of samples.
# Each entry in the list is a dictionary. The keys of the dictionary are the non-zero features.
# The values of the dictionary for each key is a list which gives you the feature value. 

## Training, Validation and Testing data

In [None]:
# All datasets above consist of training and testing data. 

# You should seperate the training data into training and validation data.
# Follow the instructions from the lectures about how you can use both training and validation data.
# You can use 10% of the training data as validation data and the remaining 90% to train the models.
# This is a suggested percentage, you can do otherwise if you wish.

# Do not use the testing data to influence training in any way. Do not use the testing data at all.
# Only your instructor and TA will use the testing data to measure generalization error. 
# If you do use the testing data to tune parameters or for training of the algorithms we will figure it out :-).

## Optimization problems

### You need to solve the following optimization problems 

Hinge-loss
$$\mbox{minimize}_{x\in\mathbb{R}^d, \beta \in \mathbb{R}} \ \frac{1}{n} \sum_{i=1}^n \max \{0,1-b_i(a_i^Tx + \beta)\},$$
where $a_i\in\mathbb{R}^d$ is the feature vector for sample $i$ and $b_i$ is the label of sample $i$. The sub-gradient of the hinge-loss is given in the lecture slides (note that there is a small difference due to the intercept $\beta$). A smooth approximation of the function $f(z):=\max\{0,1-z\}$ is given by
$$
\psi_\mu(z) = 
\begin{cases}
0 & z\ge 1\\
(1-z)^2 & \mu < z < 1 \\
(1-\mu)^2 + 2(1-\mu)(\mu-z) & z \le \mu.
\end{cases}
$$
You can use the smooth approximation $\psi_\mu(z)$ for methods that work only for smooth functions. For sub-gradient methods you should use the sub-gradient.

L2-regularized logistic regression
$$\mbox{minimize}_{x\in\mathbb{R}^d,\beta\in\mathbb{R}} \ \lambda \|x\|_2^2 + \frac{1}{n} \sum_{i=1}^n \log (1+ \exp(-b_i(a_i^Tx + \beta))).$$
This is a smooth objective function, therefore, you should use gradient methods to solve it. You do not need sub-gradient methods for this problem.

In [None]:
def hinge_margin(b, a, x, beta):
    m = a.dot(x) + beta
    m = b.multiply(m)
    return m.toarray()

def hinge_predict(a, x, beta):
    m = a.dot(x) + beta
    labels = np.sign(m)
    return labels

#=============================================================================

def hinge_loss(b, a, x, beta):
    margin = hinge_margin(b, a, x, beta)
    margin_max = 1.0 - margin
    margin_max[margin_max < 0.0] = 0.0
    loss = np.sum(margin_max) / b.shape[0]
    return loss

def hinge_grad(b, a, x, beta, reduce=True):
    margin = hinge_margin(b, a, x, beta)
    margin_mask = (margin >= 1.0).squeeze()
    
    acc_x = -1.0 * b.multiply(a)
    acc_x[margin_mask, :] = acc_x[margin_mask, :].multiply(0.0)
    if reduce:
        g_x = acc_x.sum(axis=0).T / b.shape[0]
    else:
        g_x = acc_x
    
    acc_beta = -1.0 * b
    acc_beta[margin_mask] = 0.0
    
    if reduce:
        g_beta = acc_beta.sum() / b.shape[0]
    else:
        g_beta = acc_beta
    
    return g_x, g_beta

#=============================================================================

def smooth_hinge_loss(b, a, x, beta, mu):
    z = hinge_margin(b, a, x, beta)
    case_1_mask = (z >= 1.0).squeeze(axis=1)
    case_2_mask = ((mu < z) & (z < 1.0)).squeeze(axis=1)
    case_3_mask = (z <= mu).squeeze(axis=1)
    
    f = z.copy()
    f[case_1_mask] = 0.0
    f[case_2_mask] = (1.0 - z[case_2_mask]) ** 2.0
    f[case_3_mask] = (1.0 - mu) ** 2.0 + 2.0 * (1.0 - mu) * (mu - z[case_3_mask])

    loss = np.sum(f) / b.shape[0]
    return loss

def smooth_hinge_grad(b, a, x, beta, mu, reduce=True):
    z = hinge_margin(b, a, x, beta)
    
    case_1_mask = (z >= 1.0).squeeze(axis=1)
    case_2_mask = ((mu < z) & (z < 1.0)).squeeze(axis=1)
    case_3_mask = (z <= mu).squeeze(axis=1)
    
    acc_x = b.multiply(a)
    acc_x[case_1_mask, :] *= 0.0
    acc_x[case_2_mask, :] = acc_x[case_2_mask, :].multiply(2.0 * (1.0 - z[case_2_mask, :]))
    acc_x[case_3_mask, :] *= -2.0 * (1.0 - mu)
    if reduce:
        g_x = acc_x.sum(axis=0).T / b.shape[0]
    else:
        g_x = acc_x
    
    acc_beta = 1.0 * b
    acc_beta[case_1_mask, :] *= 0.0
    acc_beta[case_2_mask, :] = acc_beta[case_2_mask, :].multiply(2.0 * (1.0 - z[case_2_mask, :]))
    acc_beta[case_3_mask, :] *= -2.0 * (1.0 - mu)
    if reduce:
        g_beta = acc_beta.sum() / b.shape[0]
    else:
        g_beta = acc_beta
    
    return g_x, g_beta

#=============================================================================

def reg_logistic_loss(b, a, x, beta, lambda_):
    margin = hinge_margin(b, a, x, beta)
    logistic = np.log(1.0 + np.exp(-1.0 * margin))
    reg = lambda_ * (np.linalg.norm(x) ** 2.0)
    loss = reg + (np.sum(logistic) / b.shape[0])
    return loss

def reg_logistic_grad(b, a, x, beta, lambda_, reduce=True):
    margin = hinge_margin(b, a, x, beta)
    z = -1.0 * margin
    exp_z = np.exp(z)
    log_grad = exp_z / (1.0 + exp_z)
    
    z_x = -1.0 * b.multiply(a)
    acc_x = z_x.multiply(log_grad)
    if reduce:
        g_x = 2.0 * lambda_ * x + acc_x.sum(axis=0).T / b.shape[0]
    else:
        g_x = 2.0 * lambda_ * x.T + acc_x
    
    z_beta = -1.0 * b
    acc_beta = z_beta.multiply(log_grad)
    if reduce:
        g_beta = acc_beta.sum() / b.shape[0]
    else:
        g_beta = acc_beta.todense()
    
    return g_x, g_beta

#=============================================================================

def reg_logistic_predict(a, x, beta):
    m = a.dot(x) + beta
    probs = 1.0 / (1.0 + np.exp(-1.0 * m))
    labels = probs.copy()
    labels[probs > 0.5] = 1.0
    labels [probs <= 0.5] = -1.0
    return labels

In [None]:
class HingeLossModel:
    def loss(self, b, a, x, beta):
        return hinge_loss(b=b, a=a, x=x, beta=beta)
    
    def grad(self, b, a, x, beta, reduce=True):
        return hinge_grad(b=b, a=a, x=x, beta=beta, reduce=reduce)
    
    def predict(self, a, x, beta):
        return hinge_predict(a=a, x=x, beta=beta)


class SmoothHingeLossModel:
    def __init__(self, mu=0.1):
        self.mu = mu
    
    def loss(self, b, a, x, beta):
        return smooth_hinge_loss(b=b, a=a, x=x, beta=beta, mu=self.mu)
    
    def grad(self, b, a, x, beta, reduce=True):
        return smooth_hinge_grad(b=b, a=a, x=x, beta=beta, mu=self.mu, reduce=reduce)
    
    def predict(self, a, x, beta):
        return hinge_predict(a=a, x=x, beta=beta)


class LogisticRegressionModel:
    def __init__(self, lambda_=0.01):
        self._lambda = lambda_
        
    def loss(self, b, a, x, beta):
        return reg_logistic_loss(b=b, a=a, x=x, beta=beta, lambda_=self._lambda)
    
    def grad(self, b, a, x, beta, reduce=True):
        return reg_logistic_grad(b=b, a=a, x=x, beta=beta, lambda_=self._lambda, reduce=reduce)
    
    def predict(self, a, x, beta):
        return reg_logistic_predict(a=a, x=x, beta=beta)

## Optimization algorithms

In [None]:
# For this assignment you will need the following methods


# 1) Stochastic sub-gradient
# 2) Stochastic gradient
# 3) Mini-batch (sub-)gradient (you will have to decide what batching strategy to use, see lecture slides)
class Optimizer123:
    def __init__(self, model, init_lr=0.01, grad_all_every=10, decrease_lr=False, batch_size=1, epsilon=0.000001, max_iterations=200, seed=777):
        self.model = model
        self.batch_size = batch_size
        self.epsilon = epsilon
        self.max_iterations = max_iterations
        self.seed = seed
        self.init_lr = init_lr
        self.grad_all_every = grad_all_every
        self.decrease_lr = decrease_lr
        self.x = None
        self.beta = None
        self.steps = None
        
    def fit(self, a_t, b_t):
        np.random.seed(self.seed)
        self.x = csr_matrix(np.random.rand(a_t.shape[1], 1))
        self.beta = 0.0
        self.sgd(a_t_all=a_t, b_t_all=b_t)

    def predict(self, a):
        predictions = self.model.predict(a=a, x=self.x, beta=self.beta)
        return list(itertools.chain.from_iterable(predictions.tolist()))
    
    def sgd(self, a_t_all, b_t_all):
        x_updated = self.x.copy().toarray()
        beta_updated = self.beta
        f_vals = []
        norm_vals = []
        samples = b_t_all.shape[0]
        t1 = time.time()
        for i in range(1, self.max_iterations+1):
            if i % self.grad_all_every == 0:
                a_t = a_t_all
                b_t = b_t_all
            else:
                batch_idx = np.random.randint(low=0, high=samples, size=self.batch_size)
                a_t = a_t_all[batch_idx, :]
                b_t = b_t_all[batch_idx, :]
            
            current_grad_x, current_grad_beta = self.model.grad(b=b_t, a=a_t, x=x_updated, beta=beta_updated)
            total_grad = np.vstack((current_grad_x, [current_grad_beta]))
            current_grad_norm = np.linalg.norm(total_grad)
            norm_vals.append(current_grad_norm)
            f_vals.append(self.model.loss(b=b_t, a=a_t, x=x_updated, beta=beta_updated))
            
            alpha = self.init_lr / (float(i) if self.decrease_lr else 1.0)
            x_updated = x_updated - alpha * current_grad_x
            beta_updated = beta_updated - alpha * current_grad_beta
            f_diff = (f_vals[-1] - f_vals[-2]) if len(f_vals) > 1 else None
            grad_diff = (norm_vals[-1] - norm_vals[-2]) if len(norm_vals) > 1 else None
            
            step_num = f"--{i}--" if i % self.grad_all_every == 0 else str(i)
            print(f"Step = {step_num}: alpha = {alpha}, Function = {f_vals[-1]}, Function Diff. =  {f_diff}, Grad. Norm = {norm_vals[-1]}, Grad. Diff. = {grad_diff}")
        t2 = time.time()
        print(f"Iterations (Total) time = {t2-t1}")
        self.x = x_updated
        self.beta = beta_updated
        self.steps = np.array(f_vals)  

#=============================================================================

# 4) Stochastic average sub-gradient (SAG)
# 5) Stochastic average gradient (SAG)
class Optimizer45:
    def __init__(self, model, init_lr=0.01, decrease_lr=False, epsilon=0.000001, max_iterations=400, seed=777):
        self.model = model
        self.epsilon = epsilon
        self.max_iterations = max_iterations
        self.seed = seed
        self.init_lr = init_lr
        self.decrease_lr = decrease_lr
        self.x = None
        self.beta = None
        self.steps = None
        
    def fit(self, a_t, b_t):
        np.random.seed(self.seed)
        self.x = csr_matrix(np.random.rand(a_t.shape[1], 1))
        self.beta = 0.0
        self.sag(a_t_all=a_t, b_t_all=b_t)

    def predict(self, a):
        predictions = self.model.predict(a=a, x=self.x, beta=self.beta)
        return list(itertools.chain.from_iterable(predictions.tolist()))
    
    def sag(self, a_t_all, b_t_all):
        x_updated = self.x.copy().toarray()
        beta_updated = self.beta
        f_vals = []
        norm_vals = []
        samples = b_t_all.shape[0]
        t1 = time.time()
        
        all_grad_x, all_grad_beta = self.model.grad(b=b_t_all, a=a_t_all, 
                                                    x=x_updated, beta=beta_updated, 
                                                    reduce=False)
        
        grad_x_sum = all_grad_x.sum(axis=0).T
        grad_beta_sum = float(all_grad_beta.sum(axis=0).squeeze())
        
        for i in range(1, self.max_iterations+1):
            sample_idx = int(np.random.randint(low=0, high=samples, size=1).squeeze())
            
            a_t = a_t_all[sample_idx, :]
            b_t = b_t_all[sample_idx, :]
            
            current_grad_x, current_grad_beta = self.model.grad(b=b_t, a=a_t, x=x_updated, beta=beta_updated)
            old_grad_x, old_grad_beta = all_grad_x[sample_idx, :].T, float(all_grad_beta[sample_idx, :])
            all_grad_x[sample_idx, :] = current_grad_x.T
            all_grad_beta[sample_idx, :] = float(current_grad_beta)
            
            grad_x_sum = current_grad_x - old_grad_x + grad_x_sum
            grad_beta_sum = float(current_grad_beta) - old_grad_beta + grad_beta_sum
            
            total_grad = np.vstack((grad_x_sum, [grad_beta_sum])) / samples
            current_grad_norm = np.linalg.norm(total_grad)
            norm_vals.append(current_grad_norm)
            f_vals.append(self.model.loss(b=b_t, a=a_t, x=x_updated, beta=beta_updated))

            if current_grad_norm <= self.epsilon:
                break
            
            alpha = self.init_lr / (float(i) if self.decrease_lr else 1.0)
            x_updated = x_updated - (alpha / samples) * grad_x_sum
            beta_updated = beta_updated - (alpha / samples) * grad_beta_sum
            f_diff = (f_vals[-1] - f_vals[-2]) if len(f_vals) > 1 else None
            grad_diff = (norm_vals[-1] - norm_vals[-2]) if len(norm_vals) > 1 else None
            print(f"Step = {i}: alpha = {alpha}, Function = {f_vals[-1]}, Function Diff. =  {f_diff}, Grad. Norm = {norm_vals[-1]}, Grad. Diff. = {grad_diff}")
        t2 = time.time()
        print(f"Iterations (Total) time = {t2-t1}")
        self.x = x_updated
        self.beta = beta_updated
        self.steps = np.array(f_vals) 
    
#=============================================================================

# 6) Gradient descent with Armijo line-search
class Optimizer6:
    def __init__(self, model, epsilon=0.000001, max_iterations=400, gamma=0.1, seed=777):
        self.model = model
        self.epsilon = epsilon
        self.max_iterations = max_iterations
        self.gamma = gamma
        self.seed = seed
        self.x = None
        self.beta = None
        self.steps = None
        
    def fit(self, a_t, b_t):
        np.random.seed(self.seed)
        self.x = csr_matrix(np.random.rand(a_t.shape[1], 1))
        self.beta = 0.0
        self.gradient_descent_arm(a_t=a_t, b_t=b_t)

    def predict(self, a):
        predictions = self.model.predict(a=a, x=self.x, beta=self.beta)
        return list(itertools.chain.from_iterable(predictions.tolist()))
    
    def line_search_arm(self, a, b, x, beta, f, grad_x, grad_beta, current_grad_norm):
        decrease = self.gamma * (current_grad_norm ** 2.0)
        alpha = 1.0
        while self.model.loss(b=b, a=a, x=x-alpha*grad_x, beta=beta-alpha*grad_beta) > f - alpha * decrease:
            alpha /= 2.0
        return alpha
    
    def gradient_descent_arm(self, a_t, b_t):
        x_updated = self.x.copy().toarray()
        beta_updated = self.beta
        f_vals = []
        norm_vals = []
        t1 = time.time()
        for i in range(1, self.max_iterations+1):
            current_grad_x, current_grad_beta = self.model.grad(b=b_t, a=a_t, x=x_updated, beta=beta_updated)
            
            total_grad = np.vstack((current_grad_x, [current_grad_beta]))
            current_grad_norm = np.linalg.norm(total_grad)
            norm_vals.append(current_grad_norm)
            f_vals.append(self.model.loss(b=b_t, a=a_t, x=x_updated, beta=beta_updated))
            
            if current_grad_norm <= self.epsilon:
                break
            
            alpha = self.line_search_arm(a=a_t, b=b_t, f=f_vals[-1],
                                         x=x_updated, beta=beta_updated,
                                         grad_x=current_grad_x, 
                                         grad_beta=current_grad_beta, 
                                         current_grad_norm=current_grad_norm)
            
            x_updated = x_updated - alpha * current_grad_x
            beta_updated = beta_updated - alpha * current_grad_beta
            
            f_diff = (f_vals[-1] - f_vals[-2]) if len(f_vals) > 1 else None
            grad_diff = (norm_vals[-1] - norm_vals[-2]) if len(norm_vals) > 1 else None
            print(f"Step = {i}: alpha = {alpha}, Function = {f_vals[-1]}, Function Diff. =  {f_diff}, Grad. Norm = {norm_vals[-1]}, Grad. Diff. = {grad_diff}")
        t2 = time.time()
        print(f"Iterations (Total) time = {t2-t1}")
        self.x = x_updated
        self.beta = beta_updated
        self.steps = np.array(f_vals)

#=============================================================================

# 7) Acceleratd gradient with Armijo line-search (the same method as Q5 in Assignemnt 3)
class Optimizer7:
    def __init__(self, model, epsilon=0.000001, max_iterations=400, gamma=0.1, seed=777):
        self.model = model
        self.epsilon = epsilon
        self.max_iterations = max_iterations
        self.gamma = gamma
        self.seed = seed
        self.x = None
        self.beta = None
        self.steps = None
        
    def fit(self, a_t, b_t):
        np.random.seed(self.seed)
        self.x = csr_matrix(np.random.rand(a_t.shape[1], 1))
        self.beta = 0.0
        self.accelerated_gd_practical(a_t=a_t, b_t=b_t)

    def predict(self, a):
        predictions = self.model.predict(a=a, x=self.x, beta=self.beta)
        return list(itertools.chain.from_iterable(predictions.tolist()))

    def line_search_arm_prac(self, a, b, x, beta, f, grad_x, grad_beta, current_grad_norm):
        decrease = self.gamma * (current_grad_norm ** 2.0)
        alpha = 1.0
        while self.model.loss(b=b, a=a, x=x-alpha*grad_x, beta=beta-alpha*grad_beta) > f - alpha * decrease:
            alpha /= 2.0
        return alpha

    def accelerated_gd_practical(self, a_t, b_t):
        y = self.x.copy().toarray()
        beta_y = self.beta

        t = 1.0

        x_updated = self.x.copy().toarray()
        beta_updated = self.beta

        f_vals = []
        norm_vals = []
        t1 = time.time()
        for i in range(1, self.max_iterations+1):
            current_grad_x, current_grad_beta = self.model.grad(b=b_t, a=a_t, x=y, beta=beta_y)
            
            total_grad = np.vstack((current_grad_x, [current_grad_beta]))
            current_grad_norm = np.linalg.norm(total_grad)
            norm_vals.append(current_grad_norm)
            f_vals.append(self.model.loss(b=b_t, a=a_t, x=x_updated, beta=beta_updated))
            
            if current_grad_norm <= self.epsilon:
                break

            fy = self.model.loss(b=b_t, a=a_t, x=y, beta=beta_y)
            
            alpha = self.line_search_arm_prac(a=a_t, b=b_t, f=fy,
                                              x=y, beta=beta_y,
                                              grad_x=current_grad_x, 
                                              grad_beta=current_grad_beta, 
                                              current_grad_norm=current_grad_norm)
            
            x_updated_new = y - alpha * current_grad_x
            beta_updated_new = beta_y - alpha * current_grad_beta
            t_new = (1.0 + np.sqrt(1 + 4 * t ** 2)) / 2.0
            y = x_updated_new + ((t - 1.0) / t_new) * (x_updated_new - x_updated)
            beta_y = beta_updated_new + ((t - 1.0) / t_new) * (beta_updated_new - beta_updated)

            t = t_new
            x_updated = x_updated_new
            beta_updated = beta_updated_new

            f_diff = (f_vals[-1] - f_vals[-2]) if len(f_vals) > 1 else None
            grad_diff = (norm_vals[-1] - norm_vals[-2]) if len(norm_vals) > 1 else None
            print(f"Step = {i}: alpha = {alpha}, t = {t}, Function = {f_vals[-1]}, Function Diff. =  {f_diff}, Grad. Norm = {norm_vals[-1]}, Grad.Norm. Diff. = {grad_diff}")
        t2 = time.time()
        print(f"Iterations (Total) time = {t2-t1}")
        self.x = x_updated
        self.beta = beta_updated
        self.steps = np.array(f_vals)
        
# Information is provided in the lecture slides about parameter tuning and termination.
# However, the final decision of any parameter tuning and termination criteria is up to the students to make. 

## Validation error: measure the validation error by calculating
$$
\frac{1}{t}\sum_{i\in\mbox{validation data}} \left| \ b_i^{\mbox{your model}} - b_i^{\mbox{true}} \ \right|
$$
where $t$ is the number of samples in your validation set. $b_i^{\mbox{true}}$ is the true label of the $i$-th sample. $b_i^{\mbox{your model}}$ is the label of the $i$-th sample of your model.

For hinge loss calculate $$b_i^{\mbox{your model}}:= \mbox{sign}(a_i^Tx + \beta).$$

For logistic regression calculate the predicted label by
$$
b_i^{\mbox{your model}}=
\begin{cases}
1 & \mbox{if } \frac{1}{1+e^{-(a_i^Tx + \beta)}} > 0.5\\
-1 & \mbox{otherwise}
\end{cases}
$$

In [None]:
def validation_error(preds, trues):
    arr_preds = np.array(preds)
    arr_trues = np.array(trues)
    assert arr_preds.shape == arr_trues.shape
    return np.abs(arr_preds - arr_trues).sum() / len(preds)

## Question 1: Use the ML pipeline that is mentioned in slide 60 of Lecture 11 to train your model for the logistic regression problem (the hinge-loss problem does not have any hyper-parameters). Pick any algorithm that you want from the above suggested list to train the models. Report your ML pipeline. Print your Generalization Error. We will not measure running time for this pipeline. Running time will be measure only in Q2. Marks: 30.

In [None]:
# Data splitting
A_s, b_s = to_sparse(A, B)
a_t, b_t, a_v, b_v = split_train_validate(A_s, b_s)

# Train/eval loop
eval_preds = []
for l in [0.01, 0.05, 0.1, 0.3, 0.5, 0.7, 1.0]:
    for g in [0.1, 0.15, 0.2, 0.3]:
        print(f"lambda={l}, gamma={g}")
        optm = Optimizer7(model=LogisticRegressionModel(lambda_=l), gamma=g, epsilon=0.001, max_iterations=200)
        optm.fit(a_t=a_t, a_t=b_t)
        preds = optm.predict(a_v)
        valid_error = validation_error(preds, b_v.toarray().T.tolist()[0])
        eval_preds.append((l, g, valid_error))
        print("-=" * 40)
        print("-=" * 40)

# best lambda: 0.01, gamma: 0.1
for t in eval_preds:
    print(f"===> lambda={t[0]}, gamma={t[1]}, validation error={t[2]}")

In [None]:
class MyMethod:
    def __init__(self):
        self.optimizer = Optimizer123(model=HingeLossModel(), 
                                      init_lr=0.01, 
                                      decrease_lr=False, 
                                      batch_size=10000, 
                                      epsilon=0.000001, 
                                      max_iterations=5700, 
                                      seed=777)
        self.x = None
        self.beta = None
    
    def fit(self, train_data, train_label):
        a_t, b_t = to_sparse(train_data, train_label)
        self.optimizer.fit(a_t=a_t, b_t=b_t)
        self.x = self.optimizer.x
        self.beta = self.optimizer.beta
    
    def predict(self, test_data):
        a, _ = to_sparse(test_data)
        return self.optimizer.predict(a=a)

## Question 2: Plot the objective function (y-axis) vs running time in sec (x-axis). Have one plot for each optimization problem. In each plot show the performance of all relevant algorithms. For each plot use the parameter setting that gives you the best validation error in Q1 (this refers to the logistic regression probelm). Do not show plots for all parameter settings that you tried in Q1, only for the one that gives you the smallest validation error. Do not include computation of any plot data in the computation of the running time of the algorithm, unless the plot data are computed by the algorithm anyway. Make sure that the plots are clean and use appropriate legends. Note that we should be able to re-run the code and obtain the plots. Marks: 70.

### For this question, we will measure the running time of your stochastic sub-gradient method for the sparse dataset news20.binary for the hinge-loss problem. We will not measure the running time of any other combination of algorithm, dataset, problem. You need to implement the stochastic sub-gradient method and encapsulate it in a python class.

To make sure your object can be used by our script, your class should have two methods:

1. <strong>fit(self, train_data, train_label)</strong>. It will use stochastic sub-gradient method to minimize the hinge loss and store the optimized coefficients (i.e. $x, \beta$) in the instance. The "train_data" and "train_label" are similar to the output of "svm_read_problem". 
    * "train_data" is a list of $n$ python dictionaries (int -> float), which presents a sparse matrix. The keys (int) and values (float) in the dictionary at train_data[i] are the indices (int) and values (float) of non-zero entries of row $i$. 
    * "train_label" is a list of $n$ integers, it only has <strong>-1s and 1s</strong>. $n$ is the number of samples.  This function returns nothing.


2. <strong>predict(self, test_data)</strong>. It will predict the label of the input "test_data" by using the coefficients stored in the instance. The "test_data" has the same data structure as the "train_data" of the "fit" function. This function returns a list of <strong>-1s and 1s</strong> (i.e. the prediction of your labels).

You can also define other methods to help your programming, we will only call the two methods decribed above.

To let us import your class, you need to follow these rules:

1. You should name your python file by <strong>a4_[your student ID].py</strong>. For example, if your student id is 12345, then your file name is <strong>a4_12345.py</strong>
1. Your object name should be <strong>MyMethod</strong> (it's case sensitive).

Any violation of the above requirements will get error in our script and you will get at most 50% of the total score. Your solution will be mainly measured by the runing time of the <strong>fit</strong> function and the accuracy of the <strong>predict</strong> function. For example your method will be called and measured in following pattern:

    obj = MyMethod()
    st = time.time()
    obj.fit(train_data, train_label) # .fit() optimizes the objective and stores coefficients in obj.
    running_time = time.time() - st
    predict_label = obj.predict(test_data)
    accuracy = get_accuracy(predict_label, test_label) # this is a function we use to measure accuracy.
Then your accuracy will be measured by <strong>predict_labels</strong>, you don't have to implement "get_accuracy". When you finish your implementation, upload the .py file to Learn dropbox.

In [None]:
A_s, b_s = to_sparse(A, B)
a_t, b_t, a_v, b_v = split_train_validate(A_s, b_s)

optimizers = {}
fvals = {}

# Hinge
optimizers['hinge_acc_arm'] = Optimizer7(model=HingeLossModel())
optimizers['hinge_gd_arm'] = Optimizer6(model=HingeLossModel())
optimizers['hinge_sag'] = Optimizer123(model=HingeLossModel())

# Smooth Hinge
optimizers['s_hinge_acc_arm'] = Optimizer7(model=SmoothHingeLossModel())
optimizers['s_hinge_gd_arm'] = Optimizer6(model=SmoothHingeLossModel())
optimizers['s_hinge_sag'] = Optimizer123(model=SmoothHingeLossModel()) 

# Logistic Regression
optimizers['log_reg_acc_arm'] = Optimizer7(model=LogisticRegressionModel())
optimizers['log_reg_gd_arm'] = Optimizer6(model=LogisticRegressionModel())
optimizers['log_reg_sag'] = Optimizer123(model=LogisticRegressionModel()) 

for name, opt in optimizers.items():
    opt.fit(a_t, b_t)
    fvals[name] = opt.steps.tolist()

fig = plt.figure(figsize=(16, 9))
ax = fig.add_subplot(1, 1, 1)

for name, step in fvals.items():
    ax.plot(step, label=(name), linewidth=3.0, linestyle='--')

ax.legend(prop={'size': 15},loc="upper right")
plt.xlabel("#Iterations", fontsize=25)
plt.ylabel(f"Objective", fontsize=25)
ax.grid(linestyle='dashed')