<a href="https://colab.research.google.com/github/lingchm/datascience/blob/master/exercises/socially_distanced_robots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Logistic Regression

*convex optimization*, *logistic regression*, *gradient descent*, *heavy ball*, *nesterov*, *newton's method*

**Problem**

Logistic regression is a simple, but surprisingly powerful, tool for solving a fundamental problem in machine learning: learning a classi er that can distinguish between two classes from a set of training data. 

Suppose that we have a set of training data $(x_1, y_1), \dots, (x_N, y_N)$, where each $x_n \in \mathbb{R}^D$ is a feature vector and $y_n \in \{0,1\}$ is a label of two classes.
\feature vector" and each yn 2 f0; 1g is a \label" that indicates which of two classes. The goal in learning a classifier is to find a function $h(x)$ that predicts the correct label for a feature vector that we have never seen before. We can do this by trying to learn a function $h$ that gives us $h(x_n) = y_n$ on (most of) the training set of $N$ samples. 

Consider the logistic function
$$
g(t) = \frac{1}{1 + e^{-t}} \in [0, 1]
$$

In logistic regression, we set $t = a^Tx + b$ where $a \in \mathbb{R}^D$ and $b \in \mathbb{R}$, so that for any new sample $x$ we can compute $g(a^T x+b)$, returning a number between 0 and 1, interpreted as the probability of belonging to one of the classes. To form a classifier, we then simply compare the predicted probability with a threshold. A simple threshold is 0.5. 

To learn $a$ and $b$, we can setup the following optimization problem
$$
\max_{a,b} \sum_{n: y_n=1} \log( g(a^Tx_n +b)) +  \sum_{n: y_n=0} \log( 1-g(a^Tx_n +b)).
$$
This can be simplified as
$$
max_{a,b} \sum^N_{n=1} y_n(a^Tx_n +b)) - \log( 1 + e^{a^Tx_n +b}).
$$


**Method**

We explore four alternative approaches for solving this problem:
1. Gradient descent
2. Heavy ball
3. Nesterov
4. Newton's method

Results are compared with Scikitlearn's logistic regression function.

We implement each method with three approaches
1. with fixed step size 
2. using backtracking to find step size
3. using bisection to find step size

**References**

Credits to Dr. Justin Romberg for designing this problem.


In [13]:
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn import datasets
from numpy.linalg import inv

In [14]:
# generate data
np.random.seed(2020) 
x, y = datasets.make_blobs(n_samples=100, n_features=2, centers=2, cluster_std=6.0)
x_ = np.concatenate((x, np.ones((x.shape[0], 1))), axis=1)

In [15]:
# function to compute objective function, its gradient, and hessian matrix
def obj_function(x, x_, y):
    t = np.dot(x_, x)
    return np.dot(y, t) - np.sum(np.log(1 + np.exp(t)), axis=0)
    
def gradf(x, x_, y):
    e = np.exp(np.dot(x_, x))
    return np.sum(np.multiply(np.tile(y - e / ( 1 + e), 
                  (x_.shape[1], 1)).T, x_), axis=0)

def hessianf(x, x_):
    t = np.zeros((x_.shape[1], x_.shape[1]))
    for i in range(x_.shape[0]):
        e = np.exp(np.dot(x_[i, :], x))
        t += e / (e + 1)**2 * (x_[i, :].reshape(-1, 1) @ x_[i, :].reshape(1, -1))
    return t    

In [28]:
# two ways to find step size
def find_alpha_bisection(xk, dk, y, x_, epsilon=1e-3):
    al, ah, alpha, alpha_old = epsilon, 0.1, 0.001, 1
    k = 0
    while np.absolute(alpha - alpha_old) > epsilon:
        alpha_old = 1 * alpha
        alpha = (al + ah) / 2
        e = np.exp(np.dot(x_, theta + alpha * dk)) # 100x1 
        a_gradient = np.sum(np.multiply(y, np.dot(dk.T, x_.T))) + \
            np.sum(np.multiply(np.dot(x_, dk), e / (1 + e)))
        if a_gradient > 0:
            ah = alpha * 1
        elif a_gradient < 0:
            al = alpha * 1
        else:
            return alpha
        k += 1
    #print("number of iterations of bisection:", k, "alpha: ", alpha)
    return alpha, k

def find_alpha_backtracking(xk, dk, a0, c1, p, x_, y, tol=1e-6):
    alpha, k = a0, 1
    while obj_function(xk + alpha * dk, x_, y) < (obj_function(xk, x_, y) + \
               c1 * alpha * np.dot(gradf(xk, x_, y), dk)):
        alpha *= p
        k += 1
    return alpha, k

In [29]:
# Solution 1: Gradient descent
def gradient_descent(y, x_, alpha=0.001, c1=0.5, p=0.5, tol=1e-6):
    
    k, k_, max_iter, dk = 0, 0, 10000, 1
    theta = np.zeros(x_.shape[1])
    while np.linalg.norm(dk) > tol and k < max_iter:
        dk = gradf(theta, x_, y) 
        if alpha == 'bisection':
            ak, i = find_alpha_bisection(theta, dk, y, x_)
            k_ += i
        elif alpha == 'backtracking':
            ak, i = find_alpha_backtracking(theta, dk, 0.1, c1, p, x_=x_, y=y)
            k_ += i                
        else:
            ak = alpha
            i = 0
        theta += ak * dk
        k += 1
        
    print("Number of iterations of gradient descent:", k)
    print("Number of iterations of each bisection:", i)
    print("Number of iterations of bisection:", k_)
    print("Number of iterations:", k_ + k)
    print("Estimated theta:", theta)
    
    return theta

In [30]:
# Solution 2: heavy ball
def heavy_ball(y, x_, alpha=0.001, beta=0.95, c1=0.5, p=0.8, tol=1e-6):
    
    k, k_, max_iter, dk = 0, 0, 10000, 1
    theta, theta_old = np.zeros(x_.shape[1]), np.ones(x_.shape[1])
    while np.linalg.norm(dk) > tol and k < max_iter:
        dk = gradf(theta, x_, y) 
        
        if alpha == 'backtracking':
            ak, i = find_alpha_backtracking(theta, dk, 0.1, c1, p, x_, y)
            k_ += i                
        else:
            ak = alpha
            i = 0
        
        dk += beta / ak * (theta - theta_old)
            
        theta_old = theta * 1
        theta += ak * dk
        k += 1
        
    print("Number of iterations of gradient descent:", k)
    print("Number of iterations of each bisection:", i)
    print("Number of iterations of bisection:", k_)
    print("Number of iterations:", k_ + k)
    print("Estimated theta:", theta)
    
    return theta

In [31]:
# Solution 3: Nestrov's method
def nesterov(y, x_, alpha=0.001, beta=0.95, c1=0.5, p=0.5, tol=1e-6):
    
    k, k_, max_iter, dk = 0, 0, 10000, 1
    theta, theta_old = np.zeros(x_.shape[1]), np.ones(x_.shape[1])
    while np.linalg.norm(dk) > tol and k < max_iter:
        dk = gradf(theta + beta * (theta - theta_old), x_, y) 
        beta = (k-1) / (k+2)
        if alpha == 'backtracking':
            ak, i = find_alpha_backtracking(theta, dk, 0.1, c1, p, x_=x_, y=y)
            k_ += i                
        else:
            ak = alpha
            i = 0
        dk += beta / ak * (theta - theta_old)
        theta_old = theta * 1
        theta += ak * dk
        k += 1
        
    print("Number of iterations of gradient descent:", k)
    print("Number of iterations of each bisection:", i)
    print("Number of iterations of bisection:", k_)
    print("Number of iterations:", k_ + k)
    print("Estimated theta:", theta)
    
    return theta

In [32]:
# Solution 4: Newton's method
def newton(y, x_, alpha=1, c1=0.5, p=0.5, tol=1e-6):
    
    k, k_, max_iter, dk = 0, 0, 10000, 1
    theta = np.zeros(x_.shape[1])
    while np.linalg.norm(dk) > tol and k < max_iter:
        dk = inv(hessianf(theta, x_)) @ gradf(theta, x_, y) 
        if alpha == 'backtracking':
            ak, i = find_alpha_backtracking(theta, dk, 1, c1, p, x_=x_, y=y)
            k_ += i                
        else:
            ak = alpha
            i = 0
        theta += ak * dk
        k += 1
        
    print("Number of iterations of gradient descent:", k)
    print("Number of iterations of each bisection:", i)
    print("Number of iterations of bisection:", k_)
    print("Number of iterations:", k_ + k)
    print("Estimated theta:", theta)
    
    return theta

In [33]:
theta = gradient_descent(y, x_, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

theta = gradient_descent(y, x_, alpha='bisection', tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

# question 5 a
theta = gradient_descent(y, x_, alpha='backtracking', c1=0.5, p=0.8, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])


Number of iterations of gradient descent: 3864
Number of iterations of each bisection: 0
Number of iterations of bisection: 0
Number of iterations: 3864
Estimated theta: [-0.2808693  -0.45756691  2.21342365]
Classified 9 wrong out of 100
Number of iterations of gradient descent: 2166
Number of iterations of each bisection: 7
Number of iterations of bisection: 15162
Number of iterations: 17328
Estimated theta: [-0.28086942 -0.45756708  2.21342507]
Classified 9 wrong out of 100
Number of iterations of gradient descent: 289
Number of iterations of each bisection: 9
Number of iterations of bisection: 3790
Number of iterations: 4079
Estimated theta: [-0.28087948 -0.45757267  2.21348641]
Classified 9 wrong out of 100


In [34]:
theta = heavy_ball(y, x_, alpha=0.001, beta=0.95, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

theta = heavy_ball(y, x_, alpha='backtracking', beta=0.95, c1=0.5, p=0.8, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])


Number of iterations of gradient descent: 605
Number of iterations of each bisection: 0
Number of iterations of bisection: 0
Number of iterations: 605
Estimated theta: [-0.28091162 -0.45762542  2.21390607]
Classified 9 wrong out of 100


  after removing the cwd from sys.path.


Number of iterations of gradient descent: 484
Number of iterations of each bisection: 15
Number of iterations of bisection: 8111
Number of iterations: 8595
Estimated theta: [-0.28092082 -0.45764652  2.21403439]
Classified 9 wrong out of 100


In [35]:
theta = nesterov(y, x_, alpha=0.001, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

theta = nesterov(y, x_, alpha='backtracking', beta=0.95, c1=0.8, p=0.9, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

Number of iterations of gradient descent: 676
Number of iterations of each bisection: 0
Number of iterations of bisection: 0
Number of iterations: 676
Estimated theta: [-0.28140113 -0.45829609  2.21957098]
Classified 9 wrong out of 100


  after removing the cwd from sys.path.


Number of iterations of gradient descent: 515
Number of iterations of each bisection: 6
Number of iterations of bisection: 22990
Number of iterations: 23505
Estimated theta: [-0.28077063 -0.45741223  2.21237208]
Classified 9 wrong out of 100


In [37]:
theta = newton(y, x_, alpha=1, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

theta = newton(y, x_, alpha="backtracking", c1=0.5, p=0.8, tol=1e-3)
pred = np.where(np.dot(x_, theta) >= 0, 1, 0)
print("Classified", np.sum(np.absolute(pred - y)), "wrong out of", x.shape[0])

Number of iterations of gradient descent: 7
Number of iterations of each bisection: 0
Number of iterations of bisection: 0
Number of iterations: 7
Estimated theta: [-0.28091254 -0.45762617  2.21392367]
Classified 9 wrong out of 100
Number of iterations of gradient descent: 7
Number of iterations of each bisection: 1
Number of iterations of bisection: 7
Number of iterations: 14
Estimated theta: [-0.28091254 -0.45762617  2.21392367]
Classified 9 wrong out of 100


In [12]:
### compare with scikit learn package
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x, y)
pred = model.predict(x)
error = np.sum(np.absolute(pred - y))
print("Classified", error, "wrong out of", x.shape[0])

Classified 9 wrong out of 100
