# Logistic Regression
Logistic Regression is an example of <b>supervised learning</b> and <b>binary classification problem</b>.
* <b>Supervised Learning:</b> input dataset composed by input and labels $(x, y)$ and want to learn a mapping $x \longrightarrow y$
* <b>Classification:</b> the value $y$ to predict is discrete 
* <b>Binary classification</b> problem $y \in \{0, 1\}$

## Problem Formulation

We want the hypothesis outputs value between 0 and 1 ( $h_\theta(x) \in [0,1] $ ).<br>
Then, for the logistic regression, the hypothesis is:

$$ h_\theta(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^T x}} $$

## Logistic Function or Sigmoid Function
The function $g(z)$ is called logistic function or sigmoid function and outputs values between 0 and 1. <br>

$$
g(z) = \frac{1}{1+e^{-z}}
$$

The derivates of the sigmoid function is:
<br>
$$
\begin{align}
g'(z) & = \frac{d}{dz} \frac{1}{1+e^{-z}} \\
      & = \frac{d}{dz} (1+e^{-z})^{-1} \\
      & = \frac{1}{(1+e^{-z})^2}(e^{-z}) \\
      & = \frac{1}{(1+e^{-z})} \cdot \bigg(1 - \frac{1}{(1+e^{-z})} \bigg) \\
      & = g(z)(1-g(z))
\end{align}
$$

## Maximum Likelihood Estimation

Assuming that:
$$ P(y=1| x;\theta) = h_\theta(x)$$
$$ P(y=0| x;\theta) = 1 - h_\theta(x)$$

The two equation can be combined in one equation as follows: 

$$ P(y| x;\theta) = h_\theta(x)^y (1 - h_\theta(x))^{1-y} $$

* if $y=1 \longrightarrow P(y| x;\theta) = h_\theta(x) $
* if $y=0 \longrightarrow P(y| x;\theta) = 1-h_\theta(x) $


Likelihood:
$$
\begin{align*} 
 \mathcal{L}(\theta) & = P(\vec{y}/x; \theta) \\
                     & = \prod_{i=1}^{m} P(y^{(i)}/x^{(i)}; \theta) \\
                     & = \prod_{i=1}^{m} h_\theta(x^{(i)})^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{(1-y^{(i)})}
\end{align*}
$$

Log Likelihood:
\begin{align*}
                \ell(\theta) & = log \mathcal{L}(\theta) \\
                             & = log \prod_{i=1}^{m} h_\theta(x^{(i)})^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{(1-y^{(i)})} \\
                             & = \sum_{i=1}^{m} y^{(i)} log(h_\theta(x^{(i)})) + (1-y^{(i)})log(1 - h_\theta(x^{(i)}))
\end{align*}

<b>Maximum Likelihood Estimation</b>: choose $\theta$ to <b>maximize</b> the <b>log likelihood</b> $\ell(\theta)$.

## Batch Gradient Ascent

To <b>maximize</b> the <b>likelihood</b> we can use <b>gradient ascent</b>:

$$ \theta_j := \theta_j + \alpha \frac{\partial}{\partial\theta_j} \ell(\theta) \:\:\: \forall j $$

### Gradient Computation

Firstly, compute the partial derivate for one training examples:
$$
\begin{align*}
    \frac{\partial}{\partial\theta_j} \ell(\theta) & = \frac{\partial}{\partial\theta_j} \bigg(y \ log \ h_\theta(x) + (1-y) \ log \ (1 - h_\theta(x)) \bigg) \\
                                                   & = \frac{\partial}{\partial\theta_j} \bigg(y \ log \ g(\theta^Tx) + (1-y) \ log \ (1 - g(\theta^Tx)) \bigg) \\
                                                   & = y \frac{1}{g(\theta^Tx)} g(\theta^Tx) (1 - g(\theta^Tx)) x_j - (1-y) \frac{1}{(1-g(\theta^Tx))} g(\theta^Tx) (1-g(\theta^Tx)) x_j \\
                                                   & = y(1-g(\theta^Tx))x_j-(1-y)g(\theta^Tx)x_j \\
                                                   & = y(1-h_\theta(x))x_j-(1-y)h_\theta(x)x_j \\
                                                   & = \bigg(y(1-h_\theta(x))-(1-y)h_\theta(x)\bigg)x_j \\
                                                   & = \bigg(y - yh_\theta(x) - h_\theta(x) + yh_\theta(x)\bigg)x_j \\
                                                   & = (y - h_\theta(x))x_j
\end{align*}
$$
Then, for m training examples:
$$ \frac{\partial}{\partial\theta_j} \ell(\theta) = \sum_{i = 1}^{m} (y^{(i)} - h_\theta(x^{(i)}))\ x_j^{(i)} $$
$$ \theta_j := \theta_j + \alpha \sum_{i = 1}^{m} (y^{(i)} - h_\theta(x^{(i)}))\ x_j^{(i)} \:\:\: \forall j $$

## Logistic Regression vs Linear Regression

The main differences are:

1. Update rule: 
   * $ \theta_j := \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\theta) = \theta_j - \alpha \sum_{i = 1}^{m} (h_\theta(x^{(i)}) - y^{(i)})\ x_j^{(i)} \qquad $ Linear Regression (Gradient Descent)
   * $ \theta_j := \theta_j + \alpha \frac{\partial}{\partial\theta_j} \ell(\theta) = \theta_j + \alpha \sum_{i = 1}^{m} (y^{(i)} - h_\theta(x^{(i)}))\ x_j^{(i)} \qquad $ Logistic Regression (Gradient Ascent) 
<br>
<br>
2. Minimization vs Maximization
   * In Linear Regression Minimize the squared errors
   * In Logistic Regression Maximize the log likelihood 
<br>
<br>
3. Definition of $h_\theta(x)$

They are similar because they belong the same class of algorithms: <b>General Linear Models</b>

In [1]:
import pandas as pd
import numpy as np

import os
import time

In [31]:
class LogisticRegression:
    
    def __init__(self):
        self.INIT_PARAMETERS = {"zero", "random"}
        self.theta = None
        return
    
    def fit(self, X, Y, iterations, learning_rate, batch_size=512, init_parameters="zero", step_per_iterations=10):
        
        # First dimension of X,Y (n_samples) must be the same
        if X.shape[0] != Y.shape[0]:
            raise ValueError("Error: first dimension of X {} and Y {} not matches.".format(X.shape[0], Y.shape[0]))

        m = X.shape[0]
        nx = X.shape[1]
        n = nx + 1
        
        # Add 1 in first component of all Xs (interpect term)
        X = np.insert(X, 0, 1.0, axis=1)
        Y = np.reshape(Y, (Y.shape[0], 1))
        
        # Initialize paramters theta
        self._init_weights(n, init_parameters)
        
        history, execution_time = self.mini_batch_gradient_descent(X, Y, m, n, batch_size,
                                                                   iterations, learning_rate, step_per_iterations)

        print("Training takes {:.2f} seconds.".format(execution_time))
        return history
    
    def mini_batch_gradient_descent(self, X, Y, m, n, batch_size, iterations, learning_rate, step_per_iterations):
        start_time = time.time()

        history = []
        
        # Training loop
        for iteration in range(iterations):
            
            J = 0
            n_batch = m // batch_size
            n_batch_remainder = m % batch_size
            
            for i_batch in range(n_batch):
                X_batch = X[i_batch:i_batch + batch_size]
                Y_batch = Y[i_batch:i_batch + batch_size]

                # Forward Propagation
                J_batch, Diff = self.forward_propagation(X_batch, Y_batch, batch_size)
                
                J += J_batch

                # Backward Propagation
                self.backward_propagation(X_batch, Diff, batch_size, learning_rate)
                
                history.append(J)
            
            if n_batch_remainder != 0:
                X_batch = X[:-n_batch_remainder]
                Y_batch = Y[:-n_batch_remainder]
                
                # Forward Propagation
                J_batch, Diff = self.forward_propagation(X_batch, Y_batch, n_batch_remainder)
                
                J += J_batch

                # Backward Propagation
                self.backward_propagation(X_batch, Diff, n_batch_remainder, learning_rate)
                
                history.append(J)

            if iteration and iteration % step_per_iterations == 0:
                print("Iteration {} - Cost {}".format(iteration, J))

        execution_time = time.time() - start_time
        
        return history, execution_time

    def forward_propagation(self, X, Y, m):
        # Compute the hypothesis for all Xs in the batch
        H = self._compute_hypothesis(X)
        
        # Compute the difference between estimated y and actual label
        Diff = Y - H
        
        # Sum the squared differences
        J = (1 / (2 * m)) * np.dot(Diff.T, Diff)
        return J, Diff
    
    
    def backward_propagation(self, X, Diff, m, learning_rate):
        Grad = (1 / m) * np.dot(X.T, Diff)
        
        # Update paramters theta
        self.theta += learning_rate * Grad
        return 
    
    def predict(self, X):
        # Add 1 in first component of all Xs (interpect term)
        X = _add_intercept(X) 
        return self._compute_hypothesis(X)
    
    def _compute_hypothesis(self, X):
        Z = np.dot(X,self.theta)
        return self.sigmoid(Z)
    
    @staticmethod
    def sigmoid(Z):
        return 1/(1+np.exp(-Z))
    
    @staticmethod
    def _add_intercept(X):
        # Add intercept term (1.0 in the first component of each input)
        return np.insert(X, 0, 1.0, axis=1)
    
    def _init_weights(self, n, init_parameters="zero"):
        """ initiates the paramters as zero or random. """
        if init_parameters not in self.INIT_PARAMETERS:
            raise ValueError("Error: init_parameters must be one of %s." % self.INIT_PARAMETERS)

        if init_parameters == "zero":
            # Initialize paramters with zero values
            self.theta = np.zeros((n, 1), dtype=float)

        if init_parameters == "random":
            # Initialize paramters with random values
            self.theta = np.random.rand(n, 1)
        return

In [32]:
DATASET_FOLDER = os.path.join("..","datasets","logistic-regression-data")

df_x = pd.read_csv(os.path.join(DATASET_FOLDER, "train_inputs.txt"), sep="\ +", names=["x1","x2"], header=None, engine='python')
df_y = pd.read_csv(os.path.join(DATASET_FOLDER, "train_labels.txt"), sep='\ +', names=["y"], header=None, engine='python')
df_y = df_y.astype(int)
df_x.head()

Unnamed: 0,x1,x2
0,1.34325,-1.331148
1,1.820553,-0.634668
2,0.986321,-1.888576
3,1.944373,-1.635452
4,0.976734,-1.353315


In [33]:
df_y.tail()

Unnamed: 0,y
94,1
95,1
96,1
97,1
98,1


In [34]:
df_y["y"] = df_y.apply(lambda row: max(row["y"], 0), axis=1)

In [35]:
X = df_x[["x1", "x2"]].values
Y = df_y["y"].values

In [36]:
X.shape

(99, 2)

In [37]:
Y.shape

(99,)

In [38]:
LR_model = LogisticRegression()

In [44]:
learning_rate = 0.01
iterations = 1000
init_parameters = "random"  # Random or Zero init
batch_size = 16

history = LR_model.fit(X=X,
                       Y=Y,
                       iterations=iterations,
                       learning_rate=learning_rate,
                       init_parameters=init_parameters,
                       batch_size=batch_size,
                       step_per_iterations=100)

Iteration 100 - Cost [[1.66962032]]
Iteration 200 - Cost [[1.58336183]]
Iteration 300 - Cost [[1.56543842]]
Iteration 400 - Cost [[1.56135067]]
Iteration 500 - Cost [[1.56056015]]
Iteration 600 - Cost [[1.56057329]]
Iteration 700 - Cost [[1.56074047]]
Iteration 800 - Cost [[1.56089695]]
Iteration 900 - Cost [[1.56101187]]
Training takes 0.14 seconds.
