# Binomial logistic regression retrospect

In previous post [Logistic regression (binomial regression) and regularization](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#Modeling) we revealed the model for logistic regression directly: $h_\theta(x) = \frac{1}{1+e^{-\theta x}}$, for why the model looks like that we already had one explanation in the post: [GLM and exponential family distributions](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm008_GLM_and_exponential_family_distributions/GLM_and_exponential_family_distributions.html#With-above-three-hypotheses,-GLM-$\Rightarrow$-logistic-regression), in this post lets interpret it in another way.

Logistic regression is inspired from linear regression: $h_\theta(x) = \theta x$, but to a binary classifier(binomial logistic regression) we hope the corresponding $\theta x$ part can indicate a probability: the probability ($p$) of the sample point belongs to class $A$ (then for $\bar{A} \text{ is } 1-p$), since $p$ is a probability, then its range should be $[0,1]$, but the reality is $\theta x$ can take any value, for achieving what we want we can introduce in [odds](https://en.wikipedia.org/wiki/Odds):

$$
\text{odds } = \frac{p}{1-p}
$$

and log-**it** (it → odds, log-__odds__):

$$
\begin{align*}
  ln\frac{p}{1-p} &= \theta x \\
  \Rightarrow p &= \frac{e^{\theta x}}{1 + e^{\theta x}} = \frac{1}{1 + e^{-\theta x}}
\end{align*}
$$

now $p \in [0, 1]$, that is: when we do the $\frac{1}{1+e^{-\theta x}}$ transformation to $x$ we get probabilities, and then we can use the odds/log-odds, that is why the model looks like that!

# Extend binomial logistic regression to multinomial logistic regression

For binomial logistic regression we only have two classes: $A \text{ and } \bar{A}$, then we can use the log-odds $ln\frac{p}{1-p}$ as the binomial classifier indicator: $> 0 \text{ belongs to class A, } < 0 \text{ belongs to class } \bar{A}$ but how do we deal with the case that we have more than two classes, how do we extend the log-odds indicator?

Lets say the sample sapce can be distributed into $k$ classes, and the last class $k$ has special meaning: any sample point not belongs to any of the first $k-1$ classes falls into class $k$. For answering above question, we can choose one class from the $k$ classes as the baseline class, then we use the log-odds of other classes' probabilities to this baseline class's probability as the multinomial classifier's indicator, usually we pick up the last class $k$ as the baseline class, that is:

$$
ln\frac{p_i}{p_k} = \theta_i x, \enspace i = 1, 2, \dots, k-1
$$

And actually in this post: [GLM and exponential family distributions](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm008_GLM_and_exponential_family_distributions/GLM_and_exponential_family_distributions.html#Reference-Point) we already calculated that:

$$
p_k = 1 \Big/ \sum\limits_{i=1}^k exp(\eta_i) = 1 \Big/ \sum\limits_{i=1}^k exp(\theta_i^Tx)
$$

# Hand-written digits recognition with multinomial logistic regression

In [2]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sets the backend of matplotlib to the 'inline' backend.
#
# With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook,
# directly below the code cell that produced it.
# The resulting plots will then also be stored in the notebook document.
#
# More details: https://stackoverflow.com/questions/43027980/purpose-of-matplotlib-inline
%matplotlib inline

from scipy.io import loadmat

data = loadmat(os.getcwd() + '/hand_written_digits.mat')  
data

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 'y': array([[10],
        [10],
        [10],
        ...,
        [ 9],
        [ 9],
        [ 9]], dtype=uint8)}

In [13]:
data['X'].shape, data['y'].shape

((5000, 400), (5000, 1))

In [17]:
X = np.insert(data['X'], 0, values=np.ones(data['X'].shape[0]), axis=1)
y = data['y']
X.shape, y.shape

((5000, 401), (5000, 1))

The above data contains amount 5000 of hand-written digits, and each single digit holds a 20 by 20 pixels grid, that is each row of above $X$ represents one digit, and each of its component is a float number which represents the grayscale intensity of one of the 20*20 pixels, partial example of the data:

<img src="./hand_written_digits.png">

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

### `cost_reg` is exactly copied over from previous post: [New cost function with regularization item](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#New-cost-function-with-regularization-item)

In [4]:
def cost_reg(theta, X, y, alpha):
    theta = np.reshape(theta, (-1, len(theta)))

    assert X.shape[1] == theta.shape[1], \
      'Improper shape of theta, expected to be: {}, actual: {}'.format((1, X.shape[1]), theta.shape)

    part0 = np.multiply(y, np.log(sigmoid(X @ theta.T)))
    part1 = np.multiply(1 - y, np.log(1 - sigmoid(X @ theta.T)))
    reg = alpha / (2 * len(X)) * np.sum(np.power(theta[:, 1:theta.shape[1]], 2))

    return -np.sum(part0 + part1) / len(X) + reg

In [None]:
def gradient_reg(theta, X, y, alpha):
    theta = np.reshape(theta, (-1, len(theta)))

    assert X.shape[1] == theta.shape[1], \
      'Improper shape of theta, expected to be: {}, actual: {}'.format((1, X.shape[1]), theta.shape)

    error = sigmoid(X @ theta.T) - y
    grad = 
    
    
    grad = np.zeros(parameters)
    

    for i in range(parameters):
        term = np.multiply(error, X[:, [i]])

        if i == 0:
            # No penalization to theta_0.
            grad[i] = np.sum(term) / len(X)
        else:
            grad[i] = np.sum(term) / len(X) + alpha / len(X) * theta[:, i]

    return grad