# Lecture 16: Classification

# Problem setting

## Review
In last few lectures we have learned the linear regression, where we explore the possibility of using a linear function (or higher degree polynomials) to represent the relation of the features in the samples (aka labels, $x$ values, or training data `X_train`) to a target value ($y$ values `y_train`), so that we can predict the target value $y$ (`y_pred` obtained by the model) based on testing data `X_test`.

However, linear regression is not appropriate in the case of a qualitative target value.

## Classification
Today, we will learn how to predict a discrete label such as 
* predicting whether a grid of pixel intensities represents a "0" digit or a "1" digit;
* predicting whether tomorrow will have rain based on previous days' data.
* predicting whether a wine is good or mediocre based on its chemical components' data.

This is a classification problem. Logistic regression is a simple classification algorithm for learning to make such decisions for a binary label.

Reference: MATLAB tutorial in [Stanford Deep Learning tutorial](http://deeplearning.stanford.edu/tutorial/).

# Logistic Regression

----

## Heuristics
Recall the `winequality-red.csv` we have used in the last few lectures and labs. If the `quality` of a wine is $\geq 6$, relabel it as "favorable"; if the `quality` of a wine is $\leq 5$, relabel it as "mediocre". 

For a certain sample $(\mathbf{x}^{(i)}, y^{(i)})$, where $\mathbf{x}^{(i)}$ is the vector representing its first 11 features, and $y^{(i)}$ is the quality score (label), if we know its score is 7, then
$$
P\big(i\text{-th sample is favorable} \big) = 1, \qquad 
P\big(i\text{-th sample is mediocre} \big) = 0. 
$$
If we relabel the "favorable" and "mediocre" into 1 and 0 as our values for $y^{(i)}$, then
$$
P\big(y^{(i)} = 1\big) = 1, \qquad P\big(y^{(i)} = 0\big) = 0.
$$
If some other sample, say $j$-th sample, has quality score 4, then
$$
P\big(y^{(j)} = 1\big) = 0, \qquad P\big(y^{(j)} = 0\big) = 1.
$$
We can use vector $[1,0]$ to represent the first sample's probability in each class, and vector $[0,1]$ to represent that of the second sample.

We want to build a model, so that given the first 11 features $\mathbf{x}$ of a certain sample, it can output an estimate, say, $[0.8, 0.2]$ to tell me that 
$$
P\big(y = 1| \mathbf{x}\big) = 0.8, \qquad P\big(y = 0|\mathbf{x}\big) = 0.2,
$$
which is to say, this sample has 0.8 chance in Class 1, 0.2 chance in the Class 0. The predicted label $\hat{y}$ is then:
$$
\hat{y} = \operatorname{arg}\max_{j} P\big(y = j| \mathbf{x}\big),
$$
i.e., we use the biggest estimated probability's class as this sample's predicted label.

# Logistic regression

----


## Model function (hypothesis)

Weights vector $\mathbf{w}$, same shape with a sample's feature vector $\mathbf{x}$, $h(\mathbf{x})$ is our estimate of $ P(y=1|\mathbf{x})$ and $1 - h(\mathbf{x})$ is our estimate of $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$.

$$
h(\mathbf{x}) = h(\mathbf{x};\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{x})}
=: \sigma(\mathbf{w}^\top \mathbf{x}) 
$$
or more compactly, because $y = 0$ or $1$:
$$
P(y|\mathbf{x}) \text{ is estimated by } h(\mathbf{x})^y \big(1 - h(\mathbf{x}) \big)^{1-y}.
$$

----

## Loss function

$$
L (\mathbf{w}; X, \mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N 
\Bigl\{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}; \mathbf{w}) \big) 
+ (1 - y^{(i)}) \ln\big( 1 - h(\mathbf{x}^{(i)};\mathbf{w}) \big) \Bigr\}.
\tag{$\star$}
$$

----

## Training

The gradient of the loss function with respect to the weights $\mathbf{w}$ is:

$$
\nabla_{\mathbf{w}} \big( L (\mathbf{w}) \big) 
=\frac{1}{N}\sum_{i=1}^N \big( h(\mathbf{x}^{(i)};\mathbf{w})  - y^{(i)} \big) \mathbf{x}^{(i)}  . \tag{$\dagger$}
$$

In [1]:
import numpy as np

In [None]:
# model h(X; w) = sigma(-Xw)
# w: weights
# X: training data
# X.shape[0] is no. of samples, and X.shape[1] is the no. of features
def h(X,w):
    z = np.matmul(X,w)
    return 1.0 / (1.0 + np.exp(-z))

# loss function, modulo by N (size of training data), a vectorized implementation without for loop
def loss(w,X,y):
    loss_components = np.log(h(X,w)) * y + (1.0 - y)* np.log(1 - h(X,w))
    # above is a dimension (12665,) array
    return -np.mean(loss_components) # same with loss_components.sum()/N

def gradient_loss(w,X,y):
    gradient_for_all_training_data = (h(X,w) - y).reshape(-1,1)*X
    # we should return a (n,) array, which is averaging all N training data's gradient
    return np.mean(gradient_for_all_training_data, axis=0)

# Reading 1: Derivation of the logistic regression

For binary-valued labels, $y^{(i)} \in \{0,1\}$, we are trying to predict the probability that a given example belongs to the "1" class versus the probability that it belongs to the "0" class. Specifically, we will use the **logistic regression**, which tries to learn a function of the form:

$$
h(\mathbf{x}) = h(\mathbf{x};\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{x})}
=: \sigma(\mathbf{w}^\top \mathbf{x}) 
$$
or more compactly, because $y = 0$ or $1$:
$$
P(y|\mathbf{x}) = h(\mathbf{x})^y \big(1 - h(\mathbf{x}) \big)^{1-y}
$$

----

## Sigmoid function
The function $\sigma(z) = 1/\big(1+\exp(−z)\big)$ is often called the "sigmoid" or "logistic" function, or "logistic/sigmoid" activation function in machine learning. It is an S-shaped function that "squashes" the value of $\mathbf{w}^\top \mathbf{x}$ into the range $[0,1]$ so that we may interpret $h(\mathbf{x})$ as a probability. 

Our goal is to search for a value of the weights $\mathbf{w}$ so that:
> The probability $P(y=1|\mathbf{x})=h(\mathbf{x})$ is large when $x$ belongs to the "1" class, small when $x$ belongs to the "0" class (so that $P(y=0|\mathbf{x})=1- h(\mathbf{x})$ is large). 

----

## Maximum likelihood
For a set of training examples with binary labels $\{(\mathbf{x}^{(i)},y^{(i)}):i=1,\dots,N\}$ the following likelihood estimator measures how well a given model $h(\mathbf{x};\mathbf{w})$ does this separating class job: assuming our training samples are independently Bernoulli distributed, we want to maximize the following quantity
$$
{\begin{aligned}
&P(\mathbf{y}\; | \; \mathbf{X};\mathbf{w} )\\
=&\prod _{i=1}^N P\left(y^{(i)}\mid \mathbf{x}^{(i)};\mathbf{w}\right)\\
=&\prod_{i=1}^N h\big(\mathbf{x}^{(i)} \big)^{y^{(i)}} 
\Big(1-h\big(\mathbf{x}^{(i)}\big) \Big)^{\big(1-y^{(i)}\big)}
\end{aligned}}.
$$
This function is highly nonlinear on the weights $\mathbf{w}$ so we take the log and then average, lastly define our loss function to be minimized as follows:
$$
L (\mathbf{w}) = L (\mathbf{w}; X,\mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N 
\Bigl\{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}) \big) 
+ (1 - y^{(i)}) \ln\big( 1 - h(\mathbf{x}^{(i)}) \big) \Bigr\}.
\tag{$\star$}
$$
Note that only one of the two terms in the summation is non-zero for each training sample (depending on whether the label $y^{(i)}$ is 0 or 1). When $y^{(i)}=1$ minimizing the loss function means we need to make $h(x^{(i)})$ large, and when $y^{(i)}= 0$ we want to make $1- h(x^{(i)})$ large as explained above. 

----

## Training and cross-validation
After the loss function $L (\mathbf{w})$ is set up, the training data is used by the gradient descent to minimize $L (\mathbf{w})$ to find the best choice of weights $\mathbf{w}$. Even though the cost function $(\star)$ looks quite complicated, due to the following special property of the Sigmoid function 
$$
\frac{d}{dz} \big(\sigma(z)\big)
 = \frac{d}{dz} \left(\frac{1}{1+\exp(−z)}\right) = \sigma(z)\cdot \big(1-\sigma(z)\big).
$$
Therefore recalling $h(\mathbf{x}) =  \sigma(\mathbf{w}^\top \mathbf{x})$
$$
\begin{aligned}
\frac{\partial L (\mathbf{w})}{\partial w_k} & = 
- \frac{1}{N}\sum_{i=1}^N 
\Bigg\{y^{(i)}  \frac{1}{h(\mathbf{x}^{(i)})} \frac{\partial}{\partial w_k} \sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}  \big)
+ (1 - y^{(i)}) \frac{1}{1-h(\mathbf{x}^{(i)})} \frac{\partial}{\partial w_k}\Big(1-  \sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}\big)  \Big) \Bigg\}
\\
& = - \frac{1}{N}\sum_{i=1}^N 
\Bigg\{y^{(i)}  \frac{1}{h(\mathbf{x}^{(i)})} 
\sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}\big) 
\cdot \big(1-\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})\big) 
\frac{\partial}{\partial w_k} \sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}  \big)
\\
& \qquad \qquad - (1 - y^{(i)}) \frac{1}{1-h(\mathbf{x}^{(i)})} 
\sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}\big) 
\cdot \big(1-\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})\big) 
\frac{\partial}{\partial w_k}\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}\big)  \Bigg\}
\\
& = - \frac{1}{N}\sum_{i=1}^N 
\Bigg\{y^{(i)} \cdot \big(1-\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})\big) 
\frac{\partial}{\partial w_k} \sigma\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}  \big)
- (1 - y^{(i)}) \cdot
\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)}) 
\frac{\partial}{\partial w_k}\big(\mathbf{w}^{\top} \mathbf{x}^{(i)}\big)  \Bigg\}
\\
& =\frac{1}{N}\sum_{i=1}^N  \big(\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})  - y^{(i)} \big) x^{(i)}_k.
\end{aligned}
$$
The final expression is pretty simple, basically the derivative of the Logistic loss function w.r.t. the $k$-th weight $w_k$ is the sum of the residuals $\big(\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})  - y^{(i)} \big) $ multiply with the $k$-th component in the $i$-th training data $\mathbf{x}^{(i)}$.

Therefore the gradient for all the weights $\mathbf{w}$ is then:
$$
\nabla_{\mathbf{w}} \big( L (\mathbf{w}) \big) = \sum_{i=1}^N  \big(\sigma(\mathbf{w}^{\top} \mathbf{x}^{(i)})  - y^{(i)} \big) \mathbf{x}^{(i)} 
=\frac{1}{N}\sum_{i=1}^N \big( h(\mathbf{x}^{(i)})  - y^{(i)} \big) \mathbf{x}^{(i)}  . \tag{$\dagger$}
$$

# Reading 2: Bayesian classification

What we have learned above, the logistic regression and softmax regression, are two classification methods that are closely related to Bayesian classifiers. Because essentially, we are trying minimize the following the error associated with a set of observations of the form in a way (by introducing some model with weights):

$$
\min_{\mathbf{w}} \Big[\text{Mean of } 1\big\{y^{(i), \text{Pred}} \neq y^{(i), \text{Actual}} \big\} \Big],
$$

If there is no model yet, let $K= \# \text{ classes}$. Keep in mind for now there are no weights involved, we simply want to classify the samples into $K$ classes, so that the minimization of problem above is *assigning each sample to the most likely class it belongs to*, given its values (feature vector), i.e., we want to compute

$$
\max_{j\in \{1,\dots ,K\}} P\big(y^{(i)}=j | \mathbf{x}^{(i)} \big)  \tag{$\diamond$}
$$

where $P\big(y^{(i)}=j | \mathbf{x}^{(i)} \big)$ is the conditional probability that the label $y^{(i)}=j$ (the $i$-th sample is in the $j$-th class), given the observed vector $\mathbf{x}^{(i)}$ for the $i$-th sample. This is called the naive Bayes classifier. 

----

### Naive Bayes classifier

Using the definition of the conditional probability: for an arbitrary sample and its label $(\mathbf{x},y)$

$$
P(y=j | \mathbf {x} )={\frac { P( y = j, \mathbf {x})}{P(\mathbf {x} )}} \tag{$\ast$}
$$

Assuming $\mathbf{x} = (x_1, x_2, \dots, x_n)$, i.e., each sample has $n$ features, then the numerator above is 
$ P(y=j)\ P(\mathbf {x} | y = j)$, where $P(y=j)$ is the probability that an arbitrary sample is of class $j$ without any observation $\mathbf{x}$, i.e., $P(y=j)$ is the portion of class $j$ against all all samples. Now using the definition of conditional probability again:

$$
\begin{aligned}
P(y=j,x_{1},\dots ,x_{n}) &= P(x_{1},\dots ,x_{n},y=j)
\\
&= P(x_{1} | x_{2},\dots ,x_{n},y=j) P(x_{2},\dots ,x_{n},y=j)
\\
&= P(x_{1} | x_{2},\dots ,x_{n},y=j) P(x_{2} | x_{3},\dots ,x_{n},y=j) P(x_{3},\dots ,x_{n},y=j)
\\&=\dots 
\\&= P(x_{1} |  x_{2},\dots ,x_{n},y=j) P(x_{2} | x_{3},\dots ,x_{n},y=j)
\dots P(x_{n-1} | x_{n},y=j) P(x_{n}| y=j)P(y=j)\\
\end{aligned} \tag{$\ast\ast$}
$$

Assuming each feature is independent from one another, which means whether put $x_l$ ($l\neq i$) into the given observed conditions does not affect the probability of $x_i$:
$$
P(x_{i} | x_{i+1},\dots ,x_{n}, y =j) = P(x_{i}| y=j).
$$

Since $P(\mathbf{x}) = 1/N$ is a fixed value (assuming uniform distributed sample), we have by $(\ast)$ and $(\ast\ast)$

$$
\begin{aligned}
P(y=j | x_{1},\dots ,x_{n}) &\propto P(y=j,x_{1},\dots ,x_{n})
\\
&=P(y=j)\ P(x_{1} | y=j)\ P(x_{2}| y=j)\ P(x_{3} | y=j)\ \cdots 
\\
&=P(y=j)\prod_{i=1}^{n}P(x_{i}| y=j),
\end{aligned}
$$

Now for training sample $\mathbf{x}^{(i)}$, the problem becomes:

$$
y^{(i), \text{Pred}}={\underset {j\in \{1,\dots ,K\}}{\operatorname {argmax} }}\ P(y = j)\displaystyle \prod _{i=1}^{n} P(x_{i} | y=j),
$$

where $y^{(i), \text{Pred}}$ is the class which the probability $(\diamond)$ is maximized.

----

### Pitfalls of naive Bayes classifier

In reality, there are two main reasons the method above is neither practical nor reasonable.

* there is no way $x_i$ and $x_l$ are independent when $i\neq l$ for a sample $\mathbf{x}$. Think in the handwritten digit classification example, $x_i$'s are the pixel intensity at $i$-th location (one of the pixel among 28x28 reshaped into a 784 array), any reasonable ansatz should not assume independency, because the pixel intensity are determined by the strokes.
* For real data, we do not know $P(y=j)$'s true value, i.e., percentage of the samples in class $k$, because new data may come in. For the same reason $P(x_{i} | y=j)$ is not known either.

Therefore, we introduce a model (an a priori assumption that the data can be described by such a model) with weights $\mathbf{w}$, and the problem changes to (softmax case) the following maximization of the log of the likehood function (or say cross entropy),
$$
\max_{\mathbf{w}}\sum_{i=1}^N \left\{\sum_{j=1}^K
 1_{\{y^{(i)} = j\}} \ln P\big(y^{(i)}=j | \mathbf{x}^{(i)} ; \mathbf{w} \big) \right\},
$$
in Lecture 15, 16, 17, we will try using gradient descent to minimize the negative version of above.