# Classification

Objective of the workshop:

1. Derive a linear classifier (logistic regression)
2. Derive an objective function (negative log-likelihood)
3. Optimize the likelihood to learn the parameters
4. Predict the class with the highest probability under the model
5. Perform diagnostics to evaluate the model

## Motivation

With univariate linear regression, the output variable takes continuous values, that is $y \in \mathbb{R}$. We are now interested in the case where the output variable takes class labels $y \in \{0, 1, \dots, k\}$.

The example shown below is a classical Convolutional Neural Network (CNN). The objective is that given a new image, label it with the correct class. The softmax layer provides the model with the capability of transforming the output into a probability between 0 and 1.


In the case of classifying numbers, we have multiclass classification. However, in the case of a binary output, we arrive at the logistic regression - a valuable component to your machine learning arsenal that is simple, yet highly flexible.

![cnn](../Images/cnn.png)

## GLM Recap

---

From last session we learned the three components of a generalized linear model (GLM):

1. Random component: the probability distribution of the response variable ($Y$)
2. Systematic component: the linear combination of the covariates ($X$)
3. Link function ($\eta = g(\mathbb{E}(Y_i))$): specifies the link between random and systematic components or in other words, how the expected value of the response relates to the linear predictor of explanatory variables

Recall the linear regression, which has the following form:

$$
Y = X\beta + \epsilon
$$

We found that we could formulate this model as a simple case of the GLM:

1. The random component of the response is normally distributed. That is, the errors $e \sim N(0, \sigma^2)$.
2. The systematic component is $X\beta$.
3. The link function is $\eta = g(\mathbb{E}(Y_i)) = \mathbb{E}(Y_i) = X\beta$.


## Probability Basics

---

### Bernoulli Distribution

Consider the case where we have a **binary** response

$$
\begin{equation}
  y_i =\begin{cases}
    1, & \text{if it's raining in the way home tonight}.\\
    0, & \text{otherwise}.
  \end{cases}
\end{equation}
$$

In this case we can view $y_i$ as a random variable $Y_i$ that takes the values one and zero with probabilities $\pi_i$ and $1 - \pi_i$, respectively. The distribution $Y_i$ is known as a _Bernoulli distribution_ with parameter $\pi_i$, which can be written as:

$$
\mathbb{P}(Y_i = y_i) = \pi_i^{y_i} (1 - \pi_i)^{1-y_i}, \quad for \quad y_i = 0,1
$$

**QUESTION 1**: What is the expectation $\mathbb{E}(Y_i)$ and variance $Var(Y_i)$ of the Bournoulli random variable?

### Binomial Distribution

Now, let's consider the following:

$$
y_i = \text{ the number of successes in group } i
$$

Where we view $y_i$ as the realization of a random variable $Y_i$ that takes the values $1, \dots, n_i$. If the $n_i$ observations are independent and have the same probability $\pi_i$ of having the attribute of interest, then the distribution is _Binomial_ with paramters $\pi_i$ and $n_i$.

$$
Y_i \sim Bin(n_i, \pi_i)
$$

$$
\begin{equation}
  \mathbb{P}(Y_i = y_i) = \left(
  \begin{array}{rcr}
    n_i \\
    y_i
  \end{array}
  \right) \pi_i^{y_i} (1 - \pi_i)^{n_i - y_i}
\end{equation}
$$

**QUESTION 2**: What is the expectation $\mathbb{E}(Y_i)$ and variance $Var(Y_i)$ of the Binomial random variable?

### Logistic Distribution

Another important distribution to consider is the _logistic distribution_, which has the following probability density function (PDF):

$$
f(x) = \frac{e^{x}}{(1 + e^{x})^2}
$$

Hence, the cumulative distribution function (CDF) or **logistic function** is

$$
F(x) = \int_{-\infty}^{a} \frac{e^{x}}{(1 + e^{x})^2} dx = \frac{e^{x}}{(1 + e^{x})} = \frac{1}{1 + e^{-x}}
$$.

The logistic funcion plays an integral role in the construction of the logistic regression.

## The Logit Transformation

---

The  next  step  in  defining  a  model  for  our  data  concerns  the  systematic
structure.   We  would  like  to  have  the  probabilities $\pi_i$ depend  on  a  vector
of observed covariates $\mathbf{x}_i$. The simplest idea would be to let $\pi_i$ be a linear
function of the covariates, say:

$$
\pi_i = \mathbf{x}_i^T\mathbf{\beta},
$$

where $\beta$ is a vector of regression coefficients. The concern here is that the response has to be a probability, thereby requiring values between 0 and 1, which cannot be guaranteed by the linear predictor $\mathbf{x}_i^T\mathbf{\beta}$, which takes on any real value. 

An idea to resolve this issue is to apply a transformation (continuous function) to our linear predictor that will ensure any real value can be mapped to a probability space between 0 and 1 - this is where our logistic function comes in.

$$
\pi_i = \mathbb{P}(Y_i = 1 | X, \beta) = F(\mathbf{x}_i^T\mathbf{\beta}) = \text{logistic}(\mathbf{x}_i^T\mathbf{\beta}) = \frac{1}{1 + exp(-\mathbf{x}_i^T\beta)}
$$,

where $F(.)$ is the cumulative distribution function of the logistic distribution, or in other words, the "logistic function". We can also consider the case where we want to explain the log odds given a linear predictor. Under monotonicity conditions, we can use the inverse of the logistic function:

$$
F^{-1}(\pi_i) = \text{logit}(\pi_i) = \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \mathbf{x}_i^T\mathbf{\beta}
$$

The formulation above enables us to jump from the log-odds back to the probability interchangeably. We  are  now  in  a  position  to  define  the  logistic  regression  model, by assuming  that  the logit of  the  probability $\pi_i$,  rather  than  the  probability itself, follows a linear model.

![loglin](../Images/logistic_vs_linear.jpg)

**Question 3**: Show that the inverse of the $\text{logit}^{-1}(a) = \text{logistic}(a) = \frac{1}{1 + e^{-a}}$.

Discussion:

* Can we use other CDFs instead of the logistic function?
* The logistic regression is a simple case of the perceptron: a single layer neural network!
* Interpretation of log-odds

## Logistic Regression

---

In our framework of GLM, the logistic regression has the following components:

1. Random component: The distribution of Y is assumed to be Binomial($n$,$\pi$), where $\pi$ is a probability of "success". For simplicity, we can let $Y \sim Binomial(n=1,\pi)$.
2. Systematic component: X's are explanatory variables (can be continuous, discrete, or both) and are linear in the parameters. Transformation of the X's themselves are allowed like in linear regression; this holds for any GLM.
3. Link function: Logit link, $\eta=\text{logit}(\pi)=\text{log} \left(\dfrac{\pi}{1-\pi}\right)$

More generally, the logit link models the log odds of the mean, and the mean here is $\pi$. Binary logistic regression models are also known as logit models when the predictors are all categorical.

**Question 4**: Define the probability mass function (PMF).

**Question 5**: Derive the log-likelihood, which should look as follows:

$$
l(\beta) \,=\, \sum_{i=1}^N x_i^T\!\beta y_i \, -\, \sum_{i=1}^N n_i\,\log\left(1 + e^{x_i^T\!\beta}\right).
$$

**Question 6**: Derive the gradient of the log-likelihood. Your result should look like the following:

$$
\begin{eqnarray} \frac{\partial l}{\partial\beta_j} & = &  \sum_{i=1}^N y_i x_{ij} \, - \, \sum_{i=1}^N n_i \left(\frac{1}{1+e^{x_i^T\!\beta}}\right) e^{x_i^T\!\beta}x_{ij} & = & \sum_{i=1}^N (y_i - \mu_i)x_{ij}, \end{eqnarray}
$$

**Question 7**: Derive the Hessian of the log-likelihood. Your result should look like the following:

$$
\begin{eqnarray} \frac{\partial^2 l}{\partial\beta_j\partial\beta_k}  & = &  -\sum_{i=1}^N n_ix_{ij}\,\frac{\partial}{\partial\beta_k} \left(\frac{e^{x_i^T\beta}}{1+e^{x_i^T\!\beta}}\right) & = &  -\sum_{i=1}^N n_i \pi_i(1-\pi_i)x_{ij}x_{ik}\end{eqnarray}
$$

**Question 8** (Assignment): Perform Newton Raphson to find the Maximum Likelihood Estimates (MLE).

Discussion:
 * Assumptions behind logistic regression.

## Interpretation

---

The interpretation of the weights in logistic regression differs from the interpretation of the weights in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The weights do not influence the probability linearly any longer. The weighted sum is transformed by the logistic function to a probability. Therefore we need to reformulate the equation for the interpretation so that only the linear term is on the right side of the formula. To be explicity, let's use $\mathbb{P}(Y_i = 1) = \pi_i$:

$$
log\left(\frac{\mathbb{P}(y=1)}{1-\mathbb{P}(y=1)}\right)=log\left(\frac{\mathbb{P}(y=1)}{\mathbb{P}(y=0)}\right)=\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}
$$

Exponentiating both sides leads us to the odds:

$$
\frac{\mathbb{P}(y=1)}{1-\mathbb{P}(y=1)}=odds=exp\left(\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{p}x_{p}\right)
$$

Then we compare what happens when we increase one of the feature values by 1. But instead of looking at the difference, we look at the ratio of the two predictions:

$$
\frac{odds_{x_j+1}}{odds}=\frac{exp\left(\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{j}(x_{j}+1)+\ldots+\beta_{p}x_{p}\right)}{exp\left(\beta_{0}+\beta_{1}x_{1}+\ldots+\beta_{j}x_{j}+\ldots+\beta_{p}x_{p}\right)}
$$

We apply the following rule:

$$
\frac{exp(a)}{exp(b)}=exp(a-b)
$$

And we remove many terms:

$$
\frac{odds_{x_j+1}}{odds}=exp\left(\beta_{j}(x_{j}+1)-\beta_{j}x_{j}\right)=exp\left(\beta_j\right)
$$

In the end, we have something as simple as exp() of a feature weight. A change in a feature by one unit changes the odds ratio (multiplicative) by a factor of exp(βj). We could also interpret it this way: A change in xj by one unit increases the log odds ratio by the value of the corresponding weight.

Discussion:

* How do we interpret the intercept?

## Application

---

In [36]:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [37]:
#Import Dataset
dataset = pd.read_csv('data/social_network_data.csv')
dataset.describe()

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased
count,400.0,400.0,400.0,400.0
mean,15691540.0,37.655,69742.5,0.3575
std,71658.32,10.482877,34096.960282,0.479864
min,15566690.0,18.0,15000.0,0.0
25%,15626760.0,29.75,43000.0,0.0
50%,15694340.0,37.0,70000.0,0.0
75%,15750360.0,46.0,88000.0,1.0
max,15815240.0,60.0,150000.0,1.0


In [39]:
x = dataset.iloc[:,[2,3]].values
y =dataset.iloc[:,4].values

In [40]:
#Split Training Set and Testing Set
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5)

In [32]:
#Feature Scaling
sc_X=StandardScaler()
x_train=sc_X.fit_transform(x_train)
x_test=sc_X.transform(x_test)

In [27]:
#Training the Logistic Model
classifier = LogisticRegression()
classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [34]:
print('intercept:', classifier.intercept_)
print('coefficient:', classifier.coef_)

intercept: [-0.92688212]
coefficient: [[1.86983445 0.9005177 ]]


**Question 9**: What is the impact on the odds of a purchase with a dollar increase in EstimatedSalary?

**Question 10**: What is the odds of purchase with an age of 38 and estimated salary of $60,000?

## Diagnostics & Prediction

---

As we are unable to work with the residuals, as you would with OLS (we are using a binomial link function), a way to analyze classification accuracy and fit is through the _confusion matrix_. An example is shown below:

In [28]:
#Predicting the Test Set Result
y_pred = classifier.predict(x_test)

In [29]:
#Create Confusion Matrix for Evaluation
cm = confusion_matrix(y_test, y_pred)

In [30]:
print(cm)

[[123   6]
 [ 25  46]]


This confusion matrix captures a large amount of information that one can use to examine their model's fit or improve training.

* Sensitivity - measures a tests ability to identify positive results.
* Specificity - measures a tests ability to identify negative results.

![cm](../Images/confusion.png)