<h1 style="color:brown">Supervised Learning: Classification</h1>

## Logistic Regression

---

### Introduction

Logistic Regression is a very basic classification method in machine learning. Thanks to simplicity and efficiency, the algorithm has been widely used in practical scenarios. In this course, we will explore the principle of logical regression and its algorithm implementation, and use scikit-learn to construct a logistic regression classification prediction model.

### Key Points

- The relationship between logistic regression and linear regression<br>
- Logistic regression model<br>
- Logarithmic loss function<br>
- Gradient descent

### Environment

- Python 3.6.6<br>
- NumPy 1.18.1<br>
- Matplotlib 3.0.3<br>
- Pandas 0.25.3<br>
- scikit-learn 0.21.3

---

## Basics of Logistic Regression

"Logistic regression", when you hear the name, the first thing you may notice is exactly the word 'regression'. You may be confused. Haven't the contents of regression been learned last week? Why do we put the logistic regression into the curriculum of this week's focus - classification?

Therefore, at the beginning of this course, it is necessary to emphasize: ** Logistic regression is a classification method, not a regression method. ** You need to keep this in mind afterwards.

So, why does **logical regression take a name with the word 'regression'? Does it really have nothing to do with the regression methods mentioned earlier? **<br>
<br>
Regarding this question, we will answer it at the end of the lesson.

There is a video about Logistics Regression. We hope it can deepen your understanding of Supervised Learning: Classification Logistic Regression | Source: [Statquest.org](https://youtu.be/yIYKR4sgzI8/)

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('yIYKR4sgzI8', width=800, height=300)

### Linear Separability

First of all, we need to touch a concept first: linear separability. As shown in the figure below, **in a two-dimensional plane**, if you can separate samples by using only one line directly, the samples are called linearly separable, otherwise they are linearly inseparable:

In [None]:
from IPython.display import Image
import os
Image("../input/week-3-images/Logistic-Regression-Classification-1.jpeg")

Of course, in a three-dimensional space, if you can separate the samples through a plane, it is also known as linear separability. Since this part will not be involved, we will not discuss it here.

### Classification with Linear Regression

In the previous section, we focused on the definition of linear regression. To sum it up, linear regression is a method to predict more continuous values by fitting a straight line. In fact, in addition to the regression problem, linear regression can also be used to deal with the classification problem **in special cases**. e.g.:

We have the following dataset, which contains only `1` feature and `1` target. For example, we count the scores of students and determine whether the course is `PASS` through the length of study time:

In [None]:
"""
Sample data
"""

scores=[[1],[1],[2],[2],[3],[3],[3],[4],[4],[5],[6],[6],[7],[7],[8],[8],[8],[9],[9],[10]]
passed= [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1]

**☞ Exercise:**

In [None]:
#Type your code here-


In the above dataset, `passed` takes only `0` and `1`, which are numeric data. However, here if we denote `1` and `0` as `PASS` and `NOT PASS`, respectively, then it is converted into a classification problem. Moreover, this is a typical binary classification problem. The binary classification indicates that there are only two categories, which can also be called: `0-1` classification problem.<br>
<br>
For such a binary classification problem, how to solve it with linear regression?

Here we can define: The result of the linear fitting function $f(x)$ is $f(x) > 0.5$ (near 1) for `PASS` and $f(x) <= 0.5$ (near 0) for `NOT PASS`:<br>

$$
f_(x) \gt 0.5 => y=1 \\
f_(x) \leq 0.5 => y=0
\tag1
$$

In this way, linear regression can be skillfully used to solve binary classification problems.

Below we will start the practice. First draw a scatter plot corresponding to the dataset in the 2D plane:

In [None]:
"""
Draw a scatter plot with sample data
"""

from matplotlib import pyplot as plt
%matplotlib inline
 
plt.scatter(scores, passed, color='r')
plt.xlabel("scores")
plt.ylabel("passed")

In [None]:
nan

Then we use `sklearn` to complete the linear regression fitting. I believe that after studying the first week's contents you are familiar enough with the linear regression fitting process:

In [None]:
"""
Linear regression fitting
"""

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(scores, passed)
model.coef_, model.intercept_

In [None]:
nan

Next draw the fitted line into the scatter plot:

In [None]:
"""
Draw the plot after fitting process
"""

import numpy as np

x = np.linspace(-2,12,100)

plt.plot(x, model.coef_[0] * x + model.intercept_)
plt.scatter(scores, passed, color='r')
plt.xlabel("scores")
plt.ylabel("passed")

In [None]:
nan

If you follow the above definition, the result of the linear fit function $f(x)$ is $f(x) > 0.5$ for `PASS`, and $f(x) <= 0.5$ for `NOT PASS`.

Then, as shown in the figure below, any part whose `scores` is larger than the $x$  coordinate value of the orange vertical line will be judged as `PASS`, which means the two points circled by the brown box are misclassified:

In [None]:
#!ls ../input/week-3-images/
Image("../input/week-3-images/Untitled.jpg", width="800")

## Principle and Implementation of Logistic Regression

### Sigmoid Distribution Function

As seen above, although we can use linear regression to solve a binary classification problem, its results are not ideal. Especially in the completion of the binary classification problem, linear regression may also produce negative values or numbers greater than $1$ during the calculation. So, here we can better solve the binary classification problem by a method called **logistic regression**.

Here we introduce a function called **Sigmoid**, which is defined as follows:<br>

$$
f(z)=\frac{1}{1+e^{-z}}
\tag2
$$

You may feel some confusion: Why do we suddenly introduce such a function? <br>
<br>
Below we draw the curve of this function. Take a look. Perhaps you will understand:

In [None]:
"""
Sigmoid function
"""

z = np.linspace(-12, 12, 100) # Generate equidistant x values for easy drawing
sigmoid = 1 / (1 + np.exp(-z))
plt.plot(z, sigmoid)
plt.xlabel("z")
plt.ylabel("y")

In [None]:
nan

The figure above is of the Sigmoid function, and you will be surprised to find that it presents a perfect "S" shape (S = Sigmoid). Its value is only between `0` and `1`, and is symmetric about the center of the `z=0` axis. At the same time, larger the `z` is, the closer the `y` is to `1`; and smaller the `z` is, the closer the `y` is to `0`. If we use `0.5` as the demarcation point, we can divide the values `> 0.5` and the values `<= 0.5` into two categories. This is a perfect choice for solving binary classification problems.

Here we have to introduce a mathematical definition. That is, if a set of consecutive random variables conforms to the Sigmoid function distribution, it is called a logical distribution. **Logistic distribution is the theorem in probability theory, which is a _continuous probability distribution_**.

### Logistic Regression Model

In the example of the above section, we solve the classification problem by linear regression. `y` of the fitted linear function was found to be between $\left ( - \infty, + \infty \right )$. For the Sigmoid function, however, mentioned in the above section, its `y` is between $\left ( 0,1 \right )$.

So, consider combining the concepts here, that is, the result of fitting the linear function is compressed to $\left ( 0,1 \right )$ using the Sigmoid function. If `y` of the linear function is larger, it means that the probability is closer to `1`, and vice versa.

So, in logistic regression, we define:

$$
z_{i} = {w_0}{x_0} + {w_1}{x_1} + \cdots + {w_i}{x_i} = {w^T}x \\
f(z_{i})=\frac{1}{1+e^{-z_{i}}}
\tag3
$$

For equation $(3)$, in general: $w_0=b$, $x_0=1$, which corresponds to the intercept term in the linear function.

In the above equation, we multiply each of the features denoted by $x$ by the coefficient $w$ and then calculate the $f(z)$ value by the Sigmoid function to get the probability. Here, $z$ can be regarded as a classification boundary. Therefore:

$$
h_{w}(x) = f({w^T}x)
\tag4
$$

Below we implement the above equation $(3)$:

In [None]:
"""
Logistic Regression Model
"""

def sigmoid(z):
    sigmoid = 1 / (1 + np.exp(-z))
    return sigmoid

In [None]:
nan

### Logarithmic Loss Function

Next, we attempt to solve the parameter $w^T$ in equation $(3)$. Before that, we need to define the loss function. We recall that the loss function is a function that measures the difference between the predicted value and the ground truth. For example, in linear regression we use the squared loss function. Now, in logistic regression, we usually use a logarithmic loss function:

$$
J(w) = \frac{1}{m} \sum_{i=1}^m \left [ - y_i \log (h_{w}(x)) - (1-y_i) \log(1-h_{w}(x_i))  \right ]\tag{5}
$$

You might consider why we use the squared loss function in linear regression? In fact, there is a mathematical basis. The purpose of setting the loss function is to find the minimum value of the loss function with the selected optimization method. The minimum loss assures that the model is optimal at that time. In the optimization solution, only the convex function can obtain the global minimum and the non-convex function tends to obtain the local optimum. In our case, the squared loss function, when used for logistic regression, will obtain a non-convex function, which means that global optimization will be missed in most of the cases. Therefore, we use logarithmic loss function to avoid this problem.

Of course, the above explanation involves a lot of mathematical knowledge. Especially like the optimization theory, it is the content involved in the postgraduate course and there might be some difficulties in understanding. If you can't get it, just remember that in logistic regression it is recommended to use the logarithmic loss function.

There is a video about Logarithmic Loss Function. We hope it can deepen your understanding of Supervised Learning: Logarithmic Loss Function in Classification

[Video] Logarithmic Loss Function | Source: [End-to-end Machine Learning](https://end-to-end-machine-learning.teachable.com/courses/polynomial-regression-optimization/lectures/9559488)

In [None]:
YouTubeVideo('fr7dfyfB7mI', width=800, height=300)

Below, we implement the equation $(5)$:

In [None]:
"""
Logarithmic Loss Function
"""

def loss(h, y):
    loss = (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
    return loss

In [None]:
nan

### Gradient Descent

In the section above, we have successfully defined and implemented the logarithmic loss function. Up to now, it is only one step away from solving the optimal parameters, that is, to find the minimum value of the loss function.

In order to find the minimum value of the equation $(5)$, a solution method called **gradient descent** is introduced here. Gradient descent is a very common and classical optimization algorithm, through which we can quickly find the minimum value of a function. The principle of the gradient descent method will be explained below. We hope that you can understand it carefully. Many of the following cases will apply the gradient descent method.

To understand the gradient descent, first of all let's answer the question: What is 'gradient'? A gradient is a vector that indicates that the direction derivative of a function takes the maximum along this direction at that point, i.e., the function changes the fastest along the direction of this gradient (changes with the highest rate of change). In short, for a unary function a gradient is the derivative at a certain point. For a multivariate function, a gradient is a vector of partial derivatives at a certain point.

Since the function changes the fastest along the gradient, the core of the gradient descent method is that we **seek the minimum value of the loss function along the direction of the gradient descent**. The process is shown below:

In [None]:
Image("../input/week-3-images/Logistic Regression Classification 4.jpeg", width="800")

Therefore, we find the partial derivative for equation $(5)$ and get the gradient:

$$
\frac{\partial{J}}{\partial{w}} = \frac{1}{m}X^T(h_{w}(x)-y) \tag6
$$

When we get the direction of the gradient and then multiply it by a constant $\alpha$, we can get the step size of the gradient descent (the length of the arrow above). Finally, through multiple iterations, the point where the gradient change is small enough will be found, which corresponds to the minimum value of the loss function. Here the constant $\alpha$ is often referred to as the learning rate. The process of performing a weight update is:

$$
w \leftarrow w - \alpha \frac{\partial{J}}{\partial{w}}
$$

There is a video about Gradient Descent. We hope it can deepen your understanding of Supervised Learning: Gradient Descent in Classification

[Video] Gradient Descent | Source: [Statquest.org](https://statquest.org/video-index/)

In [None]:
YouTubeVideo('sDv4f4s2SB8', width=800, height=300) 

Below we implement the equation $(5)$:

In [None]:
"""
Calculate gradient
"""

def gradient(X, h, y):
    gradient = np.dot(X.T, (h - y)) / y.shape[0]
    return gradient

In [None]:
nan

### Logistic Regression Implementation with Python

So far, we have the basic elements to implement logistic regression. Next, we will use a set of sample data to complete the classification task. First, download and load the sample data. The dataset name is: `Logistic-Regression-Classification-data.csv`.

In [None]:
"""
Load dataset
"""

import pandas as pd

df = pd.read_csv("../input/week-3-dataset/Logistic-Regression-Classification-data.csv", header=0) # Load dataset
df.head() # Preview first 5 rows of data

In [None]:
nan

As you can see, the dataset has two feature variables `X0` and `X1`, and a target value `Y`. Among them, the target value `Y` only contains `0` and `1` which is a typical binary classification problem. We try to plot the dataset into a graph and take a look at the distribution of the data:

In [None]:
"""
Plot the data distribution
"""

plt.figure(figsize=(10, 6))
plt.scatter(df['X0'],df['X1'], c=df['Y'])

In [None]:
nan

Above dark blue represents `0` and yellow represents `1`. Next, the logistic regression is used to complete the classification task, that is, the linear function in equation $(3)$.

In order to facilitate the display of the code, the logistic regression model mentioned above, loss function and gradient descent code are collectively presented together. Now, let's use Python to implement a complete logistic regression process:

In [None]:
"""
Complete logistic regression process
"""

#Sigmoid distribution function
def sigmoid(z):
    sigmoid = 1 / (1 + np.exp(-z))
    return sigmoid

#Loss function
def loss(h, y):
    loss = (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
    return loss

#Calculate the gradient
def gradient(X, h, y):
    gradient = np.dot(X.T, (h - y)) / y.shape[0]
    return gradient

#Logistic regression process
def Logistic_Regression(x, y, lr, num_iter):
    intercept = np.ones((x.shape[0], 1)) # Initialize intercept as 1
    x = np.concatenate((intercept, x), axis=1)
    w = np.zeros(x.shape[1]) # Initialize parameters as 1
    
    for i in range(num_iter): # Gradient descent iterations
        z = np.dot(x, w) # Linear function
        h = sigmoid(z) # Sigmoid function
        
        g = gradient(x, h, y) # Calculate the gradient
        w -= lr * g # Calculate the step size and perform the gradient descent with lr
        
        z = np.dot(x, w) # Update the parameters
        h = sigmoid(z) # Get Sigmoid value
        
        l = loss(h, y) # Get loss value
        
    return l, w # Return the gradient and parameters after iteration

In [None]:
nan

Then we set the learning rate and the number of iterations to train the data:

In [None]:
"""
Set parameters and train
"""

x = df[['X0','X1']].values
y = df['Y'].values
lr = 0.001 # Learning rate
num_iter = 10000 # Iterations

#Train
L = Logistic_Regression(x, y, lr, num_iter)
L

In [None]:
nan

Based on the weight we calculated, the function of the classification boundary line is:

$$y = L[1][0] + L[1][1]*x1 + L[1][2]*x2$$

<div style="color: #999;font-size: 12px;font-style: italic;">* $L[*][*]$ denotes the corresponding values selected from the $L$ array.</div>

With the classification boundary line function, we can draw it into the original graph to see how the classification works. The following code involves using Matplotlib to draw outlines, no need to cover:

In [None]:
"""
Plot the above results
"""

plt.figure(figsize=(10, 6))
plt.scatter(df['X0'],df['X1'], c=df['Y'])

x1_min, x1_max = df['X0'].min(), df['X0'].max(),
x2_min, x2_max = df['X1'].min(), df['X1'].max(),

xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]

probs = (np.dot(grid, np.array([L[1][1:3]]).T) + L[1][0]).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, levels=[0], linewidths=1, colors='red');

In [None]:
nan

It can be seen that the red line which represents the dividing line obtained here is a linear function. It is more consistent with the separation trend.

In addition to drawing the decision boundary, that is, the dividing line, we can also draw the process of updating the loss function to observe the execution of the gradient descent:

In [None]:
"""
Draw the changing process of loss function
"""

def Logistic_Regression(x, y, lr, num_iter):
    intercept = np.ones((x.shape[0], 1)) # Initialize intercept as 1
    x = np.concatenate((intercept, x), axis=1)
    w = np.zeros(x.shape[1]) # Initialize parameters as 1
    
    l_list = [] # Save loss function value
    for i in range(num_iter): # Gradient descent iterations
        z = np.dot(x, w) # Linear function
        h = sigmoid(z) # Sigmoid function
        
        g = gradient(x, h, y) # Calculate the gradient
        w -= lr * g # Calculate the step size and perform the gradient descent with lr
        
        z = np.dot(x, w) # Update the parameters
        h = sigmoid(z) # Get Sigmoid value
        
        l = loss(h, y) # Get loss value
        l_list.append(l)
        
    return l_list

lr = 0.01 # Learning rate
num_iter = 30000 # Iterations
l_y = Logistic_Regression(x, y, lr, num_iter) # Train

#Plot
plt.figure(figsize=(10, 6))
plt.plot([i for i in range(len(l_y))], l_y)
plt.xlabel("Number of iterations")
plt.ylabel("Loss function")

In [None]:
nan

You will find that after the iteration has reached 20,000 the data tends to be stable, which is close to the minimum value of the loss function. You can change your learning rate and number of iterations yourself.

### Logistic Regression with scikit-learn

Above we have understood the principle of logistic regression and its Python implementation. This process is cumbersome, but it still makes sense. We highly recommend that you at least figure out 80% of the content in the section of principle. Next we introduce the logistic regression method in scikit-learn, which is much simpler.

In scikit-learn, the class and its default parameters for implementing logistic regression are:

```python<br>
LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=1)<br>
```

We introduce some commonly used parameters, for the rest their default values may be used:<br>
<br>
- `penalty`: L2 by default. Used to specify the norm used in penalization.<br>
- `dual`: False by default. Dual or primal formulation. <br>
- `tol`: Tolerance for stopping criteria.<br>
- `fit_intercept`: Specifies if a constant should be added to the decision function.<br>
- `random_state`: The seed of the pseudo random number generator to use when shuffling the data.<br>
- `max_iter`: 100 by default. Maximum number of iterations taken for the solvers to converge.<br>
<br>
Besides，`solver` is used to specify methods to solve loss function.

So, the code wherein we use scikit-learn to build a logistic regression classifier is as follows:

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(tol=0.001, max_iter=10000) # Set the same learning rate and iterations
model.fit(x, y)
model.coef_, model.intercept_

In [None]:
nan

You may find that the parameters obtained are inconsistent with the parameters obtained with the Python implementation above, for our solvers are different. Similarly, we can draw the resulting classification boundary line into a graph:

In [None]:
"""
Plot a graph
"""

plt.figure(figsize=(10, 6))
plt.scatter(df['X0'],df['X1'], c=df['Y'])

x1_min, x1_max = df['X0'].min(), df['X0'].max(),
x2_min, x2_max = df['X1'].min(), df['X1'].max(),

xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max), np.linspace(x2_min, x2_max))
grid = np.c_[xx1.ravel(), xx2.ravel()]

probs = (np.dot(grid, model.coef_.T) + model.intercept_).reshape(xx1.shape)
plt.contour(xx1, xx2, probs, levels=[0], linewidths=1, colors='red');

In [None]:
nan

Finally, we can look at the classification accuracy of the model on the training set:

In [None]:
model.score(x, y)

In [None]:
nan

So, back to the beginning of the course: What's the relationship between logistic regression and linear regression? I believe you should already have your own answer.<br>
<br>
In general opinion, the word "logistic" is the abbreviation of **logistic distribution** and also represents the logic between _right_ and _wrong_, $0$ and $1$, symbolizing binary classification problem. The word "regression" is derived from **linear regression**. We construct linear classification boundaries through linear functions to achieve the classification results.<br>
<br>
What do you think?

## Summary

In this chapter, we learned a classification method called logistic regression. Logistic regression is a very common and practical binary classification method that is often used in practical problems such as spam judgment. In addition, logistic regression can also work for multi-classification problems; but since we are going to learn other methods that are more dominant in those problems, we will not explain them here. The knowledge points you need to master first are:<br>
<br>
- The relationship between logistic regression and linear regression<br>
- Logistic regression model<br>
- Logarithmic loss function<br>
- Gradient descent

---

<div style="color: #999;font-size: 12px;font-style: italic;">* Congratulations! You've completed the Supervised Learning: Classification.</div>