<a href="https://colab.research.google.com/github/ilovely11/ly11/blob/master/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logisitic Regression


This is a tutorial for a Logistic Regresion algoirthm from scratch. We'll use the iris dataset since it's a simple data set to test our algorithm on. We've used elements of a tutorial from Guowei Wei's Machine Learning class at MSU, but we've added a slight mathematical twist to our work rahter than trying to fit dimensions together.

#Importing Data for Usage

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target
for n in range(len(y)):
    if (y[n] != 0):
        y[n] = 1
#Making our data set into a binary classification problem.  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
scaler = StandardScaler() # call an object function
scaler.fit(X_train)   # calculate mean
X_train_norm = scaler.transform(X_train)  # apply normalization on X_train
X_test_norm = scaler.transform(X_test)    # apply normalization on X_test

In [None]:
print(X_train_norm.shape)
print(X_test_norm.shape)
print(y_train.shape)
print(y_test.shape)

(100, 2)
(50, 2)
(100,)
(50,)


#Sigmoid Function

In [None]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

#Accuracy

In [None]:
def accuracy(predicted, actual):
    diff = predicted - actual
    return 1.0 - (float(np.count_nonzero(diff)) / len(diff))

#Forward Propagation

The structure of our network is the following:
$$X \xrightarrow{\text{linear map}} f \xrightarrow{\text{sigmoid}} \hat{y}$$
Assume that our dataset has $N$ datapoints and $D$ features. Then our matrix $X$ is $N \times D$, our weight matrix $W$ is $D \times 1$, and our output is $N \times 1$.

In [None]:
def forward(X, W, b):
  f = np.dot(X,W) + b
  y_hat = sigmoid(f)
  return y_hat

#Backpropagation

Let's go through the backpropagation process now. So, like before, we're going to take derivatives with respect to our weights. Let's start with the simple case of finding the partial derivatives with respect to each component of $\hat{y}$. Let $1 \leq m \leq N$. Then
\begin{align*}
\frac{\partial L}{\partial \hat{y}_m} &=  -\frac{1}{N} \sum_{n=1}^N y \frac{\partial}{\partial \hat{y}_m}\log(\hat{y}_n) + (1-y)\frac{\partial}{\partial \hat{y}_m}\log(1-\hat{y}_n) \\
&= -\frac{1}{N}\left( \frac{y}{\hat{y}_m} - \frac{1-y}{1-\hat{y}_m}\right) \\
&= \frac{1}{N}\left(\frac{1-y}{1-\hat{y}_m} -\frac{y}{\hat{y}_m}\right).
\end{align*}
By the chain rule, we have
$$\frac{\partial L}{\partial f_i} = \sum_{m=1}^N \frac{\partial L}{\partial \hat{y}_m} \frac{\partial \hat{y}_m}{\partial f_i}.$$
Now, notice that
$$
\frac{\partial \hat{y}_m}{\partial f_i} = 
\begin{cases}
\hat{y}_m(1-\hat{y}_m), \quad m = i \\
0. \qquad m \neq i
\end{cases}$$
When we substiutte this in, we get
$$\frac{\partial L}{\partial f_i} = \frac{1}{N}\left(\frac{1-y}{1-\hat{y}_m} -\frac{y}{\hat{y}_m}\right) \hat{y}_m(1-\hat{y}_m) = \frac{1}{N}(1-y)\hat{y}_m - y(1-\hat{y}) = \frac{1}{N}(\hat{y}_m - y).$$
Okay, final step now. Let's find
$$\frac{\partial L}{\partial W_i} =  \sum_{m=1}^N \frac{\partial L}{\partial f_m} \frac{\partial f_m}{\partial W_i}.$$
Write
$$f_m = \sum_{n=1}^N x_{m,n} W_n + b.$$
Then 
$$
\frac{\partial f_m}{\partial W_i} = 
\begin{cases}
x_{m,n}, \quad n = i \\
0. \qquad n \neq i
\end{cases}$$
When we substitute this in, we get
$$\Delta W = \frac{\partial L}{\partial W_i} = \frac{1}{N}\sum_{m=1}^Nx_{m,i}(\hat{y}_m - y).$$
Let's now put these into a vector as $\left(L_{w_1}, L_{w_2}, \ldots, L_{w_D}\right)^{T}$. To finish up, we see that we can vectorize this as
$$\Delta W = X^{T}(\hat{y}- y).$$
Performing a similar calculation for the derivative with respect to the bias, we get
$$\frac{\partial L}{\partial b} = \frac{1}{N}\sum_{m=1}^N(\hat{y}_m - y).$$
Remember that our update method is 
  $$W_{new} = W_{old} - \beta\Delta W,$$
  $$b_{new} = b_{old} - \beta\dfrac{\partial{L}}{\partial{b}}.$$
where $\beta$ is our learning rate. See the code below for the implementation.


In [None]:
def backpropagation(X, W, b, y_hat, y, beta):
    d = (sigmoid(y_hat) - y) / X.shape[0]
    dW = np.dot(X.T, d)
    db = np.sum(d)
    W = W - beta * dW
    b = b - beta * db
    return W, b

#Loss Function

Our loss function will be the same as usual for classification, the cross entropy loss: 
$$L(\hat{y}) = -\frac{1}{N} \sum_{n=1}^N y \log(\hat{y}_n) + (1-y)\log(1-\hat{y}_n).$$
Pretty easy to handle for the most part. In this case, we won't add a regularization parameter to prevent overfitting. This can be easily added by modifying two or three lines of code.

In [None]:
def print_loss(y, yhat): 
    loss = -1/y.shape[0] * np.sum(y * np.log(y_hat) + (1 - y) * np.log(1-y_hat))
    print(loss)
    return 

#Making A Prediction

In [None]:
def predict(X, W, b):
    y_pred = np.dot(X,W) + b
    y_pred = sigmoid(y_pred)
    for n in range(len(y_pred)):
        if (y_pred[n] < 0.5):
            y_pred[n] = 0
        else:
            y_pred[n] = 1
    return y_pred

#Putting Everything Together

In [1]:
W = np.zeros(X.shape[1], dtype = 'float')
y_hat = np.zeros(X.shape[0], dtype = 'float')
y_pred = np.zeros(y_test.shape[0], dtype = 'int')
b = 0
beta = 0.001
iterations = 10000
for n in range(iterations):
    y_hat = forward(X_train_norm, W, b)
    #print_loss(y_hat, y_train)
    W, b = backpropagation(X_train_norm, W, b, y_hat, y_train, beta)
y_pred = predict(X_test_norm, W, b)
acc = accuracy(y_test, y_pred)
print(y_test)
print(y_pred)
print('The accuracy of this model is: %f percent' %(acc*100))

NameError: ignored

Okay, not that bad, but not perfect. What about sklearn?

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

1.0

Looks like we can improve this slightly with some parameter tuning, like adding a regularization term.