# Logistic Regression

---

**Logistic regression** is a predictive machine learning algorithm that attempts to predict an outcome. A logistic regression algorithm consists of three parts: **Estimation, Loss Function Evaluation,** and **Optimization** (also referred to as **Training**).

### Optimization
Similarly to **linear regression**, logistic regression uses a vector consisting of weights and a bias in order to evaluate $z$, which is found through the equation: 
$$z=w^Tx+b$$ 

The values of $z$ are then fed into $\sigma$, otherwise known as the **sigmoid function**, in order to obtain a prediction for the value of $y$, also referred to as $\hat y$. The sigmoid function is shown by the following equation:
$$\sigma(z^i)=\frac{1}{1+e^{-(w^Tx+b)}}$$

For certain applications, such as the one we will evaluating in this demonstration, $y$ is a binary value with $y=1$ indicating success and $y=0$ indicating failure. In these situations, $\hat y$ will be rounded either up or down depending on its value in order to provide a prediction of the success or failure for the provided data.


### Loss Function Evaluation
Once we have estimated values, we then need to evaluate and optimize the accuracy of our model. The first step to performing this optimization is to evaluate model efficacy through use of a **loss function** which will tell us how close our estimation, $\hat y$, is to the actual value for the data, $y$. 

**Loss function:** $L(\hat y^i, y^i)$ = How close $\hat y^i$ is to y 
$$P(y|x)=\hat y^y*(1-\hat y)^{1-y}$$
$$log(P(y|x))=y(log(\hat y))+(1-y)(log(1 - \hat y))$$
**Cross Entropy Loss Function:**
$$L_{CE}(\hat y, y)=-[y(log(\hat y))+(1-y)(log(1 - \hat y))]$$
Since $\hat y$=$\sigma (z)$:
$$L_{CE}(\sigma (z), y)=-[y(log(\sigma (z)))+(1-y)(log(1 - \sigma (z)))]$$
$$L_{CE}(w, b)=-[y(log(\sigma (w^Tx+b)))+(1-y)(log(1 - \sigma (w^Tx+b)))]$$


### Optimization or Training
For our optimization method, we will be using **gradient descent** to minimize the value of the cross entropy loss function, which means we will be training our model using the entire dataset. The first thing we need to do is take the partial derivatives of the function to be optimized which gives us the following:

$$\frac{\delta L_{CE}}{\delta w_j} = [\sigma (wx+b)-y]x_j$$
$$\frac{\delta L_{CE}}{\delta b} = \sigma (wx+b)-y$$

These partial derivatives will then allow us to find create our optimization equations which gives us the following equations for a given point $(x^i,y^i)$:
$$w^{k+1}_j=w^k_j+\alpha \frac{\delta L_{CE}}{\delta w_j^k}(w^k b^k) = w^k_j+\alpha ([\sigma (w^k_jx^i+b)-y^i]x^i)$$


$$b^{k+1} = b^k+\alpha \frac{\delta L_{CE}}{\delta b}(w^k b^k) = b^k+\alpha [\sigma (wx^i+b)-y^i]$$

By iterating the above equations we will be able to minimize the values of the cross entropy loss function and train our model to more accurately predict the result based on the given data.

---

### Implementation

For this demonstration of logistic regression we will be using sample student data in order to predict whether a student will be admitted into a university. First thing's first, let's load up and look at our data.

In [2]:
# Load necessary packages
using CSV

# Read in & look at data structure
data = CSV.read("candidates_data (1).csv")
data[1:5,:]

Unnamed: 0_level_0,gmat,gpa,work_experience,admitted
Unnamed: 0_level_1,Int64,Float64,Int64,Int64
1,780,4.0,3,1
2,750,3.9,4,1
3,690,3.3,3,0
4,710,3.7,5,1
5,680,3.9,4,0


The data consists of 4 variables; *gmat*, *gpa*, *work_experience*, and *admitted*. We need to separate this data into **feature data** which will be used to train our model, and **label data** which is the outcome that we are trying to predict. Since we are trying to predict whether the student will be admitted into the university, our label data is clearly the *admitted* data. This leaves *gmat*, *gpa*, and *work_experience* as our feature data.

---

In [23]:
# Separate data:
# Create array of feature data
feature_data = [[x[1], x[2], x[3]] for x in zip(data.gmat, data.gpa, data.work_experience)]

# Create array of label data
label_data = [x for x in data.admitted];

In [24]:
σ(x) = 1/(1+exp(-x))

# Cross entropy loss function to determine accuracy of predictions
function cross_entropy_loss(x, y, w, b)
    return -y*log(σ(w'x +b)) -(1-y)*log(1 - σ(w'x+b))
end

# Cost function to measure error of weights & biases
function average_cost(features, labels, w, b)
    N = length(features)
    return (1/N)*sum((cross_entropy_loss(features[i], labels[i], w, b) for i = 1:N))
end

average_cost (generic function with 1 method)

---
There are two types of gradient descent, **batch** and **stochastic**. **Batch gradient descent** uses the entire dataset to train the model, while **stochastic gradient descent** randomly selects points to use for training each iteration. While batch gradient descent will become optimized with fewer iterations, it is more computationally expensive than stochastic. However, our dataset is small enough that this is not a concern so we will be using the batch method.

---

In [25]:
function batch_gradient_descent(features, labels, w, b, α)
    
    del_w = [0.0 for i = 1:length(w)]
    del_b = 0.0
    
    N = length(features)
    
    for i = 1:N
        del_w += (σ(w'features[i]+b) - labels[i])*features[i]
        del_b += (σ(w'features[i]+b) - labels[i])
    end
    
    w = w - α*del_w
    b = b - α*del_b
    
    return w, b
end

batch_gradient_descent (generic function with 1 method)

---
Now that our gradient descent function is complete, we can set our starting weights and bias along with a step size to run a couple test iterations to be sure our function is decreasing the cost of our model as it iterates.
        
---

In [44]:
# Set starting weights and biases
w = [0.0, 0.0, 0.0]
b = 0.0
println("Starting cost: ", average_cost(feature_data, label_data, w, b))

# This step size is too big!
w, b = batch_gradient_descent(feature_data, label_data, w, b, 0.00001)
println("\nAfter 1 iteration at α=0.00001: ", average_cost(feature_data, label_data, w, b))

w = [0.0, 0.0, 0.0]
b = 0.0
# This is better, but let's try a little bigger
w, b = batch_gradient_descent(feature_data, label_data, w, b, 0.0000001)
println("\nAfter 1 iteration at α=0.0000001: ", average_cost(feature_data, label_data, w, b))

Starting cost: 0.6931471805599446

After 1 iteration at α=0.00001: 0.7652556019981452

After 1 iteration at α=0.0000001: 0.6931177157407157


---
Now that we've testing some step sizes and found what we want to go with, we can now create a function to iterate our gradient descent function as many times as we want. One measure we can implement to verify our descent algorithm is decreasing cost throught all the iterations is to include checkpoints to print out the cost after a set number of iterations. Once we have iterated the descent function the desired number of times we will output the final weights and biases for our regression function.

---

In [56]:
function batch_train(features, labels, w, b, α, iter)
    for i = 1:iter
        w,b = batch_gradient_descent(feature_data, label_data, w, b, α)
        if i==1
            println("Cost at iteration ",i,": ", average_cost(feature_data, label_data, w, b))
        end
        if i==100
            println("Cost at iteration ",i,": ", average_cost(feature_data, label_data, w, b))
        end
        if i==10000
            println("Cost at iteration ",i,": ", average_cost(feature_data, label_data, w, b))
        end
        if i==100000
            println("Cost at iteration ",i,": ", average_cost(feature_data, label_data, w, b))
        end
        if i==1000000
            println("Cost at iteration ",i,": ", average_cost(feature_data, label_data, w, b))
        end
    end
    println("\nFinal Weights: ", w)
    println("Final Bias: ", b)
    return w, b
end

# Train our model using 1,000,000 iterations 
w, b = batch_train(feature_data, label_data, w, b, 0.0000004, 1000000)

Cost at iteration 1: 0.6931066511441942
Cost at iteration 100: 0.6926384571579437
Cost at iteration 10000: 0.6510831083097195
Cost at iteration 100000: 0.4899247448010744
Cost at iteration 1000000: 0.3980123186961089

Final Weights: [-0.009233111010187943, 0.8564526529532422, 1.0719001573543308]
Final Bias: -0.35748882940581317


([-0.009233111010187943, 0.8564526529532422, 1.0719001573543308], -0.35748882940581317)

---
Now that our model is trained, we can create a function to test it by having it predict whether a student will be accepted into the university based on their *GPA*, *GMAT score*, and *work experience*. Once we have our predictions, we can compare them to the actual labels and determine the accuracy of our trained model.

---

In [59]:
# Predict acceptance
function predict(x, w, b)
   if σ(w'x+b) >= 0.5
        return 1
    else
        return 0
    end
end
    
# Compare prediction to actual acceptance data to find error rate
function model_accuracy(x, y, w, b)
    mean_error = 0
    for i in 1:length(x)
        mean_error += (predict(x[i], w, b) - y[i])^2
    end
    println("Model Accuracy: ", (1-(mean_error/length(x)))*100,"%")
end

model_accuracy(feature_data, label_data, w, b)

Model Accuracy: 82.5%


---
So after training our regression model via 1,000,000 iterations of gradient descent optimization we were able to achieve an 82.5% accuracy when predicting the acceptance of students at a particular university when given their *GPA*, *GMAT scores*, and *work experience.*

---