---

<center><font size="6"><b> Logistic Regression Algorithm </b></font></center>
 
<br>
<center><font size="5"><b> Jade Gee  </b></font></center>

---
---

## Introduction

<font size ="3"> Logistic Regression Model is an supervised machine learning algorithm that uses a logistic function to model a binary dependent variable. The probability of an event---i.e. pass/fail, win/lose, etc.---is what is modeled. 

Logistic Regression is a form of binary regression, where the model has a dependent variable that is categorical (it has only two (2) possible values), `0` or `1`; and the logarithm of the odds (log-odds) for the value labeled `1` is a linear combination of at least one independent variables or predictors. From this, our logistic function takes the log-odds and converts it into probability. The factor that defines the logistic model is that one of the independent variables multiplicatively scales the odds of the given outcome at a *constant* rate. Each independent variable has their own parameter which, for a binary dependent variable, generalizes the odds ratio.

**NOTE:** Logistic regression algorithm is not a classifier as it does not perform classification of the data, but it can be used to make a classifier.

Although this algorithm has various types, such as

+ **Binary Logistic Regression**

+ **Multinomial Logistic Regression**, and

+ **Ordinal Logistic Regression**

for the purpose of this notebook, we are going to focus on the first type of logistic regression: binary.

To begin we will take a look at the data set, `Admission_Predict_Ver1.1`, which can be located [here](https://www.kaggle.com/yameenajani/admission). From this data set, we will take a look at the following information </font>

<font size="4">$$
\begin{align}
(x^1, y^1) \text{, ... , } (x^n, y^n); where& \\
&x^{(i)} = 
\begin{bmatrix}
\text{student; GRE score} \\
\text{student; TOEFL score} \\
\text{student; CGPA}
\end{bmatrix}\\
&y^{(i)} \in \{0, 1 \} \text{  where } 1 \text{ indicates Chance of Admit; } 0 \text{ if not}
\end{align}
$$</font>

### Logistic Regression Formula

![Logistic Regression](LogR.png)

<font size ="3"> This translates to the following formula

<font size="5">$$
\begin{align}
\hat y^{(i)} = \sigma (w^T x^{(i)} + b)
\end{align}
$$</font>

With this formula at hand, we will proceed to gather our data in preparation for the implementation of our binary logistic regression algorithm. </font>

---
---

## Gather the Data

<font size ="3"> In order to begin, we will need to import the following libraries:
    
+ `Random`
    - To create a random subset of our data
    
+ `CSV` and `DataFrames`
    - To import our data set as a data frame </font>

In [10]:
using CSV
using DataFrames
using Random

In [2]:
data = CSV.read("Admission_Predict_Ver1.1.csv", DataFrame);

x_data = [[x[1], x[2], x[3]] for x in zip(data.GRE Score, data.TOEFL Score, data.CGPA)];
y_data = [x for x in data.Chance of Admit];

<font size ="3"> Now, we need to split the data into a training subset and a testing subset. In order to do this we will use `randsubseq` from the `Random` library to randomly select data points from the data set that have a probability of 0.5. These selected data points will be designated as the training data; and, all of the points in the original data, `data`, that are not in `training_data` will become the testing data, `test_data`. </font>

In [172]:
# Split Data into a Training set and a Testing Set
# Randomly select data points from the full data set
# to make a training data set

full_data = [x for x in zip(x_data, y_data)];

# Randomly selects the data points from the original data
training_data = randsubseq(full_data, 0.5);

# Takes all points in the original data that are not in 
# the training data and stores them as the test data
test_data = [x for x in full_data if x ∉ training_data];

# Assigns the first column to the x-values
training_x = [x[1] for x in training_data];
test_x = [x[1] for x in full_data if x ∉ training_data];

# Assigns the second column to the y-values
training_y = [x[2] for x in training_data];
test_y = [x[2] for x in full_data if x ∉ training_data];

---
---

## Implement the Algorithm

<font size ="3"> In order to implement the algorithm, we will need to create several functions to compute the following:</font>

### Cross-Entropy Loss

<font size ="3">$$
\begin{align}
L_{CE}(\hat y^{(i)}, y^{(i)}) = \hat y^{(i)} + (1-y^{(i)}) log(1-y^{(i)})
\end{align}
$$</font>

---

### Average Loss

<font size ="3">$$
\begin{align}
Cost(w,b) = \frac{1}{N} \sum_{i=1}^N \hat y^{(i)} + (1-y^{(i)}) log(1-y^{(i)})
\end{align}
$$</font>

---

### Gradient of Cost Function --- Weights

<font size ="3">$$
\frac {\partial L_{CE}(w,b)}{\partial w_j} = \frac{1}{N} \sum_{i=1}^{N} \left[\sigma (w^T x^{(i)} + b) - y\right]x^{(i)}_{j}
$$</font>

---

### Bias

<font size ="3">$$
\frac {\partial L_{CE}(w,b)}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} \left[\sigma (w^T x^{(i)} + b) - y\right]
$$</font>

<font size ="3"> As such, the following functions have been created to calculate the above formulas for the cross-entropy loss, average loss, gradient of the cost function or weights, and bias. </font>

In [149]:
"""
Parameters:
     x:    x values in the data

This function takes in x-values and calculates
the value of σ.

Returns: 
    The value of σ.
"""
σ(x) = 1/(1 + exp(-x))

#######################################################################################

"""
Parameters:
     x:    x values in the data
     y:    y values in the data
     w:    weight
     b:    bias

This function takes in x-values, y-values, weight,
and bias and calculates the cross entropy loss.

Returns: 
    The cross entropy loss.
"""
function cross_entropy_loss(x, y, w, b)
    return -y * log(σ(w'x + b)) - (1-y) * log(1 - σ(w'x + b))
end

#######################################################################################

"""
Parameters:
     x:    x values in the data
     y:    y values in the data
     w:    weight
     b:    bias

This function takes in x-values, y-values, weight,
and bias and calculates the average loss for the data.

Returns: 
    The average loss for the data set.
"""
function avg_loss(x, y, w, b)
    N = length(x)
    return (1/N) * sum([cross_entropy_loss(x[i], y[i], w, b) for i = 1:N])
end

#######################################################################################

"""
Parameters:
     features:  x values in the data
     labels:    y values in the data
     w:         weight
     b:         bias
     α:         step length

Calculates the new weight and bias and updates them.

Returns:
    The new weight and bias generated after the derivative change.
"""
function logit_batch_gradient_descent(features, labels, w, b, α)
    
    del_w = [0.0 for i = 1:length(w)]
    del_b = 0.0
    
    N = length(features)
    
    for i = 1:N
        del_w += (σ(w' * features[i] + b) - labels[i]) * features[i]
        del_b += (σ(w' * features[i] + b) - labels[i])
    end
    
    w = w - α*del_w
    b = b - α*del_b
    
    return w, b
end

#######################################################################################

"""
Parameters:
     x_data:    x values in the data
     y_data:    y values in the data
     w:         weight
     b:         bias
     α:         step length
     iter:      number of iterations to complete

Trains the regression and displays the cost at 10^n iterations.

Returns:
    The new weight and bias generated.
"""
function logit_training(features, labels, w, b, α, iter)
    j = 0
    for i = 1:iter
        w, b = logit_batch_gradient_descent(features, labels, w, b, α)      
        if i == 10^j
            println(i, " iteration with cost ", avg_loss(x_data, y_data, w, b))
            j = j + 1
        end
    end
        return w, b
end

#######################################################################################

"""
Parameters:
     x:    x values in the data
     y:    y values in the data
     w:    weight
     b:    bias

This function takes in x-values, y-values, weight,
and bias and predicts the label of an x-value.

Returns: 
    The prediction of the label associated with the x values.
"""
function logit_predictor(x, y, w, b)
    if σ(w'x + b) >= 0.5
        println("Prediction:\tAccepted")
        y == 1 ? println("Actual:\t\tAccepted") : println("Actual:\t\tNot Accepted")
        return 1
    else
        println("Prediction:\tAccepted")
        y == 1 ? println("Actual:\t\tAccepted") : println("Actual:\t\tNot Accepted")
        return 0
    end
end

#######################################################################################

"""
Parameters:
     x:    x values in the data
     y:    y values in the data
     w:    weight
     b:    bias

This function takes in x-values, y-values, weight,
and bias and calculates the mean square error.

Returns: 
    The mean squared error.
"""
function error_MSE(x, y, w, b)
    mean_error = 0.0

    for i = 1:length(x)
        mean_error = mean_error + (logit_predictor(x[i], y[i], w, b) - y[i])^2
    end
    println("--------------------------------")
    println("\tError: \t", mean_error/length(x_data))
end

error_MSE

---

## Train the algorithm

<font size ="3"> Now, we will train the algorithm using initial weights `w = [0.0, 0.0, 0.0]` and `b = 0.0` </font>

In [150]:
# Initial Pass with the Training Data
w = [0.0, 0.0, 0.0]
b = 0.0

w, b = logit_training(training_x, training_y, w, b, 0.000001, 100000)

1 iteration with cost 0.6975379877487258
10 iteration with cost 1.1952527047242898
100 iteration with cost 1.194655298975357
1000 iteration with cost 1.1884432383569965
10000 iteration with cost 1.1270430520806354
100000 iteration with cost 0.5470366829725104


([-0.007510038757002728, 0.29456618666821893, 0.8549193300115617], -0.09129887445783746)

<font size ="3"> After the first pass of the algorithm, we can see that our algorithm does take steps toward in the direction of the descent; but, it does not reach a minimum. For the purpose of this notebook, we undertake $n$ number of passes to show the descent of the average loss. </font>

In [151]:
# Pass 2 -- Training Data
w, b = logit_training(training_x, training_y, w, b, 0.000001, 100000)

1 iteration with cost 0.5641669430777728
10 iteration with cost 0.5469874687362811
100 iteration with cost 0.5465448555347018
1000 iteration with cost 0.5421499489744581
10000 iteration with cost 0.5013556324124026
100000 iteration with cost 0.4193408916975182


([-0.007292399353612156, 0.4582055956281985, 1.0187689688481552], -0.16975992913114293)

In [152]:
# Pass 3 -- Training Data
w, b = logit_training(training_x, training_y, w, b, 0.000001, 100000)

1 iteration with cost 0.41934078427122057
10 iteration with cost 0.4193398174604494
100 iteration with cost 0.41933015191644385
1000 iteration with cost 0.41923375178762656
10000 iteration with cost 0.41829426253607105
100000 iteration with cost 0.4106263236447304


([-0.00822135217241391, 0.5963459104948603, 1.0894095584713905], -0.24555751740614856)

In [153]:
# Pass 4 -- Training Data
w, b = logit_training(training_x, training_y, w, b, 0.000001, 100000)

1 iteration with cost 0.4106262507833631
10 iteration with cost 0.41062559504015045
100 iteration with cost 0.4106190385078476
1000 iteration with cost 0.4105535629119764
10000 iteration with cost 0.40990753236120003
100000 iteration with cost 0.4041339092170203


([-0.008905484266453564, 0.71913120506564, 1.127757688322977], -0.31972757711370803)

---

## Test the algorithm

<font size ="3"> Now that we have trained our algorithm and seen the data, we can test our algorithm on the test data. We will perform the same number of passes as we did for our training data to see if our average loss is the same or near the training data using initial weights `w = [0.0, 0.0, 0.0]` and `b = 0.0`. </font>

In [161]:
# Initial Pass of the Test Data
w2 = [0.0, 0.0, 0.0]
b2 = 0.0

w2, b2 = logit_training(test_x, test_y, w2, b2, 0.000001, 100000)

1 iteration with cost 0.7355410326666838
10 iteration with cost 0.7248417625064386
100 iteration with cost 0.7245157538820408
1000 iteration with cost 0.7212818993649188
10000 iteration with cost 0.6914229078302015
100000 iteration with cost 0.5400892021140317


([-0.0029937810348607406, 0.08617063453120022, 0.39897801829697793], -0.016070251197478236)

In [162]:
# Pass 2 -- Test Data 
w2, b2 = logit_training(test_x, test_y, w2, b2, 0.000001, 100000)

1 iteration with cost 0.5400883516523918
10 iteration with cost 0.5400806979637801
100 iteration with cost 0.5400042072016865
1000 iteration with cost 0.5392438876520876
10000 iteration with cost 0.5320762655495221
100000 iteration with cost 0.4883690993229899


([-0.004197740938538153, 0.1478548901497723, 0.5884590789303132], -0.029200113080882134)

In [163]:
# Pass 3 -- Test Data
w2, b2 = logit_training(test_x, test_y, w2, b2, 0.000001, 100000)

1 iteration with cost 0.48836879280677564
10 iteration with cost 0.48836603427823094
100 iteration with cost 0.4883384606091755
1000 iteration with cost 0.48806388096947223
10000 iteration with cost 0.4854293635586577
100000 iteration with cost 0.4669884614021719


([-0.004942072125095563, 0.19996052982472592, 0.6916194042963345], -0.041069677943505176)

In [164]:
# Pass 4 -- Test Data
w2, b2 = logit_training(test_x, test_y, w2, b2, 0.000001, 100000)

1 iteration with cost 0.4669883129534147
10 iteration with cost 0.46698697695566244
100 iteration with cost 0.46697362104237494
1000 iteration with cost 0.46684046707831356
10000 iteration with cost 0.46554822497362497
100000 iteration with cost 0.4556119163568081


([-0.0054547281261891085, 0.24738823691224093, 0.7522536951745404], -0.052311653433813894)

---

## Test the Predictions

<font size ="3"> With out operational algorithm, we can now implement our predictor function on the training data. We must iterate through the data set to test the prediction and compare it to the actual data in our `training_y` data or the column 2 data from the `training_data`. Then we will do the same for the test data comparing it to the actual values in `test_y`. </font>

In [169]:
# Predict acceptance test with the training data

for i = 1: length(training_x)
    logit_predictor(training_x[i], training_y[i], w, b)
end

Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted

In [171]:
# Predict acceptance test with the test data

for i = 1: length(test_x)
    logit_predictor(test_x[i], test_y[i], w, b)
end

Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted


<font size ="3"> As we can see from the prediction comparison for both the `training_data` and the `test_data`, the percentage of incorrect predictions appear to be about the same. To verify this, we will calculate the error for both subset of the data for comparison. </font>

---

## Calculate the Error 

In [167]:
# Calculate the Error of the training data
error_MSE(training_x, training_y, w, b)

Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted

In [168]:
# Calculate the Error of the test data
error_MSE(test_x, test_y, w, b)

Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
Prediction:	Accepted
Actual:		Not Accepted
--------------------------------
	Error: 	0.075


---
---

## Conclusion

<font size ="3"> As we can see from above, the errors calculated for both of the subsets of the data are extremely close. This allows us to see that we have successfully implemented the Logistic Regression Algorithm on our data. </font>


### For more information on Logistic Regression, please see:  
<br>

<font size ="3"> 
    
+ [Logistic Regression - Towards Data Science](https://towardsdatascience.com/binary-cross-entropy-and-logistic-regression-bf7098e75559)
<br>
    
+ [Logistic Regression - Statistic Solution](https://www.statisticssolutions.com/what-is-logistic-regression/)
<br>
    
+ [Logistic Regression - Machine Learning Mastery](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)
<br>
    
+ [Logistic Regression - Wikipedia](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression)
</font>

---
---