## Toy dataset - Algorithm Implementation

**This notebook is to demonstrate algorithm implementation with toy dataset.**

As part of this Algorithm implementation, we created our own toy example that matches the dataset provided and used this toy example to explain the math behind the logistic regression and finding the loss through gradient descent. This algorithm has been applied to the training dataset and has been evaluated on the test set.

### Section 1 - Setup Environment

In [0]:
import pandas as pd
from html import escape
from IPython.display import HTML, display as ipython_display
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from pyspark.sql import SQLContext
from pyspark.sql.types import IntegerType, StringType, BooleanType, DateType, DoubleType
from pyspark.sql import functions as F
import numpy as np
import math

In [0]:
### Set Up Blob Storage

blob_container = "w261-container" # The name of your container created in https://portal.azure.com
storage_account = "w261storageaccount" # The name of your Storage account created in https://portal.azure.com
secret_scope = "w261scope" # The name of the scope created in your local computer using the Databricks CLI
secret_key = "w261key" # The name of the secret key created in your local computer using the Databricks CLI 
blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net"
mount_path = "/mnt/mids-w261"

spark.conf.set(
  f"fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net",
  dbutils.secrets.get(scope = secret_scope, key = secret_key)
)

### Section 2 - Load Data

In [0]:
# Read joined dataset from parquet.

# Take a small sample 0.01% of the Train and Test Datasets to create the Toy Example

toy_train = spark.read.parquet(f"{blob_url}/model_train_data_full_v2/*")
toy_test = spark.read.parquet(f"{blob_url}/model_test_data_full_v2/*")
                      
toy_train = toy_train.select('departure_delay_boolean', 'VectorAssembler_features')
toy_test = toy_test.select('departure_delay_boolean', 'VectorAssembler_features')

toy_train = toy_train.sample(0.0001, 3)
toy_test = toy_test.sample(0.0001, 3)

### Section 3 - Define helper functions

In [0]:
# Helper function
# Create helper function to print evaluation metrics

def print_results(predictions):
  tp = predictions[(predictions.label == 1) & (predictions.prediction == 1)].count()
  tn = predictions[(predictions.label == 0) & (predictions.prediction == 0)].count()
  fp = predictions[(predictions.label == 0) & (predictions.prediction == 1)].count()
  fn = predictions[(predictions.label == 1) & (predictions.prediction == 0)].count()
  total = predictions.count()
  
  recall = float(tp)/(tp+fn)
  precision = float(tp)/(tp+fp)
  f1 = (2*recall*precision)/(precision+recall)
  
  data = {'Actual-delay': [tp, fn], 'Actual-on time': [fp, tn]}
  confusion_matrix = pd.DataFrame.from_dict(data, orient='index', columns=['Predicted-delay', 'Predicted-on time'])
  
  #print("Test Area Under ROC: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderROC'})))
  #print("Test Area Under Precision-Recall Curve: ", "{:.2f}".format(evaluator.evaluate(predictions, {evaluator.metricName: 'areaUnderPR'})))

  print("Sensitivity: {:.2%}".format(tp/(tp + fn)))
  print("Specificity: {:.2%}".format(tn/(tn + fp)))
  print("False positive rate: {:.2%}".format(fp/(fp + tn)))
  print("False negative rate: {:.2%}".format(fn/(tp + fn)))
  print("Recall: {:.2%}".format(recall))
  print("Precision: {:.2%}".format(precision))
  print("f1: {:.2%}".format(f1))
  
  print("########### Confusion Matrix ###########")
  print(confusion_matrix)


### Section 4 - Toy Example

Logistic regression aggregates the predictor variables similar to Linear Regression. The input \\(X_j\\) is multiplied by a weight \\(beta_j\\) and the product \\(X_j \beta_j\\) is added as shown below:  

$$\displaystyle f(X)= \beta_0 + \Sigma_{j=1}^p X_j \beta_j$$

This can be expressed as \\(f(X)= \theta^TX\\) in matrix form, where \\(\theta\\) is a vector of weights including beta_0 \\( \beta_0 \\), and \\(X\\) is a vector of inputs (with an input of \\(0\\) for \\(\beta_0\\). Logistic regression embeds the output of \\(\theta^TX\\) in a new funtion \\(g(z)\\) where $$\displaystyle g(z)=\frac{1}{1+e^{-z}}$$ 

This can be expressed as: $$h_\theta (x) = g(\theta^Tx)$$ where \\(g(z)=\frac{1}{1+e^{-z}}\\) \\(g(z)\\) is a sigmoid function, and it scales all outputs values to between 0 and 1. By substituting \\(\theta^TX\\) for \\(z\\), the simplified equation is as follows: 

$$\displaystyle h_\theta (x) = \frac{1}{1+e^{-\theta^TX}}$$ 

The value $$h_\theta(x)$$ is the probability estimate that \\(x\\) is a member of category \\(y=1\\) The probability that \\(y=0\\) will then be $$1 - h_\theta(x)$$ \\(h_\theta(x)\\) ranges from 0 to 1 due to the application of the sigmoid function and both probabilities will add to one.

The cost or loss function computes the error of the model. The weights used in logistic regression equation can vary from one model to another. The goal of a model is to fit the data that minimizes the cost function. Comparison of model performance can be done by calculating the error of the models when attempting to predict label \\(y\\).   

For logistic regression, the squared loss function is not convex and has many local minima and alternatives like hinge loss and logistic loss function is used. 
For logistic loss, the negative log of the logistic regression output is taken when the actual value of \\(y\\) is 1. When the actual value of \\(y\\) is 0, the negative log of 1 minus the logistic regression output is used. 

This can be expressed as:
<br>
<img src ='https://sudhritybucket.s3.amazonaws.com/cf1.png' width="400" height="400">
<br>

When the logistic regression predicts \\(\hat{y}=1\\) with a probability of 1 correctly, then \\(-log(1)=0\\) and the loss function is zero, this is a perfect prediction. Similarly, when \\(\hat{y}:0\\) is correctly predicted with a probability of 1, the cost function will be \\(-log(1-1)=0\\). For an incorrect prediction of \\(P(\hat{y}:0)=.999\\), (and the corresponding probability \\(P(\hat{y}:1)=.001)\\) but \\(y=1\\), then the log loss function will be \\(-log(.001)\approx3\\) showing a higher amount of error. Since we can't take the log of 0, values of .999 and .001 are used. As the correct prediction approaches a probability of 0, the log loss function will approach infinity and the prediction is \\(y=0\\)

The weights in logistic regression can be selected at random, and the cost function can be evaluated to see if the new model is an improvement over the last but this is inefficient. The cost function has a slope of zero at its minimum and taking a derivative of the cost function to obtain the slope, and then moving to the next iteration closer to zero, we can find a minimum of the cost function. However, we have to make sure that we are moving in the right direction to find a minimum, since the derivative of the maximum of the cost function will also have a slope of zero. Several different algorithms including Gradient Descent, Newton methods, and quasi-Newton methods can be used that apply some variation of this approach.

In Gradient Descent, the first-order derivative of the cost function is evaluated which provides the slope or gradient. The next step is taken based on the greatest negative change in gradient. The learning rate or the step-size is constant for each step and is set by the user. With multiple iterations, the minimum is reached. We will use this method in our toy logistic regression implementation.

In [0]:
# Turn sampled Dataframes into RDDs for processing into feature array, label format

trainRDD = toy_train.rdd.map(lambda x: (x[1:],x[0])).cache()
testRDD = toy_test.rdd.map(lambda x: (x[1:],x[0])).cache()

As it was discussed above Logistic regression uses the sigmoid function to solve classification problems

Using the sigmoid function $$h_\theta (x) = \frac{1}{1+e^{-\theta^TX}}$$

Where the cost function is given by:

$$ cost(h_{\theta}(x),y)=-y^i \times \log(h_\theta (x^{i})) - (1-y^i) \times \log(h_\theta (x^i))$$

Therefore, the loss function for logistic regression, when dealing with a vector of n parameters, is defined as it follows: 

$$ J(\theta)=\frac{1}{n}\sum_{i=1}^{n}\left(x^i\times\log(h_\theta (x^i))+(1-y^i)\times\log(h_\theta (x^i))\right)$$

Which is translated in the below equation as **loss variable (line 15)**, which is inside of our LogLoss function. With the only variation that we are leveraging the use mean of the function instead of dividing it to increase the efficiency of the RDD calculation. 

It is important to point put we are augmenting our data using a bias at index 0.

**Where in the RDD implementation x[0] is equivalent to the feature array or x and x[1] is y in the formulas above.**

In [0]:
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
  
def LogLoss(dataRDD, W): 
    """
    Compute logistic loss error.
    
    Where:
        dataRDD - each record is a tuple of (features_array, y)
        W       - (array) model coefficients with bias at index 0
        
    """
    
    # Augment the data by adding 1 to the front of the predictors array
    
    augmented_data = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))
    
    # Calculate loss
    
    loss = augmented_data.map(lambda x: x[1] * np.log(sigmoid(W.dot(x[0]))) + (1-x[1]) * np.log(sigmoid(W.dot(x[0])))).mean()*-1
   
    return loss
  

In gradient descent we aim to find the minimum of a  differentiable function trying different values an updating them to reach the optimal levels. Thus, minimizing the differentiable function. 

$$\theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial\theta_j}J(\theta)$$

In order to minimize the function we need to run the gradient descent on each parameter of the weight vector (W).

Assume we have a total of n features. In this case, we have n parameters for the weight vector vector. To minimize our cost function, we need to run the gradient descent on each parameter of the W  vector.

In order to use gradient descent we need to calculate the derivative of the function:

$$\frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{n}\sum_{i=1}^{n}\left((h_\theta)x^i-y^i \right)x^i_j$$

It is important to point out that we are using ridge (L2) regularization to increase the generalizability of our model. Thefore we need to add the term for the penalty (without including the Bias term) which is:

$$\frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{n}\sum_{i=1}^{n}\left((h_\theta)x^i-y^i x^i_j + \lambda x \right)$$


And then updating using the learning rate parameter in the previous equation, which provides the new model for this iteration.

$$\theta_j \leftarrow \theta_j - \alpha \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{n}\sum_{i=1}^{n}\left((h_\theta)x^i-y^i x^i_j + \lambda x \right)$$

In [0]:
def GDUpdate_wReg(dataRDD, W, learning_rate = 0.1, reg_param = 0.1):
    """
    Gradient descent update with ridge regularization (1 Iteration).
    """
    
    W_broadcast = sc.broadcast(W)
    
    new_model = None
    
    N = dataRDD.count()
    
    augmented_data = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))

    grad = augmented_data.map(lambda x: ((sigmoid(W.dot(x[0])) - x[1])*x[0])).sum()
    
    ### Add regularization penalty
    
    grad += reg_param * np.append([0.0], W[1:])
    
    new_model = W - learning_rate * grad / N

    return new_model

In [0]:
def GradientDescent_wReg(trainRDD, testRDD, wInit, nSteps = 10, learning_rate = 0.1 , reg_param = 0.1):
    """
    Loops gradient descent regularization based on steps and creates/updates lists with loss results.
    """
    
    # initialize list to store values
    
    train_history, test_history, model_history = [], [], []
    
    # perform iterations and calculate loss
    
    model = wInit
    for idx in range(nSteps): 
      
        # update the model
        model = GDUpdate_wReg(trainRDD, model, learning_rate, reg_param)
        
        # append results
        train_history.append(LogLoss(trainRDD, model))
        test_history.append(LogLoss(testRDD, model))
        model_history.append(model)
        
    return train_history, test_history, model_history

Train the model and initiate a random vector W to start the process.

In [0]:
import numpy as np

wInit = np.random.uniform(0,1,809)

ridge_results = GradientDescent_wReg(trainRDD, testRDD, wInit, nSteps = 10, reg_param = 0.1 )

In [0]:
### Print final vector

w = ridge_results[2][-1] # final model

### Peform process in test data

augmented_test_data = testRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))
results = augmented_test_data.map(lambda x: (sigmoid(w.dot(x[0])),x[1])).collect()

In [0]:
# Set decision threshold to 0.5
df_toy_predictions = pd.DataFrame(results)
df_toy_predictions['pred'] = df_toy_predictions[0] >= .5

In [0]:
# Create spark dataframe to write into blob
df_toy_predictions = spark.createDataFrame(df_toy_predictions)
df_toy_predictions.write.mode('overwrite').parquet(f"{blob_url}/toy_model_results_v3")

In [0]:
df_toy_predictions = spark.read.parquet(f"{blob_url}/toy_model_results_v3/*")

In [0]:
# Rename 

df_toy_predictions = df_toy_predictions.withColumnRenamed('1', 'label')
df_toy_predictions = df_toy_predictions.withColumnRenamed('pred', 'prediction')

In [0]:
print_results(df_toy_predictions)

Here we observed the impact of limited data, as we are only using 0.01% of the full data. Therefore, our F1 Score is strongly affected by this as we see that we have a false negative rate of 98.56%.

In [0]:
# Model Evaluation with scikit learn

y_true_lr = df_toy_predictions.select(['label']).collect()
y_pred_lr = df_toy_predictions.select(['prediction']).collect()

# Print metrics
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true_lr, y_pred_lr))
print(confusion_matrix(y_true_lr, y_pred_lr))