# Execution Time Analysis for Logistic Regression Training with Replicated Data
This notebook analyzes the execution times from `08_replicated_data_spark_logistic_regression_food_inspections` to provide comparisons between:

1. A from-scratch Python implementation which uses looping to compute gradients on individual samples
2. A from-scratch Spark implementation which leverages distributed map and reduce operations to compute gradients on individual samples

**Notes:**

1. Logistic Regression Algorithm for Spark RDD from M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of NSDI, pages 15–28, 2012.
2. Raw Python Version based on https://github.com/jstremme/l2-regularized-logistic-regression but without regularization and vectorization of matrix operations.  Instead `py_lr_grad_descent` computes the gradient on each sample sequentially, unlike Spark which will compute the gradient on each sample but in parallel.

### Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Record Fit Times by Instance

In [None]:
fit_times_df = pd.DataFrame(
    {'instances': [2, 4],
     'raw_python_fit_time_seconds': [N, N, N, N, N, N]
     'spark_fit_time_seconds': [N, N, N, N, N, N]
    })

### Create Plotting Functions

In [None]:
def mean_r(x):
    return round(np.mean(x), 3)

In [None]:
def min_r(x):
    return round(min(x), 3)

In [None]:
def max_r(x):
    return round(max(x), 3)

In [None]:
def plot_execution_time(df, algo, title):
    
    plt.figure(figsize=(8,8))
    xpos=np.arange(len(algo))
    plt.barh(xpos-0.2, df['Min'], height=0.2, label='Min', color='yellow')
    plt.barh(xpos, df['Mean'], height=0.2, label='Mean', color='orange')
    plt.barh(xpos+0.2, df['Max'], height=0.2, label='Max', color='red')

    plt.title(title)
    plt.yticks(xpos, algo)
    plt.xlabel('Execution Time (Seconds)')
    plt.ylabel('Algorithm and Number of Instances')
    plt.legend()
    
    pylab.xlim([0,2.5])
    plt.show()

### Split Data by Number of Instances

In [None]:
two_instance_df = fit_times_df[fit_times_df['instances'] == 2]
four_instance_df = fit_times_df[fit_times_df['instances'] == 4]

### Compute Summary Statistics

In [None]:
algos = ['Raw Python 2 Instances', 'Spark 2 Instances', 'Raw Python 4 Instances', 'Spark 4 Instances']

In [None]:
all_times = [two_instance_df['raw_python_fit_time_seconds'].tolist(),
             two_instance_df['spark_fit_time_seconds'].tolist(),
             four_instance_df['raw_python_fit_time_seconds'].tolist(),
             four_instance_df['spark_fit_time_seconds'].tolist()]

In [None]:
mean_times = list(map(mean_r, all_times))
min_times = list(map(min_r, all_times))
max_times = list(map(max_r, all_times))

In [None]:
summary_df = pd.DataFrame(
        {'Mean': mean_times,
         'Min': min_times,
         'Max': max_times,
        })

### Plot Execution Times

In [None]:
plot_execution_time(summary_df, algos,
                    title='Logistic Regression Execution Times for Gradient Descent with Batch Size = 1')