# Spark Logistic Regression with Replicated Data
This notebook reads the data splits from `04_create_data_splits` to apply a Logistic Regression model in Python and a from-scratch Logistic Regression model in Spark (with a homemade implementation of Gradient Descent).

See the `01_food_inspections_data_prep` notebook for information about the Chicago Food Inspections Data, the license, and the various data attributes.  See the `02_census_data_prep` notebook for the US Census API terms of use.

### Analysis and Models in this Notebook

- Simple Logistic Regression model using scikit-learn
- From-scratch Logistic Regression model using homemade implementation of Gradient Descent
- Spark MlLib Logistic Regression

### Set Global Seed

In [1]:
SEED = 666

### Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression

from pyspark import SparkContext
from pyspark.sql.types import FloatType, StructType, StructField, LongType
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression as SparkLogisticRegression

import l2_regularized_logistic_regression as nplr

### Read Train and Test Splits

In [3]:
X_train = pd.read_csv('../data/X_train.gz', compression='gzip')
X_test = pd.read_csv('../data/X_test.gz', compression='gzip')
y_train = pd.read_csv('../data/y_train.gz', compression='gzip').values.flatten()
y_test = pd.read_csv('../data/y_test.gz', compression='gzip').values.flatten()

In [4]:
features = X_train.columns

### Replicate the Data to Test Model Training with More Samples

Add a bit of noise to the labels.

In [5]:
def add_binary_noise(targets):
    
    five_percent = int(0.05 * len(targets))
    idx = np.random.choice(list(range(0, len(targets))), five_percent, replace=False)
    targets[idx] = -targets[idx]
    
    return targets

In [6]:
replications = 4

In [7]:
X_train_temp = X_train.copy()
X_test_temp = X_test.copy()
for i in range(1, replications):
    X_train = X_train.append(X_train_temp)
    X_test = X_test.append(X_test_temp)

In [8]:
y_train = add_binary_noise(nplr.transform_target(np.repeat(y_train, replications, axis=0)))
y_test = add_binary_noise(nplr.transform_target(np.repeat(y_test, replications, axis=0)))

### Initialize Spark

In [9]:
sc = SparkContext.getOrCreate()
spark = (SparkSession
         .builder
         .appName("lr2")
         .config("spark.rpc.message.maxSize", "1024mb")
         .getOrCreate())

### Spark MlLib Version

In [10]:
target_column = 'class'

In [11]:
X_train[target_column] = y_train
X_test[target_column] = y_test

In [None]:
train_df = spark.createDataFrame(X_train)
test_df = spark.createDataFrame(X_test)

In [None]:
to_assemble = [item for item in train_df.columns if item != target_column]
assembler = VectorAssembler(inputCols=to_assemble, outputCol='features')
train_vector = assembler.transform(train_df)
test_vector = assembler.transform(test_df)

In [None]:
train_vector.cache().count()

In [None]:
lr = SparkLogisticRegression(labelCol=target_column, featuresCol='features', regParam=0,
                             tol=0.001, standardization=True, fitIntercept=True)

In [None]:
%%time
lr = lr.fit(train_vector)

In [None]:
y_pred = lr.transform(test_vector).select('prediction').rdd.map(lambda x: x.prediction).collect()
np.mean(np.array([0 if x == -1 else 1 for x in y_test]) == y_pred)