# 04 Logistic Regression

## 4.1 Introduction

Logistic regression is a *generalized linear model (GLM)* used for estimating the probability of an event occuring. Formally, let $Y$ be a Bernoulli random variable. That is, $Y = 1$ if event $E$ occurs while $Y = 0$ if event $E$ does not occur. What we wish to do is estimate the probability $P(Y = 1)$ using some predictor variables $X_1, X_2, \ldots, X_m$. Mathematically, this simply means we wish to estimate the conditional probability:

$$
P(Y = 1 | X_1, \ldots, X_m)
$$

Logistic regression attempts to estimate this probability by taking a linear model $\beta X := \beta_0 + \beta_1X_1 + \ldots + \beta_mX_m$ and feeding it through a **logistic function** (aka a "sigmoid" function):

$$
P(Y = 1 | X_1, \ldots, X_m) = \frac{1}{1 + e^{\beta X}}
$$

The inverse of the logistic function is the **logit function** $\text{logit}(p) = \ln\frac{p}{1-p}$, which we can use to isolate the "linearity" in the expression above:

$$
\text{logit}P(Y = 1 | X_1, \ldots, X_m) = \beta X = \beta_0 + \beta_1X_1 + \ldots + \beta_mX_m
$$

Thus, a logistic regression is really just a linear regression applied to the quantity $\text{logit}P(Y = 1 | X_1, \ldots, X_m)$. For this reason, logistic regression is called a **generalized linear model** with $\text{logit}(p)$ called the **link function**.

<br>

---

<br>

## 4.2 The Sample Data

In [34]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

SEED = 42
DATA_PATH = "../course_materials/Spark_for_Machine_Learning/Logistic_Regression/"

spark = SparkSession.builder.appName("Logistic Regression").getOrCreate()

df_sample = spark.read.format("libsvm").load(DATA_PATH + "sample_libsvm_data.txt")

In [11]:
print(f"Num. rows: {df_sample.count()}")
df_sample.show(10)

Num. rows: 100
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
+-----+--------------------+
only showing top 10 rows



In [10]:
df_sample.collect()[0][1]

SparseVector(692, {127: 51.0, 128: 159.0, 129: 253.0, 130: 159.0, 131: 50.0, 154: 48.0, 155: 238.0, 156: 252.0, 157: 252.0, 158: 252.0, 159: 237.0, 181: 54.0, 182: 227.0, 183: 253.0, 184: 252.0, 185: 239.0, 186: 233.0, 187: 252.0, 188: 57.0, 189: 6.0, 207: 10.0, 208: 60.0, 209: 224.0, 210: 252.0, 211: 253.0, 212: 252.0, 213: 202.0, 214: 84.0, 215: 252.0, 216: 253.0, 217: 122.0, 235: 163.0, 236: 252.0, 237: 252.0, 238: 252.0, 239: 253.0, 240: 252.0, 241: 252.0, 242: 96.0, 243: 189.0, 244: 253.0, 245: 167.0, 262: 51.0, 263: 238.0, 264: 253.0, 265: 253.0, 266: 190.0, 267: 114.0, 268: 253.0, 269: 228.0, 270: 47.0, 271: 79.0, 272: 255.0, 273: 168.0, 289: 48.0, 290: 238.0, 291: 252.0, 292: 252.0, 293: 179.0, 294: 12.0, 295: 75.0, 296: 121.0, 297: 21.0, 300: 253.0, 301: 243.0, 302: 50.0, 316: 38.0, 317: 165.0, 318: 253.0, 319: 233.0, 320: 208.0, 321: 84.0, 328: 253.0, 329: 252.0, 330: 165.0, 343: 7.0, 344: 178.0, 345: 252.0, 346: 240.0, 347: 71.0, 348: 19.0, 349: 28.0, 356: 253.0, 357: 252.0,

This dataframe contains only 100 rows, but what is interesting is the `features` column which is a 692-dimensional vector. This vector is a **sparse vector** meaning most of its entries are 0, and that the nonzero entries are stored as a dictionary of the form `{ index : value }`. These kinds of sparse vectors are very common for online customer behavior. For example, each entry could represent the amount a customer spent on a specific product on Amazon. Most customers will only ever purchase a very small subset of all the potential products for sale, so we would typically see a very sparse vector such as this.

In [18]:
df_sample.withColumn("label", df_sample["label"].cast('string')).groupBy("label").agg(F.count("label").alias("count")).show()

+-----+-----+
|label|count|
+-----+-----+
|  1.0|   57|
|  0.0|   43|
+-----+-----+



The other important thing to note is that the `label` column is an indicator variable: 1 = positive class, 0 = negative class.

<br>

---

<br>

## 4.3 The `LogisticRegression` Estimator

We wish to build a logistic regression model to try and correctly classify the rows of the dataframe as either `label = 1` or `label = 0`, using the `features` column as a predictor. To do this, we can use the `LogisticRegression` estimator found in the `pyspark.ml.classification` library.

In [21]:
from pyspark.ml.classification import LogisticRegression

# instantiate a new LogisticRegression Estimator
logistic_estimator = LogisticRegression(
    featuresCol = "features",
    labelCol = "label",
    predictionCol = "prediction"
)

# use the estimator to fit a model on the data
logistic_model = logistic_estimator.fit(df_sample)

In [33]:
print(f"Model Accuracy: {logistic_model.summary.accuracy}")

Model Accuracy: 1.0


Notice that the model achieved 100% prediction accuracy on the data. This is actually a very bad sign because it usually indicates severe overfitting. Recall that for a predictive model, what we actually care about is accuracy on unseen data.

<br>

---

<br>

## 4.4 Train-Test Split and Evaluators

In order to get an estimate for how the model performs on unseen data, we should use a train-test split.

In [66]:
# train-test split
df_train, df_test = df_sample.randomSplit([0.7, 0.3], seed = SEED)

In [67]:
# instantiate a new LogisticRegression Estimator
logistic_estimator = LogisticRegression(
    featuresCol = "features",
    labelCol = "label",
    predictionCol = "prediction"
)

# use the estimator to fit a model on the data
logistic_model = logistic_estimator.fit(df_train)

In [68]:
print(f"Train Accuracy: {logistic_model.summary.accuracy}")
print(f"Test Accuracy: {logistic_model.evaluate(df_test).accuracy}")

Train Accuracy: 1.0
Test Accuracy: 0.9714285714285714


Surprisingly we achieve very high training accuracy on testing data as well. This indicates that the underlying data we are predicting on is highly separable.

Accuracy isn't the only metric that matters when evaluating classification models. Other important metrics include:

* Precision: if the model predicts "positive", how likely is it to be a true positive?
* Recall / Sensitivity: amongst all the "true positives", how likely is the model to detect them?
* Specificity: amonst all the "true negatives", how likely is the model to detect them?
* Area Under the ROC Curve: this measures how well the model is able to balance between detecting true positives vs claiming a false positive

All of these metrics can be easily computed using the `BinaryClassificationEvaluator` class.

In [70]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

logistic_eval = BinaryClassificationEvaluator(
    labelCol = "label",
    metricName = "areaUnderROC" # the AUC is the default metric
)

logistic_results = logistic_model.evaluate(df_test)

logistic_eval.evaluate(logistic_results.predictions)

1.0

The model achieves a perfect AUC of 1.0 on the testing data, which means the model is able to detect all the true positives without predicting any false positives. Real world data will almost never achieve such a perfect AUC of 1.0, so it's likely the data is just very easy to separate.

<br>

---

<br>

## 4.5 Regularization

Let us pretend that our logistic regression model was, in fact, overfit. We could attempt to regularize it by adding an L1 and/or L2 penalty term to the loss function to restrict the possible range of fitted coefficient values, thereby restricting the complexity of the model. This can be achieved by using the built-in `regParam` and `elasticNetParam` arguments of the `LogisticRegression` estimator.

In [39]:
# instantiate a new LogisticRegression Estimator
# but this time with regularization
logistic_estimator = LogisticRegression(
    featuresCol = "features",
    labelCol = "label",
    predictionCol = "prediction",
    regParam = 0.1,     # weight of the penalty
    elasticNetParam = 0 # L2 vs L1 penalty slider
                        # 0 = L2 penalty only ; 1 = L1 penalty only
)

# use the estimator to fit a model on the data
logistic_model = logistic_estimator.fit(df_train)

print(f"Train Accuracy: {logistic_model.summary.accuracy}")
print(f"Test Accuracy: {logistic_model.evaluate(df_test).accuracy}")

Train Accuracy: 1.0
Test Accuracy: 0.9428571428571428


Adding a penalty term actually makes the model fit worse on testing data, which again indicates that the underlying data is highly separable

<br>

---

<br>

## 4.6 Hyperparameter Tuning with Cross-Validation

In the previous section, we chose the weight of `regParam` somewhat arbitrarily. A better approach would be to systematically test different values for `regParam`, based on how the model preforms on the testing data with each specific choice of `regParam` value. This is called **hyperparameter tuning** and can be automated by searching through a set of candidate `regParam` values called the **hyperparameter grid**.

If we were to just evaluate the different choices of `regParam` on our `df_test` data set, we run the risk of overfitting again because we might just happen to find a `regParam` value that works very well on `df_test` specifically, and not on unseen data. To mitigate this kind of issue, we typically want to test out our `regParam` values on a bunch of different versions of test data. This process is called **$k$-fold cross-validation** is done like so:

1) Start with a specific train-test split of the data.
2) Fit the model on the training data set and evaluate the accuracy on the test data.
3) Shuffle the data around to create a new train-test split and go back to step 2.

We go throug this loop $k$-many times for each candidate `regParam` value we want to test out. The average test accuracy of the $k$-many loops is computed and the optimal `regParam` value is the one with the highest average test accuracy. This process can be done by using the `ParamGridBuilder` and `CrossValidator` classes in `pyspark.ml.tuning` library.

In [54]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# instantiate new LogisticRegression estimator
logistic_estimator = LogisticRegression(
    featuresCol = "features",
    labelCol = "label",
    predictionCol = "prediction"
)

# build a hyperparameter grid for 
# the hyper parameter regParam
param_grid = ParamGridBuilder()\
    .addGrid(logistic_estimator.regParam, [0.1, 0.01])\
    .build()

# instantiate CrossValidator
crossval = CrossValidator(
    estimator = logistic_estimator,  # the Estimator to fit the models
    estimatorParamMaps = param_grid, # the set of candidate hyperparam values
    evaluator = BinaryClassificationEvaluator(), # specifies how to evaluate on testing data
    numFolds = 3
)

# run 3-fold cross validation
logistic_cv = crossval.fit(df_train)

In [72]:
optimal_regParam = logistic_cv.bestModel._java_obj.getRegParam()

print(f"Optimal regParam value = {optimal_regParam}")

Optimal regParam value = 0.1


Note: cross-validation can also be used to estimate generalization error, without the need to tune hyperparameters. We can always pass an empty `ParamGrid` if we just want to do CV without hyperparameter tuning.

<br>

---

<br>