In [1]:
# HIDDEN
using CSV
using DataFrames
using MLJ
using Statistics
using StatsPlots
Base.displaysize() = (5, 90)

## Fitting a Logistic Model

Previously, we covered batch gradient descent, an algorithm that iteratively updates $\boldsymbol{\theta}$ to find the loss-minimizing parameters $\boldsymbol{\hat\theta}$. We also discussed stochastic gradient descent and mini-batch gradient descent, methods that take advantage of statistical theory and parallelized hardware to decrease the time spent training the gradient descent algorithm. In this section, we will apply these concepts to logistic regression and walk through examples using scikit-learn functions.

### Batch Gradient Descent

The general update formula for batch gradient descent is given by:

$$
\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \cdot \nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}^{(t)}, \textbf{X}, \textbf{y})
$$

In logistic regression, we use the cross entropy loss as our loss function:

$$
L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) = \frac{1}{n} \sum_{i=1}^{n} \left(-y_i \ln \left(f_{\boldsymbol{\theta}} \left(\textbf{X}_i \right) \right) - \left(1 - y_i \right) \ln \left(1 - f_{\boldsymbol{\theta}} \left(\textbf{X}_i \right) \right) \right)
$$

The gradient of the cross entropy loss is $\nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) = -\frac{1}{n}\sum_{i=1}^n(y_i - \sigma_i)\textbf{X}_i $. Plugging this into the update formula allows us to find the gradient descent algorithm specific to logistic regression. Letting $ \sigma_i = f_\boldsymbol{\theta}(\textbf{X}_i) = \sigma(\textbf{X}_i \cdot \boldsymbol{\theta}) $:

$$
\begin{align}
\boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \cdot \left(- \frac{1}{n} \sum_{i=1}^{n} \left(y_i - \sigma_i\right) \textbf{X}_i \right) \\
&= \boldsymbol{\theta}^{(t)} + \alpha \cdot \left(\frac{1}{n} \sum_{i=1}^{n} \left(y_i - \sigma_i\right) \textbf{X}_i \right)
\end{align}
$$

- $\boldsymbol{\theta}^{(t)}$ is the current estimate of $\boldsymbol{\theta}$ at iteration $t$
- $\alpha$ is the learning rate
- $-\frac{1}{n} \sum_{i=1}^{n} \left(y_i - \sigma_i\right) \textbf{X}_i$ is the gradient of the cross entropy loss
- $\boldsymbol{\theta}^{(t+1)}$ is the next estimate of $\boldsymbol{\theta}$ computed by subtracting the product of $\alpha$ and the cross entropy loss computed at $\boldsymbol{\theta}^{(t)}$


### Stochastic Gradient Descent

Stochastic gradient descent approximates the gradient of the loss function across all observations using the gradient of the loss of a single data point.The general update formula is below, where $\ell(\boldsymbol{\theta}, \textbf{X}_i, y_i)$ is the loss function for a single data point:

$$
\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i)
$$

Returning back to our example in logistic regression, we approximate the gradient of the cross entropy loss across all data points using the gradient of the cross entropy loss of one data point. This is shown below, with $ \sigma_i = f_{\boldsymbol{\theta}}(\textbf{X}_i) = \sigma(\textbf{X}_i \cdot \boldsymbol{\theta}) $.

$$
\begin{align}
\nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) &\approx \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i)\\
&= -(y_i - \sigma_i)\textbf{X}_i
\end{align}
$$

When we plug this approximation into the general formula for stochastic gradient descent, we find the stochastic gradient descent update formula for logistic regression.

$$
\begin{align}
\boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \nabla_\boldsymbol{\theta} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i) \\
&= \boldsymbol{\theta}^{(t)} + \alpha \cdot (y_i - \sigma_i)\textbf{X}_i
\end{align}
$$

### Mini-batch Gradient Descent

Similarly, we can approximate the gradient of the cross entropy loss for all observations using a random sample of data points, known as a mini-batch.

$$
\nabla_\boldsymbol{\theta} L(\boldsymbol{\theta}, \textbf{X}, \textbf{y}) \approx \frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}, \textbf{X}_i, y_i)
$$

We substitute this approximation for the gradient of the cross entropy loss, yielding a mini-batch gradient descent update formula specific to logistic regression:

$$
\begin{align}
\boldsymbol{\theta}^{(t+1)} &= \boldsymbol{\theta}^{(t)} - \alpha \cdot -\frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}(y_i - \sigma_i)\textbf{X}_i \\
&= \boldsymbol{\theta}^{(t)} + \alpha \cdot \frac{1}{|\mathcal{B}|} \sum_{i\in\mathcal{B}}(y_i - \sigma_i)\textbf{X}_i
\end{align}
$$

## Implementation in MLJ

MLJ provides an interface to the model `SGDClassifier` from the package `ScikitLearn` (you can see the original python's module [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)). Since there is not an available model that implements batch gradient descent, we will compare `SGDClassifier`'s performance against `LogisticClassifier` on the `emails` dataset. We omit feature extraction for brevity:

In [2]:
# HIDDEN
function dataframe_sample(df, frac)
    number_samples = round(Int, frac*nrows(df))
    return DataFrame([sample(df[:, col], number_samples) for col in names(df)], names(df))
end;

In [3]:
# HIDDEN
using StatsBase

emails = dataframe_sample(CSV.read("emails_sgd.csv"), 0.01)
X = emails.email
y = emails.spam

83-element Array{Int64,1}:
 ⋮

In [4]:
# HIDDEN
using TextAnalysis

function create_text_matrix(document_array)
    crps = Corpus(StringDocument.(document_array))
    prepare!(crps, strip_punctuation | strip_case | strip_html_tags | strip_whitespace)
    update_lexicon!(crps)
    m = DocumentTermMatrix(crps)
    return dtm(m, :dense)
end

X_dense_matrix = create_text_matrix(X)

83×8112 Array{Int64,2}:
 ⋮      ⋱  

In [19]:
X_prepared = DataFrame(X_dense_matrix)[:, 1:3000]
y_prepared = coerce(y, Multiclass)

83-element CategoricalArray{Int64,1,UInt8}:
 ⋮

In [15]:
X_prepared

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0
5,18,2,0,0,0,0,0,0,0,0,0,0,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮


In [8]:
@load LogisticClassifier pkg=ScikitLearn
@load SGDClassifier pkg=ScikitLearn

log_cl = LogisticClassifier(tol=0.0001, random_state=42)
sgd_cl = SGDClassifier(tol=0.0001, loss="log", random_state=42)

SGDClassifier(loss = "log",
              penalty = "l2",
              alpha = 0.0001,
              l1_ratio = 0.15,
              fit_intercept = true,
              max_iter = 1000,
              tol = 0.0001,
              shuffle = true,
              verbose = 0,
              epsilon = 0.1,
              n_jobs = nothing,
              random_state = 42,
              learning_rate = "optimal",
              eta0 = 0.0,
              power_t = 0.5,
              early_stopping = false,
              validation_fraction = 0.1,
              n_iter_no_change = 5,
              class_weight = nothing,
              warm_start = false,
              average = false,)[34m @ 1…77[39m

In [9]:
train, test = partition(eachindex(y_prepared), 0.75, shuffle=true)
println("Training set size: ", length(train))
println("    Test set size: ", length(test))

Training set size: 62
    Test set size: 21


In [22]:
log_mac = machine(log_cl, X_prepared, y_prepared)
@time fit!(log_mac, rows=train)

│ scitype(X) = ScientificTypes.Table{AbstractArray{Count,1}}
│ input_scitype(model) = ScientificTypes.Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:Continuous). 
└ @ MLJBase /Users/irinabchan/.julia/packages/MLJBase/sMgCp/src/machines.jl:54
┌ Info: Training [34mMachine{LogisticClassifier} @ 1…19[39m.
└ @ MLJBase /Users/irinabchan/.julia/packages/MLJBase/sMgCp/src/machines.jl:179


  0.375945 seconds (16.99 k allocations: 4.096 MiB)


[34mMachine{LogisticClassifier} @ 1…19[39m


## Summary

Stochastic gradient descent is a method that data scientists use to cut down on computational cost and runtime. We can see the value of stochastic gradient descent in logistic regression, since we would only have to calculate the gradient of the cross entropy loss for one observation at each iteration instead of for every observation in batch gradient descent. From the example using scikit-learn's `SGDClassifier`, we observe that stochastic gradient descent may achieve slightly worse evaluation metrics, but drastically improves runtime. On larger datasets or for more complex models, the difference in runtime might be much larger and thus more valuable.