# EM algorithm for logistic regression

The logistic regression model is defined as follows:
\begin{equation}
\mathbb{P}\left(y_i=1 \mid X_i ; \beta\right)=\frac{1}{1+\exp \left(-X_i^T \beta\right)}
\end{equation}
with $y_i \in \{0,1\}$ and $X_i \in \mathbb{R}^{d}$.

The log-likelihood $\ell$ of the data is given by:
\begin{align*}
\mathcal{\ell}(\beta)
&= \sum_{i=1}^n \left[y_i \log \mathbb{P}\left(y_i=1 \mid X_i ; \beta\right) + (1-y_i) \log \mathbb{P}\left(y_i=0 \mid X_i ; \beta\right)\right] \\
&= \sum_{i=1}^n \left[y_i \log \frac{1}{1+\exp (-X_i^T \beta)} + (1-y_i) \log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)}\right] \\
&= \sum_{i=1}^n \left[\log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)} + y_i \left[ \log \frac{1}{1+\exp (-X_i^T \beta)} - \log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)}\right]\right] \\
&= \sum_{i=1}^n \left[\log \frac{1}{1+\exp (X_i^T \beta)} + y_i \log \frac{1}{\exp (-X_i^T \beta)}\right] \\
&= \sum_{i=1}^n \left[- \log \left[1+\exp (X_i^T \beta)\right] + y_i X_i^T \beta\right] \\
\end{align*}

Furthermore, the observed log-likelihood $\ell_{obs}$ is given by:
\begin{align*}
\ell_{obs}(\beta)
&= \sum_{i=1}^n \left[- \log \left[1+\exp (X_i^T \beta)\right] + y_i X_i^T \beta\right] \\
&= \sum_{i=1}^n \left[- \log \left[1+\exp (X_i^T \beta)\right] + y_i \left[ \log \frac{1}{1+\exp (-X_i^T \beta)} - \log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)}\right]\right] \\
\end{align*}

In [1]:
import numpy as np
from scipy.optimize import minimize




Apply the EM algorithm for the synthetic data defined below. Compare the following estimators for $\beta$ by computing the MSE:

- (i) without missing values (y, X),
- (ii) with missing values (yNA , X) by using only the rows which do not contain missing values,
- (iii) with missing values (yNA , X) by using the EM algorithm.

Note that in (i) and (ii), you just have to use the function glm. Here, you can consider that the intercept is null.


In [2]:
import numpy as np
from numpy.random import multivariate_normal, random

# Set seed
np.random.seed(1)

# Parameters
d = 3
beta_true = [0.1, 0.5, 0.9]
n = 1000
mu = np.zeros(d)
Sigma = np.eye(d) + 0.5 * np.ones((d, d))

# Generate multivariate Gaussian variable
X = multivariate_normal(mean=mu, cov=Sigma, size=n)

# Compute logit link
logit_link = 1 / (1 + np.exp(-np.dot(X, beta_true)))

# Generate binary response variable
y = (random(n) < logit_link).astype(int)

# Introduction of MCAR values
nb_missingvalues = int(0.2 * n * d)
missing_idx = np.random.choice(range(n), nb_missingvalues, replace=False)
yna = y.copy()
yna = np.where(np.isin(range(n), missing_idx), np.nan, y)

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from exercise2 import LogisticRegressionEM

# Compute the MSE without missing values (y, X) with the Python equivalent of glm
# Fit the model
model = LogisticRegression(fit_intercept=False, solver="lbfgs")
model.fit(X, y)

# Compute the MSE
y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)

# Compute the MSE with missing values (yna , X) by using only the rows which do not contain missing values
# Fit the model
model = LogisticRegression(fit_intercept=False, solver="lbfgs")
model.fit(X[~np.isnan(yna)], yna[~np.isnan(yna)])

# Compute the MSE
y_pred = model.predict(X)
mse_na = mean_squared_error(y, y_pred)

# Compute the MSE with missing values (yna , X) by using a custom EM algorithm
# Fit the model
model = LogisticRegressionEM(X=X, y=yna, missing_mask=np.isnan(yna))
model.fit()

# Compute the MSE
y_pred = model.predict(X)
mse_em = mean_squared_error(y, y_pred)

# Print results in a table
from tabulate import tabulate

print(
  tabulate(
    [
      ["MSE without missing values", mse],
      ["MSE with missing values", mse_na],
      ["MSE with missing values (EM)", mse_em],
    ],
    headers=["Method", "MSE"],
  )
)


Method                          MSE
----------------------------  -----
MSE without missing values    0.275
MSE with missing values       0.276
MSE with missing values (EM)  0.31


What do we notice for estimators (ii) and (iii) ?

We notice that the MSE is higher for the EM algorithm than for the other two estimators. This is due to the fact that the EM algorithm tries to estimate the missing values, which is not possible in this case. Howev