# EM algorithm for logistic regression

The logistic regression model is defined as follows:
\begin{equation*}
\mathbb{P}\left(y_i=1 \mid X_i ; \beta\right)=\frac{1}{1+\exp \left(-X_i^T \beta\right)}
\end{equation*}
with $y_i \in \{0,1\}$ and $X_i \in \mathbb{R}^{d}$.

The log-likelihood $\ell$ of the data is given by:
\begin{align*}
\mathcal{\ell}(\beta)
&= \sum*{i=1}^n \left[y_i \log \mathbb{P}\left(y_i=1 \mid X_i ; \beta\right) + (1-y_i) \log \mathbb{P}\left(y_i=0 \mid X_i ; \beta\right)\right] \\
&= \sum*{i=1}^n \left[y_i \log \frac{1}{1+\exp (-X_i^T \beta)} + (1-y_i) \log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)}\right] \\
&= \sum*{i=1}^n \left[\log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)} + y_i \left[ \log \frac{1}{1+\exp (-X_i^T \beta)} - \log \frac{\exp (-X_i^T \beta)}{1+\exp (-X_i^T \beta)}\right]\right] \\
&= \sum*{i=1}^n \left[\log \frac{1}{1+\exp (X_i^T \beta)} + y_i \log \frac{1}{\exp (-X_i^T \beta)}\right] \\
&= \sum\_{i=1}^n \left[- \log \left[1+\exp (X_i^T \beta)\right] + y_i X_i^T \beta\right] \\
\end{align*}

Furthermore, the full log-likelihood, taking into account missing data, $\ell_{full}$ is given by:
\begin{align*}
\ell\_{full}(\beta)
&= \log \prod_{i=1}^{n} \left\{ \mathbb{P}(y_i | x_i, \beta)^{\boldsymbol{1}_{(y_i \text{ observed})}} \mathbb{P}(y_i | x_i, \beta)^{\boldsymbol{1}_{(y_i \text{ missing})}} \right\} \\
&= \sum_{i=1}^{n} \left\{ \boldsymbol{1}_{(y_i \text{ observed})} \log \mathbb{P}(y_i | x_i, \beta) + \boldsymbol{1}_{(y_i \text{ missing})} \log \mathbb{P}(y_i | x_i, \beta) \right\} \\
&= \sum_{i=1}^{n} \left\{ \boldsymbol{1}_{(y_i \text{ observed})} \left[ y_i \log \left( \frac{1}{1+e^{-x_i^T \beta}} \right) + (1-y_i) \log \left( 1 - \frac{1}{1+e^{-x_i^T \beta}} \right) \right] \right. + \\
&\left. \boldsymbol{1}_{(y_i \text{ missing})} \mathbb{E}\left[ y_i | x_i, \beta, y_{obs} \right] \log \left( \frac{1}{1+e^{-x_i^T \beta}} \right) + (1-\mathbb{E}\left[ y_i | x_i, \beta, y_{obs} \right]) \log \left( 1 - \frac{1}{1+e^{-x_i^T \beta}} \right) \right\}
\end{align*}

where $y_{ij} \in \{0,1\}$ is the $j$-th label of the $i$-th sample.

We now want to show that the E-step can be written as follows:
\begin{equation*}
Q(\beta; \beta^{(r)}) = \sum_{i=1}^n Q_i(\beta; \beta^{(r)})
\end{equation*}
where $\beta^{(r)}$ is the current estimate of $\beta$, and $Q_i(\beta; \beta^{(r)})$ is given by:
\begin{equation*}
Q_i(\beta; \beta^{(r)}) =
\begin{cases}
\sum*{y_i \in \{0, 1\}} \mathbb{P}(y_i \mid x_i; \beta^{(r)}) \log \mathbb{P}(y_i \mid x_i; \beta) & y \text{ is missing.}\\
\log \mathbb{P}(y_i \mid x_i; \beta) & \text{otherwise.} \\
\end{cases}
\end{equation*}
Here is the proof:
\begin{align*}
Q(\beta; \beta^{(r)})
&= \ldots
\end{align*}



Apply the EM algorithm for the synthetic data defined below. Compare the following estimators for $\beta$ by computing the MSE:

- (i) without missing values (y, X),
- (ii) with missing values (yNA , X) by using only the rows which do not contain missing values,
- (iii) with missing values (yNA , X) by using the EM algorithm.

Note that in (i) and (ii), you just have to use the function glm. Here, you can consider that the intercept is null.


In [7]:
import numpy as np
import pandas as pd
from exercise2 import LogisticRegressionEM
from numpy.random import multivariate_normal, random
from scipy.optimize import minimize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from tabulate import tabulate


In [8]:
# Set seed
np.random.seed(42)

# Parameters
d = 3
beta_true = [0.1, 0.5, 0.9]
n = 1000
mu = np.zeros(d)
sigma = np.eye(d) + 0.5 * np.ones((d, d))

# Generate multivariate Gaussian variable
X = multivariate_normal(mean=mu, cov=sigma, size=n)

# Compute logit link
logit_link = 1 / (1 + np.exp(-np.dot(X, beta_true)))

# Generate binary response variable
y = (random(n) < logit_link).astype(int)

# Introduction of MCAR values
nb_missingvalues = int(0.2 * n * d)
missing_idx = np.random.choice(range(n), nb_missingvalues, replace=False)
yna = y.copy()
yna = np.where(np.isin(range(n), missing_idx), np.nan, y)


In [9]:
# Compute the MSE without missing values (y, X) with the Python equivalent of glm
# Fit the model
model = LogisticRegression(fit_intercept=False, solver="lbfgs", max_iter=1000, tol=1e-6)
model.fit(X, y)

# Compute the MSE
y_pred = model.predict(X)
mse = mean_squared_error(y, y_pred)


# Compute the MSE with missing values (yna , X) by using only the rows which do not contain missing values
# Fit the model
model = LogisticRegression(fit_intercept=False, solver="lbfgs", max_iter=1000, tol=1e-6)
model.fit(X[~np.isnan(yna)], yna[~np.isnan(yna)])

# Compute the MSE
y_pred = model.predict(X)
mse_na = mean_squared_error(y, y_pred)


# Compute the MSE with missing values (yna , X) by using a custom EM algorithm
# Fit the model
model = LogisticRegressionEM(X=X, y=yna, missing_mask=np.isnan(yna))
model.fit()

# Compute the MSE
y_pred = model.predict(X)
mse_em = mean_squared_error(y, y_pred)


# Print results in a table
from tabulate import tabulate

print(
  tabulate(
    [
      ["MSE without missing values", mse],
      ["MSE with missing values", mse_na],
      ["MSE with missing values (EM)", mse_em],
    ],
    headers=["Method", "MSE"],
  )
)


Method                          MSE
----------------------------  -----
MSE without missing values    0.271
MSE with missing values       0.273
MSE with missing values (EM)  0.373


What do we notice for estimators (ii) and (iii) ?

We notice that the MSE is higher for the estimator (iii) than for the estimator (ii).


### Cancer prostate data


In [10]:
# Set seed
np.random.seed(1)

# Read data
canpros = pd.read_csv("cancerprostate.csv", sep=";")

# Select quantitative variables
quanti_var = [0, 1, 5, 6]
canpros = canpros.iloc[:, quanti_var]

# Define response variable and predictor matrix
n = canpros.shape[0]
y = canpros["Y"]
X = np.hstack((np.ones((n, 1)), canpros[["age", "acide", "log.acid"]]))
d = X.shape[1]

# Introduction of MCAR values
nb_missingvalues = int(0.2 * n * d)
missing_idx = np.random.choice(range(n), nb_missingvalues, replace=False)
yna = y.copy()
yna.iloc[missing_idx] = np.nan


In [11]:
# Apply the EM algorithm and compare different estimators for beta
# Fit the model
model = LogisticRegressionEM(X=X, y=yna, missing_mask=np.isnan(yna))
model.fit()
beta_em = model.beta

# Compute MSE
y_pred_em = model.predict(X)
mse_em = mean_squared_error(y, y_pred_em)


# Compute beta with missing values by using only the rows which do not contain missing values, with intercept
# Fit the model
model = LogisticRegression(fit_intercept=True, max_iter=1000, tol=1e-6)
model.fit(X[~np.isnan(yna)], yna[~np.isnan(yna)])
beta_na = model.coef_

# Compute MSE
y_pred_na = model.predict(X)
mse_na = mean_squared_error(y, y_pred_na)


# Compute beta without missing values, with intercept
model = LogisticRegression(fit_intercept=True, max_iter=1000, tol=1e-6)
model.fit(X, y)
beta_full = model.coef_

# Compute MSE
y_pred_full = model.predict(X)
mse_full = mean_squared_error(y, y_pred_full)

# Print results in a table
print(
  tabulate(
    [
      ["beta with missing values (EM)", *beta_em.round(4), mse_em],
      ["beta with missing values", *beta_na[0].round(4), mse_na],
      ["beta without missing values", *beta_full[0].round(4), mse_full],
    ],
    headers=["Method", *[f"beta_{i}" for i in range(len(beta_em))], "MSE"],
  )
)


Method                           beta_0    beta_1    beta_2    beta_3       MSE
-----------------------------  --------  --------  --------  --------  --------
beta with missing values (EM)    1.1419   -1.1296   -0.8501    0.9608  0.377358
beta with missing values         0.0006   -0.0782    0.0985    0.1818  0.377358
beta without missing values      0        -0.0507    0.4026    0.9858  0.339623


In [12]:
y_pred_em, y_pred_na, y_pred_full, y
X


array([[ 1.00000000e+00,  6.60000000e+01,  4.80000000e-01,
        -7.33969175e-01],
       [ 1.00000000e+00,  6.80000000e+01,  5.60000000e-01,
        -5.79818495e-01],
       [ 1.00000000e+00,  6.60000000e+01,  5.00000000e-01,
        -6.93147181e-01],
       [ 1.00000000e+00,  5.60000000e+01,  5.20000000e-01,
        -6.53926467e-01],
       [ 1.00000000e+00,  5.80000000e+01,  5.00000000e-01,
        -6.93147181e-01],
       [ 1.00000000e+00,  6.00000000e+01,  4.90000000e-01,
        -7.13349888e-01],
       [ 1.00000000e+00,  6.50000000e+01,  4.60000000e-01,
        -7.76528789e-01],
       [ 1.00000000e+00,  6.00000000e+01,  6.20000000e-01,
        -4.78035801e-01],
       [ 1.00000000e+00,  5.00000000e+01,  5.60000000e-01,
        -5.79818495e-01],
       [ 1.00000000e+00,  4.90000000e+01,  5.50000000e-01,
        -5.97837001e-01],
       [ 1.00000000e+00,  6.10000000e+01,  6.20000000e-01,
        -4.78035801e-01],
       [ 1.00000000e+00,  5.80000000e+01,  7.10000000e-01,
      

#### Discuss the link with semi-supervised learning

The EM algorithm can be used as a semi-supervised learning algorithm. Indeed, it is used to estimate the parameters of a model when some of the data are missing. In this case, the missing data are the values of the response variable `y`. This is the definition of semi-supervised learning, _i.e._ the learning of a model when some of the data is present, but not all of them. This is a very common situation in real life, for example when we have to estimate the parameters of a model when we have a small number of labeled observations, but a large number of unlabeled observations (_e.g._ when labeling is expensive). This may be the case for cancer data where medical doctors have to label each and every data point, which is very time consuming. In this case, the EM algorithm can be used to estimate the parameters of the model when some of the data is missing.
