4.2. Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable.
For example,

$$Y_3 = [0, 0, 1, 0, 0]$$
assuming we have 5 classes. This is also called _one-hot encoding_. 

Thus if $\mathcal{G}$ has $K$ classes, there will be $K$ such indicators $Y_k$, $k=1,\cdots,K$, with

\begin{equation}
Y_k = 1 \text{ if } G = k \text{ else } 0.
\end{equation}

These are collected together in a vector $Y=(Y_1,\cdots,Y_k)$, and the $N$ training instances of these form an $N\times K$ *indicator response matrix* $\mathbf{Y}$, which is a matrix of $0$'s and $1$'s, with each row having a single $1$. 

For example,

$$
Y = \begin{bmatrix} 
    1 & 0 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 & 0 \\
     & & \vdots & &  \\ 
    0 & 0 & 0 & 0 & 1
 \end{bmatrix}
$$

We fit a linear regression model to each of the columns of $\mathbf{Y}$ simultaneously, and the fit is given by

\begin{equation}
\hat{\mathbf{Y}} = \mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{Y} = \mathbf{X}\hat{\mathbf{B}}.
\end{equation}

Note that we have a coefficient vector for each response columns $\mathbf{y}_k$, and hence a $(p+1)\times K$ coefficient matrix $\hat{\mathbf{B}} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{Y}$. Here $\mathbf{X}$ is the model matrix with $p+1$ columns with a leading columns of $1$'s for the intercept.

A new observation with input $x$ is classified as follows:
* Compute the fitted output $\hat{f}(x)^T = (1, x^T)^T\hat{\mathbf{B}}$, a $K$ vector.
* Identify the largest component and classify accordingly:  

\begin{equation}
\hat{G}(x) = \arg\max_{k\in\mathcal{G}} \hat{f}_k(x).
\end{equation}

## Masked class with the regression approach

There is a serious problem with the regression approach when the number of class $K\ge 3$, especially prevalent when $K$ is large. Because of the rigid nature of the regression model, classes can be *masked* by others. FIGURE 4.2 illustrates an extreme situation when $K=3$. The three classes are perfectly separated by linear decision boundaries, yet linear regression misses the middle class completely.

In [1]:
import math
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns 
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_formats = ['svg']

In [58]:
# generate three clusters
size = 300
cluster_means = {
    'class-1': [-4, -4],
    'class-2': [0, 0],
    'class-3': [4, 4]
}
cluster_cov = np.eye(2)
npdata = np.array([])  # sensitive to dtype
nplabel = np.array([])
np.random.seed(789)

for l, v in cluster_means.items():
    const = np.ones((size, 1))  # constant values
    temp = np.random.multivariate_normal(
            v, cluster_cov, size
        )  # feature values, float type 
    label = np.array([l]*size).reshape(-1, 1)  # labels 
    temp = np.hstack((const, temp))  # stack together 
    npdata = np.append(
        npdata, temp
    ).reshape((-1, 3))
    nplabel = np.append(
        nplabel, label
    ).reshape((-1, 1))  # string type 

sdata = pd.DataFrame(
    np.hstack([npdata, nplabel]),
    columns=['const', 'x1', 'x2', 'class']
)
sdata.head()

Unnamed: 0,const,x1,x2,class
0,1.0,-5.108111402613945,-4.72571863413936,class-1
1,1.0,-3.4771956650459828,-2.765558103380788,class-1
2,1.0,-3.903104149267599,-4.987922064070484,class-1
3,1.0,-3.932767227473897,-4.592591785578464,class-1
4,1.0,-4.931771172550313,-2.94463723320077,class-1


In [43]:
# create ont-hot encoding
y_mat = pd.get_dummies(sdata['class'])
y_mat.head()

Unnamed: 0,class-1,class-2,class-3
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [63]:
# fit linear regression
x_mat = npdata[:, :3]
beta = np.linalg.solve(x_mat.T @ x_mat, x_mat.T @ y_mat)
beta

array([[ 0.3350334 ,  0.33335976,  0.33160684],
       [-0.0611274 ,  0.00219793,  0.05892947],
       [-0.05870027, -0.00227578,  0.06097605]])

In [64]:
# estimate coefficients
y_est = x_mat @ beta
y_est.shape

(900, 3)

We have done:

A new observation with input $x$ is classified as follows:
* Compute the fitted output $\hat{f}(x)^T = (1, x^T)^T\hat{\mathbf{B}}$, a $K$ vector.

Now, we will:

* Identify the largest component and classify accordingly:  

\begin{equation}
\hat{G}(x) = \arg\max_{k\in\mathcal{G}} \hat{f}_k(x).
\end{equation}

In [79]:
y_classified = y_est.argmax(axis=1)

One rather formal justification is to view the regression as an estimate of conditional expectation. For the random variable $Y_k$, 

\begin{aligned}
\text{E}(Y_k|X=x) & = 0 \cdot \text{Pr}(G!=k|X=x) + 1 \cdot \text{Pr}(G=k|X=x) +
                    0 \cdot \text{Pr}(G!=k|X=x) \\ 
                  & = \text{Pr}(G=k|X=x),
\end{aligned}

The real issue is: How good an approximation to conditional expectation is the rather rigid linear regression model? Alternatively, are the $\hat{f}_k(x)$ reasonable estimates of the posterior probabilities $\text{Pr}(G=k|X=x)$, and more importantly, does this matter?

It is quite straightforward to verify wheter the following condition will hold
or not:,

\begin{equation}
\sum_{k\in\mathcal{G}}\hat{f}_k(x) = 1.
\end{equation}

assuming the model has an intercept (or constant feature). 



In [80]:
# assert whether row sum  == 1
assert np.allclose(y_est.sum(axis=1), 1), 'Not all row sum == 1'

## Masked class with the regression approach

There is a serious problem with the regression approach when the number of class $K\ge 3$, especially prevalent when $K$ is large. Because of the rigid nature of the regression model, classes can be *masked* by others. FIGURE 4.2 illustrates an extreme situation when $K=3$. The three classes are perfectly separated by linear decision boundaries, yet linear regression misses the middle class completely.

In [None]:
# figure 4.2
fig, axes = plt.subplots(1, 2, figsize=(8, 5))
axes[0].scatter(
    sdata['x1'], sdata['x2']
)

In [88]:
sdata.head()

Unnamed: 0,const,x1,x2,class
0,1.0,-5.108111402613945,-4.72571863413936,class-1
1,1.0,-3.4771956650459828,-2.765558103380788,class-1
2,1.0,-3.903104149267599,-4.987922064070484,class-1
3,1.0,-3.932767227473897,-4.592591785578464,class-1
4,1.0,-4.931771172550313,-2.94463723320077,class-1
