## Classification

반응변수가 질적 변수인 경우들이 많이 있다.

질적 변수들의 예:

\begin{align*}
&\textrm{eye color} \in \{ \textrm{brown}, \textrm{blue}, \textrm{green} \}  \\
&\textrm{email} \in \{\textrm{spam}, \textrm{ham} \} 
\end{align*}

질적 반응변수를 예측하는 분류 (classfication)에 대해 알아보자. 크게 두 가지 방법이 있다.

* 주어진 입력 변수 $X$에 대해 클래스를 결정하는 함수 $C(X)$를 찾는 방법  
* 입력 변수 $X$가 어떤 클래스에 속하는 확률을 구하는 방법

질적 반응 변수가 두 개의 클래스로 이루어진 경우 선형회귀 방법 또한 잘 작동한다. 하나의 클래스를 0으로 다른 하나를 1로 설정한다.

이는 나중에 살펴볼 linear discriminat analysis와 동치이다.

하지만, 클래스의 숫자가 늘어나면 선형회귀 방법은 적용하기 어렵다. 

예를 들어, 응급실에 환자가 도착한 경우, 증상에 따른 분류를 다음의 숫자들로 치환하여 선형회귀를 진행할 수 있다.

$$
    Y = 
\begin{cases}
    1, & \text{if stroke;} \\
    2, & \text{if drug overdose;} \\
    3, & \text{if epileptic seizure.}
\end{cases}
$$

하지만, 위 코딩은 $Y$에 순서 구조와 거리 구조를 강제하며, 이는 질적 변수의 특징이 아니다.

따라서 분류를 위해 특별히 고안된 방법들을 사용하는 것이 좋겠다.

### Logistic regression

balance라는 입력 변수를 이용하여 default의 여부를 예측하는 문제를 생각해 보자. Default에 대한 반응변수 $Y$는 다음과 같이 코딩한다.

$$
    Y = 
\begin{cases}
    0, & \text{if No;} \\
    1, & \text{if Yes.} 
\end{cases}
$$

주어진 $X$에 대해 default가 발생할 확률을 다음과 같이 표현하자.

$$ p(X) = \mathbb P (Y = 1 | X)$$

Logistic regression에서는 다음의 식을 가정한다.

$$ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} $$

위와 같이 가정하면 $\beta$나 $X$의 값에 상관없이 $p(X)$는 항상 0과 1 사이의 값을 취한다.

또한, 위 식은 다음으로도 표현되며, $p(X)$의 log odds 혹은 logit transformation이라고도 불리운다.

$$ \log \left( \frac{p(X)}{1 - p(X)}\right)  = \beta_0 + \beta_1 X $$

### Maximum likelihood

관찰값 $\{x_i, y_i \}$가 주어졌을 때, logistic regression의 모수 $\beta_0, \beta_1$을 추정하기 위해, likelihood 함수를 정의하자.

$$ \ell (\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i:y_i=0}(1-p(x_i)) $$

여기서 $p(x_i) = \frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}$ 이므로 위 식의 우변은 $\beta_0, \beta_1$의 함수이다.

추정량 $\hat \beta_0, \hat \beta_1$은 $\ell (\beta_0, \beta_1)$를 최대화하는 값들로 다음으로 표현된다.

$$ \hat \beta_0, \hat \beta_1 = \arg \max_{\beta_0, \beta_1} \ell (\beta_0, \beta_1) $$

$\hat \beta_0, \hat \beta_1$의 추정치가 결정되면 이를 이용하여 주어진 $X$에 대해 default 확률을 추정할 수 있다.

$$ \hat p(X) = \frac{e^{\hat \beta_0 + \hat \beta_1 X}}{1 + e^{\hat \beta_0 + \hat \beta_1 X}} $$

### 여러 입력변수로 확장

로지스틱 회귀는 자연스럽게 여러 입력변수를 가지는 모형으로도 확장 가능하다.

$$ \log \left( \frac{p(X)}{1 - p(X)}\right)  = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p  $$

혹은,

$$ p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}} $$


### 로지스틱 회귀의 gradient vector

앞서 공부한 경사하강법은 로지스틱 회귀방법에도 적용 가능하다.

비용함수는
$$ J(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat p_i) + (1 - y_i)\log(1-\hat p_i) \right]$$
이며, 여기서
$$ \hat p_i = \sigma(\theta \cdot x_i) = \frac{1}{1 + \exp( - \theta \cdot x_i)} = \frac{1}{1 + \exp( - \theta_0 - x_{i1} \theta_1 - \cdots -  x_{ip} \theta_p )}. $$

다음에 대해
$$ \sigma(t) = \frac{1}{1 + \exp(-t)} $$
미분은
$$ \frac{d}{d t}\sigma(t) = \sigma(t) (1 - \sigma(t))$$
와 같이 주어지므로, 비용함수의 편미분을 다음과 같이 계산할 수 있다.

\begin{align*}
\frac{\partial }{\partial \theta_j} J(\theta) &= -\frac{1}{N} \sum_{i=1}^{N} 
\left[ y_i  \frac{ \frac{\partial }{\partial \theta_j} \sigma(\theta \cdot x_i)}{\sigma(\theta \cdot x_i)} + (1 - y_i)  \frac{ \frac{\partial }{\partial \theta_j} \left\{ 1 - \sigma(\theta \cdot x_i) \right\} }{1 - \sigma(\theta \cdot x_i)} \right]\\
&= -\frac{1}{N} \sum_{i=1}^{N} 
\left[ y_i  \frac{\sigma(\theta \cdot x_i)(1 - \sigma(\theta \cdot x_i)) }{\sigma(\theta \cdot x_i)}x_{ij} + (1 - y_i)  \frac{ \sigma(\theta \cdot x_i)(1 - \sigma(\theta \cdot x_i)) } {1 - \sigma(\theta \cdot x_i)}x_{ij} \right] \\
&= - \frac{1}{N}  \sum_{i=1}^{N} \left[ y_i - \sigma(\theta \cdot x_i)) y_i - \sigma(\theta \cdot x_i) + \sigma(\theta \cdot x_i) y_i \right] x_{ij} \\
& = \frac{1}{N}  \sum_{i=1}^{N} \left[ \sigma(\theta \cdot x_i)) - y_i\right] x_{ij} \\
& = \frac{1}{N}  \mathbf x_j^T ( \sigma(\mathbf X \theta) - \mathbf y).
\end{align*}

따라서 비용함수의 gradient vector는
\begin{equation*}
\nabla_\theta J (\theta) =  \frac{1}{N} \mathbf X^T ( \sigma(\mathbf X \theta) - \mathbf y).
\end{equation*}

In [5]:
from sklearn import datasets
raw_cancer = datasets.load_breast_cancer()
X = raw_cancer.data
y = raw_cancer.target

In [21]:
import pandas as pd
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [22]:
pd.DataFrame(y)

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
564,0
565,0
566,0
567,0


In [10]:
from sklearn.model_selection import train_test_split
X_tn, X_te, y_tn, y_te = train_test_split(X, y)

In [14]:
from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()
std_scale.fit(X_tn)
X_tn_std = std_scale.transform(X_tn)
X_te_std  = std_scale.transform(X_te)

In [23]:
pd.DataFrame(X_te_std)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,-0.213959,0.312546,-0.143552,-0.282540,1.028572,0.853958,0.712142,0.840172,1.125336,1.553567,...,0.019060,0.663968,0.172169,-0.073844,1.087024,0.875052,1.217003,1.370438,1.089112,1.539283
1,-0.267507,1.461224,-0.329552,-0.334762,-0.611043,-1.019675,-0.776920,-0.734122,-0.671484,-0.990173,...,-0.402289,1.418399,-0.476612,-0.434972,-0.157330,-0.965829,-0.658579,-0.842661,-0.715774,-0.881060
2,-0.039223,-0.867702,-0.104631,-0.144208,-1.207201,-0.945401,-0.864264,-0.583501,-0.779944,-0.987251,...,-0.287748,-1.044644,-0.322154,-0.339355,-1.270700,-0.996148,-1.044194,-0.505318,-1.202982,-0.924943
3,0.028417,-0.258150,-0.037851,-0.070319,-2.211637,-1.016712,-0.814790,-0.913113,-0.613639,-0.987251,...,-0.019803,-0.062398,-0.048906,-0.116018,-1.661471,-0.238177,-0.570251,-0.609984,-0.412060,-0.381463
4,-0.318237,-0.197438,-0.390596,-0.373929,-0.472301,-1.303930,-0.803697,-0.513607,-1.221015,-0.582532,...,-0.617055,-0.466852,-0.677937,-0.583522,-1.549698,-1.362017,-1.116219,-0.994015,-1.438677,-1.229315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,2.550816,1.878925,2.513711,2.809944,-0.092205,1.274710,1.356067,1.922886,0.376962,0.069110,...,3.005319,1.464622,2.904653,3.511144,0.680972,1.053011,1.577381,2.197393,0.326662,0.181708
139,-1.028454,0.232406,-0.962936,-0.900593,1.144191,0.526047,-0.304623,-0.476210,0.423961,2.221281,...,-1.054767,-0.476757,-1.026940,-0.876352,-0.109303,-0.315951,-0.706529,-0.822674,-0.812266,0.378059
140,-0.512701,-1.690962,-0.540953,-0.527539,-0.457848,-0.801990,-0.753203,-0.584791,-0.418411,-0.662891,...,-0.553648,-1.051247,-0.596581,-0.551080,-0.144232,-0.299473,-0.456182,-0.126322,0.337735,-0.428722
141,-0.177321,-2.013952,-0.173459,-0.275596,2.365412,0.020354,-0.253491,0.423386,2.162937,0.554188,...,-0.457515,-2.170512,-0.474548,-0.481757,0.549987,-0.757552,-0.929166,-0.628751,-0.295003,-0.654329


In [24]:
from sklearn.linear_model import LogisticRegression
clf_logi =  LogisticRegression()
clf_logi.fit(X_tn_std, y_tn)

LogisticRegression()

In [26]:
# 예측
pred_logistic = clf_logi.predict(X_te_std)
print(pred_logistic)

[0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0
 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1
 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0]


In [19]:
from sklearn.metrics import precision_score
precision = precision_score(y_te, pred_logistic)
print(precision)

0.9666666666666667


In [20]:
from sklearn.metrics import classification_report
class_report = classification_report(y_te, pred_logistic)
print(class_report)

              precision    recall  f1-score   support

           0       0.94      0.94      0.94        53
           1       0.97      0.97      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143



### 여러 클래스에 대한 모형

반응변수가 여러 클래스를 가질 때에도 확장 가능하다.

$$ \mathbb P (Y = k | X) = \frac{e^{\beta_{0k} + \beta_{1k} X_1 + \cdots + \beta_{pk} X_p}}{\sum_{\ell=1}^{K}e^{\beta_{0\ell} + \beta_{1\ell} X_1 + \cdots + \beta_{p\ell} X_p}} $$

총 $K$개의 식이 생성되지만, 실제로는 이 중 $K-1$개의 식이면 충분하다.

이를 multinomial regression 혹은 softmax regression이라고 부르기도 한다.

여러 클래스를 가지는 경우 다음에 살펴볼 discriminant analysis를 사용하는 방법도 있다.

소프트맥스 회귀의 경우도 그레디언트 벡터를 구할 수 있다.

We have $K$ classes for $y_i$, i.e, $ y_i \in \{ 1, \cdots, K \}$.
The softmax function is defined by
$$ \hat p_{i}^k =  P(y_i = k | x_i, \Theta ) = \frac{\exp(\theta^{(k)} \cdot x_i)}{\sum_{m=1}^{K} \exp(\theta^{(m)} \cdot x_i)} $$
where $$\Theta = \begin{bmatrix} \theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \end{bmatrix}.$$
The cross entropy for the cost function is 
$$ J(\Theta) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{\ell = 1}^{K} y_i^{(\ell)} \log \hat p_{i}^{(\ell)} $$
where 
$$ y_i^{\ell} = \mathbbm{1}_{\{ y_i=\ell \}}, $$
$$ \hat p_{i}^{\ell} =  P(y_i = \ell | x_i, \Theta ).$$
The gradient vector for $J$ with respect to $\theta^{(k)}$ is
$$ \nabla_{\theta^{(k)}} J (\Theta) = \begin{bmatrix} \frac{\partial J(\Theta)}{\partial \theta_0^{k}} \\ \frac{\partial J(\Theta)}{\partial \theta_1^{k}} \\ \vdots \\ \frac{\partial J(\Theta)}{\partial \theta_p^{k}}  \end{bmatrix} = \frac{1}{N} \sum_{i=1}^{N} (\hat p_i^{(k)} - y_i^{(k)} ) x_i .$$
To derive the above formula,
\begin{align*}
\frac{\partial J(\Theta)}{\partial \theta_j^{(k)}}  &= -\frac{1}{N} \sum_{i=1}^{N} \sum_{\ell = 1}^{K} \frac{\partial}{\partial \theta_j^{(k)}}  y_i^{(\ell)} \log \hat p_{i}^{(\ell)} \\
&= -\frac{1}{N} \sum_{i=1}^{N} \sum_{\ell = 1}^{K} y_i^{(\ell)} \frac{\partial \log \hat p^{(\ell)}_i }{\partial p^{(\ell)}_i}  \frac{\partial p^{(\ell)}_i}{\partial a^{(k)}} \frac{\partial a^{(k)}}{\partial \theta_{j}^{(k)}}
\end{align*}
where 
$$ a^{(k)} = \theta^{(k)} \cdot x_i $$ 
and 
$$ \frac{{\partial a^{(k)}}}{\partial \theta_{j}^k} = x_{ij}. $$
Thus,
$$\frac{\partial J(\Theta)}{\partial \theta_j^{(k)}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{\ell = 1}^{K} \frac{y^{(\ell)}_i}{p^{(\ell)}_i}  \frac{\partial p^{(\ell)}_i}{\partial a^{(k)}}  x_{ij}.$$
In addition,
if $\ell = k$,
$$ \frac{\partial p^{(\ell)}_i}{\partial a^{(k)}} = \frac{\partial p^{(\ell)}_i}{\partial a^{(\ell)}} = \frac{\partial}{\partial a^{(\ell)}} \frac{\exp(a^{(\ell)})}{\sum_{m=1}^{K} \exp(a^{(m)})} = \frac{\exp(a^{(\ell)})\sum_{m=1}^{K} \exp(a^{(m)}) -  \exp(a^{(\ell)}) \exp(a^{(\ell)})}{ (\sum_{m=1}^{K} \exp(a^{(m)}) )^2 } = p^{(\ell)}_i \left( 1 - p^{(\ell)}_i \right) $$
and if $\ell \neq k$,
$$ \frac{\partial p^{(\ell)}_i}{\partial a^{(k)}} = - \hat p^{(k)}_i p^{(\ell)}_i.$$
Therefore,
\begin{align*}
\nabla_{\theta^{(k)}} J (\Theta) &= -\frac{1}{N} \sum_{i=1}^{N} \left( y_i^{(k)} (1 - \hat p_i^{(k)} )  - \sum_{\ell \neq k} \hat p_i^{(k)} y_i^{(\ell)} \right) x_{ij} \\
&=  -\frac{1}{N} \sum_{i=1}^{N} \left(y_i^{(k)} - \hat p_i^{(k)} (y_i^{(k)}  + \sum_{\ell \neq k} y_i^{(\ell)}) \right) x_{ij}  \\
& = \frac{1}{N} \sum_{i=1}^N (\hat p_i^{(k)} - y_i^{(k)} ) x_{ij}.
\end{align*}

In [1]:
from sklearn import datasets