In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# 1.判别模型和生成模型

判别模型：model $p(y|x)$ 如逻辑回归<br>
生成模型：model $p(x|y) , p(y)$ 如高斯判别分析、朴素贝叶斯<br>
然后利用：
$$p(y|x)=\frac{p(x|y)p(y)}{p(x)}$$
那么：
$$\hat{y}=\underset{y}{argmax}\ \frac{p(x|y)p(y)}{p(x)}= \underset{y}{argmax}\ {p(x|y)p(y)}$$

# 2.高斯判别模型

多维高斯分布

均值为$\mu \in \mathbb{R}^{d}$，方差为$\Sigma \in \mathbb{R}^{d \times d}$（其中 $\Sigma >= 0$ 对称和半正定）的多维高斯分布$\mathcal{N}(\mu, \Sigma)$的概率密度函数：

$$p(x;\mu,\Sigma)=\frac{1}{(2\pi)^{d/2}\left | \Sigma \right |^{1/2} }exp(-\frac{1}{2}(x-\mu)^{T}{\Sigma}^{-1}(x-\mu))$$

若我们有二分类数据集:$\left \{ (x^{(1)},y^{(1)}),...,(x^{(n)},y^{(n)}) \right \} , x$中的值连续，假设:

$$y\sim Bernoulli(\phi)$$
$$x | y=0 \sim \mathcal{N}(\mu_{0},\Sigma) $$
$$x | y=1 \sim \mathcal{N}(\mu_{1},\Sigma) $$

从而我们有对数似然函数：

$$
\begin{equation}
\begin{split}
l(\phi,\mu_{0},\mu_{1},\Sigma) &= log\prod_{i=1}^{n}p(x^{(i)},y^{(i)};\phi,\mu_{0},\mu_{1},\Sigma) \\
&= log\prod_{i=1}^{n}p(x^{(i)}|y^{(i)};\mu_{0},\mu_{1},\Sigma)p(y^{(i)};\phi)
\end{split}
\end{equation}
$$

使用最大似然估计、可以得到：
$$\phi = \frac{1}{n}\sum_{i=1}^{n}1\left \{ y^{(i)}=1 \right \}$$
$$\mu_{0} = \frac{\sum_{i=1}^{n}1\left \{ y^{(i)}=0 \right \}x^{(i)}  }{\sum_{i=1}^{n}1\left \{ y^{(i)}=0 \right \}} $$
$$\mu_{1} = \frac{\sum_{i=1}^{n}1\left \{ y^{(i)}=1 \right \}x^{(i)}  }{\sum_{i=1}^{n}1\left \{ y^{(i)}=1 \right \}} $$
$$\Sigma=\frac{1}{n}\sum_{i=1}^{n}(x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^{T}$$

很符合预期的估计

In [2]:
def guassian_discriminant_analysis(x,y):
    pass

## 2.1高斯判别模型和逻辑回归的关系

# 3.朴素贝叶斯

在高斯判别模型(GDA)中，x是连续的，若x是离散的，我们可以使用朴素贝叶斯。

假设x有d维，则我们有：

$$
\begin{equation}
\begin{split}
p(x|y) &= p(x_{1},...,x_{d}|y) \\
&= p(x_{1}|y)p(x_{2}|y,x_{1})....p(x_{d}|y,x_{1},...,x_{d-1})
\end{split}
\end{equation}
$$

若我们假设各个维度相互独立（所以说是朴素的），那么上式可以简化为：
$$p(x|y)=\prod_{j=1}^{d}p(x_{j}|y)$$

我们可以使用频率来估计$p(x_{j}|y),p(y)$（应当也是某个假设下的最大似然估计，t.b.c）<br>
这样我们可以通过下式得到$\hat{y}$：
$$\hat{y}= \underset{y}{argmax}\ {p(x|y)p(y)}$$

# 3.1 laplace smoothing

使用频率来估计$p(x_{j}|y)$时有一个问题：<br>
即若$(x_{j},y)$未在训练样本中出现，其概率为0。这样所第j维为$x_{j}$的$x$都有$p(x|y)=0$

解决这种乘0问题的一个方法是为未在训练样本中出现的$p(x_{j}|y)$赋一个很小的值，for example：

$$p(x_{j}=k|y) = \frac{\sum_{i=1}^{n}1\left \{ y^{(i)}=y, x_{j}^{(i)}=k\right \} + 1}{\sum_{i=1}^{n}1\left \{ y^{(i)}=y \right \} + \sharp x_{j}}$$

for cases $x_{j}$'s choice is binary, for example text classification, $\sharp x_{j}=2$ <br>
in short:

$$p(x_{j}=1|y=1) = \frac{1 + number\ of\ spammers\ with\ word\ j}{2 + number\ of\ spammers}$$

## 3.2 event models for text classification

bernoulli event model: <br>
first randomly determined whether a spammer or non-spammer<br>
then runs through the dictionary deciding whether to include each word j.

multinomial event model: <br>
first randomly determined whether a spammer or non-spammer<br>
then each word in the email is generating from some same multinomial distribution independently.

if we have training set $\left \{ (x^{(1)},y^{(1)}),...,(x^{(n)},y^{(n)}) \right \}$ where $x^{(i)}=(x_{1}^{(i)},...,x_{d_{i}}^{(i)})$(here $d_{i}$ is the number of words in the i-th training example)<br>
we have the following maximum likelyhood estimate with laplace smooth parameters:

$$\phi = \frac{1}{n}\sum_{i=1}^{n}1\left \{ y^{(i)}=1 \right \}$$
$$\phi_{k|y=1}=\frac{1 + \sum_{i=1}^{n}\sum_{j=1}^{d_{i}}1\left \{x_{j}^{(i)}=k\wedge y^{(i)}=1 \right \}}{\left | V \right | + \sum_{i=1}^{n}1\left \{ y^{(i)}=1 \right \}d_{i}}$$
$$\phi_{k|y=0}=\frac{1 + \sum_{i=1}^{n}\sum_{j=1}^{d_{i}}1\left \{x_{j}^{(i)}=k\wedge y^{(i)}=0 \right \}}{\left | V \right | + \sum_{i=1}^{n}1\left \{ y^{(i)}=0 \right \}d_{i}}$$

$\left | V \right |$ is the size of the vocabulary <br>
in short:
$$\phi_{k|y=1}=\frac{1 + number\ of\ words\ k\ occur\ in\ spammer}{\left | V \right | + number\ of\ words\ in\ spammer}$$