# Naive Bayes

$$p(y_k|x_1, ..., x_m) = p(y_k)p(x_1,...,x_m|y_k) = p(y_k) \prod_{i=1}^m p(x_i|y_k)$$

$\textbf{Steps}$:

N: number of sample, m: number of features, K: number of labels
1. Compute prior and conditional probability
$$p(y=c_k) = \frac{1}{N} \sum_{i=1}^N 1(\hat{y}_i = c_k)$$
$$p(x_j=a_{j, l}| y = c_k) = \frac{\sum_{i=1}^N 1(x_{i,j}=a_{j,l}, \hat{y}_i = c_k)}{\sum_{i=1}^N 1(\hat{y}_i = c_k)}$$
$$j=1,2,....,m;l=1,2,...,s_j;k=1,2,...,K$$
2. Given $\overrightarrow{x}$, compute
$$p(y_k|x_1, ..., x_m) = p(y_k=c_k) \prod_{i=1}^m p(x_i|y_k=c_k)$$
3. Predict 
$$\hat{y} = \underset{c_k}{\mathrm{argmax}} [p(y_k=c_k) \prod_{i=1}^m p(x_i|y_k=c_k)]$$

$\textbf{Assumptions}$:
- Features are conditoinally independent

$\textbf{Advantages}$:
- Fase for training and prediction
- Easy to interpret (white box)
- Few tunable hyperparameters
- Less complexity

# Hidden Markov Models (HMM)

$\textbf{Assumptions}$:
- Independence: labels are independent
  - What if labels are $\underline{not}$ independent: Maximum Entropy Markove Model, Conditional Random Field
- Markov Property: current state $s_t$ only depends on previous state $s_{t-1}$

$\textbf{Applications}$:

A: transition probability matrix, B: observation probability matrix, $\overrightarrow{\pi}$: initial state
- Given parameter $\lambda = (A, B, \overrightarrow{\pi})$ and observation $O = (o_1, o_2, ..., o_T)$, compute $P(O; \lambda)$
  - Forward-backward algorithm
    - Forward: $\alpha_t(i) = P(o_1, o_2, ..., o_t, i_t = i; \lambda)$
      1. Compute $\alpha_1(i) = \pi_i b_i (o_1), i = 1, 2, ..., Q$
      2. For $t= 1, 2, ..., T-1$ 
      $$\alpha_{t+1}(i) = \left[\sum_{j=1}^Q \alpha_t (i) a_{j,i}\right] b_i (o_{t+1})$$
      3. Stop: $P(O; \lambda) = \sum_{i=1}^Q \alpha_T (i)$
    - Backward: $\beta_t (i) = P(o_{t+1}, o_{t+2}, ..., o_T | i_t = i; \lambda)$
      1. Initialize $\beta_T (i) = 1, i = 1, 2, ..., Q$
      2. For $t= T-1, T-2, ..., 1$
      $$\beta_t (i) = \sum_{j=1}^Q a_{i,j} b_j (o_{t+1}) \beta_{t+1} (i)$$
      3. Step: $P(O;\lambda) = \sum_{i=1}^Q \pi_i b_i (o_1) \beta_1 (i)$
      
- Given $O = (o_1, o_2, ..., o_T)$, estimate $\lambda = (A, B, \overrightarrow{\pi})$ to maximize $P(O; \lambda)$
  - Supervised learning: maximal likelihoood estimation (MLE)
  - Unsupervised learning: Baum-Welch algorithm (EM)
  
- Given $\lambda = (A, B, \overrightarrow{\pi})$ and $O = (o_1, o_2, ..., o_T)$, find sequence of hidden states $I = (i_1, i_2, ..., i_T)$ to maximize $P(I|O)$
  - Vertibi algorithm (Dynamic Programming)
    1. Initialize $\delta_1 (i) = \pi_i b_i (o_1), i = 1, 2, ..., Q$
    2. For $t = 2, ..., T$
    $$\delta_t (i) = \underset{1 \leq j \leq Q}{\mathrm{max}} \delta_{t-1}(i) a_{j,i} b_i (o_t)$$
    $$\Psi_t (i) = \underset{1 \leq j \leq Q}{\mathrm{argmax}} \delta_{t-1}(i) a_{j,i}$$
    3. Stop: $P^{*} = \underset{1 \leq j \leq Q}{\mathrm{max}} \delta_T (i), i_T^{*} = \underset{1 \leq j \leq Q}{\mathrm{argmax}} \delta_T (i)$
    4. Backtracking: $i_t^{*} = \Psi_{t+1} (i_{t+1}^{*})$

# Conditional Random Field (CRF)

$\textbf{Input}$:
- Input vectors $X$
- The position i of data point
- The label of data point $i - 1$ in $X$: $y_{i-1}$
- The label of data point $i$ in $X$: $y_{i}$

$\textbf{Objective}$:
- Model conditional probability
$$\hat{y} = \underset{y}{\mathrm{argmax}}  p(y|X)$$
- Does not require label independence

$\textbf{Feature Function}$:
- Purpose: express the characteristic of data sequence
- Example: Part-of-Speech tagging
- Each feature function is based on label of previous word and current word
$$f(X, i, L_{i-1}, L_{i})=
    \begin{cases}
      1, & \text{if}\ L_{i-1} \in Noun \& L_{i} \in Verb \\
      0, & \text{else}\ 
    \end{cases}$$
- Assign each feature function with weights
$$p(y, X, \lambda) = \frac{1}{Z(X)} exp\left(\sum_{i=1}^n \sum_j \lambda_j f_i (X, i, y_{i-1}, y_i)\right)$$
$$Z(x) = \sum_{y^{'} \in y} \sum_{i=1}^n \sum_j \lambda_j f_i (X, i, y_{i-1}^{'}, y_i^{'})$$
- Take negative loglikelihood, compute partial derivative w.r.t. $\lambda$
- Gradient Descent update for CRF
$$\lambda := \lambda + \alpha \left(\sum_{k=1}^m F_j (y^k, x^k) + \sum_{k=1}^m p(y|x^k, \lambda) F_j (y, x^k) \right)$$
$$F_j (y, x) = \sum_{i=1}^n f_i (X, i, y_{i-1}, y_i)$$

HMM vs CRF: https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541