# Naive Bayes

$$p(y_k|x_1, ..., x_m) = p(y_k)p(x_1,...,x_m|y_k) = p(y_k) \prod_{i=1}^m p(x_i|y_k)$$

$\textbf{Steps}$:

N: number of sample, m: number of features, K: number of labels
1. Compute prior and conditional probability
$$p(y=c_k) = \frac{1}{N} \sum_{i=1}^N 1(\hat{y}_i = c_k)$$
$$p(x_j=a_{j, l}| y = c_k) = \frac{\sum_{i=1}^N 1(x_{i,j}=a_{j,l}, \hat{y}_i = c_k)}{\sum_{i=1}^N 1(\hat{y}_i = c_k)}$$
$$j=1,2,....,m;l=1,2,...,s_j;k=1,2,...,K$$
2. Given $\overrightarrow{x}$, compute
$$p(y_k|x_1, ..., x_m) = p(y_k=c_k) \prod_{i=1}^m p(x_i|y_k=c_k)$$
3. Predict 
$$\hat{y} = \underset{c_k}{\mathrm{argmax}} [p(y_k=c_k) \prod_{i=1}^m p(x_i|y_k=c_k)]$$

$\textbf{Assumptions}$:
- Features are conditoinally independent

$\textbf{Advantages}$:
- Fase for training and prediction
- Easy to interpret (white box)
- Few tunable hyperparameters
- Less complexity

# Hidden Markov Models (HMM)

$\textbf{Assumptions}$:
- Independence: labels are independent
  - What if labels are $\underline{not}$ independent: Maximum Entropy HMM
- Markov Property: current state $s_t$ only depends on previous state $s_{t-1}$

$\textbf{Applications}$:

A: transition probability matrix, B: observation probability matrix, $\overrightarrow{\pi}$: initial state
- Given parameter $\lambda = (A, B, \overrightarrow{\pi})$ and observation $O = (o_1, o_2, ..., o_T)$, compute $P(O; \lambda)$
  - Forward-backward algorithm
- Given $O = (o_1, o_2, ..., o_T)$, estimate $\lambda = (A, B, \overrightarrow{\pi})$ to maximize $P(O; \lambda)$
  - Supervised learning: maximal likelihoood estimation (MLE)
  - Unsupervised learning: Baum-Welch algorithm (EM)
- Given $\lambda = (A, B, \overrightarrow{\pi})$ and $O = (o_1, o_2, ..., o_T)$, find sequence of hidden states $I = (i_1, i_2, ..., i_T)$ to maximize $P(I|O)$
  - Vertibi algorithm (Dynamic Programming)