# Bias-Variance Tradeoff

https://towardsdatascience.com/the-bias-variance-tradeoff-8818f41e39e9

$Bias(\hat f) = E[\hat f] - f$  
Bias is how much the average of all your models (fitted on different training data) differ from the true model. It is similar to accuracy. With high complexity, the average of all your models will have low bias; and vice-versa.

$Var(\hat f) = E[(\hat f - E[\hat f])^2]$  
Variance is how much your model (fitted on the training data) varies from the average from all models (fitted on different training data) given a fixed complexity . With low complexity, it won't vary as much between different training samples; with high complexity, it will vary a lot between different training samples.

$MSE$  
$= E[(y - \hat f)^2]$  
$= E[(f + e - \hat f)^2]$  
$= E[(f - \hat f + e)^2]$  
$= E[(f - \hat f)^2 + e^2 + 2e(f - \hat f)]$  
$= E[(f - \hat f)^2] + E[e^2] + 2E[e]E[(f - \hat f)]$  
$= E[(f - \hat f)^2] + \sigma ^2 + 2(0)E[(f - \hat f)]$  
$= E[(f - \hat f)^2] + \sigma ^2$  
$= E[(f - E[\hat f] + E[\hat f] - \hat f)^2] + \sigma ^2$  
$= E[(f - E[\hat f])^2] - 2E[f - E[\hat f]]E[-E[\hat f] + \hat f] + E[(-E[\hat f] + \hat f)^2] + \sigma ^2$  
$= E[(f - E[\hat f])^2] - 2(E[f] - E[\hat f])(-E[\hat f] + E[\hat f]) + E[(\hat f - E[\hat f])^2] + \sigma ^2$  
$= E[(f - E[\hat f])^2] + E[(\hat f - E[\hat f])^2] + \sigma ^2$  
$= (f - E[\hat f])^2 + E[(\hat f - E[\hat f])^2] + \sigma ^2$  
$= Bias(\hat f)^2 + Var(\hat f) + \sigma ^2$  

# Bayes Theorem

$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B | A) P(A)}{P(B)}$

# Generative vs Discriminative Model

Discriminative models decision boundaries between classes
- learns $P(Y|X)$
- generally low bias
- e.g. NN, Logistic, SVM, Decision Trees, Random forests (high bias), KNN (bias $\propto$ K)

Generative models models distributions of individual classes
- learns $P(Y, X)$ [or $p(Y)$, $P(X|Y)$]
- e.g. NB (special case of LDA), GMM, HMM, VAE, Linear Discriminant Analysis (LDA), QDA

# Naive Bayes

- high bias, low variance
- supervised classifier
- generative model


Assumes independence $P(x_1 \ldots x_n| y_k) = \Pi_i P(x_i| y_k)$

$P(y_k|x_1 \ldots x_n) = \frac{P(x_1 \ldots x_n| y_k) P(y_k)}{P(x_1 \dots x_n)} \propto \Pi_i P(x_i| y_k) P(y_k)$

$\hat y_k = argmax_k \Pi P(x_i| y_k) P(y_k)$

# Linear/Quadratic Discriminant Analysis

https://en.wikipedia.org/wiki/Linear_discriminant_analysis

https://sebastianraschka.com/faq/docs/lda-vs-pca.html

LDA equivalent to a Gaussian Naive Bayes model without independence assumption.

High Level Differences

- GNB : assumes covariance of X under classes C1 and C2 are different, but the off-diagonal elements are 0.
- LDA : assumes covariance of X under classes C1 and C2 are same, and the off-diagonal elements are not equal to 0.
- QDA : assumes covariance of X under classes C1 and C2 are different, and the off-diagonal elements are not equal to 0.

https://stackoverflow.com/questions/63023314/gaussian-nb-vs-lda-in-scikit-learn
http://www.columbia.edu/~mh2078/MachineLearningORFE/Classification1_MasterSlides.pdf

# Expectation Maximization

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

# ROC
https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://www.youtube.com/watch?v=4jRBRDbJemM

# Type I vs Type II

Type I (alpha)
- H0 True -> reject
- False Positive 

Type II (beta)
- H0 False (H1 True) -> accept
- False Negative

1 - beta = Power / Recall / Sensitivity / TPR

1 - alpha = TNR / Specificity

# Accuracy vs Performance

Accuracy 
- $Acc = \frac{TP + TN}{P + N}$

Performance
- $BA = \frac{TPR + TNR}{2}$
- $F_1 = \frac{2PPV \cdot TPR}{PPV + TPR}$


Accuracy is a poor measure for class inbalanced datasets.

# Bagging vs Boosting

https://www.kaggle.com/prashant111/bagging-vs-boosting

https://medium.com/syncedreview/infiniteboosting-a-bagging-boosting-hybrid-algorithm-8b109019e480

# Random Forests vs Bagged Trees

https://stats.stackexchange.com/questions/264129/what-is-the-difference-between-bagging-and-random-forest-if-only-one-explanatory

https://www.youtube.com/watch?v=7VeUPuFGJHk

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

# MLE vs MAP

https://wiseodd.github.io/techblog/2017/01/01/mle-vs-map/

---

Recall Bayes' Theorem

$\begin{align}
P(\theta \vert X) &= \frac{P(X \vert \theta) P(\theta)}{P(X)} \\[10pt]
                  &\propto P(X \vert \theta) P(\theta)
\end{align}$

where:
  - $P(\theta \vert X)$ is the posterior
  - $P(X \vert \theta)$ is the likelihood
  - $P(\theta)$ is the prior
  - $P(X)$ is the normalizing constant

---

MLE is a frequentist approach as it ignores the prior

$\begin{align}
\theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} P(X \vert \theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_i P(x_i \vert \theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log P(x_i \vert \theta)
\end{align}$

---

Maximum A Posteriori is a Bayesian approach as it considers the Prior

$\begin{align}
\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} P(X \vert \theta) P(\theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) + \log P(\theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_i P(x_i \vert \theta) + \log P(\theta) \\[10pt]
             &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log P(x_i \vert \theta) + \log P(\theta)
\end{align}$

---

When $P(\theta)$ is uniform, $\theta_{MAP} = \theta_{MLE}$