## Side notes
_(code snippets, summaries, resources, etc.)_
- Provides extra explanation: [_Bayesian Learning_ PDF by Udactiy](https://www.evernote.com/shard/s37/nl/1033921335/d62d0042-fbcb-4831-83d6-fe311e1bc6cd/) (Evernote)

__definition:__ i.i.d. (independent, identically distributed)
- A collection / sequence of random variables is _independent and identically distributed_ if
    1. each random variable has the _same probability distribution_ as the others and 
    - are all _mutually independent_.

__Notes on math used:__
- Can switch $argmax$ to $argmin$ by multiplying equation by $-1$
- Can simplify products into sums for $e^x$ with inverse $ln$ (and vice versa):
    - Used to simplify Bayesian Learning equation to find best hypothesis
    - Works with equations with $\Sigma$ and $\Pi$ (naturally)
    - This works because:
        - $ln(x)$ is the inverse of $e^x$
        - and is also _monotonic_ (??) (as too is $e(x)$), which will not change the _argmax_

# Bayesian Learning

## Summary of topics covered

Section uses Bayesian Learning and Classification (derived from Bayes' Rule) as the underlying theories that explain concepts and methods that we have already been using like the bias-variance tradeoff.

![summary](bayesian_learning_images/summary.png)

- Maximum a posteriori hypothesis h<sub>MAP</sub> is the maximum likelihood hypothesis h<sub>ML</sub> when the prior P(h) is uniform
- Bayes' Optimal Classifier is where multiple hypotheses vote for most probable output for a particular data point
    - On average, _we cannot do any better_ than using this method.

## Bayes' Rule

![naive bayes formula](naive_bayes_images/IMG_0316.jpg)</div>

- P(E) (prior probability of evidence) acts as _normalizer for joint probability P(H, E)
    - where P(E) is the sum of P(E|H) and P(E|H')
    
![simple diagram Bayes' rule in words](naive_bayes_images/5C7CC10B-2C1E-46E0-BABB-A131BCF3950E.png)

### Example: Testing for Cancer

![cancer example question](naive_bayes_images/FA5ADE87-8BFC-434F-98D4-6D7122166298.png)

Another way to phrase question:
- Probability of correct positive = P(Pos|S)
- Probability of correct negative = P(Pos'|S')

![diagram of question](naive_bayes_images/90251E43-6F74-4A8D-AA00-7C27611418EA.png)</div>

![answer calculation](naive_bayes_images/CC52E540-F6ED-4620-BFDA-9B35C7DD557B.png)

__Detailed flow chart of Bayes' Rule for example:__

![detailed flow diagram of Bayes' rule](naive_bayes_images/6E1DD6D8-10E4-49F7-ADE9-B68822ACAD32.png)

- P(Pos, C) is joint probability, combines cancer hypothesis with the test result.
- Normalizer in this case is P(Pos)

__Note about result:__
- Cancer test is not a good indicator of a patient having cancer
    - to improve this, P(C) can be pushed up from 0.01 by using other symptoms to put patient in sub-population with higher P(C)
    - Need to be later in the process

### Example: Testing for "spleentitis"

![speentitis example](naive_bayes_images/speentitis_example.png)

## Bayesian Learning Algorithm

![bayes' rule overview](bayesian_learning_images/bayes_rule_overview.png)

Terms:
- P(D|h) is probability of seeing data with given labels in a universe where _h_ happens to be true.
- P(h|D) is...

- Accepting features as given, Pr(D|h) measures probability of associated labels
- Pr(D|h) represents type of accuracy measure
- Pr(D) is not directly connected to the hypothesis
- Pr(h), our _prior_ on a particular hypothesis drawn from the hypothesis space (_prior to seeing the data_)
    - encapsulates our belief that one hypothesis is likely or unlikely compared to other hyotheses
    - in fact, captures _domain knowledge_
    - a hypothesis is more likely if, for example, points grouped together in k-NN get more similar labels than other hypotheses
    - a hypothesis could be:
        - a similarity metric in k-NN, or 
        - which features might be important relating to higher information gain in a decision tree, 
        - structure of a neural network, 
        - etc.

![bayesian rule overview](bayesian_learning_images/bayesian_learning_algorithm.png)

- To determine maximal hypothesis P(h|D) (hypothesis that is most likely given the data), do not need P(D) which is sometimes not available, so we use approximation of Pr(D|h) (the hypothesis that best matches the data)
- P(h) can also be difficult to determine so we assume all hypothesis are equally likely, having a _uniform prior_
    - which lets us ignore P(h) when determining _maximum likelihood hypothesis_ because "we don't have a strong prior"
    - this h<sub>ML</sub>, or _maximum likely hypothesis_ is effectively the same as h<sub>MAP</sub>, or _maximum a posteriori hypothesis_, if there is not a strong prior
- Hard part that remains is running through all possible hypotheses to determine _argmax of h_ P(h|D) (all possible hypotheses are effectively infinite)
    - algorithm is not computationally practical, but conceptually useful
    
__This provides us with a "gold standard" with which to compare different hypotheses__

### Bayesian learning in Action

#### Some Terminology

- Appended from [Version Spaces - Georgia Tech - Machine Learning](https://classroom.udacity.com/courses/ud675/lessons/383498973/concepts/3972687050923#) (from Udacity course on Supervised Learning):

![version space terminology](bayesian_learning_images/version_space_terminology.png)

Where
- $S$ is training set consists of
    - features / examples (a subset of all possbile inputs, $X$)
    - labels / the _true class_ for all of those $x$'s
- _Consistent learner_ is perfectly accurate for training data
- _Version space__ is set of learners consistent with $S$
- To test understanding, see [quiz](https://classroom.udacity.com/courses/ud675/lessons/383498973/concepts/6833586430923) that follows video

#### Bayesian Learning on data without noise

From [Bayesian Learning in Action](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345404/modules/5c2f3b47-b791-46a7-88eb-34a0753665e6/lessons/5462070314/concepts/4733385550923#) (Udacity video):

![bayesian rule overview](bayesian_learning_images/bayesian_learning_in_practice.png)

_Assumptions, explained:_
1. $\{<x_{i}, d_{i}>\}$ training data, consisting of an input space and its labels
    - labels can be classification data, etc.
    - noise-free examples drawn from some underlying concept $c$
    - $d_{i}$ = $c$ of $x_{i}$ (for all $x_{i}$)
- $c \in H$, i.e. true concept is in _hypothesis class_ or hypothesis space
- uniform prior, i.e. no hypothesis in h is more likely than another.
    - also called _uninformed prior_
    
_Generalized Bayes' Rule for Machine Learning, explained:_
- P(h), since uniform, is just it's uniform part of P(all) = 1 of hypothesis class.
- P(D|h) is probability of seeing $D$ in a universe where $h$ is true, or _the hypothesis that describes the data the best_. It will be:
    - 1 if all labels match corresponding set of features, _h_ happens to be completely true (noise-free)
    - 0 if not, i.e. any of the feature sets does not match its corresponding labels.
    - will be between 0 and 1 if noise is factored into the calculation (see quiz below).
- P(D) 
    - works because we have necessary terms
    - works if hypotheses (all P(h<sub>$i$</sub>)) are mutually exclusive, which we always assume
- P(h|D) is _the hypothesis that best fits the data_, which can be now be calculated with Bayes' rule.
    - Result means: "Given a bunch of data the probability of a particular hypothesis being correct is simply unifrom over all the hypotheses that in the version space, that is, are consistent with the data we see."
    - P(h|D) = 0 if no hypothesis is consisted

#### Bayesian Learning on data with noise

- From 15-min course video [Return to Bayesian Learning](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345404/modules/5c2f3b47-b791-46a7-88eb-34a0753665e6/lessons/5462070314/concepts/4733385600923#):

![bayesian learning with noise 1](bayesian_learning_images/bayesian_learning_with_noise_1.png)

Simplication (continued from above):

![bayesian learning with noise 2](bayesian_learning_images/bayesian_learning_with_noise_2.png)

(see definition for _i.i.d._ in Sidenotes section)

Notes on math used for simplification above:
- When maximizing a function (looking for _argmax_), we can simplify expression by ignoring values that will not effect the _argmax_, in this case, of _h_
    - as done with $\frac{1}{\sqrt{2\pi\sigma^{2}}}$ below
- Can switch $argmax$ to $argmin$ by multiplying equation by $-1$
- Can simplify products into sums for $e^x$ with inverse $ln$ (and vice versa):
    - Used to simplify Bayesian Learning equation to find best hypothesis
    - Works with equations with $\Sigma$ and $\Pi$ (naturally)
    - This works because:
        - $ln(x)$ is the inverse of $e^x$
        - and is also _monotonic_ (??) (as too is $e(x)$), which will not change the _argmax_

- _Beautiful_ result shows that minimizing SSE to maximize $h$ can be use to determine the most probable hypothesis given the training data
    - Validates use of SSE in back propogation and other methods used on perceptrons with Bayesian Learning
- Smart trick: Modeling noise as a Gaussian distribution
    - Clever use of nature log
- Assumptions:
    - data has been corrupted by Gaussian noise
        - which would mean error domain has zero mean
    - we are modeling a deterministic function $f(x)$
    - we know / can guess the hypothesis class (independent of other assumptions)

NOTE: most likely doing something wrong if noise not Gaussian or function is not deterministic.
- Example of $x$ as height and $d$ as weight:
    - if error on weight but not height, these assumptions would be violated.
    - Fortunately, linear regressions / lines are well-behaved in this case, if noise on $d$ is i.i.d. as well.

#### Quiz: Noisy Data

(shown as intro to _Bayesian Learning on data with noise_ section above)
![quiz: noisy data](bayesian_learning_images/quiz_noisy_data.png)

Notes on question:
- ~$P(\frac{1}{2^k})$ is a geometric sequence 
    - whose sum from 1 to infinity equals 1, 
    - and is therefore a _true probability distribution_
- "Noisy-data" means some stochastic process is going on to distort the data.
- important for error domain is that mean is zero (variance not)

Notes on solution:
- If any $d$ value is not a multiple of its $x$ value, then 


#### Quiz: Best hypothesis

![quiz: best hypothesis](bayesian_learning_images/quiz_best_hypothesis.png)

- Perhaps MOD function is best only because other options are so bad
    - Check this by running a linear regression to determine best hypothesis
- to assess the resulting regression (hypothesis): 
    - same described above
    - calculate errors for each data point to give SSE value across data points
- In actuality, data is weird enough that MOD performs best, even better than a regression function

### Minimum Description Length

From course video [Minimum Description Length](https://classroom.udacity.com/nanodegrees/nd009/parts/0091345404/modules/5c2f3b47-b791-46a7-88eb-34a0753665e6/lessons/5462070314/concepts/4733385650923):

![minimum description length 1](bayesian_learning_images/minimum_description_length_1.png)

- Looking at the relationship of maximum a posteriori to information theory
- Information theory is usually entropy, "sum of P log P" (??)
- Entropy concept comes straight out of Information Theory
- "We want to somehow minimize two terms that can be described as lengths"
    - "length of hypothesis given the hypothesis ( because P(D|h) ~ P(h|D) ) and the length of the hypothesis"
- "length of a hypothesis" means _the number of bits needed to represent a hypothsis_
    - which is why we us $lg$ (log base 2)

Continued from above (after quiz on choosing smaller DT):

![minimum description length 2](bayesian_learning_images/minimum_description_length_2.png)

__length(h)__:
- Smaller Decision Tree (fewer nodes, less depth, etc.) mean smaller hypothesis length or _size_
- Can think about our bias for shorter decision tree as the prior P(h)
- Kind of a Bayesian argument for Occam's Razor, which is often the rationale for pruning DTs

__length(D|h):__
- i.e. the length of the data given a particular hypothesis
- Can be interpretted as:
    - If hypothesis matches the data perfectly, then data in term have no relevance
    - Otherwise, D would be represented by the error of the hypothesis on the data
    - If hypothesis does not match the data at all, then D would contain all the correct answers.
    - Captures the notion of fit of a particular hypothesis
   
__Therefore length(h)+ length(D|h):__ 
- maximum a posteriori hypothesis would be a one that:
    - is the simplest hypothesis, $-lgP(h)$ or length(h)
    - that minimizes the error, $-lgP(D|h)$ or length(D|h)
- Represents a fundamental tradeoff in ML
- Called _minimum description_
- Applies to Neural Nets as well
    - Minimizing weights to minimize length / size of hypothesis for nodes prevents overfitting
- Another validation of Occum's Razor and tradeoffs in ML from Bayesian Learning

### Bayesian Classification

With quiz to introduce Bayesian Classification:

![bayesian classification](bayesian_learning_images/bayesian_classification.png)

- Bayesian Classification _is not the same as_ Bayesian Learning in that
    - Bayesian Classification looks as most probable _label_ (or $v$ for value) for a particular data point instead of most probable _hypothesis_ for training data
- Student can derive / simplify similar equation to Bayesian Learning equation with similar method used above.
- Determining the most probable value is what we care about, assessing hypotheses' porbabilities is just the means to that end