Skip to content

Commit

Permalink
notes lec apr 14
Browse files Browse the repository at this point in the history
  • Loading branch information
zhudotexe committed Apr 14, 2020
1 parent 2d4a20a commit 617ab67
Show file tree
Hide file tree
Showing 2 changed files with 124 additions and 2 deletions.
2 changes: 1 addition & 1 deletion index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Welcome to cse142-notes's documentation!
========================================

.. toctree::
:maxdepth: 2
:maxdepth: 4
:caption: Contents:

intro
Expand Down
124 changes: 123 additions & 1 deletion prob.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Probability Review
==================

Useful notes: http://cs229.stanford.edu/section/cs229-prob.pdf

Let's define some important things.

- **Outcome Space**: :math:`\Omega` - contains all possible atomic outcomes
Expand Down Expand Up @@ -44,4 +46,124 @@ Let's define some important things.
E(\text{# heads}) & = \sum_{r=1}^6 P(roll = r) E(\text{# heads} | roll = r) \\
& = \frac{1}{6}(\frac{1+2+3+4+5+6}{2}) \\
& = \frac{21}{12} = 1.75
& = \frac{21}{12} = 1.75
- Joint distributions factor
- if :math:`\Omega = (S*T*U)`, then :math:`P(S=s,T=t,U=u)` is :math:`P(S=s)P(T=t|S=s)P(U=u|S=s,T=t)`
- Conditional distributions are also distributions
- :math:`P(A|B) = \frac{P(A, B)}{P(B)}`, so :math:`P(A|B, C)=\frac{P(A,B|C)}{P(B|C)}`

Bayes Rule for Learning
-----------------------

- Assume joint distribution :math:`P(\mathbf{X=x}, Y=y)`
- We want :math:`P(Y=y|\mathbf{X=x})` for each label :math:`y` on a new instance :math:`\mathbf{x}`
- So, using Bayes' Rule, :math:`P(y|\mathbf{x}) = P(\mathbf{x}|y) \frac{P(y)}{P(\mathbf{x})}`
- :math:`P(\mathbf{x})` doesn't matter here, so we care that :math:`P(y|\mathbf{x})` is proportional to :math:`P(\mathbf{x}|y) P(y)`
- From the data, we can learn :math:`P(\mathbf{x}|y)` and :math:`P(y)`
- Predict label :math:`y` with largest product

So how do we learn :math:`P(\mathbf{x}|y)`?

.. note::
Take for example a coin flip. You observe the sequence HTH; what is the probability that the next flip is H?

Mathematically, the answer is 2/3: taking the likelihood function :math:`L(\theta) = P(HTH|\theta)`
we get the probability equal to :math:`\theta^2 (1-\theta)`.

By finding the :math:`\theta` value at the zero derivative, we get 2/3.

.. note::
But what if we have a prior belief :math:`P(\theta)` where :math:`\theta = P(H)`?

Now, the posterior on :math:`\theta` becomes :math:`P(\theta | HTH)`:

.. math::
P(\theta | HTH) = P(HTH | \theta) \frac{P(\theta)}{P(HTH)}
Or in this case:

.. math::
\frac{\theta^2 (1-\theta) P(\theta)}{normalization}
**Discrete Prior**

Taking :math:`P(\theta=0) = P(\theta=1/2) = P(\theta=1) = 1/3`, :math:`\theta^2 (1-\theta) P(\theta)` is
0, 1/24, and 0 for the 3 cases respectively. Thus, the posterior :math:`P(\theta = 1/2 | HTH) = 1`.

**Prior Density**

- :math:`P(\theta) = 1` for :math:`0 \leq \theta \leq 1`
- So :math:`\theta^2 (1-\theta) P(\theta)` is just :math:`\theta^2 (1-\theta)`
- and the posterior is :math:`\frac{\theta^2 (1-\theta)}{12}`
- If we plot this, the max is at :math:`\theta = 2/3`

- Treat parameter :math:`\theta` as a random var with the prior distribution :math:`P(\theta)`, see training data :math:`Z`
- :math:`posterior = \frac{prior * data likelihood}{constant}`
- :math:`P(\theta | Z) = \frac{P(\theta) P(Z | \theta)}{P(Z)}`

Bayes' Estimation
-----------------

Treat parameter :math:`\theta'` as a RV with the prior distribution :math:`P(\theta)`, use fixed data
:math:`Z = (\mathbf{x}, y)` (RV :math:`S`)

Maximum Likelihood
^^^^^^^^^^^^^^^^^^

.. math::
\theta_{ML} = \arg \max_{\theta'} P(S=Z|\theta = \theta')
Maximum a Posteriori
^^^^^^^^^^^^^^^^^^^^

.. math::
\theta_{MAP} & = \arg \max_{\theta'} P(\theta = \theta' | S=Z) \\
& = \arg \max_{\theta'} P(S=Z | \theta = \theta')\frac{P(\theta = \theta')}{P(S=Z)}
Predictive Distribution
^^^^^^^^^^^^^^^^^^^^^^^
aka Full Bayes

.. math::
P(Y=y | S=Z) = \int P(Y=y | \theta=\theta') P(\theta=\theta' | S=Z) d\theta'
Mean a'Post
^^^^^^^^^^^

.. math::
\theta_{mean} = E[\theta | S=Z] = \int \theta' P(\theta=\theta' | S=Z) d\theta'
Use
^^^

- draw enough data so that :math:`P(Y=y | X=\mathbf{x})` estimated for every possible pair
- this takes a lot of data
- another approach: class of models
- think of each model :math:`m` as a way of generating the training set Z of :math:`(\mathbf{x}, y)` pairs

Compound Experiment
^^^^^^^^^^^^^^^^^^^

- prior :math:`P(M=m)` on model space
- models give :math:`P(X=x | M=m)` (where :math:`x` is a pair :math:`(\mathbf{x}, y)`)
- The joint experiment (if data is iid given m) is:

.. math::
P(\{(\mathbf{x_i}, y_i)\}, m) = P(m) \prod_i (P(\mathbf{x_i} | m) P(y_i | \mathbf{x_i}, m))
Generative and Discriminative Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Generative model: :math:`P((\mathbf{x}, y) | m)`
- tells how to generate examples (both instance and label)
- learn :math:`P(\mathbf{x} | y, m)` and use Bayes' rule
- common assumptions:
- :math:`P(\mathbf{x} | y, m)` is Gaussian
- :math:`P(y | m)` is Bernoulli
- Discriminative model: :math:`P(y | h, \mathbf{x})`
- tells how to create labels from instances
- often :math:`f(\mathbf{x}) = \arg \max_y f_y(\mathbf{x})`



0 comments on commit 617ab67

Please sign in to comment.