 Introduction to Bayesian Analysis *
=================================

Bayesian analysis is the process of taking what we already know about
the world and applying it to new situations. Formalized to mathematical
constructs, it's used for statistical inference, probability
interpretations, and is central to many machine learning algorithms
because "what we already know about the world" can be constantly updated
as new material becomes available.

A Simple Real-Life Example
--------------------------

What we already know about the world can be couched in terms of
probabilities. For instance we might know what fraction of the
population in the US is registered Republican. Say it's 40%. Let's also
say that everyone else is a Democrat.

If we don't know anything about some random person, we might say they
have 40% chance of being a Republican. By so doing, we are imposing an
existing belief (called a "prior") on the present challenge of
classifying the random person.

We might know a bit more about Republicans. Let's say we had voting
registration records for people living in gated communities. From these,
suppose we learn that 70% of Republicans live in gated communities.

We might know a bit about housing, as well. For instance, we could look
at real estate transaction data and determine that 30% of the population
lives in gated communities.

Now, let's say we meet someone in a park. During a pleasant
conversation, we learn that they live in a gated community. Not wanting
to be nosy, we don't ask, but wonder "What are the odds that they are
Republican?"

We already have a hunch that the new person is Republican, but can we do
better than that and figure out the probability that they are?

The Bayesian Approach
---------------------

Enter Thomas Bayes (and his posthumous contributor Richard Price). Bayes
worked out the math behind conditional probabilities: "Given some
condition A, what's the probability of another condition B". Here,
"Given Republican, what's the probability of Gated?"

|             |     | Condition B |     |
|-------------|-----|-------------|-----|
|             |     | B           | \~B |
| Condition A | A   | w           | x   |
|             | \~A | y           | z   |

Consider the table above. The symbols w, x, y, and z represent
probabilities of combinations of A, not A (\~A), B, and not B (\~B). A
few things might be noted:

> w + x + y + z = 100% The whole world
>
> P(A) = (w + x) / (w + x + y + z) Probability of A overall (row
> sum/everything)
>
> P(B) = ( w + y) / (w + x + y + z) Probability of A overall (column
> sum/everything)
>
> Bayes' theorem, stated simply, is that conditional probabilities like
> (A\|B) "A given B" can be related thusly:
>
> P(A\|B) \* P(B) = P(B\|A) \* P(A)
>
> The proposition is easily demonstrated in terms of the table.
>
> P(A\|B) = w / (w + y) Probability of (A and B) / (column sum gets Bs)
>
> P(B\|A) = w/ (w + x) Probability of (A and B) / (row sum gets As)

Throwing this whole mess together in terms of w, x, y, and z we get:

w ~~(w + y)~~ w ~~(w + x)~~

\_\_\_\_\_ \_ \* \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ = \_\_\_\_\_\_\_
\* \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

~~(w + y)~~ (w + x + y + z) ~~(w +x)~~ (w + x + y + z)

By inspection, we can see that once we cancel terms, the equation works.

Bayes' theorem is typically expressed in terms of the value we want to
solve for. Since we only have four terms: P(A\|B), P(B\|A), P(A) and
P(B) as long as we know three of them, we can solve for the last one. So
we might state it:

P(B\|A) \* P(A)

P(A\|B) = \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

> P(B)

For our problem, we might take our real world values and substitute
them:

A = Rep(ubilcan) and B = Gate

Restated, our equation and assumptions become:

P(Gate \| Rep) \* P(Rep)

P(Rep \| Gate) = \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

P(Gate)

P(Gate \| Rep) = .70 70% of Republications live in gated communities

P(Rep) = .40 40% of the population are Republicans

P(Gate) = .30 30% of the population live in gated communities

We can now solve the equation with our assumed population data:

P(Rep \| Gate) = .70 \* .40 / .30 = .93

So our new friend has a 93% chance of being a Republican.

Reality Checking
----------------

Let's do a gut check. A priori, we would guess that there would be a 70%
chance they would live in a gated community, right? Most do. We'd expect
that to get shaded higher because Republicans represent a higher
fraction of the general population than do residents of gated
communities. A bump of 20+ percent seems a bit large, but possible. We
can see at least that our post-analysis ("posterior") expectation,
informed by other things we know about the world, is in the right
direction.

Another way to do a gut check is to fool around with our assumptions. We
have to be prepared for the possibility that they're a bit off, anyway.
As we do so, we'll want to be thinking about a reasonable range for them
– and how the assumptions play off each other in the real world.

Here's a chart showing different values for P(Gate) leaving 40% of the
population as Republicans and 70% of the Republicans living in gated
communities. We'll test the range from 10% of everyone living in gated
communities to 100%. As we'll see, some potential values we might test
don't really makes sense.



The gray area on top covers a "region of impossibility" – it shows a
greater than 100% chance our new friend is a Republican. How could that
happen? These values are associated with p(Gate) values smaller than
around 30%. In other words there is not enough gated community housing
stock to accommodate all the Republicans we're claiming, and where
they're choosing to live.

The gray area on the bottom covers another "region of impossibility". As
soon as we claim that more than 80% of everyone lives in a gated
community it violates the assumption that some Republicans live
elsewhere.

Getting rid of the obviously spurious data, we can review our
sensitivity test:

| **p(Gate)** | **P(friend=Republican)** |
|--------------------|---------------------------------|
| 30%                | 93%                             |
| 35%                | 80%                             |
| 40%                | 70%                             |
| 45%                | 62%                             |
| 50%                | 56%                             |
| 55%                | 51%                             |
| 60%                | 47%                             |
| 65%                | 43%                             |

This would seem to make sense. We're assuming that 70% of Republicans
live in gated communities. To the extent this housing is rare, housing
choice is a pretty good predictor of party affiliation. But if most
people live in gated communities, the effect is "watered down" – and
this feature loses its discriminatory power.

Advertisers are keenly aware of the fact that many Internet users have
ad-blocking software, and that software uses Bayesian filters to figure
out what's an ad and what is content. As a result, they do what they can
to make their advertising content "blend in" with the herd of legitimate
content – they try not to provide features with a high discriminating
effect. You'll see sidebars called "news from around the Internet",
"chosen for you", lists, quizzes, and other "click bait" much more often
than a flashing banner for herbal Viagara. Quite possibly, Thomas Bayes
had already anticipated "herbal Viagra".