# Probability

`counting` as in counting cards, is the most basic kind of probability.  The kind of questions that counting answers are like this:  You have a set S, and you pick elements from that set for your new set T. How many different T's could you make?  

There are 2 questions you should ask that determine how you solve this problem:

Does `order` matter? Is T = {x, y} the same as T = {y, x}?  If the answer is yes, then order does not matter.  If the answer is no, then order matters.

Is there `replacement`?  Lets say S = {x, y, z} and you want to choose 2 elements for T.  If you choose x, does S = {y, z} now, or is it still S = {x, y, z}?  If x is gone from S, then there is no replacement.  If x is still in S, then there is replacement.  If there is replacement, then you could have T = {x, x}.  If there is no replacement, then it would be impossible for you to get T = {x, x}.

Let's say we have S = {a, b, c, d, e, f, g, h, i, j}, which is 10 items, and we want to find different subsets of 5.

`n` is the number of items we can choose from

`k` is the number of choices we get to make.

`With replacememnt, order matters`

Example:  You roll a die 5 times.  How many sequences of rolls could you get?

Answer:  6\*6\*6\*6\*6 = 6^5

On each roll, there are 6 possible outcomes.  1 followed by 2 is different from 2 followed by 1, since order matters.

General formula:  n^k

`Without replacement, order matters`

Example:  How many ways could 20 cards be stacked up from a deck of 100 unique cards?

Answer:  100\*99\*98\*...\*81 = 100! / 80!

Since each card is unique, there is no replacement.  A stack is ordered, so order matters.  We have 100 choices for our first card, 99 choices for our second card, etc.

General formula:  n! / (n-k)!

`Without replacement, order doesn't matter`

Example:  How many 20 card hands could you make from a deck of 100 unique cards?

Answer:  100! / 80! / 20!

First we do the calculation as if order did matter.  Now consider how many different stacks of cards you could make with a single hand of 20 cards.  The answer to that is 20!.  So for every 20! stacks of cards, there is 1 hand of cards.  So you take 100! / 80! and divide it the whole thing by 20! to convert from ordered stacks of cards to unordered hands of cards.

General formula:  n! / ((n-k)!k!)

This formula is so common it has a name:  choose, or `n choose k`, and it often abbreviated as $n \choose k$.

`With replacement, order doesn't matter` Warning, this one is weird.

Example:  How many ways can you put 5 identical balls into 10 labeled bins?

Answer:  $14 \choose 5$

First, we are choosing bins, not balls.  If that's weird, maybe think of the bin as being 'reused' instead of 'replaced'.  We can represent this problem with a binary string.  0's are balls, and 1's are dividers between bins.  So if there are 2 bins, there's 1 divider.  Since there are 10 bins, there are 9 dividers.  Here's an example ordering:  01101011100111.  There's 1 ball in the first bin, third bin, fourth bin, 2 balls in the seventh bin, and 0 balls in all the other bins.  So there are 10 - 1 + 5 digits, and we are choosing 5 of them to be 0's and the rest to be 1's.  So we end up with $14 \choose 5$.  We could also equivalently say that we are choosing 9 of them to be 1's and the rest to be 0's.

<details><summary>Prove it</summary>
    
$$
\binom{n}{k}

= \frac{n!}{(n-k)!k!}
$$

---------------------------------------------------------------------

$$
\binom{n}{n-k}

= \frac{n!}{(n-(n-k))!(n-k)!}

= \frac{n!}{k!(n-k)!}

= \frac{n!}{(n-k)!k!}

$$
    
</details>

https://docs.google.com/presentation/d/1Y7sdF-q27CjbGO58gpYcYS9Fgz3PSkZJcXDyOuJqiTQ/edit#slide=id.g407c66229f_1_274

Compare this with what is said by the 'law of large numbers'.  I think it was something like 'if we know the distribution, then we can predict with X accuracy'.  The thing is that you dont actually know the distribution.  Like, pretty much ever.  Because you're a human, and you're bad at collecting data.

Oh, if the link is broken, just look for 'Literary Digest 1936 election poll failing'.  Even though Digest got it right for a number of years beforehand.

**Bayes Nets**

Reviewing these because Sequioua asked for help and it's not very often that you can provide assistance to someone.

Bayes nets are for when we don't have perfect information about the world.  They let us calculate the probabilities of certain events happening.

Events are connected and potentially dependent on each other in some way.  This is where the net comes in.  An arrow means that 2 events _might_ be dependent on each other.

Inference by enumeration: We have a giant table of probabilities.
Say we want P(W=sun|S=winter).
We select the rows where S=winter.
Say S=winter is a quarter of them, so P(S=winter)=25%
Now the entries in our table only account for 25% of the what season it might be.
But since we _know_ S=winter, the probability is 1, and we have to renormalize.
Renormalizing means making the total probability of this new table 1 again.
To do that, just sum the probabilities of each table entry, then divide each table entry by that sum.
Now you have a new table whose name is P(W|S=winter).
This name means 'we know the season is winter.  Put in a query for the weather W to find the probability of that weather given that the season is winter.'
Then you sum all the P(W=sun) rows to get the total probability of P(W=sun|S=winter).

So if you have a table of probabilities, you want to compute P(Q_1, ...., Q_n|e_1, ...., e_n).

Each e_i is 'evidence'.  It's a random variable that we already know the assignment of, like S=winter.

Each Q_i is a query variable that we don't know.  A Q_i might be W, or it might be W=sun.

W means 'give me a table with all the possibilities for W'.

W=sun means 'give me a table where W=sun specifically.'

these are the general steps on how you compute P:
Step 1:  select all the rows where each of the e_i's are true and put them in a new table.
Step 2:
for each row:
    P(row) = P(row) / sum(all P(row)'s from 1 to n)
Step 3:
Select all the rows where the Q_i's have been assigned.  Do not repeat step 2.
Step 4:
Now for the unassigned Q_i's, assign them to whatever it is that you want to query.

What's the difference between assigning Q_i vs an e_i?  The difference is just like P(X=x, Y=y|Z=z) vs P(X=x|Y=y, Z=z).  In the first case, Y is a Q_i, in the second, it's an e_i.  The first is saying 'what is the probability that X is x and Y is y given that Z is z'.  The second is saying 'what is the probability that X is x given Y is y and Z is z.'

The problem is that these tables can be really big.  If each row is size d, and we have n rows, then that's d^n entries.  Too many.  A bayes net will be less space complexity.

So we have the events season, temperature, and weather.

We have 4 seasons, spring, summer, fall, winter.  We have, I don't know, 100 temperatures in fahrenheit, from 1 to 100.  Obviously we've had 0 or below or above 100, but lets keep this easy.  We also have 2 weathers, sun and rain.  Also snow doesn't exist, because God is punishing us.  If we wanted to calculate the joint probability table, we would need all of the following:

P(Season=spring, temperature=100, weather=sun)
P(spring, 100, rain)
P(spring, 99, sun)
P(spring, 99, rain)
....
P(winter, 2, sun)
P(winter, 2, rain)
P(winter, 1, sun)
P(winter, 1, rain)

Altogether, there will be 4\*100\*2=800 table entries.

Now lets think of the general case.  We have n variables (columns), and each variable can have d values, then we will have d^n entries (rows), which is a ton, right?  In our specific case, the variables have different domains, so the formula is instead product(d_i) for all i.

A bayes net combines tables with DAGs.  Instead of having 1 table that calculates all possibilities for P(Season, Temperature, Weather), we have a bunch of tables of conditional probabilities.  These conditional tables will be much smaller than the joint probability table, and will decrease our space complexity.

We know that there is some dependence relationship between season, temp, and weather.  We would probably guess that temp and weather are dependent on season, not the other way around.  Maybe weather might also be directly dependent on temp, or temp on weather, but lets keep this simple for now.  So temperature and weather do not have an edge between them.  Season influences weather, and season influences temperature.

So we have this DAG:
season---->temperature
       |
       |-->weather
       

If we know the season, temp and weather are independent from each other.

If we do not know the season, temp and weather are dependent on each other.  Well, truthfully, they _might_ be (and intuitively they should be) dependent on each other, but with bayes nets you can never guarentee dependence.

Lets say we know temperature, but not season or weather.  Knowing the temperature means we can make a more educated guess about the season.  Having a more educated guess about the season means we can make a more educated guess about the weather.  So knowing the temperature means we know more about the weather.  So temperature and weather are dependent on each other.

Now here's a second scenario.  Let's say we already know the season, but not the temperature or the weather.  Now we learn what the temperature is.  The temperature will let us make a more educated guess about the season .... but we already know exactly what the season is, so using temperature to guess the season is pointless.  Since we already know the season in this case, learning about the temperature won't help us make a better guess about the weather.  So temperature and weather are dependent on each other.

This relationship between season, temperature, and weather is a 'common cause' relationship.  A similar line of thought can be applied to understand the other relationships: common effect and causal chain. 

So we just need a P(Season) table, a P(Temp|Season) table, and a P(Weather|Season) table.

The P(Season) table has 4 entries.

The P(Temp|Season) table has 400 entries.

The P(Weather|Season) table has 8 entries.

So these 3 tables have 412 entries, rather than 800 entries, not to mention the entries are smaller than in the joint probability table.  Much better in terms of space.

With these 3 tables, you can calculate any query.  Here are some examples:

P(T, W|S) = P(T|S)P(W|S), since T and W are independent given S.

P(S|W) = P(W|S)P(S)/P(W)

P(T) = sum(P(T|S=season) for all seasons)

Gotcha.  So I'm looking at note 6, under Bayes Net (Inference).

So you have these probability tables:
P(T), P(C|T), P(S|T), P(E|C,S)

You want the table P(T|+e)

Since all you need is T and +e, all other variables are extraneous information.  So C and S are not needed, and neither is anyting in E that is not +e.  

One way to do this is that you could get P(T|+e) by combining all the information you have into a single table P(T, C, S, E).  Then you would sum over the C column.  Then sum over the S column.  Then drop all rows whose E column value isn't +e, and normalize to get P(T|+e).  However, this is inefficient because the P(T, C, S, E) table is huge.  A better way is to eliminate as soon as possible so that you aren't doing calculations on variables that you don't need to be doing.

Here's the 'variable elimination' way to do it.  The first thing we do is find the tables that contain E, and eliminate all rows that are not +e, then normalize the new probability table.  So P(E|C,S) will become P(+e|C,S).  Now you won't be doing any calculations on rows that involve -e, which are useless.

Then we want to eliminate C or S.  We'll do C first.  We combine all the tables that involve C in any way.  Our unified table of C will be P(C,+e|T,S) = P(C|T)·P(+e|C,S).  Now we can sum out C to get P(+e|T,S).  

Aaaand i'm bored.  Hopefully the stuff from before will be enough.


Side note, FPGAs are currently being used for this.
https://www.nextplatform.com/2018/08/27/xilinx-unveils-xdnn-fpga-architecture-for-ai-inference/
Not entirely practical yet, but has a lot of potential.
