# Chapter 06 - Probabilities

Code from "Chapter 6" of the book, _Data Science from Scratch_, 2nd edition, by Joel Grus.

In [None]:
import enum
import fractions
import random

In [None]:
import matplotlib.pyplot as plt

In [None]:
import dsfs as scratch

## Dependence and Independence

Roughly speaking, we say that two events, _E_ and _F_, are **dependent** if knowing something about whether _E_ happens gives us information about whether _F_ happens (and vice versa). Otherwise, we say that _E_ and _F_ are **independent**.

Mathematically, we say that two events, _E_ and _F_, are independent if the probability that both happens is the product of the probabilities that each event happens. In symbols:

$$
    P (E|F) = P(E)P( F)
$$

## Conditional Probability

We use conditional probabilities to calculate, "What is the probability that 'both children are girls' given that 'at least one of the children is a girl'?"

Mathematically, if we have:

- B - The event, 'both children are girls'
- L - The event, 'at least one of the children are girls'

Symbolically, we calculate the conditional probabilities as:

$$
    P(B|L) = \frac{P(B, L)}{P(L)} = \frac{P(B)}{P(L)} = \frac{\frac{1}{4}}{\frac{3}{4}} = \frac{1}{3}
$$


We can "check" this calculation by "generating" a lot of families; that is, by performing a simulation.

In [None]:
# An `Enum` is a typed set of enumerated values. We can use them to make our
# code more descriptive and readable.
class Kid(enum.Enum):
    BOY = 0
    GIRL = 1

def random_kid() -> Kid:
    return random.choice([Kid.BOY, Kid.GIRL])

both_girls = 0
older_girl = 0
either_girl = 0

random.seed(0)  # A specific seed makes random values repeatable.

for _ in range(1000):  # Don't care about the iterated values
    younger = random_kid()
    older = random_kid()

    if older == Kid.GIRL:
        older_girl += 1

    if older == Kid.GIRL and younger == Kid.GIRL:
        both_girls += 1

    if older == Kid.GIRL or younger == Kid.GIRL:
        either_girl += 1

In [None]:
print(f'P( both | older ) = {both_girls / older_girl} ~= {fractions.Fraction(1, 2)}')
print(f'P( both | either ) = {both_girls / either_girl} ~= {fractions.Fraction(1, 3)}')

## Bayes's Theorem

One of the data scientist's best friends is Bayes's Theorem, which is a way of "reversing" conditional probabilities.

Imagine that we are interested in the conditional probability of _E_ occurring given that _F_ has occurred; however, the only information we have is the "reverse" conditional probability: the probability that _F_ occurs given that _E_ has occurred. How can we solve this problem?

We can solve this problem by applying the definition of conditional probability **twice**.

$$
    P(E | F) = \frac{P(E, F)}{P(F)}
$$

But,

$$
    P(E, F) = P(F, E) = P(F | E) P(E)
$$

Substituting for $ P(E, F) $ in the first equation, we get:

$$
    P(E | F) = \frac{P(F | E)P(E)}{P(F)}
$$

We can split the event _F_ into (F and E) + (F and not E). In symbols

$$
    P(F) = P(F, E) + P(F, \neg{E})
$$

Substituting this expression for P(F) and expressing the joint probabilities in terms of conditional probabilities gives us:

$$
    P(E | F) = \frac{P(F | E)(P(E)}{P(F, E) + P(F, \neg{E})|
$$

And

$$
    P(E | F) = \frac{P(F | E)(P(E)}{P(F | E)P(E) + P(F | \neg{E})P(\neg{E})}
$$

Which is one of the formulations of Bayes's Theorem.

As an application of Bayes's Theorem (with tongue firmly in cheek demonstrating why data scientists are smarter than doctors), imagine that a certain diseases affects only 1 in 10,000 people in the general population. Further, imagine that a doctor orders a test that gives the correct result, "diseased" if you have this diseases and "not diseased" if you do not have the disease, 99% of the time. Finally, imagine that the result of your test is "diseased."

What is the actual probability that you have this diseases?

We can use Bayes's Theorem to calculate this probability. If _D_ is the event "actually diseased" and _T_ is the event "test reports 'diseased'", then we have the following symbolic equation:

$$
    P(D | T) = \frac{(P(T | D)(P(D)}{P(T | D)P(D) + P(T | \neg{D})P(\neg{D})}
$$

But,

- $P(T | D) = 0.99$
- $P(D) == 0.0001$
- $P(T | \neg{D}) == 0.01$
- $P(\neg{D}) == 0.9999$

Substituting these numbers produces:

$$
    P(D | T) = \frac{0.99 \cdot 0.0001}{(0.99 \cdot 0.0001) + (0.01 \cdot 0.9999)} = 0.98\%
$$

**NOTE**: This analysis is unrealistic. A hidden assumption is that people take the test at random. In reality, since the test is administered mostly to people who have symptoms, the correct conditioning event is something like "test positive and have symptoms". In this scenario, the odds of having the disease are likely much higher than the result of our analysis.


## Random Variables

A _random variable_ is a variable whose possible values have an associated **probability distribution**.

For example, a very simple random variable has the value 1 if a coin flip is heads and 0 if the coin flip is tails (a Bernoulli trial with probability of heads $\frac{1}{2}$). A more complicated example is the number of heads that one observes when flipping a coin 10 times (or 10 "identical" coins) (a binomial distribution of 10 Bernoulli trials each with probability of heads $\frac{1}{2}$). Another complicated example is the probability of picking an integer from the range [0, 9] where all numbers are equally likely (a "uniform" distribution).

The _expected value_ of a random variable is the average of all the possible values in the distribution weighted by the probability of seeing that value. For a Bernoulli trial with probability $\frac{1}{2}$ is

$$
    \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot 0 == \frac{1}{2}
$$

Random variables can be _conditioned_ on events just as other events can.

For the most part, we will use random variables implicitly in what we do.

## Continuous Distributions

### The Uniform Distribution

A coin flip, multiple coin flips and selecting integers in [0, 9] with each integer selected with probability $\frac{1}{10}$ are all examples of _discrete_ distributions: a distribution that associates probabilities with **discrete** outcomes.

Imagine a distribution that puts equal probabilities on each value in [0, 1]. Because an **infinite** number of values exist in [0, 1], the probability of each value tends toward 0 in such a way that the sum of all these values is identically 1; that is,

$$
    \int{0}^1 uniform(x)dx = 1
$$

if "uniform(x)" is the uniform distribution between 0 and 1.


The function, "uniform(x)", is a _probability density function_.

Notice that the Python's `random.random()` is a (pseudo)random variable modelling a uniform probability density function.

In [None]:
{x: scratch.probability.uniform_pdf(x) for x in [-0.0001, 0, 0.5, 0.9999, 1]}

We are often more interested in the _cumulative density function_, (CDF), which gives the probability that a random variable is less than or equal to a specified value.

In [None]:
{x: scratch.probability.uniform_cdf(x) for x in [-0.0001, 0, 0.5, 0.9999, 1]}

In [None]:
xs = [x / 10.0 for x in range(-10, 20)]
plt.plot(xs, [scratch.probability.uniform_cdf(x) for x in xs], 'g-', label='cdf')
plt.legend()
plt.title('Uniform PDF')
plt.show()