In [None]:
import numpy, pandas
import matplotlib.pyplot as plot
%matplotlib inline
#%config InlineBackend.figure_format = 'svg'
#plot.rcParams['figure.figsize'] = [4, 4]
histogram_bins = 20

Thanks to Wells Lucas Santo from AI4ALL. Some of the material here is based on his slides.

# Probability

*Probability* is a tool for describing how uncertain or how likely an event will happen.

For example, if you flip a (fair) coin, what is the probability of it landing tails?


The probability of an event is expressed as a number between 0 and 1.

Probability = 0: impossible.

Probability = 1: certain to happen.

For a coin flip, the possible outcomes are "heads" and "tails". The probability is 0.5 for heads, and 0.5 for tails.

In mathematical notation:

> *P*(heads) = 0.5
>
> *P*(tails) = 0.5

(Actually it's probably something like *P*(heads) = 0.4999, *P*(tails) = 0.4999, *P*(edge) = 0.0002, but let's ignore that.)


### Percents

You could also express probabilities as percentages: just multiply by 100.

> P(heads) = 50%

But we'll usually use numbers 0&ndash;1.

### Conditional Probability

We sometimes have to talk about the probability of two related things happening: what is the probability of X happening if we know Y has happened?

We write that as:

> *P*(X | Y)

Pronounce that at "the probability of X given Y" or "the probability of X if Y".


An example: if we select a person at random, what is the probability that the person is attending InventTheFuture?

> *P*(ITF) = 24/7000000000 = 0.0000000034

What is the probability of select a person at random, what is the probability of that the person is attending InventTheFuture, if we know **the person is in this room**?

> *P*(ITF | InThisRoom) = 24/28 = 0.86

In [None]:
24./7000000000.

In [None]:
24./28.

### Independent Events

Some events are related to each other, and the conditional probability doesn't tell us anything. What if we know it's raining, and we select a person. Then what is the probability of them attending InventTheFuture?

Same as before: the weather is independent of coming to ITF.

> *P*(ITF | ItsRaining) = *P*(ITF)

# Bayes Theorem

Bayes Theorem gives us a way to flip around a conditional probability: 

> *P*(Y | X) = *P*(X | Y) \* *P*(Y) / *P*(X)

If we know something about *P*(X | Y), we can use that to get to *P*(Y | X).

In our example:

> *P*(InThisRoom | ITF) = *P*(ITF | InThisRoom) \* *P*(InThisRoom) / *P*(ITF) = ...

In [None]:
(24./28.) * (28./7000000000.) / (24./7000000000.) 

If you're in ITF, it's certain that you're in this room. It doesn't seem like we had enough information to calculate that, but Bayes theorem told us.

# Review: Our Goal

Let's remember what we're trying to do here:

*Supervised Learning*: learning from labelled examples.

*Classification*: A type of supervised learning problem where we want to determine what class each example belongs to.

*Features*: the input we are using. Whatever was observed/measured.

*Label*: the thing we're trying to predict, the class.

Here are the steps we want to have for a machine learning classification problem:

1. Start with some labelled data where we know the features **and** correct labels.
1. Split that data into two parts: **training** data and **testing** data.
1. Create a **model** which will be a general strategy for mapping features to labels.
1. Use the training data to **fit** the model to our problem, so it can predict correct labels.
1. Use the testing data to evaluate our model and give it a **score**: how well do the model's predictions match the correct labels on the testing data?
1. Once we have a good model, use it to **predict** labels for data where we don't know the correct result.

# Making Predictions with Probability

The whole goal is to create a classifier that can take a data point (features) and predict a label from the possibilities.

Idea: let's use probability and Bayes Theorem for it.

If we're thinking about probability, we want to start with the data and come up with a probability that we're in each class.

If we can come up with a probability that we're in each class, we can just find the highest probability and predict that class.

In probabilty notation, we want to know the probability of being in a particular class C, given that we have observed data X.

> *P*(C | X)



Bayes Theorem gives us a way to use *P*(X | C) to figure that out. All we need is some way to figure out the probability of particular observations, given the true class.

## Normal Distributions

Any time we have something random happening where a number is produced, we can ask how those numbers are distributed: the **probability distribution**.

As an example: suppose we flip a coin 1000 times. How many heads will we see?

Sometimes 500. Sometimes 499. Very rarely, 100. We would like some way to express the way those results are distributed.

Let's try the experiment: we will flip 1000 coins and count the number of heads. We'll repeat the experiment many times and see what the outcomes are.

In [None]:
import random
observations = []
n = 1000
repeat = 2000
for experiment in range(repeat): # repeat the experiment 2000 times
    total_heads = 0
    for coin in range(n):    # flip 1000 coins
        result = random.choice(['H', 'T'])
        if result == 'H':
            total_heads += 1
    observations.append(total_heads)

In [None]:
# Faster version of the experiment if we want larger numbers...
#repeat = 10000
#n = 1000
#observations = numpy.random.randint(0, 2, (repeat, n)).sum(axis=1)

In [None]:
observations = numpy.array(observations)
histogram_data = plot.hist(observations, bins=histogram_bins);

If we repeat that experiment many times (increasing the 2000 times above), that histogram will look more and more like a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution).

This one, actually:

In [None]:
from scipy.stats import norm
x = numpy.linspace(450, 550, 100)
mean = 0.5*n
stddev = (0.25*n) ** 0.5
plot.plot(x, norm.pdf(x, mean, stddev), 'b-');

Normal distributions happen in many, many places.

# Normally-Distributed Data

The naive Bayes classifer assumes that our observations (for each class) follow a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution).

In the example above, we knew the distribution because of some mathematics about random distributions in situations like flipping coins. We don't usually know the "true" distribution: we have to make our best guess what it is.

### Figuring Out the Distribution

There are two things we need to know about a normal distribution: its **mean** and **standard deviation**.

The *mean* is the average of the values.

The *standard deviation* is a way to measure the &ldquo;width&rdquo; of the data: how spread out is it?

We can calculate both of those: they will tell us exactly what normal distribution the data seems to have.

In [None]:
mean = observations.mean()
mean

In [None]:
stddev = observations.std()
stddev

Now we can check how close we are: we can draw the normal distribution that we discovered (the red line) on top of the histogram of the data points we started with:

In [None]:
plot.hist(observations, bins=histogram_bins)
x = numpy.linspace(observations.min(), observations.max(), 100)
normal_curve = norm.pdf(x, mean, stddev)
plot.plot(x, normal_curve / normal_curve.max() * histogram_data[0].max(), 'r-');

# Multiple Classes

We are more concerned about having more than one class of observations, and we want to be able to predict for future observations which class they are in.

Imagine we have two classes: blue and red. We will make up some fake data and see how it looks...

In [None]:
n = 300
observations_a = numpy.random.normal(23, 19, n)
observations_b = numpy.random.normal(91, 24, n)

histogram_data = plot.hist([observations_a, observations_b],
                           bins=histogram_bins, color=['b','r'])

We can do the same trick as before and guess a normal distribution for each class...

In [None]:
mean_a = observations_a.mean()
mean_b = observations_b.mean()
stddev_a = observations_a.std()
stddev_b = observations_b.std()

In [None]:
plot.hist([observations_a, observations_b], bins=histogram_bins, color=['b','r'])
x_range = numpy.linspace(observations_a.min(), observations_b.max(), 1000)
curve_a = norm.pdf(x_range, mean_a, stddev_a)
curve_b = norm.pdf(x_range, mean_b, stddev_b)
plot.plot(x_range, curve_a / curve_a.max() * histogram_data[0][0].max(), 'b-', lw=2)
plot.plot(x_range, curve_b / curve_a.max() * histogram_data[0][0].max(), 'r-', lw=2);

# Probabilities and Predictions

What we have done so far:
1. Make some observations about individuals in different classes: our labeled data.
1. Know something about the data, and have a pretty good idea that each class' measurements are going to be normally distributed.
1. Calculate the mean and standard deviation.
1. Use those to guess a normal distribution for each class

The distribution gives us a ways to guess the probability of getting each observation, if we know what class we are looking at.

Or in other words, if we measured X for something in class C, the probability distribution tells us
> *P*(X | C)

If we want to start making prediction on unlabeled data, we need to figure out what the most-likely class is for the observations we made, or
> *P*(C | X)

Bayes theorem gives us a way to turn one into the other.

But we don't have to do the calculations ourselves... there are tools for that.

# Multiple Features

Those examples assumes we were only measuring one thing with each observation. We can measure any number of things. The picture (and math) just gets a little more complicated.

The assumptions we're going to make:
1. The values that we measure are **independent**: if we measure height and age, those don't have anything to do with each other.
2. Each thing we measure has a normal distribution.

We can see what happens if we measure two features with some more sample data...

In [None]:
n = 400
x = numpy.random.normal(10, 3, n)
y = numpy.random.normal(3, 1, n)
plot.plot(x, y, 'b.');

Since we're assuming that the features are independent, we can deal with them separately just like before.

In [None]:
from helpers.bayes import joint_histograms, pdf_2d
joint_histograms(numpy.stack([x,y], axis=1), numpy.zeros((n,)))

If we have multiple classes, we do the same thing with each one...

In [None]:
from helpers.bayes import cmap, sample_data_1
observations, classes = sample_data_1()
plot.scatter(observations[:, 0], observations[:, 1], c=classes/2.0, cmap=cmap, edgecolor='k');

In [None]:
joint_histograms(observations, classes)

And just like before, we calculate the mean and standard deviation of the data for each class: this time for each feature as well.

Then we get normal distributions for each class and feature. That gives us a probability that each point is in a particular class.

This is the probability of being in class 0 (blue), with brightness representing higher probability:

In [None]:
pdf_2d(observations, classes, 0)

# Making Predictions

Once we have figured out these probability distributions, we can use them to make predictions. The things we're doing are now something like:

1. Make some observations about individuals in different classes: our labeled data.
1. Know something about the data, and have a pretty good idea that each class' measurements are going to be normally distributed.
1. Calculate the mean and standard deviation.
1. Use those to guess a normal distribution for each class, **for each feature separately**. Combine those distributions so we can guess a value for *P*(X | C).
1. Use Bayes theorem (behind the scenes) to calculate *P*(C | X). When we want to make a prediction for the class X is in, look at the *P*(C | X) values for each class.
1. Choose the class with the **highest** *P*(C | X).

The math isn't *that* hard, but we don't actually have to do it ourselves...