# Probability-Based Learning
---


**Main Idea**: introduce probability-based approaches to machine learning, which relies heavily on Bayes' Theorem probability theory.

The naive Bayes model is the standard approach to using probability-based learning. Some extensions include smoothing to combat overfitting, modifications to handle continuous features, and Bayesian network models to give more control than Naive Bayes over the assumptions encoded in the model.

The key idea is to use esimates of likelihoods to determine the most likely predictions that should be made. However, more importantly, these predictions should be revised when new data is collected and new evidence becomes available.

** Inductive Bias:** we assume that all features are independent and that each feature contribute equally to the outcome. Obviously, these assumptions are not realistic but they work well in practice. 

## Fundamentals
---


**Random Variable-** a variable whose value is unknown and is either the result of random phenomenon or an underlying function. We can view each feature in a dataset as a random variable

**Sample Space-** set of all possible combinations of assignments of values to features

**Probability Function-** P() returns the probability of an event

**Probability Mass Functions**- probability functions for categorical or discrete features

**Probability Density Functions**- probability functions for continuous features

**Joint Probability**- probability of an assignment of specific values to multiple features

**Conditional Probability**- probability of one feature taking a specific value given that we already know the value of a different feature

**Probability Distribution**- data strucutre that describes the probability of each possible value a feature can take

**Joint Probability Distribution**- probability distribution over which more than one feature assignment and is written as a multi-dimensional matrix in which each cell lists the probability of a particular combination of feature values being assigned

### Bayes' Theorem

**Bayes' Theorem**- the probability that an event has happened given a set of evidence for it is equal to the probability of the evidence being causes by the event multiplied by the probability of the event itself

P(event given evidence) = P(evidence given event) x P(event)

$$ P(X|Y) = \frac{P(Y|X)P(X)} {P(Y)} $$


To derive P(Y), we can use the Law of Total Probability.

$$ P(Y) = \sum{P(Y|X_i)P(X_i)} $$

Or we could calculate P(Y) directly from a dataset:

$$ P(Y) = \frac{\text{|{rows where Y is the case}|}}{\text{|{rows in the dataset}|}} $$

We can now calculate the posterior probability distribution over the possible assignment of values to the features in event X conditioned on the event Y.

We let $$\eta = \frac{1} {\sum{P(Y|X_i)P(X_i)}}$$

Then $$ P(X_i|Y) = \eta * P(Y|X_i)P(X_i) $$


## Bayesian Prediction
---

To make predictions, we generate the probability of the event that a target feature, t, takes a specific level, l, given the assignment of values to a set of descriptive features, q, from a query instance. 

The Generalized Bayes' Theorem:

$$ P(t=l\ |\ q[1]...q[m]) = \frac{P(q[1]...q[m]\ |\ t=l) P(t=l)} {P(q[1]...q[m])} $$

We need calculate a probability using the Generalized Bayes' Theorem, we need:
1. P(t=l), the prior probaiblity of the target feature t taking the level l
2. P(q[1]...q[m]), the joint probability of the descriptive features of a query instance taking a specific set of values
3. P(q[1]...q[m] | t=l), the conditional probability of a query instance taking a specific set of values given that the target feature takes the level l

The prediction model will formally look like:

$$ M(q) = argmax_{l\in levels(t)}\ P(t=l\ |\ q[1]...q[m]) $$

where M(q) is the prediction returned for a query q = q[1]...q[m], levels(t) is the set of levels/outputs the output can take.


### Example: Game of Tennis
---
**Data:** the day's forecast

**Goal:** predict if I will play tennis today
-   |Outlook   | Temperature | Humidity | Windy | Plays Tennis
--- | ---      | ---         | ---      | ---   | ---
0   | Sunny    | Hot         | High     | False | No
1   | Rainy    | Cool        | Normal   | False | Yes
2   | Overcast | Cool        | High     | True  | Yes
3   | Rainy    | Mild        | Normal   | False | Yes
4   | Rainy    | Hot         | High     | True  | Yes
5   | Overcast | Cool        | Normal   | False | No
6   | Sunny    | Mild        | High     | True  | No
7   | Overcast | Cool        | High     | False | Yes
8   | Sunny    | Mild        | Normal   | False | No
9   | Sunny    | Hot         | Normal   | True  | Yes

**Query:** Let's say we want to see if I will play tennis today. Today = (Sunny, Mild, Normal, False).

So we want to compute two probabilities and take the max of the two as our prediction:
* **P(Yes | Today)** = P(Yes | Outlook = 'Sunny', Temperature = 'Mild', Humidity = 'Normal', Windy = 'False')
* **P(No | Today)** = P(No | Outlook = 'Sunny', Temperature = 'Mild', Humidity = 'Normal', Windy = 'False')

In order to compute the final probabilties, we first need to compute:
1. **P(t=l)** = P(Yes) or P(No) 
    1. P(Yes) is just the number of times 'Yes' appears as an outcome. So 6/10
    2. P(No) is just the number of times 'No' appears as an outcome. So 4/10
2. **P(q[1]...q[m])** = P(Today) 
    1. P(Outlook = 'Sunny') = 4/10
    2. P(Temperature = 'Mild') = 3/10
    3. P(Humidity = 'Normal') = 5/10
    3. P(Windy = 'False') = 6/10
    4. Since they are independent, we can just multiply them together to obtain P(q[1]...q[m])
        1. P(q[1]...q[m]) = (4/10)(3/10)(5/10)(6/10) = 0.036
3. **P(q[1]...q[m] | t=l)** = P(Today | Yes) or P(Today | No)- a little trickier, but a simple example should clear things up

<h4> Outlook Conditional Probability: P(Outlook | Yes) and P(Outlook | No) </h4>

-        | Yes | No | P(yes)| P(no)
---      | --- | ---| ---   | ---
Sunny    | 1   | 3  | 1/6   | 3/4
Rainy    | 3   | 0  | 3/6   | 0/4
Overcast | 2   | 1  | 2/6   | 1/4
Total    | 6   | 4  | 100%  | 100%

<br>

<h4><center>Temperature Conditional Probability: P(Temperature | Yes) and P(Temperature | No)</center></h4>

-    | Yes | No | P(yes)| P(no)
---  | --- | ---| ---   | ---
Hot  | 2   | 1  | 2/6   | 1/4
Cool | 3   | 1  | 3/6   | 1/4
Mild | 1   | 2  | 1/6   | 2/4
Total| 6   | 4  | 100%  | 100%

<br>

<h4><center>Humidity Conditional Probability: P(Temperature | Yes) and P(Temperature | No)</center></h4>

-     | Yes | No | P(yes)| P(no)
---   | --- | ---| ---   | ---
Normal| 3   | 2  | 3/6   | 2/4
High  | 3   | 2  | 3/6   | 2/4
Total | 6   | 4  | 100%  | 100%

<br>

<h4><center>Windy Conditional Probability: P(Temperature | Yes) and P(Temperature | No) </center></h4>

-     | Yes | No | P(yes)| P(no)
---   | --- | ---| ---   | ---
True  | 3   | 1  | 3/6   | 1/4
False | 3   | 3  | 3/6   | 3/4
Total | 6   | 4  | 100%  | 100%

Now we can calculate P(Today | Yes) and P(Today | No)

**P(Today | Yes)** = 

        P(Outlook = 'Sunny', Temperature = 'Mild', Humidity = 'Normal', Windy = 'False' | t=Yes)
        
            = P(Sunny | Yes) P(Mild | Yes) P(Normal | Yes) P(False | t=Yes)
    
            = (1/6)(1/6)(3/6)(3/6)
    
            = 9/1296

**P(Today | No)** = 
    
    P(Outlook = 'Sunny', Temperature = 'Mild', Humidity = 'Normal', Windy = 'False' | t=No)

        = P(Sunny | No) * P(Mild | No) P(Normal | No) P(False | No)
        
        = (3/4)(2/4)(2/4)(3/4)
    
        = (3/4)(2/4)(2/4)(3/4)
    
        = 36/256
    
Combining everything together (we can ignore P(Today) since both equations contain it)
    1. P(Yes | Today) = P(Today|Yes) P(Yes) = (9/1296)(6/10) = 0.0042
    2. P(No  | Today) = P(Today|No) P(No) = (36/256)(4/10) = 0.0563

Since P(Yes | Today) + P(No  | Today) = 1, we can normalize to:
1. P(Yes | Today) = $\frac{P(Yes | Today)} {P(Yes | Today) + P(No | Today)}$ = $ \frac{0.0042} {0.0563 + 0.0042}$ = 0.07

2. P(No | Today) = $\frac{P(No | Today)} {P(Yes | Today) + P(No | Today)}$ = $ \frac{0.0563} {0.0563 + 0.0042}$ = 0.93

Thus, the argmax(0.07, 0.93) is 0.93 (No) and the model will predict that I will not be playing tennis today.

## References 
---

Kelleher, John D., et al. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. The MIT Press, 2015.

“Naive Bayes Classifiers.” GeeksforGeeks, 10 Feb. 2018, www.geeksforgeeks.org/naive-bayes-classifiers/.