# DS106 Machine Learning : Lesson Nine Companion Notebook

### Table of Contents <a class="anchor" id="DS106L9_toc"></a>

* [Table of Contents](#DS106L9_toc)
    * [Page 1 - Introduction](#DS106L9_page_1)
    * [Page 2 - What are Bayesian Statistics?](#DS106L9_page_2)
    * [Page 3 - Bayes Theorem](#DS106L9_page_3)
    * [Page 4 - Parts of Bayes Theorem](#DS106L9_page_4)
    * [Page 5 - A/B Testing](#DS106L9_page_5)
    * [Page 6 - Bayesian Network Basics](#DS106L9_page_6)
    * [Page 7 - Key Terms](#DS106L9_page_7)
    * [Page 8 - Lesson 4 Practice Hands-On](#DS106L9_page_8)
    * [Page 9 - Lesson 4 Practice Hands-On Solution](#DS106L9_page_9)
    
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L9_page_1"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Bayesian Networks
VimeoVideo('388131444', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO106-ML-L04overview.zip)**.

# Introduction

Bayesian Networks are a way for you to apply probability knowledge in a machine learning algorithm. By the end of this lesson, you should be able to:

* Explain what a Bayesian Network is
* Perform Bayesian networks in Python

This lesson will culminate in a hands-on in which you use Bayesian networks to predict the chance of shark attack. 



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What are Bayesian Statistics?<a class="anchor" id="DS106L9_page_2"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">



# What are Bayesian Statistics?

*Bayesian statistics* are a branch of stats that make use of probability to test your beliefs against data. In practice, the simplest Bayesian statistics are similar in concept to the basics you learned - probability, the normal distribution, etc.  But they go out of their way to change the names of everything! As if statistics weren't complicated enough! In this lesson, you'll be learning about Bayesian statistics using the terms you're already familiar with, but don't get into a fistfight with anyone if they use slightly different lingo!

---

## Bayesian Reasoning

Here's an example to ease you into the Bayesian mindset. You start off with an observation of data. For instance, say you hear a very loud, rushing noise outside. You might come up with a couple different ideas of what is going on, or hypotheses, and those are based on your previous experience. You might have a couple different options: it's a plane making the noise, or it's a tornado. Which is more likely? Well, you know that based on your past experience, tornados make this much noise only once or twice a year when they are very severe.  So you're thinking that the plane is more likely. 

Now add in additional data - you live on an Air Force base. Suddenly, the likelihood that the noise is a fighter jet taking off is much, much higher, and your belief that there's a tornado is almost non-existent. The brilliant thing about Bayesian statistics is that you can continually update your hypotheses based on updated data. You can even compare hypotheses to see which one fits your data better. 

An important thing to note is that your data should help change your beliefs about the world, but you should not search for data to back up your beliefs! 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Bayes Theorem<a class="anchor" id="DS106L9_page_3"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Bayes Theorem

Remember back to multiple event probability? Where you either used `or` or `and` to combine the probability of multiple things happening? Well, those were great, but they implied independence - that the probability of one of those things did not impact the probability of the other whatsoever.  But what happens when your two events are related in any way? For instance, color blindness is much more prevalent in males than in females. So the probability that any one random person is color blind very much depends on their gender.  So you can't possibly assume that there is no relation between those two variables! 

How would you calculate probability in that instance? Enter *Bayes theorem*! Bayes theorem is a special probability formula that allows you to calculate the likelihood of an event given the likelihood of another event.

---

## Bayes Formula

Here is the mathematical formula for Bayes theorem.  Don't panic! It's not as bad as it looks! You can even see it in neon lights if that makes it more fun! 

![Bayes Theorem Formula written in neon on a sign.](Media/NeonBayes.jpg)

Too hard to read? Well, you can have less fun but also squint less with this bad boy:

![Bayes Theorem Formula.](Media/BayesTheorem.png)

In plain English, this is what this reads like: 

> The probability of event A given the probability of event B is equal to the probability of event A times the probability of event B given A, divided by the probability of B. 

Quite a mouthful! You can break it down even further. A and B are just two events that are not independent.  It's assumed A is the first and B is the second, but it doesn't really matter, as long as you stay consistent with your variable assignment throughout the use of the equation.

Then you have `P`, which is just shorthand for probability.

And lastly, you have the pipe symbol, `|`. This means "given." All in all, this equation is telling you that if you know the probability of A by itself, and the probability of B by itself, then you can figure out how A and B interact.

---

## Bayesian Reasoning with the Bayes Formula

If you want to walk this into the wonderful world of Bayes reasoning that you've just hit upon, you can think of this in terms of observations and beliefs.  Substitute for `A` beliefs, and for `B`, observations. Now the question becomes, what is the probability of my beliefs being true, given my observations?

The pretty cool thing about this is that with Bayes theorem, you can figure out exactly how much your beliefs change because of evidence. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Parts of Bayes Theorem<a class="anchor" id="DS106L9_page_4"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Parts of Bayes Theorem

There are three components to Bayes theorem:

* Posterior probability
* Likelihood
* Prior probability

You will learn about these components in more detail below!

---

## Posterior Probability

The *posterior probability* is the part of the theorem that lets you quantify how strongly you hold beliefs about the data you've observed. The posterior probability is the end result of Bayes' theorem and what you're trying to find out. This is often shortened to just "posterior." No butt jokes, guys!

---

## Likelihood

The *likelihood* is the probability of the data given your current beliefs. i.e. How likely is it that x happens? This represents the top left portion of Bayes' theorem - the P(B|A) part.

---

## Prior Probability

The *prior probability* is all about the strength of your belief before you see the data. Hence, prior, meaning before! This represents the top right portion of Bayes' theorem - the P(A) part.

---

## The Bottom?

You may be wondering about the bottom of the equation.  Doesn't that get its own special name too? Apparently not, but you're encouraged to give it one.  Stormageddon, anyone? But the bottom portion of Bayes' theorem helps normalize the data, so that even if you have a different amount of data in A and B, you can still compare them fairly. 

---

## An Example

You will now calculate the probability that your instructor's a dork, given that she cuddles her statistics book at night. Call "your instructor's a dork" `A` and "cuddling statistics books" `B`. 

---

### Find the Likelihood

You can think of the likelihood in this scenario as the probability that cuddling a statistics book is good evidence that your instructor's a dork. If you are pretty darn certain, then you could make it a probability like 8/10. If you change your mind at any point, well, guess what? That is totally fine! This means `P(B|A) = 8/10`. 

---

### Find the Prior

First, you need to calculate the prior. Remember, this is just the probability of A. Believe it or not, a survey found that 60% of Americans consider themselves nerds. So you'll use a probability of 6/10 for that. That means: `P(A) = 6/10`. 

---

### Calculate the Normalizing Factor P(B)

What is `P(B)`? That's on the bottom! Well that is the probability that someone cuddles their statistics book at night, regardless of whether or not they are a dork. How many people is that? Well, you could take an educated guess based upon the fact that only 11% of people take statistics in a secondary school, and of those, 55% sell them back after the course. That means that only 6.05% (11% * 55%) still own a statistics book after a semester is up. So it's not even very likely that people have statistics books, let alone cuddle them. Maybe 1 in 100 will cuddle a statistics book, and with only 1 in 4 owning them at all...that makes it `6.05% * 1%` or `.000065`. 

That is one way to go. But if you don't want to estimate it, or it is difficult to estimate it, then you can choose from a standard `P(B)` setup. Your choices are:

* .05
* .01
* .005
* .001

It is important to note that the smaller the `P(B)`, the larger your posterior probability, or end result, is. 

---

### Calculate the Posterior

Next, you will calculate the posterior. Remember that this is your overall goal! You are ready to solve this bad boy! 

This is just plug 'n play at this point:

```text
P(A|B) = (P(B|A) * P(A)) / P(B)
P(A|B) = (.8 * .6) / .000065
P(A|B) = .3 / .000065
P(A|B) = 4,615.38
```

That's great! You have a number! But what does it mean? It's really hard to say for sure, especially without a comparison to an alternative hypothesis. It's sort of like comparing machine learning models with AIC - the number itself doesn't matter, just whether it is larger or smaller than other models.

Can you guess what you're going to do next?

---

### Create and Test Alternative Hypotheses Using Bayes

Ok, so one explanation for why your instructor may cuddle her statistics textbook at night is because she doesn't have a pillow. That becomes your new `A`. A quick internet search shows no relevant results. You can then assume that 99% of people own a pillow, which means that 1% don't. 

Your new `P(A)` is now 1/100. And that's probably a high estimate of those who don't own a pillow. How does that change your results?

Do some more plug 'n chug!

```text
P(A|B) = (P(B|A) * P(A)) / P(B)
P(A|B) = (.8 * .01) / .000065
P(A|B) = .0008 / .000065
P(A|B) = 12.31
```

So this means that it is much more likely that your instructor's a dork and not that she doesn't own a pillow. Tada! Relative probability at its finest.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - A/B Testing<a class="anchor" id="DS106L9_page_5"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# A/B Testing

Remember all those fun research designs you learned about in Basic Statistics? Well, there are more than you learned about there. There are nearly infinite variations having to do with comparisons, varying timepoints, and messing with individuals over and over again. Ethically, of course. And only if they're willing!

A/B testing is yet another type of research design, in which you are directly comparing two methods. In practice, if you have means, you'd compare group A and group B with a *t*-test.  But what if you don't have means? What if you have probabilities or percentages? Enter the Bayesian A/B test!

---

## Create the Prior

Say you are testing whether your existing recipe for cream cheese frosting (A) is better than your new recipe for cream cheese frosting (B), and that you are testing it at your local bakesale. Your null hypothesis will be that these frostings will be totally equal. No one will like the new one any better than the old.  So, assume that 80% of all bakesale buyers will finish eating your cupcakes with both types of frosting. 

---

## Collect Data

Now that you have a hypothesis to test, it's time to collect data! You hold that bakesale, and for the old cream cheese frosting, A, 82% of people finish eating their cupcake. And for the new cream cheese frosting recipe, only 61% finish eating their cupcake. Want it in table form? Take a peek.

<table class="table table-striped">
    <tr>
        <th>Frosting Type</th>
        <th>Ate it All</th>
        <th>Did Not Eat it All</th>
        <th>Ratio</th>
    </tr>
    <tr>
        <td>Old</td>
        <td>95</td>
        <td>22</td>
        <td>.82</td>
    </tr>
    <tr>
        <td>New</td>
        <td>73</td>
        <td>46</td>
        <td>.61</td>
    </tr>
</table>

Right off the bat, you should be thinking to yourself that perhaps frosting recipe B isn't as good. But, it's always a good idea to science things and know for sure! That's what statistics is all about!

---

## Work the Problem in R using Monte Carlo Simulation

That's right, folks, you're out of calculator land again and into programming! Lucky for you, you can finish the A/B testing in R.

*Monte carlo simulation* is a way to simulate the results of something by re-sampling the data you already have. It's based off a little bit of data, but to get the best results, you may want a lot more rows than you have. So use monte carlo simulation to expand things.  Kind of like those toy dinosaurs that grow when you pour water over them. 

The function to do this is `rbeta()` function, which samples from the probability density function for a binomial distribution. Remember that the binomial distribution is one in which you only have two outcomes. For instance, a) did eat the whole cupcake or b) did not eat the whole cupcake.

There are two components of the beta distribution that you'll need to define as variables, in addition to the number of trials you intend to use:  
* alpha: How many times an event happens that you care about 
* beta: How many times an event happens that you don't care about

First, assign some variables in R.  You'll need a variable to hold onto the priors and the number of trials you want to extend this to. Although you can choose any number of trials you want, here, you'll use `10,000`. 

```{r}
trials <-10000
```

`alpha` and `beta` are based on the priors you created. Since you thought that about 80% of people would finish eating a cupcake, `8` becomes your `alpha`. `beta`, the event you don't care about, or not finishing a cupcake, would be `2`. This is because of the "not" rule of probability. You've only got two potential options - people either will finish eating their cupcake or they won't - so the probability of not eating is one minus the probability of eating. Since you are doing this out of 10, that means 10-8 = 2, and 2 becomes your `beta`. 

```{r}
alpha <- 8
beta <- 2
```

Now, you are all set up to use `rbeta()` at last! You'll use it for both frosting types. Remember that A was your old, tried-and-true cream cheese frosting recipe, and B was the new one. The variable `samplesA` calculates the probability of the data you collected happening. The first argument it uses is the number of trials you want to simulate this over, and the second is the number of people who ate all of the cupcake with frosting A plus the prior of alpha. The third argument is the number of people who did not eat frosting A plus the prior of beta. You are basically comparing your guess with reality here. 

You will follow the same flow for `samplesB`.

```{r}
samplesA <- rbeta(trials, 95+alpha, 22 + beta)
samplesB <- rbeta(trials, 73+alpha, 46 + beta)
```

Lastly, you can figure out if B is better by seeing the percentage of the trials in which B came back greater than A. You are basically just adding up with the `sum()` function every time that `samplesB` was greater than `samplesA` out of the total number of `trials`.

```{R}
Bsuperior <- sum(samplesB > samplesA) / trials
```

The end result is `0`. Wow! Your initial suspicions were right! There is definitely a clear case to stick with your original frosting, because in no situations out of 10,000 did people ever eat the whole cupcake more times with frosting B, your new recipe!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Bayesian Network Basics<a class="anchor" id="DS106L9_page_6"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Bayesian Network Basics

*Bayesian Statistics* is based around conditional probability.  *Bayesian Networks* are a specific type of Bayesian Statistics that map out the conditional relationships of multiple variables. Use Bayesian Networks when you want to find the probability of an outcome when it is impacted by several previous conditional variables.

The image below is an example of a simple Bayesian Network. The results of condition A impact condition B and condition C, and both condition B and condition C impact the probability of condition D.  This means that condition D is the final thing you are trying to predict.

![Four circles, one on top labeled A, one on left labeled B, one on right labeled C, one on bottom labeled D. There is an arrow from A to B and A to C. There is an arrow from B to D and C to D.](Media/BayesianNetwork.png)

---

## Example

How about an example to clear things up?  Ask yourself if you will have fun at the beach today. In this case, you want to know the probability of having fun at the beach today.  Sounds simple, right? But maybe not.  First, ask yourself if it is sunny today.  This directly impacts the temperature of the beach and how crowded it is.  If it is sunny, it is more likely to be hot and it is more likely to be crowded.  Whether or not it is sunny does not directly impact if you will have fun, but if the beach is hot or if the beach is crowded will both impact your probability of having fun. If the beach is warm and not crowded, you are more likely to have fun than if the beach is blazing hot and so busy you are packed in like sardines. 

![Four circles, one on top labeled Sunny? One on left labeled Hot? One on right labeled Busy? One on bottom labeled Fun? There is an arrow from Sunny to Hot and from Sunny to Busy. There is an arrow from Hot to Fun and Busy to Fun.](Media/BeachNetwork.png)

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS106L9_page_7"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Bayesian Statistics</td>
        <td>Statistics using conditional probability.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Bayesian Networks</td>
        <td>Machine learning using the conditional relationships of multiple variables.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 4 Practice Hands-On<a class="anchor" id="DS106L9_page_8"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


This Hands-On will **not** be graded, but you are encouraged to complete it. However, the best way to become a data scientist is to practice.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Bayesian Statistics Hands-On

For this hands-on, you will be determining which type of mold-removal solution works better: just bleaching objects, or bleaching them and scrubbing them down thoroughly out of 10,000 trials. Based on the priors you created, the mold-removal solutions have a 90% chance of working.

You're trying to determine whether the mold will grow back or not, using the following table:

<table class="table table-striped">
    <tr>
        <th>Mold Removal Type</th>
        <th>Mold Returned</th>
        <th>Did Not Return</th>
        <th>Ratio</th>
    </tr>
    <tr>
        <td>Bleach</td>
        <td>27</td>
        <td>39</td>
        <td>.41</td>
    </tr>
    <tr>
        <td>Bleach and Scrubbing</td>
        <td>10</td>
        <td>45</td>
        <td>.18</td>
    </tr>
</table>

Complete A/B testing and Monte Carlo simulation using R. Please attach your R script file with your code documented and information in comments about your findings.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 4 Practice Hands-On Solution<a class="anchor" id="DS106L9_page_9"></a>

[Back to Top](#DS106L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 4 Practice Hands-On Solution

```{r}

trials <-10000
# Create a variable to hold onto the priors and the number of trials you want to extend this to. 

alpha <- 9
beta <- 1
# Create your alpha and beta variables out of the priors which were 90% leaving the beta to be 10%. 


samplesA <- rbeta(trials, 27+alpha, 39 + beta)
samplesB <- rbeta(trials, 10+alpha, 45 + beta)
# Your rbeta() is ready to be set up by placing the function inside of a two separate sample variables. The alpha is added with the Mold Returned and the beta is added with the Did Not Return. 

Bsuperior <- sum(samplesB > samplesA) / trials
# The sum() function is used to add up every time that samplesB was greater than samplesA out of the total number of trials. You are calculating the percentage of trials in which sampleB came back greater than sampleA.

Bsuperior
# Print the answer of the function above.

# Bleach theres a .1318 % chance that the "bleach" is 99% effective
0.1318
```