#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Probability

## Overview

### Learning Objectives

* Know how to calculate the probability of an event from a probabiltiy distribution
* Understand the definition of and how to calculate expected value
* Be comfortable doing simple probability calculations and coding simple probability distributions

### Prerequisites

* Track 02-06: Probability - Part 1


### Estimated Duration

60 minutes

# Independent and Dependent Variables

In Machine Learning, we regularly apply concepts from probability to evaluate the accuracy and meaning of our results. However, much of the inner workings of Machine Learning also rely on concepts we see in probability, as we will see when we dive deeper into Machine Learning.

In fact, much of Machine Learning works because of **dependence** -- that relationships between data and what we want to infer from the data exist.

What, then, is dependence?

In probability, there can be **Independent Variables** and **Dependent Variables**. 

Independent variables are events whose probability does not change based on knowledge of another event. For example, rolling a 2 on one 6 sided die does not change the probability of rolling a 4 on the another.

Dependent variables are influenced by other events. For example, if you are rolling two dice and getting the sum, rolling a 2 on the first changes the probability of the overall sum being 7.

# Types of Probability

##Joint Probability

Joint probability is the probability that two separate events will both happen. If the events are **independent**, then it can be calculated simply by multiplying the probabilities of the two events:

$P(X \text{ and } Y) = P(X)\cdot P(Y)$

For example, if we roll a die and flip a coin, the probability that we roll a 3 and flip a heads is

$P(\text{Roll 3 and Flip Heads}) = P(\text{Roll 3})\cdot P(\text{Flip Heads}) = \frac{1}{6}\cdot\frac{1}{2} = \frac{1}{12}$.


***Consider***

1.   Do you think $P(X \text{ and } Y) = P(X)\cdot P(Y)$ when $X$ and $Y$ are not independent? Why or why not?
2.   Consider the case where you roll a die and $X$ is the even that the dice roll was even while $Y$ is the event that the dice roll was 2. What is $P(X \text{ and } Y)$?
3.   Intuitively, why does $P(X \text{ and } Y) \neq P(X)\cdot P(Y)$ in this example?


When $X$ is the even that a dice roll was even and $Y$ is the event that the dice roll was 2, then $P(X) = P(\text{roll a 2, 4, or 6})= \frac{3}{6}$ and $P(Y) = \frac{1}{6}$.

From here, we see that $P(X \text{ and } Y = P(\text{roll an even number and roll 2}) = P(\text{roll 2})$, since if we fulfill the condition of having rolled 2 we also fulfill the condition of rolling an even number!

Thus, in this example $P(X \text{ and } Y) = P(Y) = \frac{1}{6} \neq \frac{3}{36} = P(X) \cdot P(Y)$.

##Conditional Probability

Conditional probability is perhaps the most common example of using dependent variables. Conditional probabilities are calculated when we want to ``update'' the probability of one event occuring once we have new information about another event that occured. For example, say we want to calculate the probability of the sum of two dice rolls being 7, given that the first roll was a 2. This would be called a **conditional probability**. The probability of event X *given* event Y is calculated as such:

$P(X|Y) = \frac{P(X \text{ and } Y)}{P(Y)}$

Thus, the probability of rolling a sum of 7, given that the first roll was a 2, can be found:

$P(7|2) = \frac{P(7 \text{ and } 2)}{P(2)} = \frac{1/36}{1/6} = 1/6$.

If $X$ and $Y$ are indepent, then the conditional probability of $X$ given $Y$ is simply the probability of $Y$.

##Summary

**Vocabulary:**

1.   $P(X|Y)$ is a conditional probability, calculating the probability of one event ($X$) given that we know another event ($Y$) has occurred.

2.   $P(X \text{ and } Y)$ is a joint probability, the probability that two separate events have both occurred.

3.  $P(X)$ is a marginal probability, the probability that a single event has occurred.

**Independence:**

When $X$ and $Y$ are independent events,

1.   $P(X \text{ and } Y) = P(X)\cdot P(Y)$
2.   $P(X | Y) = P(X)$



#Bayes Theorem

Suppose we know one conditional probability in a situation, $P(X|Y)$. If we want to know the opposite, $P(Y|X)$, do we have to start over?

Not at all! Bayes theorem gives us a simpler way of calculating this new probability:

$P(Y|X) = \frac{P(X|Y)P(X)}{P(Y)}$

## Example: Plagiarism Detection

We saw in lecture that Bayes' Theorem can be useful to calculate a conditional probability that we are not given. This can be especially useful for determining how much trust we should place in a machine learning algorithm's results.

### **Problem Statement**

Consider a binary classifier that examines student essays and identifies them as positive or negative for being plagiarized. The algorithm's developers reported that the algorithm correctly identifies $99\%$ of all plagiarized essays and correctly identifies $98\%$ of all non-plagiarized essays.

***Consider:*** How accurate/useful does this classifier sound? What is one concern you might have about its accuracy?

A professor who is worried about plagiarism in her class might be interested in deploying this tool. However, since she knows that plagiarism is rare (at her university, $0.5\%$ of essays are plagiarized, so $P(\text{plagiarized}) = 0.005$), she wants to know what it means if the classifier positively identifies an essay as being plagiarized -- how likely is it that the classifier is correct? Should she punish the student?

#### **Identifying Goals**

Before we can solve this problem, let's consider what information we are going to need.

First, we recall that Bayes' Theorem gives us $P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)}$. 

Here, we wish to calculate $$P(\text{plagiarized}|\text{+}) = \frac{P(\text{+}|\text{plagiarized})P(\text{plagiarized})}{P(\text{+})}.$$

Recalling that the algorithm's developers reported that the algorithm correctly identifies $99\%$ of all plagiarized essays and correctly identifies $98\%$ of all non-plagiarized essays, we can determine that $P(\text{+}|\text{plagiarized}) = 0.99$ and $P(\text{-}|\text{not plagiarized}) = 0.98$. We also recall that the professor knows that $P(\text{plagiarized}) = 0.005$.


Thus, all we need to calculate is $P(\text{+})$.

### **Calculations**

#### **Finding $P(\text{+})$**

We do so using $P(X) = \sum_Y P(X|Y)P(Y)$. 

In this case, this means that $P(\text{+}) = P(\text{+}|\text{plagiarized})P(\text{plagiarized}) + P(\text{+}|\text{not plagiarized})P(\text{not plagiarized})$.

We already know $P(\text{+}|\text{plagiarized})$, and since the algorithm always gives us either a positive result or a negative result, $P(\text{+}|\text{not plagiarized}) = 1 - P(\text{-}|\text{not plagiarized}) = 0.02$.

Likewise, we can calculate that $P(\text{not plagiarized}) = 1 - P(\text{plagiarized}) = 0.995$.

Thus, 
\begin{align*}
P(\text{+}) &= P(\text{+}|\text{plagiarized})P(\text{plagiarized}) + P(\text{+}|\text{not plagiarized})P(\text{not plagiarized})\\
&= 0.99\cdot0.005 + 0.02\cdot0.995\\
&= 0.02485
\end{align*}

#### **Finding $P(\text{plagiarized}|\text{+})$**

From here, we have all the pieces we need to calculate 

\begin{align*}
P(\text{plagiarized}|\text{+}) &= \frac{P(\text{+}|\text{plagiarized})P(\text{plagiarized})}{P(\text{+})}\\
&= \frac{0.99\cdot0.005}{0.02485}\\
&\approx 0.199
\end{align*}

This tells us that, even if the algorithm positively identifies an essay as being plagiarized, the probability that the essay is actually plagiarized is less than $20\%$!

### **Concluding Questions**


***Consider:***  
1.   Now how accurate/useful do you think this classifier is?
2.   Should the professor punish students whose essays it classifies as being plagiarized?
3.   If the algorithm correctly classifies $98\%$ of non-plagiarized essays as not plagiarized, how can over $80\%$ of the essays it classifies as plagiarized actually be non-plagiarized? 


#### **Solutions** 




1.   We just saw that even though the algorithm correctly identifies $99\%$ of plagiarized essays and $98\%$ of non-plagiarized essays, it is still unlikely that its classification of an essay as plagiarized is meaningful since it will most likely have incorrectly identified non-plagiarized essays as plagiarized.

2.   The professor would punish significantly more innocent students than guilty ones, so no.

3.   This might seem impossible, but it actually isn't since there are significantly fewer plagiarized essays than non-plagiarized ones: 

     We know the algorithm correctly classifies $99\%$ of the $0.5\%$ of the essays that are plagiarized, which is only $0.495\%$ of all essays.

     In contrast, we know that the algorithm only correctly classifies $98\%$ of non-plagiarized essays, so it must misidentify $2\%$ of the $99.5\%$ of the essays that are not plagiarized. This is actually $1.99\%$ of all essays.

     Thus, if we look at these percentages, we see that out of all essays the algorithm classifies as plagiarized (which is $1.99\% + 0.495\% = 2.485\%$), most of them are actually misidentified non-plagiarized essays. In fact, we see that $\frac{1.99\%}{2.485\%} \approx 0.801$. That is, we indeed see that over $80\%$ of the essays it classifies as plagiarized are actually non-plagiarized.



# Exercises: Halloween Chocolate


## Recall: No. 15 Crescent Way


Recall at No. 15 Crescent Way, the probability of any chocolate bar is:
$$P(\text{chocolate bar of type } X) = \begin{cases}
0.2 & X =\text{Kit Kat}\\
0.3 & X =\text{Milky Way}\\
0.15 & X =\text{Snickers}\\
0.05 & X =\text{Toblerone}\\
0.3 & X =\text{Twix}\\
0 & \text{otherwise}
\end{cases}$$

## No. 13 Crescent Way


Next door at No. 13 Crescent Way the residents have two bags of chocolate bars. The first bag only holds Twix bars. The second bag has four types of chocolate bars in equal quantities: Butterfinger bars, Hershey bars, Kit Kat bars, and Milky Way bars. 

Half the time the residents draw from the first bag, and half the time they give out chocolate bars from the second bag.


## Exercise 1

What is the probability of getting a Butterfinger bar from No. 13 Crescent Way?

>***Hint:*** Use $P(X) = \sum_Y P(X|Y)P(Y)$ where $X$ is getting a Butterfinger bar and $Y$ is which bag the residents got the bar from.

### Student Solution

Your answer goes here

### Answer Key

**Solution**

In [0]:
answer = 0.125

\begin{align*}P(\text{Butterfingers}) &= P(\text{First Bag})\cdot P(\text{Butterfingers}|\text{First Bag}) + P(\text{Second Bag})\cdot P(\text{Butterfingers}|\text{Second Bag})\\ &= \frac{1}{2}\cdot0 + \frac{1}{2}\cdot\frac{1}{4}\\ &= \frac{1}{8}\\ &= 0.125\end{align*}

**Validation**

In [0]:
assert answer == 0.125

## Exercise 2

What is the probability of getting at least one Kit Kat bar if you get one chocolate bar from each house?

>***Hint:*** Add the probability from each house.

### Student Solution

Your answer goes here

### Answer Key

**Solution**

In [0]:
answer = 0.325

\begin{align*}
P(\text{# Kit Kats} \geq 1) &= P(\text{Kit Kat}|\text{ Bar is from No. 13 Crescent Way}) + P(\text{Kit Kat}|\text{Bar is from No. 15 Crescent Way})\\
&= 0.125 + 0.2\\
&= 0.325
\end{align*}


**Validation**

In [0]:
assert answer == 0.325

## Exercise 3

What is the probability that a random bar with equal probability of being from No. 13 or No. 15 Crescent Way is a Twix bar?

>***Hint:*** Think of condition on which house the bar is from.

### Student Solution

Your answer goes here

### Answer Key

**Solution**

In [0]:
answer = 0.4

\begin{align*}
P(\text{Twix}) &= P(\text{Twix}|\text{No. 13 Crescent Way})P(\text{No. 13 Crescent Way}) + P(\text{Twix}|\text{No. 15 Crescent Way})P(\text{No. 15 Crescent Way})\\
&= \frac{1}{2}\cdot\frac{1}{2} + 0.3\cdot\frac{1}{2}\\
&= 0.4
\end{align*}


**Validation**

In [0]:
assert answer == 0.4

## Exercise 4

Your friend visited one of the houses and got a Twix bar, but they can't remember which of the two houses they went to. What is the likelihood that the house they visited was No. 13 Crescent Way?

Before doing the computation, consider which house you think it is more likely for the Twix bar to come from. 

Do your calculations support this guess?

>***Hint:*** Use Bayes' Theorem.

Assume equal probability of visiting either house.



### Student Solution

Your answer goes here

### Answer Key

**Solution**

In [0]:
answer = 0.325

We use Bayes' Theorem to find \begin{align*}
P(\text{No. 13 Crescent Way}|\text{Twix}) &= \frac{P(\text{Twix}|\text{No. 13 Crescent Way})P(\text{No. 13 Crescent Way})}{P(\text{Twix})}\\
&= \frac{0.5\cdot0.5}{0.4}\\
&= 0.625
\end{align*}



**Validation**

In [0]:
assert answer == 0.625

## Exercise 5

You visited one of the houses and got a Milky Way bar. Your friend wants to guess which house you got it from. What is the likelihood that the house you visited was No. 15 Crescent Way?

Again, assume equal probability of visiting either house.



### Student Solution

Your answer goes here

### Answer Key

**Solution**

In [0]:
answer = 0.706

We know we want to begin by using Bayes' Theorem to find \begin{align*}
P(\text{No. 15 Crescent Way}|\text{Milky Way}) &= \frac{P(\text{Milky Way}|\text{No. 15 Crescent Way})P(\text{No. 15 Crescent Way})}{P(\text{Milky Way})}\\
&= \frac{0.3\cdot0.5}{P(\text{Milky Way})}
\end{align*}

We then next must find  
\begin{align*}
P(\text{Milky Way}) &= P(\text{Milky Way}|\text{No. 13 Crescent Way})P(\text{No. 13 Crescent Way}) + P(\text{Milky Way}|\text{No. 15 Crescent Way})P(\text{No. 15 Crescent Way})\\
&= P(\text{Milky Way}|\text{No. 13 Crescent Way})\cdot0.5 + 0.3\cdot0.5
\end{align*}

Thus we must calculate:
\begin{align*}
P(\text{Milky Way}|\text{No. 13 Crescent Way}) &= P(\text{Milky Way}|\text{First Bag})P(\text{First Bag}) + P(\text{Milky Way}|\text{Second Bag})P(\text{Second Bag})\\
&= 0\cdot0.5 + 0.25\cdot0.5\\ &= 0.125
\end{align*}

Substituting back in, we get that
\begin{align*}
P(\text{Milky Way})
&= P(\text{Milky Way}|\text{No. 13 Crescent Way})\cdot0.5 + 0.3\cdot0.5\\
&= 0.125\cdot0.5 + 0.3\cdot0.5\\
&= 0.2125
\end{align*}

We can finally substitute this into the original equation to calculate
\begin{align*}
P(\text{No. 15 Crescent Way}|\text{Milky Way}) &= \frac{0.3\cdot0.5}{P(\text{Milky Way})}\\
&= \frac{0.3\cdot0.5}{0.2125}\\
&= \frac{12}{17}\\
&\approx 0.706
\end{align*}

**Validation**

In [0]:
assert 0.7 <= answer <= 0.71