In [13]:
using StatsBase
using Tables

# Section 1 Overview

Section 1 introduces you to Discrete Probability. Section 1 is divided into three parts:

Introduction to Discrete Probability
Combinations and Permutations
Addition Rule and Monty Hall
After completing Section 1, you will be able to:

apply basic probability theory to categorical data.
perform a Monte Carlo simulation to approximate the results of repeating an experiment over and over, including simulating the outcomes in the Monty Hall problem.
distinguish between: sampling with and without replacement, events that are and are not independent, and combinations and permutations.
apply the multiplication and addition rules, as appropriate, to calculate the probably of multiple events occurring.
use sapply() instead of a for loop to perform element-wise operations on a function.
There are 3 assignments that use the DataCamp platform for you to practice your coding skills. There are also some quick probability calculations for you to perform directly on the edX platform as well, and there is a longer set of problems at the end of section 1.

This section corresponds to the following section of the course textbook.

We encourage you to use R to interactively test out your answers and further your learning.



# Discrete Probability


RAFAEL IRIZARRY: We start by covering some basic principles related to categorical data.

This subset of probability is referred to as discrete probability.

It will help us understand the probability theory we will later introduce for numeric and continuous data, which is more common in data science applications.
Discrete probability is more useful in card games and we use these as examples.
The word probability is used in everyday language.
For example, Google's auto complete of, what are the chances of,gives us getting pregnant, having twins, and rain tomorrow.
Answering questions about probability is often hard, if not impossible.
Here, we discuss a mathematical definition of probability that does permit us to give precise answers to certain questions.
For example, if I have two red beads and three blue beads inside an urn and I pick one at random, what is the probability of picking a red one?
Our intuition tells us that the answer is 2/5, or 40%.
A precise definition can be given by noting that there are five possible outcomes of which two satisfy
the condition necessary for the event "pick a red bead."
Because each of the five outcomes has the same chance of occurring, we conclude that the probability is 0.4 for red and 0.6 for blue.
A more tangible way to think about the probability of an event is as a proportion of times the event occurs when we repeat the experiment over and over independently and under the same
conditions.
Before we continue, let's introduce some notation. We use the notation probability of A to denote the probability of an event A happening.
We use the very general term event to refer to things that can happen when something happens by chance.
For example, in our previous example, the event was picking a red bead.
In a political poll, in which we call 100 likely voters at random, an example of an event is calling 48 Democrats and 52 Republicans.
In data science applications, we will often deal with continuous variables.
In these cases, events will often be things like, is this person taller than 6 feet?
In this case, we write events in a more mathematical form.
For example, x greater than 6.
We'll see more of these examples later.
Here, we focus on categorical data and discrete probability.


The probability of an event is the proportion of times the event occurs when we repeat the experiment independently under the same conditions.

$$P(A) = \text{probability of event A}$$

An event is defined as an outcome that can occur when when something happens by chance.
We can determine probabilities related to discrete variables (picking a red bead, choosing 48 Democrats and 52 Republicans from 100 likely voters) and continuous variables (height over 6 feet).



Sure, I can help you with that. Here is the text with line breaks:

RAFAEL IRIZARRY: Computers provide a way to actually perform
the simple random experiments, such as the one we did before.
Pick a bead at random from a bag or an urn with 3 blue beads and 2 red ones.
Random number generators permit us to mimic the process of picking at random.
An example in R is the sample function.
We demonstrate its use showing you some code.
First, use the rep function to generate the urn.
We create an urn with 2 red and 3 blues.
You can see when we type beads we see this.
Now, we can use a sample function to pick one at random.
If we type sample beads comma 1, in this case, we get a blue.
This line of code produces one random outcome.
Now, we want to repeat this experiment over and over.
However, it is, of course, impossible to repeat forever.
Instead, we repeat the experiment a large enough number of times
to make the results practically equivalent to doing it
over and over forever.
This is an example of a Monte Carlo simulation.
Note that much of what mathematical and theoretical statisticians study--
something we do not cover in this course--
relates to providing rigorous definitions of practically equivalent,
as well as studying how close a large number of experiment
gets us to what happens in the limit, the limit meaning if we did it forever.
Later in this module, we provide a practical approach
to deciding what is large enough.
To perform our first Monte Carlo simulation,
we use the replicate function.
This permits us to repeat the same task any number of times we want.
Here, we repeat the random event 10,000 times.
We set B to be 10,000, then we use the replicate function
to sample from the beads 10,000 times.
We can now see if, in fact, our definition
is in agreement with this Monte Carlo simulation approximation.
We can use table, for example, to see the distribution.
And then we can use prop.table to give us the proportions.
And we see that, in fact, the Monte Carlo simulation
gives a very good approximation with 0.5962 for blue and 0.4038 for red.

Is there anything else I can help you with?


Monte Carlo simulations model the probability of different outcomes by repeating a random process a large enough number of times that the results are similar to what would be observed if the process were repeated forever.

The sample() function draws random outcomes from a set of options.

The replicate() function repeats lines of code a set number of times. It is used with sample() and similar functions to run Monte Carlo simulations.

In [2]:
# create an urn with 2 red, 3 blue balls
beads = ["red", "red", "blue", "blue", "blue"]


5-element Vector{String}:
 "red"
 "red"
 "blue"
 "blue"
 "blue"

In [8]:
# sample 1 bead from urn at random
sample(beads, 1)


1-element Vector{String}:
 "blue"

This line of code produces one random outcome. We want to repeat this experiment an infinite number of times, but it is impossible to repeat forever. Instead, we repeat the experiment a large enough number of times to make the results practically equivalent to repeating forever. This is an example of a Monte Carlo simulation.

Much of what mathematical and theoretical statisticians study, which we do not cover in this book, relates to providing rigorous definitions of “practically equivalent” as well as studying how close a large number of experiments gets us to what happens in the limit. Later in this section, we provide a practical approach to deciding what is “large enough”.

To perform our first Monte Carlo simulation, we use the replicate function, which permits us to repeat the same task any number of times. Here, we repeat the random event  $$B =   10,000 \text{ times}$$

In [11]:
# create monte carlo simulation to sample 1 bead 10000 times
# the bead will be sampled with replacement
# the result will be a vector of 10000 beads

# create empty vector to store results
results = []

[push!(results, sample(beads, 1)) for i in 1:10000];

In [18]:
countmap(results) # count the number of times each color was sampled

Dict{Any, Int64} with 2 entries:
  ["blue"] => 5994
  ["red"]  => 4006

In [19]:
proportionmap(results) # calculate the proportion of each color sampled

Dict{Any, Float64} with 2 entries:
  ["blue"] => 0.5994
  ["red"]  => 0.4006

# An important application of the mean() function
In R, applying the mean() function to a logical vector returns the proportion of elements that are TRUE. It is very common to use the mean() function in this way to calculate probabilities and we will do so throughout the course.

Suppose you have the vector beads from a previous video:

beads <- rep(c("red", "blue"), times = c(2,3))
beads
[1] "red" "red" "blue" "blue" "blue"
To find the probability of drawing a blue bead at random, you can run:

mean(beads == "blue")
[1] 0.6
This code is broken down into steps inside R. First, R evaluates the logical statement beads == "blue", which generates the vector:

FALSE FALSE TRUE TRUE TRUE
When the mean function is applied, R coerces the logical values to numeric values, changing TRUE to 1 and FALSE to 0:

0 0 1 1 1
The mean of the zeros and ones thus gives the proportion of TRUE values. As we have learned and will continue to see, probabilities are directly related to the proportion of events that satisfy a requirement.

In [20]:
beads = ["red", "red", "blue", "blue", "blue"]


5-element Vector{String}:
 "red"
 "red"
 "blue"
 "blue"
 "blue"

In [24]:
mean(beads .== "blue")

0.6

In [25]:
beads .== "blue"

5-element BitVector:
 0
 0
 1
 1
 1

Key points
The probability distribution for a variable describes the probability of observing each possible outcome.
For discrete categorical variables, the probability distribution is defined by the proportions for each group.

# Independance

Key points
Conditional probabilities compute the probability that an event occurs given information about dependent events. For example, the probability of drawing a second king given that the first draw is a king is:

$$Pr(\text{Card 2 is a King} \text{ | } \text{Card 1 is a King}) = \frac{3}{51}$$

- if two events A and B are independance then:
$$Pr(A \text{ | } B) = Pr(A) $$

To determine the probability of multiple events occurring, we use the multiplication rule.