# Lab 5: Probability, Randomization and Simulation

Welcome to our 2nd Module on statistics and lab 5! This week, we will go over conditionals and iteration, and introduce the concept of randomness. Randomness and probability are central concepts to statistics. Most of this material is covered in [Chapter 8](https://inferentialthinking.com/chapters/09/Randomness.html) of the textbook. Near the end of the lab there are questions on simulation and p-test of the groundhog weather prognostication data. The textbook's Chapter 12 https://inferentialthinking.com/chapters/12/1/AB_Testing.html covers concepts of simulation and p-test.
<br>**<center>Learning Goals**
|Area|Concept|
|---|---|
|Booleans|comparison operators |
|Conditional|combine booleans with `if` statements to conditionally execute code|
|Randomization|use `np.random.choice()` to simulate outcomes|
|Iteration|stepping through elements of a list, array, or string sequentially|
|Probability|determine the probability of a given outcome, i.e rolling two 6's on a pair of dice|
|p-value|statistical test to find probability that outcome is by chance |


First, set up the tests and imports by running the cell below.

In [None]:
name = ...

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')
from gofer.ok import check

from EDS_mod.EDS_mod import *
notebooks = glob.glob('*.ipynb')
notebook = max(notebooks, key=os.path.getmtime)

## 1. Conditionals

In Python, Boolean values can either be `True` or `False`. We get Boolean values when using comparison operators, among which are `<` (less than), `>` (greater than), and `==` (equal to). For a complete list, refer to [Booleans and Comparison](https://inferentialthinking.com/chapters/09/Randomness.html) at the start of Chapter 8.

<br>**<center>Boolean Operators**
|Operator|Description|Example|
|---|---|---|
|`==`|test equality| `3 == 3`|
|`>`|greater than| `if x > 3:`|
|`<`|less than| ` 3 < 5 `|
|`>=`| greater than or equal to | `if x >= 5:`|
|`<=`| less than or equal to | `if x <= 5:`|
|`!=`| not equal to | `if x !=  y:`|



Run the cells below to see an example of a comparison operator in action.

In [None]:
3 == 3

In [None]:
3 == 1

In [None]:
3 > 1

In [None]:
1 > 3

In [None]:
bool_expression = 3 > 1 + 1
bool_expression

Arrays are compatible with comparison operators. The output is an array of boolean values.

In [None]:
np.array([1, 5, 7, 8, 3, -1]) > 3

In Python, `True` is also equivalent to the integer 1, and `False` is equivalent to the integer 0.

In [None]:
bool_expression == 1

### Radiation Dose from Airline Travel

When you fly on a commercial aircraft, you are exposed to cosmic radiation from space that is more intense at higher altitudes and at higher latitudes (less atmospheric shielding). The average dose rate is about 0.004 millisieverts (mSv) per hour of flight at cruising altitude (around 10,000 meters or 33,000 feet).

Using the function call `np.random.choice(array_name)`, let's simulate a random flight from Philadelphia to a variety of destinations, both near and far. Start by running the cell below several times, and observe how the destinations change.

In [None]:
destination = make_array('Boston', 'Los Angeles', 'London', 'Beijing', 'Sydney')
np.random.choice(destination)

In [None]:
np.random.choice(destination, 8)

### <font color=blue> **Question 1.** </font> 
Assume we took eight trips at random, and stored the results in an array called `eight_trips`. Find the number trips to London (do not hardcode the answer).  

*Hint:* Our solution involves a comparison operator and the `np.count_nonzero` method. This is because Python considers any non-zero integer to be True, and zero to be False.

In [None]:
eight_destinations = make_array('Los Angeles', 'Boston', 'Sydney', 'Beijing', 'Sydney', 'London', 'Boston', 'London' )
number_London = ...
number_London

In [None]:
# NOTE: There is no check here because random.choice will be different every time, but you chould be able to tell by inspection if number_London is correct.

**Conditional Statements**

A conditional statement is made up of many lines that allow Python to choose from different alternatives based on whether some condition is true.

Here is a basic example.
```python
def sign(x):
    if x > 0:
        return 'Positive'
```
How the function works is if the input `x` is greater than `0`, we get the string `'Positive'` back.

If we want to test multiple conditions at once, we use the following general format.

```python
if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
elif <elif expression 1>:
    <elif body 1>
...
else:
    <else body>
```

Only one of the bodies will ever be executed. Each `if` and `elif` expression is evaluated and considered in order, starting at the top. As soon as a true value is found, the corresponding body is executed, and the rest of the expression is skipped. If none of the `if` or `elif` expressions are true, then the `else body` is executed. For more examples and explanation, refer to [Section 8.1](https://www.inferentialthinking.com/chapters/08/1/conditional-statements.html).

### <font color=blue> **Question 2.** </font> 
We want to write a function that returns the radiation dose depending on the destination.  The function takes in destination as a string and returns a dose: 
- 0.004 mSv for Boston (1 hr)
- 0.024 mSv for Los Angeles (6 hrs)
- 0.028 mSv for London (7 hrs)
- 0.052 mSv for Beijing (13 hrs)
- 0.08 mSv for Sydney (20 hrs) 

In [None]:
def calculate_dose(dest):
    if ...
        return 0.004
    # next condition should return 0.024
    ...
    # next condition should return 0.028
    ...
    # next condition should return 0.052
    ...
    # next condition should return 0.08
    ...

London_dose = calculate_dose('London')
London_dose

In [None]:
check('tests/q2.py')

## 2. Iteration

Using a `for` statement, we can perform a task multiple times. This is known as iteration. Here, we'll simulate drawing different suits from a deck of cards. 

In [None]:
suits = make_array("♤", "♡", "♢", "♧")
draws = make_array()
repetitions = 6
for i in np.arange(repetitions):
    draws = np.append(draws, np.random.choice(suits))
draws

Another use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow.

In [None]:
rainbow = make_array("red", "orange", "yellow", "green", "blue", "indigo", "violet")
for color in rainbow:
    print(color)

We can see that the indented part of the `for` loop, known as the body, is executed once for each item in `rainbow`. Note that the name `color` is arbitrary; we could easily have named it something else.

### <font color=blue> **Question 3.** </font>  
You're a pilot for Random Air and you fly between Philadelphia one of these destinations 60 times in a year.

In [None]:
destination = make_array('Boston', 'Los Angeles', 'London', 'Beijing', 'Sydney')
num_trips = 60

total_dose = ... # Initialize the value of total *before* you start the loop

for ... in ...: # Create a loop over the number of trips
    dest = ... # Select a destination at random from your array
    dose = ...  # Use your dose function to determine the radiation dose received on the flight
    ...         # Add the dose to your running total

total_dose

In [None]:
check('tests/q3.py')

**Note:** A normal background radiation dose for the average person is 2.4 mSv/year from natural sources. Pilots and flight crews typically receive an additional 2-5 mSv/yr, or about twice background. So, a pilot’s excess annual dose is above the general public’s recommended limit, but well below occupational limits for radiation workers. There’s no evidence of significant health risk at these levels, but it’s a unique occupational exposure. This is one of the few jobs where you can get more radiation than a nuclear power plant worker. Few of the rest of us fly anywhere close to this number of hours annually.

### <font color=blue> **Question 4.** </font> 
Charles Darwin is a famous naturalist and biologist from the late 1800s.  While Darwin is known for several different theories, one of his most well known theory involved the finches on Galapagos Islands and helped form his theory on natural selection and speciation.  In this question, we are going to loop through Charles Darwin's book on the Origin of Species and count up the number of times he refers to bird or birds in the text.

*Hint:* We want to count all instances of bird, birds, Bird and Birds.

In [None]:
darwin_string = open('darwin_origin_species.txt', encoding='utf-8').read()
darwin_words = np.array(darwin_string.split())

birds = ...
...


birds 

In [None]:
check('tests/q4.py')

## 3. Probability

Astronomers have their telescope pointed towards a particular section of sky.  During an hour period, astronomers have 0.7 chance of seeing a Meteoroid in the sky. 

### <font color=blue> **Question 5.** </font> 
What is the probability that the astronomer will not see a Meteoroid in this hour?

In [None]:
no_meteoroid = ...
no_meteoroid

In [None]:
check('tests/q5.py')

### <font color=blue> **Question 6.** </font> 
What is the probability of seeing a meteoroid in the first hour and then seeing a meteoroid in the second hour?

In [None]:
two_meteoroids = ...
two_meteoroids

In [None]:
check('tests/q6.py')

### <font color=blue> **Question 6 Discussion.** </font>
We are assuming here that the probabilities are "unconditional." That is that the probability of seeing a meteoroid in the second hour does not depend on whether or not you saw one in the first hour. Not all probabilities are unconditional. **Create a markdown cell below this one and describe an instance of conditional probability, and explain why the odds change depending on the previous outcome.**

ANSWER HERE:


In [None]:
# Be sure to save your notebook before running this check.
check('tests/q6_open_ended.py')

### <font color=blue> **Question 7.** </font> 
A club on campus is holding a contest.  There is a bag with two red marbles, two green marbles, and two blue marbles. You have to draw three marbles separately. In order to win, all three of these marbles must be of different colors.  What is the probability of you winning the contest?

In [None]:
winning_prob = ...
winning_prob

In [None]:
check('tests/q7.py')

## 4. Application:  Groundhog's Day

Researchers seek to understand whether Groundhogs are able to predict the onset of spring any better than chance.  In this particular study, researchers look at 33 groundhogs across North America (USA and Canada) and gather data for several years.  This includes whether the groundhog saw its shadow and whether the onset of spring was late or early - which for our uses are going to be the two columns we're going to focus on.  In the cell below, we read in dataset into a table called `groundhogdata`.  

In [None]:
groundhogdata = Table.read_table('GroundHogData/summarizedGroundhogData_20210326.csv')
groundhogdata

Each groundhog has the options of either seeing their shadow or not seeing their shadow.  If the groundhog sees their shadow, they are then predicting that spring will come late.  If a groundhog doesn't see their shadow, then they are predicting that spring will come early. We are trying to see if the groundhog has special perception and can predict the onset of Spring better  than a coin toss (50/50).

### <font color=blue> **Question 8.** </font>
Create a table that contains all the rows where the Groundhogs correctly predicted the onset of Spring. Since there are two ways the groundhogs could be correct, you may want to make multiple tables and then combine them. 

*Hint:* You can combine tables with the same columns using the append() method. [See the documentation.](https://www.data8.org/datascience/_autosummary/datascience.tables.Table.append.html#datascience.tables.Table.append)

In [None]:
yes_late = groundhogdata.where('shadowPres',  ...).where(..., ...)
no_early = ...

correct_tbl = ...
correct_tbl

In [None]:
check('tests/q8.py')

### <font color=blue> **Question 9.** </font>
Calculate the percent of time that the groundhogs correctly identified the onset of Spring. Hint: It should be the number of the correct predictions divided by the number of total predictions. Try the table method .num_rows.

In [None]:
percent_correct = .../...*100

In [None]:
check('tests/q9.py')

Now that we know how often groundhogs across North America correctly identify the onset of spring, we want to simulate groundhog data to see if the groundhogs do better than random.  

### <font color=blue> **Question 10.** </font>
Let's set up the needed options and iterations.  Set the possible options for whether the groundhog sees shadows and the spring onset.  Set num_observations to the total number of groundhog observations from the dataset.  

In [None]:
shadow_options = np.array(['yes', 'no'])
spring_options = ...
num_observations = ...

In [None]:
check('tests/q10.py')

### <font color=blue> **Question 11.** </font>
Now let's set up the simulation. For each of the observations, randomly choose whether the groundhog sees their shadow and then randomly choose when spring starts.  Depending on the random choices, we can use conditional statements to either record the choice as right or wrong by adding one to `right` or `wrong` variables (increment).  Use concepts in Questions 2 and 3.

In [None]:
right = 0 # Initialize at zero 
wrong = ...
# Iteration
for obs in np.arange(...):
    shadow = np.random.choice(shadow_options)
    spring = ...
    # Now decide of simulated groundhog is right or wrong
    if shadow == 'yes' and spring == 'late':
        ...
    elif  ...          and   ...           :
        ...
    else:
        ...
    
# Calculate the fraction of simulated correct answers
simulated_frac_correct = ...
simulated_frac_correct

In [None]:
check('tests/q11.py')

### <font color=blue> **Question 12.** </font> 
In the markdown cell below, compare the results of your simulation in question 11 and the actual number of correctly identified spring onsets by the groundhogs in the study. <font color='green'>Does the Groundhog have true insights into the upcoming weather and timing of Spring or would a coin toss at the 50 yard line of the field at the Linc do just as well at predicting Spring's onset?

ANSWER HERE

In [None]:
# Be sure to save your notebook before running this check.
check('tests/q12_open_ended.py')

### <font color=blue> **Question 13.** </font> 
Now we can test if the Groundhog's prediction is statistically significant by calculating a "p-value," where 'p' stands for probability. The p-value is the probability that the result is simply the same as chance or as if the Groundhog randomly guessed an answer. 

**Null Hypothesis:** Groundhog predictions are no better than random guessing.<br>
**Alternative Hypothesis:** Groundhog predictions are better than random.

We can only reject the null hypothesis and put faith in groundhog predictions if the p-value, is small. How small? This depends on how confident you need to be to consider a result to be "statistically significant." Typically, if there is a less than 5% of getting the result randomly (p-value < 0.05) we say the result merits further study, and if the odds are less than 1% (p-value < 0.01) we may be on to something real.

To calculate the odds of getting the result randomly, we have to repeat the simulation many times because one simulation could simply generate a surprisingly good set of observations but many repeats will give us the sense of the true probability of a random guess yielding a correct prediction. To make the function more general we will replace the `num_observations` with `observations` so we can look at subsets of ground hogs and compare their accuracies given fewer observations.

In [None]:
def sim_ground(repeats,observations):
    correct_obs = []
    for i in np.arange(repeats):
        right = 0 
        wrong = ...
        for obs in np.arange(observations):
            shadow = np.random.choice(shadow_options)
            spring = ...
            if shadow == 'yes' and spring == 'late':
                ...
            elif ...           and                 :
                ...
            else:
                ...
        simulated_frac_correct = right / observations
        correct_obs.append(simulated_frac_correct)
    return correct_obs        

If we look at the average for one set of random guesses we get the below figure which will vary each time we run the simulation.
Compare two simulations by running them below.

<font color='blue'>Simulation #1

In [None]:
plt.hist(sim_ground(1,num_observations),bins=np.arange(0.4,0.6,.01),color = "skyblue", ec="red")
plt.title('Simulated Groundhog outcomes')
plt.xlabel('simulated_frac_correct')
plt.savefig('sim_ground.png')
plt.show()

<font color='blue'>Simulation #2

In [None]:
plt.hist(sim_ground(1,num_observations),bins=np.arange(0.4,0.6,.01),color = "skyblue", ec="red")
plt.title('Simulated Groundhog outcomes')
plt.xlabel('simulated_frac_correct')
plt.savefig('sim_ground.png')
plt.show()

 <font color='blue'>Use this cell to compare simulation #1 with #2.</font>

##### Now we will simulate 1000 repeats and compare them to our Groundhogs' observation to ultimately get the p-value

In [None]:
num_simulations = 1000
plt.hist(sim_ground(num_simulations, num_observations),bins=np.arange(0.4,0.6,.01),color = "skyblue", ec="red")
plt.title('Simulated Groundhog outcomes')
plt.xlabel('simulated_frac_correct')
plt.savefig('sim_ground.png')
plt.show()

In [None]:
hogsimdata = Table().with_columns("Correct",sim_ground(num_simulations, num_observations))
hogsimdata

In [None]:
percent_correct = ... # answer from Question 9.

In [None]:
hogsimdata.hist("Correct", bins=np.arange(0.4,0.6,.01))
plt.scatter(percent_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

### p value
The p value is the proportion of the randomly generated results which are better than the observation we are examining. In this case, the observation we are 'testing' is the collected Groundhog prediction accuracy, `percent_correct`.

In [None]:
p_All = np.count_nonzero(hogsimdata.column('Correct') >= percent_correct/100) / num_simulations
p_All

The Groundhogs' collective observations are the same or worse than chance `p_value`*100% of the time which is larger than 5% so statistically cannot reject the null hypothesis that the Groundhog's observations are not better than chance.

#### Essex Ed (EE)
Essex Ed seems to be a far better prognosticator than our group of Groundhogs. Let's see if Essex Ed's predictions are better than chance.

In [None]:
EE_data = groundhogdata.where('hogID', "...")
EE_data

In [None]:
EE_correct = correct_tbl.where('hogID', "...")
EE_correct

### Need a new simulation Table because Essex Ed only has 9 observations

In [None]:
EE_observations = groundhogdata.where('hogID', "EE").num_rows
EE_observations

In [None]:
hogsimdata = Table().with_columns("Correct", sim_ground(num_simulations, EE_observations))
hogsimdata

In [None]:
# Refer to the method used in Question 9
# You need to use the number of rows in EE_correct and EE_data
percent_correct = ... 
percent_correct

In [None]:
hogsimdata.hist("Correct", bins=20)
plt.scatter(percent_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

### Now compute Essex Ed's p-value, see `p_All` above

In [None]:
p_EE = ...
p_EE

### <font color=blue>Is Essex Ed a better predictor than chance? Why or why not given the above p-value? Answer in the below markdown cell.

ANSWER HERE

In [None]:
# Be sure to save your notebook before running this check.
check('tests/q13a_open_ended.py')

#### Punxsutawney Phil (PYPL)
Let's see how our local hero, Punxsutawney Phil, does at predicting relative to chance. 

In [None]:
groundhogdata.where('hogID',"...")

In [None]:
correct_tbl.where('hogID',"...")

In [None]:
percent_correct = ... # Refer to the method used in Question 9 with .num_rows of each Table directly above
percent_correct

### Need a new simulation Table because Punxsutawney Phil only has 107 observations

In [None]:
PYPL_observations = ...
PYPL_observations

In [None]:
hogsimdata = ...
hogsimdata

In [None]:
hogsimdata.hist("Correct",bins=20)
plt.scatter(percent_correct/100, 0, color='red', s=200);
plt.title('Simulated Groundhog outcomes')
plt.savefig('sim_ground_correct.png')
plt.show()

### Now compute Punxsutawney Phils's p-value, see `p_EE` above

In [None]:
p_PYPL = ...
p_PYPL

### <font color=blue>Is Punxsutawney Phil a better predictor than chance? Why or why not given the above p-value? Answer in the below markdown cell.

ANSWER HERE

In [None]:
# Be sure to save your notebook before running this check.
check('tests/q13b_open_ended.py')

### <font color=blue> **Question 14.** </font>

At the end of each lab, please include a reflection. 
* How did this lab go? 
* What aspects of this introduction to probability do you find confusing?
* Were there questions you found especially challenging you would like your instructor to review in class? 
* How long did the lab take you to complete?

Share your feedback so we can continue to improve this class!

**Insert a markdown cell below this one and write your reflection on this lab.**

In [None]:
# BE SURE TO SAVE YOUR NOTEBOOK BEFORE RUNNING THIS TEST
check("tests/q14_open_ended.py")

In [None]:
import glob
from gofer.ok import check
correct = 0
questions = ["2","3", "4", "5", "6","7","8","9","10","11",
             "12_open_ended", "13a_open_ended","13b_open_ended","14_open_ended"]
for x in questions:
    print("Testing question {}: ".format(x))
    display(check("tests/q{}.py".format(x)))
    score = check('tests/q{}.py'.format(str(x)))
    if score.grade == 1.0:
        correct += 1

In [None]:
perc_correct = correct/len(questions)*100
if perc_correct < 80:
    msg = 'look over your work again, seek help, some errors!!!'
else:
    msg = 'nice work!'
print(f"----\n{name} {msg}\n----\nusername: {user}")
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)
print(f'Score: {correct/len(questions)*100:.1f}%')