In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw06.ipynb")

# Homework 6: Probability, Simulation, Estimation, and Assessing Models

**Reading**: 
* [Randomness](https://inferentialthinking.com/chapters/09/Randomness.html) 
* [Sampling and Empirical Distributions](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
* [Testing Hypotheses](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)



## 1. Monkeys Typing Shakespeare
##### (...or at least the string "datascience")

A monkey is banging repeatedly on the keys of a typewriter. Each time, the monkey is equally likely to hit any of the 26 lowercase letters of the English alphabet, 26 uppercase letters of the English alphabet, and any number between 0-9 (inclusive), regardless of what it has hit before. There are no other keys on the keyboard.  

This question is inspired by a mathematical theorem called the Infinite monkey theorem (<https://en.wikipedia.org/wiki/Infinite_monkey_theorem>), which postulates that if you put a monkey in the situation described above for an infinite time, they will eventually type out all of Shakespeare’s works.

**Question 1.** Suppose the monkey hits the keyboard 5 times.  Compute the chance that the monkey types the sequence `CS118`.  (Call this `data_chance`.) Use algebra and type in an arithmetic equation that Python can evalute.

<!--
BEGIN QUESTION
name: q2_1
manual: false
-->

In [None]:
data_chance = ...
data_chance

In [None]:
grader.check("q1_1")

**Question 2.** Write a function called `simulate_key_strike`.  It should take **no arguments**, and it should return a random one-character string that is equally likely to be any of the 26 lower-case English letters, 26 upper-case English letters, or any number between 0-9 (inclusive). 

<!--
BEGIN QUESTION
name: q2_2
manual: false
-->

In [None]:
# We have provided the code below to compute a list called keys,
# containing all the lower-case English letters, upper-case English letters, and the digits 0-9 (inclusive).  Print it if you
# want to verify what it contains.
import string
keys = list(string.ascii_lowercase + string.ascii_uppercase + string.digits)

def simulate_key_strike():
    """Simulates one random key strike."""
    ...

# An example call to your function:
simulate_key_strike()

In [None]:
grader.check("q1_2")

**Question 3.** Write a function called `simulate_several_key_strikes`.  It should take one argument: an integer specifying the number of key strikes to simulate. It should return a string containing that many characters, each one obtained from simulating a key strike by the monkey.

*Hint:* If you make a list or array of the simulated key strikes called `key_strikes_array`, you can convert that to a string by calling `"".join(key_strikes_array)`

<!--
BEGIN QUESTION
name: q2_3
manual: false
-->

In [None]:
def simulate_several_key_strikes(num_strikes):
    ...

# An example call to your function:
simulate_several_key_strikes(11)

In [None]:
grader.check("q1_3")

**Question 4.** Call `simulate_several_key_strikes` 5000 times, each time simulating the monkey striking 5 keys.  Compute the proportion of times the monkey types `"CS118"`, calling that proportion `data_proportion`.

<!--
BEGIN QUESTION
name: q2_4
manual: false
-->

In [None]:
# HINT: Use a for loop with an nested if statement to simulate

# Set the number of times to call the function
num_simulations = ...
# Initialize a count variable before the for loop
num_cs118 = ...

for i in np.arange(num_simulations):
...

data_proportion = ...

data_proportion


In [None]:
grader.check("q1_4")

**Question 5.** Check the value your simulation computed for data_proportion. Is your simulation a good way to estimate the chance that the monkey types "CS118" in 5 strikes (the answer to question 1)? Why or why not?  Set the variable `monkey_estimation` equal to the correct integer choice. Choose from the following: 

1. No.  The simulation with 5000 iterations would not be sufficient to estimate a probability as small as roughly 1 in a billion times. 
2. Yes. The probability is so low, is it essentially 0, so a simulation is a fair estimate for the true probability.



In [None]:
monkey_estimation = ...

In [None]:
grader.check("q1_5")

**Question 6.** Compute the chance that the monkey types the letter `"t"` at least once in the 5 strikes.  Call it `t_chance`. Use algebra and type in an arithmetic equation that Python can evalute. 

<!--
BEGIN QUESTION
name: q2_6
manual: false
-->

In [None]:
t_chance = ...
t_chance

In [None]:
grader.check("q1_6")

**Question 7.** Do you think that a computer simulation is more or less effective to estimate `t_chance` compared to when we tried to estimate data_chance this way? Why or why not? Set the variable `at_least_once` to the correct integer choice. Choose from the following:
1. No.  A simulation would not be sufficient to estimate this probability.  It is better to compute this probability as we did in the last question.. 
2. Yes. Since the probability of  t_chance is close to 1/13, it will show up in our simulation as often as it should under its theoretical probability.


In [None]:
at_least_once = ...

In [None]:
grader.check("q1_7")

## 2. Sampling Basketball Players


This exercise uses salary data and game statistics for basketball players from the 2019-2020 NBA season. The data was collected from [Basketball-Reference](http://www.basketball-reference.com).

Run the next cell to load the two datasets.

In [None]:
player_data = Table.read_table('player_data.csv')
salary_data = Table.read_table('salary_data.csv')
player_data.show(3)
salary_data.show(3)

**Question 1.** We would like to relate players' game statistics to their salaries.  Compute a table called `full_data` that includes one row for each player who is listed in both `player_data` and `salary_data`.  It should include all the columns from `player_data` and `salary_data`, except the `"Name"` column.

<!--
BEGIN QUESTION
name: q3_1
manual: false
-->

In [None]:
full_data = ...
full_data


In [None]:
grader.check("q2_1")

Basketball team managers would like to hire players who perform well but don't command high salaries.  From this perspective, a very crude measure of a player's *value* to their team is the number of 3 pointers and free throws the player scored in a season for every **\$100000 of salary** (*Note*: the `Salary` column is in dollars, not hundreds of thousands of dollars). For example, Al Horford scored an average of 5.2 points for 3 pointers and free throws combined, and has a salary of **\$28 million.** This is equivalent to 280 thousands of dollars, so his value is $\frac{5.2}{280}$. The formula is:

$$\frac{\text{"PTS"} - 2 * \text{"2P"}}{\text{"Salary"}\ / \ 100000}$$

**Question 2.** Create a table called `full_data_with_value` that's a copy of `full_data`, with an extra column called `"Value"` containing each player's value (according to our crude measure).  Then make a histogram of players' values.  **Specify bins that make the histogram informative and don't forget your units!** Remember that `hist()` takes in an optional third argument that allows you to specify the units! Refer to the python reference to look at `tbl.hist(...)` if necessary.

*Just so you know:* Informative histograms contain a majority of the data and **exclude outliers**

<!--
BEGIN QUESTION
name: q3_2
manual: true
-->
<!-- EXPORT TO PDF -->

<!-- BEGIN QUESTION -->



In [None]:
bins = np.arange(0, 0.7, .1) # Use this provided bins when you make your histogram
full_data_with_value = ...
...

<!-- END QUESTION -->

Now suppose we weren't able to find out every player's salary (perhaps it was too costly to interview each player).  Instead, we have gathered a *simple random sample* of 50 players' salaries.  The cell below loads those data.

In [None]:
sample_salary_data = Table.read_table("sample_salary_data.csv")
sample_salary_data.show(3)

**Question 3.** Make a histogram of the values of the players in `sample_salary_data`, using the same method for measuring value we used in question 2. Make sure to specify the units again in the histogram as stated in the previous problem. **Use the same bins, too.**  

*Hint:* This will take several steps.

<!--
BEGIN QUESTION
name: q3_3
manual: true
-->
<!-- EXPORT TO PDF -->

<!-- BEGIN QUESTION -->



In [None]:
sample_data = player_data.join('Player', sample_salary_data, 'Name')
sample_data_with_value = ...
...


<!-- END QUESTION -->

Now let us summarize what we have seen.  To guide you, we have written most of the summary already.

**Question 4.** Complete the statements below by setting each relevant variable name to the value that correctly fills the blank.

* The plot in question 2 displayed an empirical distribution of the population of [`player_count_1`] players.  The areas of the bars in the plot sum to [`area_total_1`].

* The plot in question 3 displayed an empirical distribution of the sample of [`player_count_2`] players.  The areas of the bars in the plot sum to [`area_total_2`].
 

Set `player_count_1`, `area_total_1`, `player_count_2`, and `area_total_2` should be set to integers.

Remember that areas are represented in terms of percentages.

*Hint 1:* For a refresher on distribution types, check out [Section 10.1](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html)

*Hint 2:* The `hist()` table method ignores data points outside the range of its bins, but you may ignore this fact and calculate the areas of the bars using what you know about histograms from lecture.

*Hint 3:* Review the area principle.

<!--
BEGIN QUESTION
name: q3_4
-->

In [None]:
player_count_1 = ...
area_total_1 = ...

player_count_2 = ...
area_total_2 = ...


In [None]:
grader.check("q2_4")

**Question 5.** For which range of values does the plot in question 3 better depict the distribution of the **population's player values**: 0 to 0.3, or above 0.3? Explain your answer. 

<!--
BEGIN QUESTION
name: q3_5
manual: true
-->
<!-- EXPORT TO PDF -->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 3. Earthquakes


The next cell loads a table containing information about **every earthquake with a magnitude above 5** in 2019 (smaller earthquakes are generally not felt, only recorded by very sensitive equipment), compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [None]:
earthquakes = Table().read_table('earthquakes_2019.csv').select(['time', 'mag', 'place'])
earthquakes

If we were studying all human-detectable 2019 earthquakes and had access to the above data, we’d be in good shape - however, if the USGS didn’t publish the full data, we could still learn something about earthquakes from just a smaller subsample. If we gathered our sample correctly, we could use that subsample to get an idea about the distribution of magnitudes (above 5, of course) throughout the year!

In the following lines of code, we take two different samples from the earthquake table, and calculate the mean of the magnitudes of these earthquakes.

In [None]:
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))
sample1_magnitude_mean = np.mean(sample1.column('mag'))
sample2 = earthquakes.take(np.arange(100))
sample2_magnitude_mean = np.mean(sample2.column('mag'))
[sample1_magnitude_mean, sample2_magnitude_mean]

**Question 1.**  Are these samples representative of the population of earthquakes in the original table (that is, the should we expect the mean to be close to the population mean)? 

*Hint:* Consider the ordering of the `earthquakes` table. Is it a random sample? 

<!--
BEGIN QUESTION
name: q4_1
manual: true
-->
<!-- EXPORT TO PDF -->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.** Write code to produce a sample of size 200 that is representative of the population. Then, take the mean of the magnitudes of the earthquakes in this sample. Assign these to `representative_sample` and `representative_mean` respectively. 

*Hint:* In class, we learned what kind of samples should be used to properly represent the population.


<!--
BEGIN QUESTION
name: q4_2
manual: false
-->

<!-- BEGIN QUESTION -->



In [None]:
representative_sample = ...
representative_mean = ...
representative_mean

In [None]:
grader.check("q3_2")

<!-- END QUESTION -->

**Question 3.** Suppose we want to figure out what the biggest magnitude earthquake was in 2019, but we only have our representative sample of 200. Let’s see if trying to find the biggest magnitude in the population from a random sample of 200 is a reasonable idea!

Write code that takes many random samples from the `earthquakes` table and finds the maximum of each sample. You should take a random sample of size 200 and do this 5000 times. Assign the array of maximum magnitudes you find to `maximums`.

<!--
BEGIN QUESTION
name: q4_3
manual: false
-->

In [None]:
maximums = ...
for i in np.arange(5000):
...


In [None]:
grader.check("q3_3")

In [None]:
#Histogram of your maximums
Table().with_column('Largest magnitude in sample', maximums).hist('Largest magnitude in sample') 

**Question 4.** Now find the magnitude of the actual strongest earthquake in 2019 (not the maximum of a sample). This will help us determine whether a random sample of size 200 is likely to help you determine the largest magnitude earthquake in the population.

<!--
BEGIN QUESTION
name: q4_4
manual: false
-->

In [None]:
strongest_earthquake_magnitude = ...
strongest_earthquake_magnitude


In [None]:
grader.check("q3_4")

**Question 5.** Look at the histogram of maximums from 5000 random samples of 200 earthquakes.  In that histogram, notice where the true population maximum you found in question 4 present was in the samples. Based on that, can we accurately use a sample size of 200 to determine the true maximum? Set the variable `max_estimator` to the correct answer choice from the list below.

1. Yes. Using the area principle, we can see that about 35% of the 5000 samples of size 200 had a max magnitude between 7.5 and 8.0, which is close to the true max.
2. No.  As an estimator, the maximum of each sample depends on what observations in each sample, so if the true maximum isn’t in the sample, the estimation will be less. This implies that this estimator will always be less than or equal to the true parameter.

In [None]:
max_estimator = ...

In [None]:
grader.check("q3_5")

## 4. Assessing Jade's Models
#### Games with Jade

Jade says that there is an equal chance of getting any of the cards because there is just one of each card in the deck. But we do not believe her. We believe that the deck is rigged, with more face cards than numbered cards in the deck.
Our friend Jade comes over and asks us to play a game with her. The game works like this:
> We will draw randomly with replacement from a simplified 13 card deck with 4 face cards (A, J, Q, K), and 9 numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10). If we draw cards with replacement 13 times, and if the number of face cards is greater than or equal to 4, we lose and Jade wins.  *If the number of face cards is less than 4, we win and Jade loses.*
>
We play the game once and we lose, observing 8 total face cards, which we think seems like way too many face cards if the deck is really fair and there are only 4 face cards in the deck.  We are angry and accuse Jade of cheating because we think that the deck is rigged, with more face cards than numbered cards in the deck. Jade is adamant, however, that the deck is fair, there are 4 face cards and 9 numbered cards, and we just drew an unlucky hand.  Who is correct?


#### Question 1
Assign `deck_model_probabilities` to a two-item array containing the chance of drawing a face card as the first element, and the chance of drawing a numbered card as the second element under the assumption that there is just 1 of each card in the deck (Jade's model). Since we're working with probabilities, make sure your values are between 0 and 1. 

<!--
BEGIN QUESTION
name: q5_1
manual: false
-->

In [None]:
deck_model_probabilities = ...
deck_model_probabilities

In [None]:
grader.check("q4_1")

**Question 2**

We think Jade is cheating. In particular, we believe there to be a larger chance of getting a face card. Which of the following statistics can we use during our simulation to test between the model and our alternative? Assign `statistic_choice` to the correct answer. 

1. The actual number of face cards we get in 13 draws
2. The distance (absolute value) between the actual number of face cards in 13 draws and the expected number of face cards in 13 draws (4)
3. The expected number of face cards in 13 draws (4)




In [None]:
statistic_choice = ...
statistic_choice

In [None]:
grader.check("q4_2")

#### Question 3

Define the function `deck_simulation_and_statistic`, which, given a sample size and an array of model proportions (like the one you created in Question 1), returns the number of face cards in one simulation of drawing a card under the model specified in `model_proportions`. 

*Hint:* Think about how you can use the function `sample_proportions`. 

<!--
BEGIN QUESTION
name: q5_3
manual: false
-->

In [None]:
def deck_simulation_and_statistic(sample_size, model_proportions):
    ...

deck_simulation_and_statistic(13, deck_model_probabilities)

In [None]:
grader.check("q4_3")

**Question 4** 

Use your function from above to simulate the drawing of 13 cards 5000 times under the proportions that you specified in Question 1. Keep track of all of your statistics in `deck_statistics`. 

<!--
BEGIN QUESTION
name: q5_4
manual: false
-->

In [None]:
repetitions = 5000 
...

deck_statistics

In [None]:
grader.check("q4_4")

Let’s take a look at the distribution of simulated statistics.

In [None]:
#Draw a distribution of statistics 
Table().with_column('Deck Statistics', deck_statistics).hist(bins = np.arange(0,10, 1))

#### Question 5
The simulation above shows the distribution of the number of face cards we should expect to draw when we play this game IF the deck is fair and Jade is not cheating.  Given your observed value of 8 face cards, do you believe that Jade is cheating or not? If she is cheating this is pointing to the alternative, if not, it is consistent with the null (what is being simulated). Explain your answer using the distribution drawn in the previous problem and the area principle.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Congratulations! You're done with Homework 6!

Be sure to run the tests and verify that they all pass, then choose Download as PDF from the File menu and submit the .pdf file on canvas.