In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 05: Probability, Simulation, Estimation, and Assessing Models

**Reading**: 
* [Randomness](https://inferentialthinking.com/chapters/09/Randomness.html) 
* [Sampling and Empirical Distributions](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
* [Testing Hypotheses](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Probability


We will be testing some probability concepts that were introduced in class. For all of the following problems, we will introduce a problem statement and give you a proposed answer. You must assign the provided variable to one of the following three integers, depending on whether the proposed answer is too large, too small, or correct. 

1. Assign the variable to 1 if you believe our proposed answer is too large.
2. Assign the variable to 2 if you believe our proposed answer is too small.
3. Assign the variable to 3 if you believe our proposed answer is correct.


You are welcome to create more cells throughout this notebook to test out calcuations.

### Question 1.1.

You roll a 6-sided die 10 times. What is the chance of getting 10 sixes?

Our proposed answer: $$\left(\frac{1}{6}\right)^{10}$$

Assign `ten_sixes` to either 1, 2, or 3 depending on if you think our answer is too high, too low, or correct. 

In [None]:
ten_sixes = ...
ten_sixes

In [None]:
grader.check("q1_1")

### Question 1.2.

Take the same problem set-up as before, rolling a fair dice 10 times. What is the chance that every roll is less than or equal to 5?

Our proposed answer: $$1 - \left(\frac{1}{6}\right)^{10}$$

Assign `five_or_less` to either 1, 2, or 3. 

In [None]:
five_or_less = ...
five_or_less

In [None]:
grader.check("q1_2")

### Question 1.3.

Assume we are picking a lottery ticket. We must choose three distinct numbers from 1 to 1000 and write them on a ticket. Next, someone picks three numbers one by one from a bowl with numbers from 1 to 1000 each time without putting the previous number back in. We win if our numbers are all called in order. 

If we decide to play the game and pick our numbers as 12, 140, and 890, what is the chance that we win? 

Our proposed answer: $$\left(\frac{3}{1000}\right)^3$$

Assign `lottery` to either 1, 2, or 3. 

In [None]:
lottery = ...

In [None]:
grader.check("q1_3")

### Question 1.4.

Assume we have two lists, list A and list B. List A contains the numbers [20,10,30], while list B contains the numbers [10,30,20,40,30]. We choose one number from list A randomly and one number from list B randomly. What is the chance that the number we drew from list A is larger than or equal to the number we drew from list B?

Our proposed solution: $$1/5$$

Assign `list_chances` to either 1, 2, or 3. 

*Hint: Consider the different possible ways that the items in List A can be greater than or equal to items in List B. Try working out your thoughts with a pencil and paper, what do you think the correct solutions will be close to?*

In [None]:
list_chances = ...

In [None]:
grader.check("q1_4")

## 2. Monkeys Typing Shakespeare

Suppose a monkey is banging repeatedly on the keys of a simple typewriter. Each time, the monkey is equally likely to hit any of the 26 lowercase letters of the English alphabet, 26 uppercase letters of the English alphabet, and any digit 0-9, regardless of what it has hit before. There are no other keys on the typewriter.

This question is inspired by a mathematical theorem called the Infinite monkey theorem (<https://en.wikipedia.org/wiki/Infinite_monkey_theorem>), which postulates that if you put a monkey in the situation described above for an infinite time, they will eventually type out all of Shakespeare’s works.

### Question 2.1.

Suppose the monkey hits the keyboard 6 times. Determine the *theoretical* probability (meaning, don't use simulation to estimate) that the monkey types the sequence of characters `ma4110`.  (Call this `ma4110_chance`. Assign this probability as a *proportion* between 0 and 1, formatted as either an exact fraction or an equivalent arithmetic expression. For example your assignment statement could be formatted as either `ma4110_chance = 1/10000` or `ma4110_chance = (1/100) ** 2` and still be graded correctly.

In [None]:
ma4110_chance = ...
ma4110_chance

In [None]:
grader.check("q2_1")

### Question 2.2.

Write a function called `simulate_key_strike`.  It should take **no arguments**, and it should return a random one-character string that is equally likely to be any of the 26 lower-case English letters, 26 upper-case English letters, or any number between 0-9 (inclusive). The provided code below will create a list called `keys` that contains all the lower-case English letters, upper-case English letters, and the digits 0-9 (inclusive). Lists and arrays are similar, and can often (but not always) be used interchangibly with many functions.

In [None]:
# Proivded code, do not change
import string
keys = list(string.ascii_lowercase + string.ascii_uppercase + string.digits)

def simulate_key_strike():
    """Simulates one random key strike."""
    # Your code goes below this line
    ...

# An example call to your function:
simulate_key_strike()

In [None]:
grader.check("q2_2")

### Question 2.3.

Write a function called `simulate_several_key_strikes`.  It should take one argument: an integer specifying the number of key strikes to simulate. It should return a string containing that many characters, with each character obtained from a single key strike simulation of the monkey.

**Hint:** If you make an array (or list) of the simulated key strikes called `key_strikes_array`, you can convert that to a string by calling 

```python
"".join(key_strikes_array)
```

In [None]:
def simulate_several_key_strikes(num_strikes):
    ...

# An example call to your function:
simulate_several_key_strikes(11)

In [None]:
grader.check("q2_3")

### Question 2.4.

Write code that will simulate a monkey hitting 6 keys 5000 times, and compute the *proportion* of times the monkey types `"ma4110"`. Call that proportion `ma4110_proportion`.

**Hint:** Keep in mind that you've already calculated the theoretical probability for this event to occur earlier in this assignment. Use the theoretical value to determine if your empirical estimate from simulation is reasonable.

In [None]:
...

ma4110_proportion

In [None]:
grader.check("q2_4")

### Question 2.5.

Run your simulation to compute `ma4110_proportion` a few times. Does your simulation ever exactly match the theoretical value you computed earlier in the assignment? Why or why not might this be the case? Think about the theoretical probability of this event occuring, the number of simulations you ran, and the possible values the simulation would produce when writing your response.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2.6.

Determine the theoretical chance that a monkey types the letter `"m"` at least once in 6 key strikes.  Assign the value to the variable `m_chance`. Provide your answer as an expression Python can evalute to calculate the final probability. (For example, you should put `(1/6)**3` instead of `0.00462962962963`) 

In [None]:
m_chance = ...
m_chance

In [None]:
grader.check("q2_6")

### Question 2.7.

Without actually running a computer simulation, do you think that a computer simulation with 5000 trials would be more a or less effective way to estimate `m_chance` than when trying to estimate `ma4110_chance` this way? Why or why not? Be sure to include the specific criteria you considered when making your decision.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 3. Sampling Basketball Players


This exercise uses salary data and game statistics for basketball players from the 2019-2020 NBA season. The data was collected from [Basketball-Reference](http://www.basketball-reference.com).

Run the next cell to load the player dataset:

In [None]:
player_data = Table.read_table('player_data.csv')
player_data

This `player_data` Table contains the following measurements for each player:

* `3P`: The average number of 3-point shots made per game (not points)
* `2P`: The average number of 2-point shots made per game (not points)
* `PTS`: The average total number of points scored per game in total, made up of 3-point shows, 2-point shots, and free throw shots (1 point each).

Run the next cell to load the salary data set.

In [None]:
salary_data = Table.read_table('salary_data.csv').set_format('Salary', CurrencyFormatter)
salary_data

The `salary_data` Table contains the salary for this season for each player, measured in dollars.

### Question 3.1.

We would like to relate players' game statistics to their salaries.  Compute a table called `full_data` that includes one row for each player who is listed in both `player_data` and `salary_data`.  It should include all the columns from `player_data` and `salary_data`, except the `"Name"` column.

**Hint:** A `.join` operation would be helpful here to combine your tables!

In [None]:
full_data = ...
full_data

In [None]:
grader.check("q3_1")

### Hiring criteria

Basketball team managers would like to hire players who perform well but don't command high salaries. 

Suppose that the manager of a team decides they need a player who is good at scoring points from making 3-point shots and free throws. They want to find a player who can make these types of shots fairly well, per amount of money they are paid. From this perspective, a very crude measure of a player's *value* to their team is the number of 3 pointers and free throws the player scored in a season for every **\$100,000 of salary** (*Remember*: the `Salary` column is measured in dollars, not hundreds of thousands of dollars). 

For example, look at the player Al Horford:

In [None]:
full_data.where('Player', are.equal_to('Al Horford'))

Al Horford scored an average of 12 total points per game, and 3.4 2-point shots per game. That implies that Al scores $12 - 2(3.4) = 5.2$ points from other types of shots, 3-point shots and free throws. Since Al has a salary of **\$28 million.** (which is equivalent to 280 thousands of dollars) his value is $\frac{5.2}{280} \approx 0.01857 $. 

In general, the formula used to make the value calculation for this particular manager is:

$$\frac{\text{PTS} - 2 \times \text{2P}}{\text{Salary}\ / \ 100000}$$

### Question 3.2.

Create a table called `full_data_with_value` that's a copy of `full_data`, with an extra column called `Value` containing each player's value (according to this manager's crude measure).  Then make a histogram of players' values.  Use the specified bins, as they've been pre-selected to make the histogram informative. Then, don't forget to include your units in the histogram! Remember that `hist()` takes in an optional third argument, `unit`, that allows you to specify the units of your data. Refer to the python reference sheet to look at `tbl.hist(...)` if necessary.

*Just so you know:* Informative histograms contain a majority of the data and **exclude outliers**. The provided bins will intentionally exclude some data points that are considered outliers.

<!-- BEGIN QUESTION -->



In [None]:
my_bins = np.arange(0, 0.7, .1) # Use these provided bins when you make your histogram
full_data_with_value = ...
...

<!-- END QUESTION -->

### Incomplete data

Now suppose we **weren't** able to find out every player's salary (perhaps it was too costly to interview each player).  Instead, we have gathered a *simple random sample* of 50 players' salaries.  The cell below will load a pre-made sample of 50 players to the table `sample_salary_data`.

In [None]:
sample_salary_data = Table.read_table("sample_salary_data.csv")
sample_salary_data

### Question 3.3.

Make a histogram of the values of the players in `sample_salary_data`, using the same method for measuring value we used in the previous question. Make sure to specify the same bins and units that were used previously.

**Hint:** This will take several steps and perhaps several intermediate tables. Don't feel like you need to do this in a single line of code.

<!-- BEGIN QUESTION -->



In [None]:
sample_data = player_data.join('Player', sample_salary_data, 'Name')
sample_data_with_value = ...
...

<!-- END QUESTION -->

Now let us summarize what we have seen. 

### Question 3.4.

How well does the distribution of the simple random sample of 50 players compare to the distribution of the population of all the players? Are there any ranges of values that the simple random sample seems to do better at representing the population? Why do you think these ranges do a better job representing the population than others? Cite specifics from the sample, the population, and the histrograms of values to support your explanation.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 4. Earthquakes


The next cell loads a table containing information about **every earthquake with a magnitude above 5** in 2019 (smaller earthquakes are generally not felt, only recorded by very sensitive equipment), compiled by the US Geological Survey. (source: https://earthquake.usgs.gov/earthquakes/search/)

In [None]:
earthquakes = Table().read_table('earthquakes_2019.csv').select(['time', 'mag', 'place'])
earthquakes

The `earthquakes` Table contains the following measurements for each human detectable earthquake:
* `time`: A date and timestamp for the earthquake. It contains the year, month, day, hour, minute, second, and fraction of a second that the earthquake occurred.
* `mag`: The magnitude of the earthquake. Remember, this data set only contains the earthquakes with a magnitude of 5.0 or higher.
* `place`: The approximate location that the earthquake took place 

**Notice** that this table is sorted by the `time` column such that the most recent earthquakes are at the top of the table.

### Subsampling

If we were studying all human-detectable 2019 earthquakes and had access to the above data, we’d be in good shape - however, if the data was not collected by a government agency like the USGS, and instead collected by a private business, you may not be able to obtain the full data; perhaps it would be too costly. However, could still learn *something* about earthquakes from just a smaller subsample. For example, if we gathered our sample correctly, we could use that subsample to get an idea about the distribution of magnitudes of earthquakes (above 5, of course) throughout the year! 

We'll create two samples from the `earthquake` table using different sampling methods. Analyze the code for each sample to examine how the methods are similar and how they differ. You'll be asked to compare the results of each method, and understanding the differences in the sampling method will potentially help you explain any differences in the average value of `mag` that is computed.

### Sample #1

In [None]:
# First sample method
sample1 = earthquakes.sort('mag', descending = True).take(np.arange(100))

# Calculate the mean of the first sample
sample1_magnitude_mean = np.mean(sample1.column('mag'))

sample1_magnitude_mean

### Sample #2

In [None]:
# Second sample method
sample2 = earthquakes.take(np.arange(100))

# Calculate the mean of the second sample
sample2_magnitude_mean = np.mean(sample2.column('mag'))

sample2_magnitude_mean

### Question 4.1.

Neither of these samples accurately represent the population from which they were drawn. Explain why each of the two samples would create a biased average of `mag`. Make sure you address both samples in your response.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 4.2.

Write code to produce a sample of size 200 that **is** representative of the population and assign it to a Table named `representative_sample`. Then, take the mean of the magnitudes of the earthquakes in this sample and assign it to `representative_mean`. 

**Hint:** In class, you've learned what type of sample should be used to best represent a population.

In [None]:
representative_sample = ...
representative_mean = ...
representative_mean

In [None]:
grader.check("q4_2")

### Question 4.3.

Suppose we want to figure out what the biggest magnitude earthquake was in 2019, but we only have our representative sample of 200. Let’s see if estimating the biggest magnitude in the population using a random sample of size 200 is a reasonable idea.

In the cell below write code that uses simulation to create 5,000 random samples of size 200 from the `earthquakes` Table. For each sample determine the maximum value of `mag` in the sample. The provided code will create an empty array named `maximums` and start the simulation loop for you. Complete the loop code so that the maximum values from each of the 5,000 samples are stored to `maximums`.

In [None]:
maximums = ...

for i in np.arange(5000): 
    ...

In [None]:
grader.check("q4_3")

### Visualize your distribution

Run the cell below to create a histogram of the 5,000 maximums you simulated to view their distribution. You'll need this to help answer question 4.5.

In [None]:
# Histogram of your maximums
Table().with_column('Largest magnitude in sample', maximums).hist('Largest magnitude in sample', bins=np.arange(6, 8.5, 0.25)) 

### Question 4.4

The actual maximum magnitude observed in the year the data was collected was 8.0.

Using the results of your simulation, explain how often a sample with size 200 would correctly estimate the maximum. Use specific values in the histogram above to help write a detailed answer. When the sample fails to correctl estimate the maximum magnitude, are there any patterns in the incorrect estimates?

## 5. Assessing Jade's Models
#### Games with Jade

Our friend Jade comes over and asks us to play a game with her. The game works like this: 

> We will draw randomly with replacement from a simplified 13 card deck with 4 face cards (A, J, Q, K), and 9 numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10). If we draw cards with replacement 13 times, and if the number of face cards is greater than or equal to 4, we lose.
> 
> Otherwise, Jade wins.

We play the game once and we lose, observing 8 total face cards. While we know that 8 face cards is theoretically possible, it seems so unlikely that we instead think Jade is cheating! Jade is adamant, however, that the deck is fair and that we just got unlucky.

#### Jade's model of the game
Jade claims that there is an equal chance of getting any of the cards (A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K).

#### Our alternative model of the game
We believe that the deck is rigged, with face cards (A, J, Q, K) being more likely than the numbered cards (2, 3, 4, 5, 6, 7, 8, 9, 10).

### Question 5.1.

We will simulate the game assuming Jade's model is correct, and that all cards have an equal chance of being drawn. Assign `deck_model_probabilities` to a two-item array containing the probability of drawing a face card as the first element, and the chance of drawing a numbered card as the second element under these assumptions.

Since we're working with probabilities, make sure your values are between 0 and 1. Probabilities should be exact representations of the values (1/3 not 0.333).

In [None]:
deck_model_probabilities = ...
deck_model_probabilities

In [None]:
grader.check("q5_1")

### Question 5.2.

We believe Jade's model (every card is equally likely to be drawn) is incorrect. In particular, we believe the deck is rigged such that there is a larger chance of getting a face card.  

List a few reasonable choice for a statistic that could be used to test our hypothesis. There are at least six correct answers; see how many you can come up with! You should come up with at least two statistics in addition to the statistic that will be used in the following question: number of face cards dealt in a single game.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 5.3.

Define the function `deck_simulation_and_statistic`, which given as inputs:

* `sample_size`: an integer that represents the number of cards to be dealt (with replacement)
* `model_proportions`: an array that contains the model proportions (like the one you created earlier in this question),

will return **the number of face cards** (our statistic for testing our hypothesis) in one simulation of drawing cards under the model specified in `model_proportions`. 

The included final line of code in the cell below will call your function to simulate drawing 13 cards from the deck using the assumptions of Jade's model which you assigned to the array `deck_model_probabilities` earlier. It will return the number of face cards those simulated 13 drawn cards contained.

**Hint:** Think about how you can use the function `sample_proportions` contained in the `datascience` library. 

In [None]:
def deck_simulation_and_statistic(sample_size, model_proportions):
    ...

deck_simulation_and_statistic(13, deck_model_probabilities)

In [None]:
grader.check("q5_3")

### Question 5.4.

Use your the `deck_simulation_and_statistic` function from the previous question to run 5,000 simulations in which 13 cards are drawn under the proportions that you specified in `deck_model_probabilities`. Store of all of your statistics in an array named `deck_statistics`. 

In [None]:
repetitions = 5000 
...

deck_statistics

In [None]:
grader.check("q5_4")

### Visualize the distribution

Let’s take a look at the distribution of the simulated statistics. The code cell below will create a histogram of your results. Each bin has a width of 1 unit.

In [None]:
# Draw a distribution of statistics 
Table().with_column('Deck Statistics', deck_statistics).hist(bins=np.arange(-0.5,13.5,1))

### Question 5.5.

Do you believe that Jade's model is reasonable given that we observed 8 face cards drawn out of the 13 cards? Explain your answer using specific information from the distribution shown above. In particular, consider how likely such a result would be using the information in the histogram.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw05_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw05_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)