In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

# Lab 05: Simulations

Welcome to Lab 05. 

We will go over [iteration](https://www.inferentialthinking.com/chapters/09/2/Iteration.html) and [simulations](https://www.inferentialthinking.com/chapters/09/3/Simulation.html), as well as introduce the concept of [randomness](https://www.inferentialthinking.com/chapters/09/Randomness.html).

First, set up the tests and imports by running the cell below.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# This line suppresses scientific notation
np.set_printoptions(suppress = True)

## 1. Nachos and Conditionals

One day, when you come home after a long week, you see a hot bowl of nachos waiting on the dining table! Let's say that whenever you take a nacho from the bowl, it will either have only **cheese**, only **salsa**, **both** cheese and salsa, or **neither** cheese nor salsa (a sad tortilla chip indeed). 

Let's try to simulate taking nachos from the bowl at random and determining your reaction using Python. You'll need to learn how to generate random outcomes from a list of possible outcomes, inspect the outcomes to determine the correct reaction reaction, and then return the correct reaction. of a comparison operator in action.

### `np.random.choice`

`np.random.choice` picks one item at random from the given array. It is equally likely to pick any of the items. Run the cell below several times, and observe how the results change.

In [None]:
nachos = make_array('cheese', 'salsa', 'both', 'neither')
np.random.choice(nachos)

To repeat this process multiple times, pass in an int `n` as the second argument to return `n` different random choices. By default, `np.random.choice` samples **with replacement** and returns an *array* of items. 

Run the next cell to see an example of sampling with replacement 10 times from the `nachos` array.

In [None]:
np.random.choice(nachos, 10)

### Comparisons

In Python, the boolean data type contains only two unique values:  `True` and `False`. Expressions containing comparison operators such as `<` (less than), `>` (greater than), and `==` (equal to) evaluate to Boolean values. A list of common comparison operators can be found below!

| Comparison | Operator | True Example | False Example |
| --- | --- | --- | ---|
| Less than | `<` | `2 < 3` | `2 < 2` |
| Greater than | `>` | `3 > 2` | `3 > 3` |
| Less than or equal | `<=` | `2 <= 2` | `3 <= 2` |
| Greater than or equal | `>=` | `3 >= 3` | `2 >= 3` |
| Equal | `==` | `3 == 3` | `3 == 2` |
| Not equal | `!=` | `3 != 2` | `2 != 2` |

Run the cell below to see an example of a comparison operator in action.

In [None]:
3 > 1 + 1

You can even assign the result of a comparison operation to a variable, just like you can assign the result of any other Python expression.

In [None]:
result = 10 / 2 == 5
result

Arrays are also compatible with comparison operators. When comparing a value to an array, the output will be an array of boolean values that resulted from a comparison with each corresponding element in the array.

Comparisons can be made on numerical values.

In [None]:
make_array(1, 5, 7, 8, 3, -1) > 3

And they can also be made on other data types, such as strings.

In [None]:
nachos == 'salsa'

To count the number of times a certain type of nacho is randomly chosen, we can use `np.count_nonzero`.

### `np.count_nonzero`

`np.count_nonzero` counts the number of non-zero values that appear in an array. When an array of boolean values are passed through the function, it will count the number of `True` values in the array. 

**Remember:** in Python, `True` is equivalent in value to 1 and `False` is equivalent to 0.

Run the next cell to see an example that uses `np.count_nonzero`.

In [None]:
np.count_nonzero(make_array(True, False, False, True, True))

### Question 1.1.

Assume we took ten nachos at random, and with the results assigned to the array called `ten_nachos` in the code cell below. **Compute the number of nachos with only cheese on them** using a comparison operator (e.g. `==`, `<`, ...) and the `np.count_nonzero` function demonstrated above.

In [None]:
ten_nachos = make_array('neither', 'cheese', 'both', 'both', 'cheese', 'salsa', 'both', 'neither', 'cheese', 'cheese')
number_cheese = ...
number_cheese

In [None]:
grader.check("q1_1")

### Conditional Statements

A conditional statement is a multi-line statement that allows Python to choose from different code blocks to run based on the truth value of an expression.

Here is a function to serve as a basic example.

```python
def sign(x):
    if x > 0:
        return 'Positive'
    else:
        return 'Negative'
```

If the input `x` is greater than `0`, the `sign` function returns the string `'Positive'`. Otherwise, it returns the string `'Negative'`.

If you wanted to test multiple conditions at once, use the following general format.

```python
if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
elif <elif expression 1>:
    <elif body 1>
...
else:
    <else body>
```


Each `if` and `elif` expression is evaluated and considered in order, starting at the top. As soon as a true value is found, the corresponding body is executed, and the rest of the conditional statement is skipped. If none of the `if` or `elif` expressions are true, then the `else` body is executed. 

For more examples and explanation, refer to the section on conditional statements [here](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html).

<!-- BEGIN QUESTION -->

### Question 1.2.

Suppose you want to make sure that you have enough nachos with cheese on them.

Complete the following conditional statement so that `say_please` is assigned a value based on the number of nachos with cheese in `ten_nachos`. If the number of nachoes with cheese is less than `5`, assign the string `'More please'` to `say_please`, otherwise assign the string `'Perfect!'`.

**Hint:** You should use `number_cheese` from earlier in this assignment, or, recompute the number of nachos with cheese in the array `ten_nachos`. You should not directly reassign the variable `say_please` with the correct value.


In [None]:
if ...:
    say_please = 'More please'
...
    ...
say_please

<!-- END QUESTION -->

### Question 1.3.

Write a function called `nacho_reaction` that can be provided a nacho type (as a string) and returns a reaction (as a string) based on the type of nacho passed in as an argument. Use the table below to match the nacho type to the appropriate reaction.

| Nacho Type | Reaction |
|------------|----------|
| cheese     | Cheesy!  |
| salsa      | Spicy!   |
| both       | Wow!     |
| neither    | Meh.     |


For example:

```python
nacho_reaction('cheese')
```

Would return the string `'Cheesy!'`.

**Hint:** If you're failing the test, double check the spelling of your reactions.


In [None]:
def nacho_reaction(nacho):
    if nacho == "cheese":
        return ...
    ... :
        ...
    ... :
        ...
    ... :
        ...

spicy_nacho = nacho_reaction('salsa')
spicy_nacho

In [None]:
grader.check("q1_3")

### Question 1.4.

Create a Table named `ten_nachos_reactions` that consists of the nachos in the array `ten_nachos` in a column named `Nachos` as well as the reactions for each of those nachos in a column named `Reactions`.

To get you started, the code below will create a Table named `ten_nachos_tbl` that contains the array `ten_nachos` in a column labeled `Nachos`. You can take as many steps as you need to create the final table, but as an added challenge you can attempt to do it in a single line of code.

**Hint:** Use the `apply` method on the `ten_nachos_tbl` Table to create the array of reactions. Then, construct the final Table. Here's a link to the textbook [Chapter 8, Section 1: Applying a Function to a Column](https://www.inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html) if you need a refresher:  


In [None]:
ten_nachos_tbl = Table().with_column('Nachos', ten_nachos)
...
ten_nachos_reactions

In [None]:
grader.check("q1_4")

### Question 1.5.

Compute the number of 'Wow!' reactions for the nachos in the `ten_nachos_reactions` Table. You should not manually count the number of reactions, so your code must reference the Table `ten_nachos_reactions` at least once.

**Hint:** There are a few ways to do this! Try seeing if you can use a comparison statement on an array to help out.

In [None]:
number_wow_reactions = ...
number_wow_reactions

In [None]:
grader.check("q1_5")

## 2. Simulations and For Loops
In Python, a `for` statement can perform a similar task multiple times. This repetitive process is known as **iteration**. In this course, iteration is most often seen as a `for` loop used to run simulations that involve randomness. However, that's not the only use for iteration in programming. This section will show you a few different ways to use iteration.

### Iterate over a collection
In general, a `for` loop is used to perform a block of code once for each value in a collection. For instance, the code below will print out all of the colors of the rainbow.

In [None]:
rainbow = make_array("red", "orange", "yellow", "green", "blue", "indigo", "violet")

for color in rainbow:
    print(color)

We can see that the indented part of the `for` loop, known as the body, is executed once for each item in the array `rainbow`. The name `color` is assigned to the next value in `rainbow` at the start of each iteration. Note that the name `color` is arbitrary; we could easily have named it something else in the `for` loop definition.. The important thing is that you stay consistent throughout the `for` loop.

In [None]:
for another_name in rainbow:
    print(another_name)

However, in general it is a best practice for variable names to be somewhat descriptive about the values they hold.

### Question 2.1.

In the following cell, we've loaded the text of the novel _Pride and Prejudice_ by Jane Austen, split it into individual words, and stored these words in an array `p_and_p_words`. Use a `for` loop to load each word in the array `p_and_p_words` into the iterating variable (the one that you choose the name for) one at a time, and check if the current word is longer than 5 letters, and keep a running total of how many words are longer than 5 letters. When the loop completes, the variable `longer_than_five`, which is initially assigned the value `0`, should contain an integer that represents the number of words in _Pride and Prejudice_ that are more than 5 letters long.

**Hint:** You can find the number of letters in a word/string with the `len` function. You can reference the code cell below for an example:

In [None]:
my_word = 'hello'
len(my_word)

In [None]:
austen_string = open('Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
p_and_p_words = np.array(austen_string.split())
longer_than_five = 0

# a for loop would be useful here
longer_than_five

In [None]:
grader.check("q2_1")

You should notice it didn't take very long to go through every word in the entire novel! This is a great example of how working with computers can speed up very routine simple tasks, like counting.

### Question 2.2.

Do you think the words in _Pride and Prejudice_ tend to be the same length, or different lengths? You could try to estimate how likely words are to have different lengths through choosing two words at random from the book, and seeing if they have the same length. Doing this just one time might not be insightful, but if you were to do this many times it could be helpful.

Complete a simulation with 10,000 trials, by writing a `for` loop to pick two words uniformly at random (with replacement) from _Pride and Prejudice_ and determine if the words have different lengths. When the simulation is completed, the variable `num_different` should be equal to the number of times that the two words selected in the simulations were different in length. Since there is randomness involved, your answer will likely be different from a classmates, and will change each time you run the cell.

**Hint 1:** Recall the function used earlier to sample at random with replacement from an array.

**Hint 2:** Remember that `!=` checks for non-equality between two items.

In [None]:
trials = 10000
num_different = ...

for ... in ...:
    ...
num_different

In [None]:
grader.check("q2_2")

Before moving on to the next question, inspect the value you computed for `num_different`. How would you describe what this value means in context of the simulation you wrote? Discuss your interpretation with a classmate or teacher.

### Question 2.3.

Allie is playing a simplified version of darts. Her dartboard contains ten equal-sized zones with point values from 1 to 10. Write a simulation to determine her total score after 1000 dart tosses, assuming that each dart is equally likely to hit each region on the board. 

Note that while you _could_ write a `for` loop to complete this simulation, you don't need to. Remember that `np.random.choice(...)` can select multiple values from the same array if you provide a second argument that is an integer. Attempt to use `np.random.choice` in part of your solution after defining an array that contains all the possible outcomes of throwing a single dart. 

Since there is randomness involved, your answer will likely be different from a classmates, and will change each time you run the cell.

In [None]:
possible_point_values = ...
num_tosses = 1000
simulated_tosses = ...
total_score = ...
total_score

In [None]:
grader.check("q2_3")

## 3. Sampling Basketball Data

We will now introduce the topic of sampling, which we’ll be discussing in more depth in upcoming classes. We’ll guide you through this code, but if you wish to read more about different kinds of samples before attempting this question, you can check out [section 10 of the textbook](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html).

The data used in this section will contain salary data and other statistics for basketball players from the 2022-2023 NBA season. This data was collected from the following sports analytic sites: [Basketball Reference](http://www.basketball-reference.com) and [HoopsHype](https://hoopshype.com/salaries/players/2022-2023/).

Run the cell below to load player and salary data that we will use for our sampling. 

In [None]:
player_data = Table().read_table("player_data_2022.csv")
salary_data = Table().read_table("salary_data_2022.csv").set_format('Salary', CurrencyFormatter)

### `player_data`

The `player_data` Table contains statistics for each player for the 2022-2023 season. Run the cell below to inspect a few rows of the Table.

In [None]:
player_data

### `salary_data`

The `salary_data` Table contains salary information for each player for the 2022-2023 season. The column labeled `Salary` is an integer, but is formatted as a currency to make it a bit easier to read. 

Run the cell below to inspect a few rows of the Table.

In [None]:
salary_data

### Question 3.1.

Use the `.join` method to create a Table that contains the combined information from both Tables. Assign this combined Table to `full_data`.

In [None]:
full_data = ...
full_data

In [None]:
grader.check("q3_1")

### Analysis from a sample

For 536 players, it's not so unreasonable to expect to use all of the data to conduct an analysis, but it is usually very difficult and/or expensive to have a set of data with each person from the population represented in it. Instead, information on only a subset of the population can be collected. This subset is called a sample.

If we want to make estimates about a certain numerical property (like the mean, median, standard deviation, etc) of the entire population, we may have to come up with these estimates based only on a smaller sample. Whether these estimates are useful or not often depends on how the sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you to create your own samples to see how they behave.

In this section, you'll create 3 different samples and investigate how the chosen method impacts the results:
* Convience sample
* Small simple random sample
* Large simple random sample

### Helper Functions
To save typing and increase the clarity of your code, we will package the analysis code into a few functions that you can use. Some functions will be provided, and some you will need to write. These functions will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from sample data.

### `histograms`
The `histograms` function is defined below. It takes a Table that must contain columns named `Age` and `Salary` and draws a histogram for each one. It uses bin widths of 1 year for `Age` and $1,000,000 for `Salary`.

**Note:** This function ONLY takes in a table with these labels. It will not work if the Table provided as an input does not have these labels.

In [None]:
def histograms(t):
    ages = t.column('Age')
    salaries = t.column('Salary')/1000000
    t1 = t.drop('Salary').with_column('Salary', salaries)
    age_bins = np.arange(min(ages), max(ages) + 2, 1) 
    salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
    t1.hist('Age', bins=age_bins, unit='year')
    plt.title('Age distribution')
    t1.hist('Salary', bins=salary_bins, unit='million dollars')
    plt.title('Salary distribution') 
    
histograms(full_data)

### Question 3.2.

Define a function called `compute_statistics` that takes a table containing ages and salaries and returns a two-element `array` containing the average age and average salary of the players in the provided Table (in that order).

Then, run call the function using the full set of data contained in the Table `full_data` to determine the mean age and salary from the whole population of NBA players.

**Note:** We wrote the solution using 3 lines in the body of the function, but you may be able to do it using more or less lines of code.

In [None]:
def compute_statistics(age_and_salary_data):
    age = ...
    salary = ...
    ...

full_stats = compute_statistics(full_data)
full_stats

In [None]:
grader.check("q3_2")

### Convenience sample
One sampling methodology, **which is generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from a team that plays in your city, since it's easier to survey them in person.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you were only able survey *relatively new* players with ages less than 22 because the more experienced players didn't bother to answer your surveys about their salaries. Think about how that might impact the statistics about the sample compared to the full population of players.

### Question 3.3.

Assign `convenience_sample` to a subset of `full_data` that contains only the rows for players under the age of 22. This will mimic only being able to acquire a convenience sample based on player age.

In [None]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check("q3_3")

### Question 3.4.

Assign `convenience_stats` to an array of the average age and average salary of your convenience sample, using the `compute_statistics` function.  Since these averages are computed from a sample from a larger population, they are called *sample averages*. 

In [None]:
convenience_stats = ...
convenience_stats

In [None]:
grader.check("q3_4")

### Comparing Samples

Next, we'll compare the distributions of the convenience sample salaries with those in the full data salaries in a single histogram. An overlaid histogram makes it easier to compare two distributions to see how they are similar and different. The function `compare_salaries` below will create this overlaid histogram for you. You don't need to understand how it works, but you're welcome to see if you can figure out it if you're up for a challenge!

In [None]:
def compare_salaries(first, second, first_title, second_title):
    """Compare the salaries in two tables."""
    first_salary_in_millions = first.column('Salary')/1000000
    second_salary_in_millions = second.column('Salary')/1000000
    first_tbl_millions = first.drop('Salary').with_column('Salary', first_salary_in_millions)
    second_tbl_millions = second.drop('Salary').with_column('Salary', second_salary_in_millions)
    max_salary = max(np.append(first_tbl_millions.column('Salary'), second_tbl_millions.column('Salary')))
    bins = np.arange(0, max_salary+1, 1)
    first_binned = first_tbl_millions.bin('Salary', bins=bins).relabeled(1, first_title)
    second_binned = second_tbl_millions.bin('Salary', bins=bins).relabeled(1, second_title)
    first_binned.join('bin', second_binned).hist(bin_column='bin', unit='million dollars')
    plt.title('Salaries for all players and convenience sample')

compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample')

<!-- BEGIN QUESTION -->

### Question 3.5.

Does the convenience sample give us an accurate picture of the salary of the full population? Explain why or why not using evidence from the histograms and summary statistics generated for the convenience sample and the full population to back up your claims. Offer an explanation as to why the sample does or does not represent the larger population in context of the data. Meaning, why would it make sense for younger players to have the same salaries as the overall population of players, or why would it make sense that younger players do not have the same salaries as the overall population of players.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Simple random sampling
A more justifiable approach is to sample uniformly at random from the players.  In a **simple random sample (SRS) without replacement**, we ensure that each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shuffling the box.  Then, pull out cards one by one and set them aside, stopping when the specified sample size is reached.

### Producing simple random samples
Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling accuracy.

### `sample`

The table method `sample` produces a random sample from the table. By default, it draws at random **with replacement** from the rows of a table. It takes in the sample size as its argument and returns a **table** with only the rows that were selected. 

Run the cell below to see an example call to `sample()` with a sample size of 5, with replacement. Because the sample will be based on a random process, you may or may not see the sample player more than once due to the selection with replacement. You can try running the cell more than once, but with such a large data set (over 500 players) it's unlikely to see the same player more than once when only selecting 5.

In [None]:
salary_data.sample(5)

The optional argument `with_replacement=False` can be passed as an input to the `sample()` method to specify that the sample should be drawn without replacement. This will guarantee that the same player does not appear more than once in your sample.

Run the cell below to see an example call to `sample()` with a sample size of 5, without replacement.

In [None]:
salary_data.sample(5, with_replacement = False)

<!-- BEGIN QUESTION -->

### Question 3.6.a

Produce a simple random sample of size 44 from `full_data` and save the sample to the Table named `my_small_srswor_data`. Create a histogram of the age and salary distributions using the `histograms` function, and compute the average age and salary using the `compute_statistics` function.

Due to the randomness of creating a sample, the averages and distributions will change each time the cell is run. Run the cell several times to see how the distributions and averages change from sample to sample.

In [None]:
my_small_srswor_data = ...
...
my_small_stats = ...
my_small_stats

<!-- END QUESTION -->

Run the cell below to see how the distributions of salaries in your most recent small simple random sample compares to the total population of players.

In [None]:
compare_salaries(full_data, my_small_srswor_data, 'All Players', 'Small Sample')

<!-- BEGIN QUESTION -->

### Question 3.6.b

How well does a small simple random sample tend to approximate the total population? Write an explanation of your observations and use information from the histograms or summary statistics to back up your claims. You should consider how the average age and salary changed as you drew several different samples and describe which seemed to change more from sample to sample: the average age or the average salary. Why do you think that is, based on the ages and salaries in the dataset?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3.7.

Using the same method as the previous question, analyze several simple random samples but this time of size 100 from `full_data`. 

Due to the randomness of creating a sample, the averages and distributions will change each time the cell is run. Run the cell several times to see how the distributions and averages change from sample to sample.

In [None]:
my_large_srswor_data = ...
...
my_large_stats = ...
my_large_stats

<!-- END QUESTION -->

Run the cell below to see how the distributions of salaries in your most recent small simple random sample compares to the total population of players.

In [None]:
compare_salaries(full_data, my_large_srswor_data, 'All Players', 'Large Sample')

<!-- BEGIN QUESTION -->

Does using samples of size 100 seem to do a better job at approximating the total population than samples of size 44? Your explanation should compare and contrast the histograms and summary statistics from the small and large sampling methods, and explain how one of them does a better job at accurately representing the total population.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `lab05_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `lab05_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)