In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab08.ipynb")

# Lab 8: Simulations

Welcome to Lab 8! 

You will review topics from before break: 

* [Defining a Function](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html)
* [Conditional Statements](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html)
* [Iteration](https://inferentialthinking.com/chapters/09/2/Iteration.html)

We will go over topics from this week including: 

* [Randomness](https://www.inferentialthinking.com/chapters/09/Randomness.html)
* [Simulations](https://www.inferentialthinking.com/chapters/09/3/Simulation.html)
* [Sampling](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)

Some of the data used in this lab will contain salary data and other statistics for basketball players from the 2014-2015 NBA season. This data was collected from the following sports analytic sites: [Basketball Reference](http://www.basketball-reference.com) and [Spotrac](http://www.spotrac.com).

**Submission**: Once you’re finished, run all cells besides the last one, select File > Save Notebook, and then execute the final cell. Then submit the downloaded zip file, that includes your notebook,  according to your instructor's directions.

First, set up the notebook by running the cell below.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import d8error

## 1. Death and Taxes

The United States, like many countries, uses a progressive tax bracket system. This means that as your earnings increase, the percentage of your earnings you owe in tax also increases. In addition, the US tax system uses marginal tax brackets – what this means is that US taxpayers pay different tax percentages on different "chunks" of their earnings.

Let's suppose the tax brackets for the 2023 tax year are defined by the table below (single filer). This is pretty close to the actual bracket, but for simplicities' sake we'll use 4 brackets instead of 7.

| Tax Rate | Taxable Income |
| --- | --- |
| 10% | \\$0 to \\$11,000 |
| 12% | \\$11,001 to \\$44,725 |
| 22% | \\$44,726 to \\$95,375 |
| 24% | \\$95,376 or more |

**You will need to use these numbers throughout this question.**

A few notes:
- We will assume all incomes are integers.
- "Taxable Income" refers to the part of ones income that is taxable; in the US there is what's known as a "standard deduction" which can be thought of as a discount on your taxes. In this assignment, we won't worry about deductions.

If someone has a taxable income of \\$60,000, we say they are in the 22% tax bracket. However, such an individual doesn't owe 22% of \\$60,000 in taxes. Instead, they owe:
- 10% of \\$11,000, **plus**
- 12% of \\$33,725 (\\$44,725 - \\$11,000), **plus**
- 22% of \\$15,275 (\\$60,000 - \\$44,725)

for a total of \\$8,507.50 ($0.1 \cdot \$11,000 + 0.12 \cdot \$33,725 + 0.22 \cdot \$15,275 = \$8,507.50$). This makes their effective tax rate $\frac{8508}{60000} \approx 0.1418$, or about 14.18%.

If you want to read more about the US federal income tax system, click [here](https://www.taxpolicycenter.org/briefing-book/how-do-federal-income-tax-rates-work).

**Question 1.1**  
Complete the implementation of the function `tax_bracket`, which takes in a taxable income as a number (`income`) and returns the tax bracket (as a decimal) it is in. For instance, `tax_bracket(60000)` should evaluate to `0.22` and `tax_bracket(402150)` should evaluate to `0.24`.

_Hint_: Use what you know about `if-elif-else` blocks to your advantage!

In [None]:
def tax_bracket(income):
    if ...:
        ...
    elif ...:
        ...
    elif ...:
        ...
    else: 
        ...
    ...

In [None]:
grader.check("q11")

**Question 1.2**  
Complete the implementation of `tax_owed`, which takes in a taxable income (`income`) and returns the amount of tax owed by an individual with that taxable income. For instance, `tax_owed(60000)` should evaluate to `8507.5`.

**Note**: The code you write for this question might get a little long – but that's okay! Take it one step at a time.

For your convenience, here's the tax bracket table again:

| Tax Rate | Taxable Income |
| --- | --- |
| 10% | \\$0 to \\$11,000 |
| 12% | \\$11,001 to \\$44,725 |
| 22% | \\$44,726 to \\$95,375 |
| 24% | \\$95,376 or more |



In [None]:
def tax_owed(income):
    if ...:
        ...
    elif ...:
        ...
    elif ...:
        ...
    else: 
        ...
    ...

In [None]:
grader.check("q12")

**Question 1.3**  
Finally, complete the implementation of `effective_tax_rate`, which takes in a taxable income (`income`) and returns the effective tax rate for an individual with that taxable income, as a decimal. For instance, `effective_tax_rate(60000)` should evaluate to approximately `0.149833`.

*Note*: If `income` is 0, your `effective_tax_rate` function should also return 0. Make sure you handle this case in your function!

_Hint_: You should use your `tax_owed` function. Our entire solution is only three lines, but you may use more than that if necessary.

In [None]:
def effective_tax_rate(income):
    ...
    

In [None]:
grader.check("q13")

## 2. Billboard Charts 📈

Run the cell below to load in data from the Billboard charts in the 2010s. If you're unfamiliar with the *Billboard Top 100*, you can read about the chart [here](https://www.billboard.com/charts/).

In [None]:
# Run this cell!
billboard = Table.read_table('billboard-2010.csv')
billboard.show(5)

Artists and fans alike like to keep track of the most consecutive weeks a song has been ranked #1 on the Billboard 200. For example, run the cell below to look at data regarding Drake's hit "One Dance" from 2016:

In [None]:
# Run this cell!
(billboard
     .where('Name', 'One Dance')
     .select('Artists', 'Name', 'Week', 'Weekly.rank')
     .sort('Week').show(15)
)

According to the above table, it seems like One Dance was ranked #1 for 9 consecutive weeks at one point – pretty impressive!

---



**Question 2.1**  
Below, complete the implementation of the function `one_streak`, which takes in an array `charts` representing the position of a song in the Billboard 200 over several consecutive weeks and returns the most consecutive weeks that song was ranked **#1.** Example behavior is shown below.

```py
>>> one_streak(make_array(13, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 2, 5, 7, 8))
9

>>> one_streak(make_array(4, 1, 1, 1, 2, 3, 11, 1, 1, 1, 1, 1))
5

>>> one_streak(make_array(5, 4, 1, 1, 1, 3, 2, 1, 2, 3))
3
```

In [None]:
def one_streak(charts):
    longest = ...
    current = ...
    for num in charts:
        if num == 1:
            current = ...
        else:
            longest = ...
            current = ...

    # Ask yourself – why are we returning max(longest, current) instead of just current?
    return max(longest, current)

In [None]:
grader.check("q21")

**Question 2.2**

Now that you've successfully defined the `one_streak` function, it's time to put it to use! In the following cell, assign `feeling_streak` to the most consecutive weeks that **I Gotta Feeling by the Black Eye Peas** was ranked #1 on the *Billboard Hot 100*.

*Hint*: You'll need to use the `billboard` table.

In [None]:
feeling_array = ...
feeling_streak = ...
feeling_streak

In [None]:
grader.check("q22")

## 3. Sampling Basketball Data

We will now introduce the topic of sampling, which we’ve be discussed in more depth in this week’s lectures. We’ll guide you through this code, but if you wish to read more about different kinds of samples before attempting this question, you can check out [section 10 of the textbook](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html).

Run the cell below to load player and salary data that we will use for our sampling. 

In [None]:
player_data = Table().read_table("player_data.csv")
salary_data = Table().read_table("salary_data.csv")
full_data = salary_data.join("PlayerName", player_data, "Name")

# The show method immediately displays the contents of a table. 
# This way, we can display the top of two tables using a single cell.
player_data.show(3)
salary_data.show(3)
full_data.show(3)

Rather than getting data on every player (as in the tables loaded above), imagine that we had gotten data on only a smaller subset of the players. For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky. 

If we want to make estimates about a certain numerical property of the population, we may have to come up with these estimates based only on a smaller sample. The numerical property of the population is known as a parameter, and the estimate is known as a statistic (e.g., the mean or median). Whether these estimates are useful or not often depends on how the sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the analysis code into a few functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

We've defined the `histograms` function below, which takes a table with columns `Age` and `Salary` and draws a histogram for each one. It uses bin widths of 1 year for `Age` and $1,000,000 for `Salary`.

In [None]:
def histograms(t):
    ages = t.column('Age')
    salaries = t.column('Salary')/1000000
    t1 = t.drop('Salary').with_column('Salary', salaries)
    age_bins = np.arange(min(ages), max(ages) + 2, 1) 
    salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
    t1.hist('Age', bins=age_bins, unit='year')
    plt.title('Age distribution')
    t1.hist('Salary', bins=salary_bins, unit='million dollars')
    plt.title('Salary distribution') 
    
histograms(full_data)
print('Two histograms should be displayed below')

**Question 3.1**. Create a function called `compute_statistics` that takes a table containing an "Age" column and a "Salary" column and:
- Draws a histogram of ages
- Draws a histogram of salaries
- Returns a two-element array containing the average age and average salary (in that order)

You can call the `histograms` function to draw the histograms! 

*Note:* More charts may be displayed when running the test cell. Please feel free to ignore the charts.


In [None]:
def compute_statistics(age_and_salary_data):
    ...
    age = ...
    salary = ...
    ...
    

full_stats = compute_statistics(full_data)
full_stats

In [None]:
grader.check("q31")

### Simple random sampling
A more justifiable approach is to sample uniformly at random from the players.  In a **simple random sample (SRS) without replacement**, we ensure that each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shuffling the box.  Then, pull out cards one by one and set them aside, stopping when the specified sample size is reached.

### Producing simple random samples
Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling accuracy.

### `sample`

The table method `sample` produces a random sample from the table. By default, it draws at random **with replacement** from the rows of a table. Sampling with replacement means for any row selected randomly, there is a chance it can be selected again if we sample multiple times. `sample` takes in the sample size as its argument and returns a **table** with only the rows that were selected. 

Run the cell below to see an example call to `sample()` with a sample size of 5, with replacement.

In [None]:
# Just run this cell

salary_data.sample(5)

The optional argument `with_replacement=False` can be passed through `sample()` to specify that the sample should be drawn without replacement.

Run the cell below to see an example call to `sample()` with a sample size of 5, without replacement.

In [None]:
# Just run this cell

salary_data.sample(5, with_replacement=False)

<!-- BEGIN QUESTION -->

**Question 3.2** Produce a simple random sample **without** replacement of size **44** from `full_data`. Then, run your analysis on it again by using the `compute_statistics` function.  Run the cell a few times to see how the histograms and statistics change across different samples.

Briefly answer the following questions: 

- How much does the average age change across samples? 
- What about average salary?



(FYI: srs = simple random sample, wor = without replacement)

_Type your answer here, replacing this text._

In [None]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

<!-- END QUESTION -->

## 4. More Random Sampling Practice

More practice for random sampling using `np.random.choice`.

###  Simulations and For Loops (cont.)

**Question 4.1** We can use `np.random.choice` to simulate multiple trials.

After finishing the DATA 1201 project, Stephanie decides to spend the rest of her night rolling a standard six-sided die. She wants to know what her total score would be if she rolled the die 1000 times. Write code that simulates her total score after 1000 rolls. 

*Hint:* First decide the possible values you can take in the experiment (point values in this case). Then use `np.random.choice` to simulate Stephanie’s rolls. Finally, sum up the rolls to get Stephanie's total score.


In [None]:
possible_point_values = ...
num_tosses = 1000
simulated_tosses = ...
total_score = ...
total_score

In [None]:
grader.check("q41")

### Simple random sampling (cont.)

**Question 4.2** As in the previous question, analyze several simple random samples of size 100 from `full_data` by using the `compute_statistics` function.  

Answer the questions:
- Do the histogram shapes seem to change more or less across samples of 100 than across samples of size 44?  
- Are the sample averages and histograms closer to their true values/shape for age or for salary?  What did you expect to see?

_Type your answer here, replacing this text._

<!-- BEGIN QUESTION -->



In [None]:
my_large_srswor_data = ...
my_large_stats = ...
my_large_stats

<!-- END QUESTION -->

## 5. Submission

Congratulations on finishing lab 8!



**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)