# Lab 5

Welcome to Lab 5! In this lab, we will learn about sampling strategies.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
import pandas as pd

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


## 1. Dungeons and Dragons and Sampling
In the game Dungeons & Dragons, each player plays the role of a fantasy character.

A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success.  The modifier depends on her character's competence in performing the action.

For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door.  She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and succeeds if the total is greater than 15.

**Question 1.1** Write code that simulates that procedure.  Compute three values: the result of Alice's roll (`roll_result`), the result of her roll plus Roga's modifier (`modified_result`), and a boolean value indicating whether the action succeeded (`action_succeeded`).  **Do not fill in any of the results manually**; the entire simulation should happen in code.

*Hint:* A roll of a 20-sided die is a number chosen uniformly from the array `[1, 2, 3, 4, ..., 20]`.  So a roll of a 20-sided die *plus 11* is a number chosen uniformly from that array, plus 11.

In [None]:
possible_rolls = ...
roll_result = ...
modified_result = ...
action_succeeded = ...

# The next line just prints out your results in a nice way
# once you're done.  You can delete it if you want.
print("On a modified roll of {:d}, Alice's action {}.".format(modified_result, "succeeded" if action_succeeded else "failed"))

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
possible_rolls = np.arange(1,21)
roll_result = np.random.choice(possible_rolls)
modified_result = roll_result+11
action_succeeded = modified_result > 15
</pre>
</details>


Suppose we don't know that Roga has a modifier of 11 for this action.  Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors.  We would like to estimate her modifier from these 7 numbers.

**Question 1.2** Write a Python function called `simulate_observations`.  It should take no arguments, and it should return an array of 7 numbers.  Each of the numbers should be the modified roll from one simulation.  **Then**, call your function once to compute an array of 7 simulated modified rolls.  Name that array `observations`.

In [None]:
modifier = 11
num_observations = 7

def simulate_observations():
    """Produces an array of 7 simulated modified die rolls"""
    ...

observations = ...
observations

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
return np.random.choice(possible_rolls, num_observations) + modifier</pre>
</details>


**Question 1.3** Draw a histogram to display the *probability distribution* of the modified rolls we might see. Note that each possible roll is equally likely to occur.

In [None]:
# We suggest using these bins.
roll_bins = np.arange(1, modifier+2+20, 1)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
The histogram depends on your understanding of modified rolls, one answer is 
plt.hist(np.arange(min(possible_rolls)+modifier,max(possible_rolls)+modifier+1), bins=roll_bins)
</pre>
</details>


Now let's imagine we don't know the modifier and try to estimate it from `observations`.

One straightforward (but clearly suboptimal) way to do that is to find the *smallest* total roll, since the smallest roll on a 20-sided die is 1.

**Question 1.4** Using that method, estimate `modifier` from `observations`.  Name your estimate `min_estimate`.

In [None]:
min_estimate = ...
min_estimate

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
min_estimate = min(observations) - min(possible_rolls)
</pre>
</details>


Another way to estimate the modifier involves the mean of `observations`.

**Question 1.5** Figure out a good estimate based on that quantity.  

**Then**, write a function named `mean_based_estimator` that computes your estimate.  It should take an array of modified rolls (like the array `observations`) as its argument and return an estimate of `modifier` based on those numbers.

In [None]:
def mean_based_estimator(nums):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    ...

# Here is an example call to your function.  It computes an estimate
# of the modifier from our 7 observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
return np.mean(nums) - np.mean(possible_rolls)
</pre>
</details>


## 2. Sampling

The data used in this part of the lab contains salary data and statistics for basketball players from the 2014-2015 NBA season. This data was collected from [basketball-reference](http://www.basketball-reference.com) and [spotrac](http://www.spotrac.com).

Run the cell below to load the player and salary data.

In [None]:
player_data = pd.read_csv("player_data.csv")
salary_data = pd.read_csv("salary_data.csv")
full_data = pd.merge(salary_data, player_data, left_on='PlayerName', right_on='Name')

full_data.head(3)

Rather than getting data on every player, imagine that we had gotten data on only a smaller subset of the players.  For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

A statistical inference is a statement about some statistic of the underlying population, such as "the average salary of NBA players in 2014 was $3".  You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences, unlike, say, logical inferences, can be wrong.

A general strategy for inference using samples is to estimate statistics of the population by computing the same statistics on a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors, and we'll touch lightly on a few of those today.

One very important factor in the utility of samples is how they were gathered.  We have prepared some example sample datasets to simulate inference from different kinds of samples for the NBA player dataset.  Later we'll ask you to create your own samples to see how they behave.

To save typing and increase the clarity of your code, we will package the loading and analysis code into two functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data.

**Question 2.1**. Complete the `histograms` function, which takes a table with columns `Age` and `Salary` and draws a histogram for each one. Use the min and max functions to pick the bin boundaries so that all data appears for any table passed to your function. Use the same bin widths as before (1 year for `Age` and $1,000,000 for `Salary`).

In [None]:
def histograms(t):
    ages = t['Age']
    salaries = t['Salary']
    age_bins = ...
    salary_bins = ...
    t.hist('Age', bins=age_bins)
    t.hist('Salary', bins=salary_bins)
    return age_bins # Keep this statement so that your work can be checked
    
histograms(full_data)
print('Two histograms should be displayed below')

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
age_bins = np.arange(min(ages),max(ages)+2,1)
salary_bins = np.arange(min(salaries),max(salaries)+2000000,1000000)
</pre>
</details>


**Question 2.2**. Create a function called `compute_statistics` that takes a DataFrame containing ages and salaries and:
- Draws a histogram of ages
- Draws a histogram of salaries
- Returns a two-element array containing the average age and average salary

You can call your `histograms` function to draw the histograms!

In [None]:
def compute_statistics(age_and_salary_data):
    ...
    age = ...
    salary = ...
    ...
    

full_stats = compute_statistics(full_data)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
histograms(age_and_salary_data)
age = age_and_salary_data["Age"]
salary = age_and_salary_data["Salary"]
return np.array([np.mean(age),np.mean(salary)])
</pre>
</details>


### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose players who are somehow convenient to sample.  For example, you might choose players from one team that's near your house, since it's easier to survey them.  This is called, somewhat pejoratively, *convenience sampling*.

Suppose you survey only *relatively new* players with ages less than 22.  (The more experienced players didn't bother to answer your surveys about their salaries.)

**Question 2.3**  Assign `convenience_sample_data` to a subset of `full_data` that contains only the rows for players under the age of 22.

In [None]:
convenience_sample = full_data.loc[full_data.Age<22]
convenience_sample

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
convenience_sample = full_data.loc[full_data.Age<22]
</pre>
</details>


**Question 2.4** Assign `convenience_stats` to a list of the average age and average salary of your convenience sample, using the `compute_statistics` function.  Since they're computed on a sample, these are called *sample averages*. 

In [None]:
convenience_stats = ...
convenience_stats

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
convenience_stats = compute_statistics(convenience_sample)
</pre>
</details>


Next, we'll compare the convenience sample salaries with the full data salaries in a single histogram. Just run the following cell and feel free to zoom in on either distribution:

In [None]:
max_salary = max(np.append(full_data['Salary'], convenience_sample['Salary']))
bins = np.arange(0, max_salary+1e6+1, 1e6)
plt.hist(full_data.Salary, bins=bins)
plt.hist(convenience_sample.Salary, bins=bins)

**Question 2.5** Does the convenience sample give us an accurate picture of the age and salary of the full population of NBA players in 2014-2015?  Would you expect it to, in general?  Before you move on, write a short answer in English below.  You can refer to the statistics calculated above or perform your own analysis.

*Write your answer here, replacing this text.*

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
Sample Answer: No, the convenience sample does not give us an accurate picture of the age and salary of the a full population of NBA of players in 2014-2015. We would not expect it to, because it is biased towards players younger than 22.
</pre>
</details>


### Simple random sampling
A more principled approach is to sample uniformly at random from the players.  If we ensure that each player is selected at most once, this is a *simple random sample without replacement*, sometimes abbreviated to "simple random sample" or "SRSWOR".  Imagine writing down each player's name on a card, putting the cards in an urn, and shuffling the urn.  Then, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two samples of the `salary_data` table in this way: `small_srswor_salary.csv` and `large_srswor_salary.csv` contain, respectively, a sample of size 44 (the same as the convenience sample) and a larger sample of size 100.  

The `load_data` function below loads a salary table and joins it with `player_data`.

In [None]:
def load_data(salary_file):
    return pd.merge(player_data, pd.read_csv(salary_file), left_on='Name', right_on='PlayerName')

**Question 2.6** Run the same analyses on the small and large samples that you previously ran on the full dataset and on the convenience sample.  Compare the accuracy of the estimates of the population statistics that we get from the convenience sample, the small simple random sample, and the large simple random sample.

In [None]:
# Original:
small_srswor_data = ...
small_stats = ...
large_srswor_data = ...
large_stats = ...
print('Full data stats:                 ', full_stats)
print('Small simple random sample stats:', small_stats)
print('Large simple random sample stats:', large_stats)

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
small_srswor_data = load_data("small_srswor_salary.csv")
small_stats = compute_statistics(small_srswor_data)
large_srswor_data = load_data("large_srswor_salary.csv")
large_stats = compute_statistics(large_srswor_data)
</pre>
</details>


### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  The randomized response technique was one example we saw in lecture.  Another is to help us understand how inaccurate other samples are.

Tables provide the method `sample()` for producing random samples.  Note that its default is to sample with replacement. To see how to call `sample()`, search the documentation on the [resources page](http://data8.org/su17/resources.html) of the course website, or enter `full_data.sample?` into a code cell and press Shift + Enter.

**Question 2.7** Produce a simple random sample of size 44 from `full_data`.  (You don't need to bother with a join this time -- just use `full_data.sample(...)` directly.  That will have the same result as sampling from `salary_data` and joining with `player_data`.)  Run your analysis on it again.  
- Are your results roughly similar to those in the small sample we provided you?  Run your code several times to get new samples.  
- How much does the average age change across samples? 
- What about average salary?

In [None]:
my_small_srswor_data = ...
my_small_stats = ...
my_small_stats

*Write your answer here, replacing this text.*

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
my_small_srswor_data = full_data.sample(44, replace=False)
my_small_stats = compute_statistics(my_small_srswor_data)
Sample Answer: Yes, the results are similar, but not the same, to the sample we were given. The average age tends to stay around the same value as there is a limited range of ages for NBA players, but the salary changes by a sizeable factor due to larger variability in salary.
</pre>
</details>


**Question 2.8** As in the previous question, analyze several simple random samples of size 100 from `full_data`.  
- Do the histogram statistics seem to change more or less across samples of 100 than across samples of size 44?  
- Are the sample averages and histograms closer to their true values for age or for salary?  What did you expect to see?

In [None]:
my_large_srswor_data = ...
...

*Write your answer here, replacing this text.*

<details><summary><button>Click here to reveal the answer!</button></summary>
<pre>
my_large_srswor_data = full_data.sample(100, replace=False)
my_large_stats = compute_statistics(my_large_srswor_data)
Sample Answer: The average and histogram statistics seem to change less across samples of this size. They are closer to their true values, which is what we'd expect to see because we are sampling a larger subset of the population.
</pre>
</details>


Great job! :D You're finished with lab 5!

**Acknowledgement**: The materials for this lab, and course textbook are based on the [data8](http://data8.org/) course at UC Berkeley.