# Lesson 10: Sampling

Welcome to Lesson 10!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- sampling.

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Sampling from a Table

In [None]:
coin = Table().with_column('coin', make_array('H','T'))
coin

In [None]:
coin.sample(5)

In [None]:
united = Table.read_table('data/united.csv')
united = united.with_column('Row', np.arange(united.num_rows)).move_to_start('Row')
united

## Non-Random (Deterministic) Sampling

In [None]:
united.where('Destination', 'RDU') 

In [None]:
united.take(np.arange(0, united.num_rows, 1000))

In [None]:
united.take(make_array(34, 6321, 10040))

## Random Sample

In [None]:
np.random.choice(np.arange(1000))

## Systematic Sample

In [None]:
start = np.random.choice(np.arange(1000))
systematic_sample = united.take(np.arange(start, united.num_rows, 1000))
systematic_sample.show()

## Simple Random Sample

In [None]:
united.sample(20)

In [None]:
united.sample(20, with_replacement=False)

## Distributions ##

In [None]:
die = Table().with_column('Face', np.arange(1, 7))
die

In [None]:
die.sample(10)

In [None]:
roll_bins = np.arange(0.5, 6.6, 1)

In [None]:
Table.interactive_plots()
die.hist(bins=roll_bins)

In [None]:
Table.interactive_plots()
die.sample(10).hist(bins=roll_bins)

In [None]:
Table.interactive_plots()
die.sample(1000).hist(bins=roll_bins)

In [None]:
Table.interactive_plots()
die.sample(100000).hist(bins=roll_bins)

In [None]:
Table.interactive_plots()
die.sample(1000000).hist(bins=roll_bins)

In [None]:
Table.interactive_plots()
die.sample(10000).hist(bins=roll_bins)

## Large Random Samples

In [None]:
Table.interactive_plots()

united_bins = np.arange(-20, 201, 5)
united.hist('Delay', bins = united_bins)

In [None]:
min(united.column('Delay'))

In [None]:
max(united.column('Delay'))

In [None]:
np.mean(united.column('Delay'))

In [None]:
Table.interactive_plots()
united.sample(10).hist('Delay', bins=united_bins)

In [None]:
Table.interactive_plots()
united.sample(1000).hist('Delay', bins = united_bins)

In [None]:
Table.interactive_plots()
united.sample(10000).hist('Delay', bins = united_bins)

## Statistics

### (Population) Parameter

In [None]:
np.median(united.column('Delay'))

## (Sample) Statistic

In [None]:
np.median(united.sample(10).column('Delay'))

In [None]:
np.median(united.sample(1000).column('Delay'))

## Probability and Empirical Distributions of a Statistic

In [None]:
def sample_median(size):
    return np.median(united.sample(size).column('Delay'))

In [None]:
sample_median(100)

In [None]:
num_simulations = 2000

Make an empty array to store my statistics.

In [None]:
sample_medians = make_array()

Run a loop to simulate.

In [None]:
for i in np.arange(num_simulations):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

Visualize my probability distribution.

In [None]:
Table.interactive_plots()
Table().with_column('Sample medians (size=10)', sample_medians).hist(bins=np.arange(-10, 35, 2))

In [None]:
sample_medians = make_array()

for i in np.arange(num_simulations):
    new_median = sample_median(100)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
Table.interactive_plots()
Table().with_column('Sample medians (size=100)', sample_medians).hist(bins=np.arange(-1, 11, 1))

In [None]:
sample_medians = make_array()

for i in np.arange(num_simulations):
    new_median = sample_median(1000)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
Table.interactive_plots()
Table().with_column('Sample medians (size=1K)', sample_medians).hist(bins=np.arange(0, 7, 1))

## Empirical Distributions Overlaid

In [None]:
sample_medians_10 = make_array()
sample_medians_100 = make_array()
sample_medians_1000 = make_array()

num_simulations = 2000

for i in np.arange(num_simulations):
    new_median_10 = sample_median(10)
    sample_medians_10 = np.append(sample_medians_10, new_median_10)
    new_median_100 = sample_median(100)
    sample_medians_100 = np.append(sample_medians_100, new_median_100)
    new_median_1000 = sample_median(1000)
    sample_medians_1000 = np.append(sample_medians_1000, new_median_1000)

In [None]:
sample_medians = Table().with_columns('Size 10', sample_medians_10, 
                                      'Size 100', sample_medians_100,
                                      'Size 1000', sample_medians_1000)

In [None]:
Table.interactive_plots()
sample_medians.hist(bins=np.arange(-5, 30))

## Swain vs. Alabama

Swain, a black man, was indicted and convicted of rape in the Circuit Court of Talladega County, Alabama, and sentenced to death by an all white jury. The case was appealed to the Supreme Court, in part, on the ground that there were no black jurors. Of eligible jurors in the county, 26% were black, but panels since 1953 averaged 10% to 15% black jurors and no black juror had actually served on a petit jury since 1950 ([Wikipedia](https://en.wikipedia.org/wiki/Swain_v._Alabama)).

In [None]:
population_proportions = make_array(.26, .74)
population_proportions

In [None]:
sample_proportions(100, population_proportions)

In [None]:
def panel_proportion():
    return sample_proportions(100, population_proportions).item(0)

In [None]:
panel_proportion()

In [None]:
panels = make_array()

for i in np.arange(10000):
    new_panel = panel_proportion() * 100
    panels = np.append(panels, new_panel)

In [None]:
Table.interactive_plots()
Table().with_column('Number of Black Men on Panel of 100', panels).hist(bins=np.arange(5.5,40.))