### Randomness
Elements of Data Science Week 7

# Simulation Learning Goals
Simulate a task dependent on probability such as a die roll, repeat to get distribution and characteristics (mean, ...)
- Probability
    - np.random.choice()
- Simulation: Sample the distribution
    - Repeat and collect outcomes
    - Iteration: 
        `for i in np.arange(samples)`
- Examine resulting distribution of outcomes
    - Probability distribution
    - Uncertainty

#### A random distributions play a large role in statistical inference

In [None]:
import numpy as np
from datascience import *

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

## Coin toss

In [None]:
# Create the choices
toss = np.array(['Heads', 'Tails'])

# Similar ten coin tosses
tosses=np.random.choice(toss,10)

# Results
tosses

In [None]:
tosses != 'Tails'

In [None]:
# Fraction of tosses that were heads
np.count_nonzero(tosses == 'Heads')/len(tosses)

In [None]:
# Why does this also work?
sum(tosses == 'Heads')/len(tosses)

DIGRESSION: Notice that these two do the same thing. "make_array" is just syntatic sugar for np.array()

In [None]:
make_array('Heads', 'Tails')

In [None]:
np.array(['Heads', 'Tails'])

### Simulate
- Simulate a set of 100 coin tosses, how many heads?
- Repeat simulation 20,000 times

In [None]:
def simulate_100_tosses():
    outcomes = np.random.choice(toss, 100)
    return np.count_nonzero(outcomes == 'Heads')

In [None]:
simulate_100_tosses()

Now we so 20,000 simulationa of tossing a coin 100 times. (You can try this experiment at home with a real coin, if you have a free month or two.)

In [None]:
num_repetitions = 20000   # number of repetitions

heads = make_array() # empty collection array

for i in np.arange(num_repetitions):   # repeat the process num_repetitions times
    new_value = simulate_100_tosses()  # simulate one value using the function defined
    heads = np.append(heads, new_value) # augment the collection array with the simulated value

# That's it! The simulation is done.

In [None]:
simulation_results = Table().with_columns(
    'Repetition', np.arange(1, num_repetitions + 1),
    'Number of Heads', heads
)

In [None]:
# Each row in the table is the result of a tossed a coin 100 times.
simulation_results.show(3)

In [None]:
simulation_results.hist('Number of Heads', bins = np.arange(30.5, 69.6, 1))
plt.title('Simulation of 100 Coin Tosses') # Notice the matplotlib call to add a title
plt.savefig('Simcoin.png') # This matplotlib call saves the figure into the current folder

## Thought problem
A couple getting a divorce has split up the big-ticket items. They are now looking for a fair way to divide up the smaller assets that are of roughly equal value that they don't want to sell. They decide to toss a coin for all of these items.

Explain why this is a bad idea. Can you think of a more equitable way to divide their stuff?

## Die roll betting simulation
Bet a dollar on a single die roll
Outcomes

    - 0 or 1: lose a dollar (-$1)
    - 2 or 3: no change (0)
    - 4 or 5: gain a dollar (+$1)

In [None]:
# Make sure you understand how this if-else structure works in this case
def bet_on_one_roll():
    """Returns my net gain on one bet"""
    x = np.random.choice(np.arange(1, 7))  # roll a die once and record the number of spots
    if x <= 2:
        return -1
    elif x <= 4:
        return 0
    elif x <= 6:
        return 1

In [None]:
outcomes = np.array([])

for i in np.arange(600):
    outcome_of_bet = bet_on_one_roll()
    outcomes = np.append(outcomes, outcome_of_bet)
    
print(outcomes[0:10])
len(outcomes)

In [None]:
# Notice the use of group() to return the counts of each outcome.
outcome_table = Table().with_column('Outcome', outcomes)
outcome_table.group('Outcome').barh(0)

**Run the previous two cells a couple of times.** 

Do you always get the same result. Why or why not?

# Sampling and Simulation -- The marble problem from Lab 5
If we know the probabality distribution, we can simulate drawing a random sample of any size from that distribution.

*A club on campus is holding a contest. There is a bag with two red marbles, two green marbles, and two blue marbles. You have to draw three marbles separately. In order to win, all three of these marbles must be of different colors. What is the probability of you winning the contest?*

In [None]:
# Create the bag of marbles, two of each color.
marbles = ["red", "red", "blue", "blue", "green", "green"]
marbles

In [None]:
# Test out random sampling without replacement
np.random.choice(marbles, size=3)

In [None]:
# Use np.unique() in combination with len() to count the number of unique colors in a draw
draw = np.random.choice(marbles, size=3)
print(draw)
len(np.unique(draw))

In [None]:
# Test inside a loop
for i in range(5):
    sample = np.random.choice(marbles, size=3, replace=False)
    print(f"The selected marbles are: {sample}, yielding {len(np.unique(sample))} colors.")

In [None]:
# Simulate drawing ten thousand times
num_draws = 100000
num_wins = 0
for i in np.arange(num_draws):
    sample = np.random.choice(marbles, size=3, replace=False)
    if len(np.unique(sample)) == 3:
           num_wins += 1
print("The probability of drawing three different-colored marbles is")
print(num_wins / num_draws)

## Sampling and Simulation with Tables

In [None]:
# Generate a population of purple and green marbles
marbles = np.random.choice(['purple','green'],100)
population =Table().with_columns('Color',marbles)
population

In [None]:
# Find the true fraction of the population of marbles that is purple.
population.where('Color','purple').num_rows/population.num_rows

In [None]:
# Take a sample of ten marbles
sample = population.sample(10)
sample

In [None]:
# Is the fraction of purple marbles in the sample the same as in the population?
sample.where('Color','purple').num_rows

In [None]:
# Simulate taking a sample of 10 marbles 1000 times
outcomes = np.array([])

for i in np.arange(1000):
    outcome = population.sample(10).where('Color','purple').num_rows/10
    outcomes = np.append(outcomes, outcome)
    
print(outcomes[0:10])
len(outcomes)

In [None]:
# Is the mean of the simulations equal to the value of the true population?
# Do you think it would converge on the true value with more simulations?
outcomes.mean()

In [None]:
outcome_table = Table().with_column('Outcome', outcomes)
outcome_table

In [None]:
outcome_table.hist('Outcome')