# Randomness and Code Review
Elements of Data Science

## Simulation Learning Goals
Simulate a task dependent on probability such as a die roll, repeat to get distribution and characteristics (mean, ...)
- Probability
    - np.random.choice()
- Simulation: Sample the distribution
    - Repeat and collect outcomes
    - Iteration: 
        `for i in np.arange(samples)`
- Examine resulting distribution of outcomes
    - Probability distribution
    - Uncertainty

In [None]:
import numpy as np
from datascience import *

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

## Coding review

#### Data types
Important also because different types have different methods

In [None]:
x = "Groundhog"
type(x)

In [None]:
x = np.random.choice(np.arange(1,7))
x

In [None]:
type(x)

In [None]:
x = Table().with_columns('data', np.random.choice(np.arange(1,7)))
print(type(x))
x

#### Iteration
Loop through lines of code to repeat process
Note: the "for" statement will work with anything iterable. For a string, it will iterate over the letters. Technically, an interable is anything with a .__getitem__() method.

In [None]:
fruit = "apple"
fruit_iter = fruit.__iter__()
print(fruit_iter.__next__())
print(fruit_iter.__next__())

In [None]:
type(fruit_iter)

In [None]:
names = []
count = 0
for letter in 'Dogs':
    count += 1
    names.append(letter)
    print(letter)
print(count, names)

In [None]:
my_favs = ["apples", "grapes", "mangos"]
my_favs_iter = my_favs.__iter__()
print(my_favs_iter.__next__())
print(my_favs_iter.__next__())

In [None]:
names = []
count = 0
for fruit in my_favs:
    count += 1
    names.append(fruit)
    print(fruit)
print(count, names)

#### Conditional

In [None]:
best_animal ='Groundhog'
for animal in make_array('Horse','Dog', 'Groundhog'):
    print(f"---{animal}---")
    if animal == 'Groundhog':
        print("Best animal found!")
    elif animal == 'Horse':
        print("I like horses.")
    else:
        print("Not my favorite animal...")

## A different Bernoulli Trial

A Bernoulli trial is a random experirment with only two possible outcomes. All our examples so far have been with flipping a coin. Let's try a different one: After you shuffle a deck of cards, what is the chance the top card is an ace?

Chance of success = 4/52 = 1/13

Chance of failture = 48/52 = 12/13

So, if we shuffled the deck 100 times, we'd expect 100/13 = 7.69 aces on average, consistant with a probabilty of 1/13 = 0.077

In [None]:
card = np.array(['Ace', 'King', 'Queen', 'Jack', 'Ten', 'Nine', 'Eight', 'Seven', 'Six', 'Five', 'Four', 'Three', 'Two'])

def simulate_shuffles(n_shuffles):
    outcomes = np.random.choice(card, n_shuffles)
    return np.count_nonzero(outcomes == 'Ace')

In [None]:
n = 100
simulate_shuffles(n)

In [None]:
num_repetitions = 20000   # number of repetitions

draws = make_array() # empty collection array

n = 100
for i in np.arange(num_repetitions):   # repeat the process num_repetitions times
    new_value = simulate_shuffles(n)  # simulate one value using the function defined
    draws = np.append(draws, new_value) # augment the collection array with the simulated value

# That's it! The simulation is done.

In [None]:
simulation_results = Table().with_columns(
    'Repetition', np.arange(1, num_repetitions + 1),
    'Fraction Aces', draws/n
)

simulation_results.hist('Fraction Aces', bins=20)
plt.title(f'Simulation of {n} Shuffles') # Notice the matplotlib call to add a title
ax = plt.gca()
ax.set_xlim((0.0, 0.17))

In [None]:
num_repetitions = 20000   # number of repetitions
draws = make_array() # empty collection array

n = 1000
for i in np.arange(num_repetitions):   # repeat the process num_repetitions times
    new_value = simulate_shuffles(n)  # simulate one value using the function defined
    draws = np.append(draws, new_value) # augment the collection array with the simulated value

simulation_results = Table().with_columns(
    'Repetition', np.arange(1, num_repetitions + 1),
    'Fraction Aces', draws/n
)

simulation_results.hist('Fraction Aces', bins=20)
plt.title(f'Simulation of {n} Shuffles') # Notice the matplotlib call to add a title
ax = plt.gca()
ax.set_xlim((0.0, 0.17))

### Key Point
As the sample size increases, the width of the standard deviation decreases. The sample mean becomes a better estimate of the population mean.

## Random numbers and simulation - a fun digression!

Suppose we start at (x,y) = (0,0).  We flip a coin. Heads, we add 1 to y. Tails, we subtract 1 from y, but either way we add 1 to x.  What is our position after 1,000 rolls? You might think that the y-value would always fluctuate around zero, but in fact it will drift quite a bit over time. This is the so-called "random walk" or "drunkard's walk" as the drunk staggers left or right with each step. (Figure from: https://people.duke.edu/~rnau/411rand.htm)
![Random Walk](https://people.duke.edu/~rnau/411rand_files/image008.jpg)

In [None]:
toss = make_array('Heads', 'Tails')

nflips = 1000
x = np.arange(nflips)
y = make_array(0)
y_current = 0
for i in x[1:]:
    flip = np.random.choice(toss)
    if flip == 'Heads':
        y_current += 1
    else:
        y_current -= 1
    y = np.append(y, y_current)
    
random_walk = Table().with_columns('x', x, 'y', y)
random_walk.plot('x', 'y')
fig = plt.gcf()
fig.set_figwidth(15)

Try running the cell above a number of times so see the different random pathways.

Interestingly, you cannot predict a random walk you can provide a confidence interval. The standard error of a k-step-ahead forcast is:
$$ 0.00778 \sqrt{k}$$

In [None]:
n_confidence_steps = 500 # Project ahead this many steps
x_confidence = np.arange(n_confidence_steps) + nflips

# 50% confidence interval
y_confidence_pos = y[-1] +  2/3 * np.sqrt(np.arange(n_confidence_steps))
y_confidence_neg = y[-1] -  2/3 * np.sqrt(np.arange(n_confidence_steps))

In [None]:
random_walk.plot('x', 'y')
fig = plt.gcf()
fig.set_figwidth(15)
plt.plot(x_confidence, y_confidence_pos, color='red')
plt.plot(x_confidence, y_confidence_neg, color='red')
plt.title("Red lines show the 50% confidence interval");