# 6.0002 Lecture 4: Stochastic Thinking

**Speaker:** Prof. John Guttag

## The World is hard to understand
- uncertainty is uncomfortable
- but certainty is usually justified

## Newtonian Mechanics
- every effect has a cause
    - e.g: apple falls from a tree because of gravity
- the world can be understood causally

## Copenhagen Doctrine
- Copenhagen Doctrine (Bohr and Heisenberg) of **causal nondeterminism**
    - at its most fundamental level, the behavior of the physical world cannot be predicted
    - fine to make statements of the form "x is highly likely to occus," but NOT of the form "x is certain to occur."
- Einstein and Schrödinger objected
    - "God does not play dice." -- Albert Einstein

## Does it really matter?
- Take 2 coins, and flip them
- did the flips yield
    - 2 heads?
    - 2 tails?
    - 1 head and 1 tail?

## The Moral
- the world may or may not be inherently unpredictable
- but our lack of knowledge does not allow us to make accurate predictions
- therefore we might as well treat the world as inherently unpredictable
- **predictive nondeterminism**

## Stochastic Processes
- an ongoing process where the next state might depend on both the previous states and **some random element**

In [1]:
def rollDie():
    """returns an int between 1 and 6"""

In [2]:
# specify the requirement of a stochastic implement
def rollDie():
    """returns a randomly chosen int between 1 and 6"""

## Implementing a random process

In [5]:
import random

def rollDie():
    """returns a random int between 1 and 6"""
    return random.choice([1, 2, 3, 4, 5, 6]) # uniform distribution

def testRoll(n=10):
    result=''
    for i in range(n):
        result = result + str(rollDie())
    print(result)

In [6]:
testRoll()

1551353626


## Probability of Various Results
- consider testRoll(5)
- how probable is the output 11111?

## Probability is about counting
- count the number of possible events
- count the number of events that have the property of interest
- divide latter by former
- probability of 11111?
    - 11111, 11112, 11113, ..., 11121, 11122, ..., 66666
    - $\frac{1}{6^5} \sim 0.0001286$

## Three basic facts about probability
- probabilities are always in the range **0 to 1**.
    - 0 if impossible
    - 1 if guaranteed
- if the probability of an event occurring is $p$, the probability of it NOT occurring must be $1-p$
- when events are **independent** of each other, the probability of all of the events occurring is equal to a **product** of the probabilities of each of the events ocurring

## Independence
- two events are **independent** if the outcome of one event has no influence on the outcome of the other
- independence should not be taken for granted

## Will one of the Patriots and Broncos lose?
- Patriots have winning percentage of 7/8, Broncos of 6/8
- probability of both winning next Sunday is 7/8 * 6/8 = 42/64
- probability of at least one losing is 1 - 42/64 = 22/64
- what about Sunday, December 18?
    - then the Patriots are playing the Broncos
    - outcomes are not independent
    - probability of one of them losing is much closer to 1 than to 22/64

## A simulation of die rolling

In [25]:
def runSim(goal, numTrials, txt):
    total = 0
    for i in range(numTrials):
        result = ''
        for j in range(len(goal)):
            result += str(rollDie())
        if result == goal:
            total += 1
    print('Actual probability of', txt, '=', round(1/(6**len(goal)), 8))
    estProbability = round(total/numTrials, 8)
    print('Estimated probability of', txt, '=', round(estProbability, 8))
        
runSim('11111', 1000, '11111')

Actual probability of 11111 = 0.0001286
Estimated probability of 11111 = 0.0


## Output of Simulation
- actual probability: 0.0001286
- estimated probability: 0.0
- how did we **know** that this is what would get printed?
    - why is estimated probability 0.0?
- why did simulation give me the **wrong** answer
    - truth: nothing that can be done on a computer is *actually* random
    - instead, computers generate numbers called pseudorandom
        - use an algorithm, started with a seed gotten by reading clock of computer
        - set the seed: random.seed(0)
- let's try 1,000,000 trials
    - makes sample probability more accurate, since the event is quite rare

In [27]:
random.seed(0)

runSim('11111', 10**6, '11111')

Actual probability of 11111 = 0.0001286
Estimated probability of 11111 = 0.000128


## Morals
- Moral 1: It takes a lot of trials to get a good estimate of the frequency of occurrence of a rare event. We'll talk lots more in later lectures about how to know when we have enough trials
- Moral 2: One should not confuse the **sample probability** with the actual probability
- Moral 3: There was really no need to do this by simulation, since there is a perfectly good closed form answer. We will see many examples where this is not true
- But simulations are often useful

## The Birthday Problem
- what's the probability of at least two people in a group having the same birthday
- if there are 367 people in the group?
    - 1, by pigeonhole principle (more people than days in a year)
- what about smaller numbers?
- if we assume that each birthdate is equally likely,
    - probability of no 2 people sharing a birthday: $1-\frac{366!}{366^N(366-N)!}$
- without this assumption, VERY complicated

## Approximating using a simulation

In [31]:
# function to return True if any birthday has more than numSame people sharing it
def sameDate(numPeople, numSame):
    possibleDates = range(366)
    birthdays = [0]*366
    for p in range(numPeople):
        birthDate = random.choice(possibleDates)
        birthdays[birthDate] += 1
    return max(birthdays) >= numSame

# function to simulate number of birthdays being shared in a group
def birthdayProb(numPeople, numSame, numTrials):
    numHits = 0
    for t in range(numTrials):
        if sameDate(numPeople, numSame):
            numHits += 1
    return numHits/numTrials

In [33]:
import math

# see how many birthdays shared for different group sizes
for numPeople in [10, 20, 40, 100]:
    print('For', numPeople, 'people, est. prob of a shared birthday is', 
         birthdayProb(numPeople, 2, 10000))
    numerator = math.factorial(366)
    denom = (366**numPeople)*math.factorial(366 - numPeople)
    print('Actual prob. for', numPeople, 'people is', 1 - numerator/denom)

For 10 people, est. prob of a shared birthday is 0.1125
Actual prob. for 10 people is 0.1166454118039999
For 20 people, est. prob of a shared birthday is 0.4121
Actual prob. for 20 people is 0.4105696370550831
For 40 people, est. prob of a shared birthday is 0.8916
Actual prob. for 40 people is 0.89054476188945
For 100 people, est. prob of a shared birthday is 1.0
Actual prob. for 100 people is 0.9999996784357714


- suppose we want the probability of 3 people sharing...

## Why 3 is much harder mathematically
- for 2 people, the complementary problem is "all birthdays distinct"
- for 3 people, the complementary problem is a complicated disjunct:
    - all birthdays distinct, or
    - one pair and rest distinct, or
    - two pairs and rest distinct, or
    - ...
- but changing the simulation is dead easy!

In [34]:
# change simulation for case where 3 people share a birthday
# leave incorrect mathematical calculation for 2 people
for numPeople in [10, 20, 40, 100]:
    print('For', numPeople, 'people, est. prob of a shared birthday is', 
         birthdayProb(numPeople, 3, 10000))
    numerator = math.factorial(366)
    denom = (366**numPeople)*math.factorial(366 - numPeople)
    print('Actual prob. for', numPeople, 'people is', 1 - numerator/denom) # (incorrect)

For 10 people, est. prob of a shared birthday is 0.0009
Actual prob. for 10 people is 0.1166454118039999
For 20 people, est. prob of a shared birthday is 0.0065
Actual prob. for 20 people is 0.4105696370550831
For 40 people, est. prob of a shared birthday is 0.0649
Actual prob. for 40 people is 0.89054476188945
For 100 people, est. prob of a shared birthday is 0.6504
Actual prob. for 100 people is 0.9999996784357714


## But all dates are not equally likely
- in reality, this assumption does not quite match the data
- check out this heat map: 
    - https://www.vizwiz.com/2012/05/how-common-is-your-birthday-find-out.html

## Another win for simulation
- adjusting analytic model a pain
- adjusting simulation model easy

In [37]:
# new function with some possible dates repeated
def sameDate(numPeople, numSame):
    possibleDates = 4*list(range(0,57)) + [58]\
                    + 4*list(range(59, 366))\
                    + 4*list(range(180, 270))
    birthdays = [0]*366
    for p in range(numPeople):
        birthDate = random.choice(possibleDates)
        birthdays[birthDate] += 1
    return max(birthdays) >= numSame

In [38]:
for numPeople in [10, 20, 40, 100]:
    print('For', numPeople, 'people, est. prob of a shared birthday is', 
         birthdayProb(numPeople, 2, 10000))
    numerator = math.factorial(366)
    denom = (366**numPeople)*math.factorial(366 - numPeople)
    print('Actual prob. for', numPeople, 'people is', 1 - numerator/denom) # (now incorrect)

For 10 people, est. prob of a shared birthday is 0.1318
Actual prob. for 10 people is 0.1166454118039999
For 20 people, est. prob of a shared birthday is 0.448
Actual prob. for 20 people is 0.4105696370550831
For 40 people, est. prob of a shared birthday is 0.9165
Actual prob. for 40 people is 0.89054476188945
For 100 people, est. prob of a shared birthday is 1.0
Actual prob. for 100 people is 0.9999996784357714


## Simulation Models
- a description of computations that provide useful information about the possible behaviors of the system being modeled
- descriptive, not prescriptive
    - describe possible outcomes; don't tell you how to achieve possible outcomes 
    - whereas optimization model is prescriptive
- only an approximation to reality
- "All models are wrong, but some are useful." - George Box

## Simulations are used a lot
- to model systems that are mathematically intractable
- to extract useful intermediate results
- lend themselves to development by successive refinement and "what if" questions
- start by simulating random walks
    - for next lecture