## Inferential statistics

### Inferential Statistics
 - Population : a set of examples
 - Sample : a proper subset of a population
 - Goal : Estimate some statistic about the population based on statistics about the sample
 - Key fact : If the sample is random, it tends to exhibit the same properties as the population from which it is drawn

### Why the Difference in Confidence?
 - Confidence in our estimated depends upon two things
 - Size of sample
 - Variance of sample
 - As the variance grows, we need larger samples to have the same degree of confidence

In [2]:
from lecture7_segment1 import *

In [3]:
numSpins = 1
game = FairRoulette()
playRoulette(game, numSpins)

1 spins of Fair Roulette
Expected return betting red = -100.0%
Expected return betting black = 100.0%
Expected return betting 2 = -100.0%



(-1.0, 1.0, -1.0)

In [4]:
numSpins = 1
game = FairRoulette()
playRoulette(game, numSpins)

1 spins of Fair Roulette
Expected return betting red = 100.0%
Expected return betting black = -100.0%
Expected return betting 2 = -100.0%



(1.0, -1.0, -1.0)

In [5]:
numSpins = 100
game = FairRoulette()
playRoulette(game, numSpins)

100 spins of Fair Roulette
Expected return betting red = 12.0%
Expected return betting black = -12.0%
Expected return betting 2 = -28.0%



(0.12, -0.12, -0.28)

In [6]:
numSpins = 100
game = FairRoulette()
playRoulette(game, numSpins)

100 spins of Fair Roulette
Expected return betting red = 4.0%
Expected return betting black = -4.0%
Expected return betting 2 = -64.0%



(0.04, -0.04, -0.64)

In [7]:
numSpins = 10000
game = FairRoulette()
playRoulette(game, numSpins)

10000 spins of Fair Roulette
Expected return betting red = -0.7%
Expected return betting black = 0.7%
Expected return betting 2 = 2.24%



(-0.007, 0.007, 0.0224)

In [8]:
numSpins = 1000000
game = FairRoulette()
playRoulette(game, numSpins)

1000000 spins of Fair Roulette
Expected return betting red = -0.0002%
Expected return betting black = 0.0002%
Expected return betting 2 = 0.1376%



(-2e-06, 2e-06, 0.001376)

In [9]:
numSpins = 1000000
game = FairRoulette()
playRoulette(game, numSpins)

1000000 spins of Fair Roulette
Expected return betting red = -0.0192%
Expected return betting black = 0.0192%
Expected return betting 2 = 0.4616%



(-0.000192, 0.000192, 0.004616)

In [10]:
numSpins = 1000000
game = FairRoulette()
playRoulette(game, numSpins)

1000000 spins of Fair Roulette
Expected return betting red = 0.167%
Expected return betting black = -0.167%
Expected return betting 2 = -0.6292%



(0.00167, -0.00167, -0.006292)

### Law of Large Numbers (LLN)
- In repeated independent tests with the same actual probability *p* of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from *p* converges to zero as the number of trials goes to infinity.

### Gambler's Fallacy
- If deviations from expected behavior occur, these deviations are likely to be evened out by opposite deviations in the future

- Probability of 15 consecutive reds : 1/32378
- Probability of 25 consecutive reds : 1/33554432
- Probability of 26 consecutive reds : 1/67108865
- Probability of 26 consecutive reds when previous 25 rolls were red : **1/2**

### Regression to the Mean
- Following an extreme random event, the next random event is likely to be less extreme
- If you spin a fair roulette wheel 10 times and get 100% reds, that is an extreme event (probability = 1/1024)
- It is like that in the next 10 spins, you will get fewer than 10 reds
- So, if you look at the average of the 20 spins, it will be closer to the expected mean of 50% reds than to the 100% you saw in the first 10 spins

### Variation in Data

In [12]:
from lecture7_segment2 import *

In [14]:
random.seed(0)
numTrials = 20
resultDict = {}
games = (FairRoulette, EuRoulette, AmRoulette)
for G in games:
    resultDict[G().__str__()] = []
for numSpins in (100, 1000, 10000, 100000):
    print('\nSimulate betting a pocket for', numTrials,
          'trials of',
          numSpins, 'spins each')
    for G in games:
        pocketReturns = findPocketReturn(G(), numTrials,
                                         numSpins, False)
        print('Exp. return for', G(), '=',
             str(100*sum(pocketReturns)/float(len(pocketReturns))) + '%')


Simulate betting a pocket for 20 trials of 100 spins each
Exp. return for Fair Roulette = 6.199999999999998%
Exp. return for European Roulette = -8.200000000000001%
Exp. return for American Roulette = 2.599999999999998%

Simulate betting a pocket for 20 trials of 1000 spins each
Exp. return for Fair Roulette = 4.760000000000002%
Exp. return for European Roulette = -2.4399999999999995%
Exp. return for American Roulette = -9.46%

Simulate betting a pocket for 20 trials of 10000 spins each
Exp. return for Fair Roulette = -1.3060000000000003%
Exp. return for European Roulette = -4.095999999999999%
Exp. return for American Roulette = -5.698000000000001%

Simulate betting a pocket for 20 trials of 100000 spins each
Exp. return for Fair Roulette = 0.7982%
Exp. return for European Roulette = -2.5876000000000006%
Exp. return for American Roulette = -5.134600000000001%


### Sampling Space of Possible Outcomes
- Never possible to guarantee perfect accuracy through sampling
- Not to say that an estimate is not precisely correct
- How many samples do we need to look at before we can have justified confidence on our answer?
    - Depeds upon variability in underlying distribution

### Quantifying Variation in Data

$$ Var(X) = \frac{\sum_{x \in X}(x- \mu)^2}{|n|} $$

$$ \sigma(X) = \sqrt{\frac{1}{|x|}\sum_{x \in X}(x-\mu)^2} $$

- Standard deviation simply the square root of the variance
- Outliers can have a big effect
- Standard deviation should always be considered relative to mean

### Code
```python
def getMeanAndStd(X):
    mean = sum(X) / float(len(X))
    tot = 0.0
    for x in X:
        tot += (x - mean) ** 2
    std = (tot / len(X)) ** 0.5
    return mean, std
```

### Confidence Levels and Intervals
- Instead of estimating an unknown parameter by a single value (e.g., the mean of a set of trials), a confidence interval provides a range that is likely to contain the unknown value and a confidence that the unknown value lays within that range

- "The return on betting on 2 twenty times in European roulette is -3.3%. The margin of error is $\pm$ 1% with a 95% level of confidence"

### Empirical Rule
- Under some assumptions discussed later
    - ~ 68% of data within $\sigma$ of mean
    - ~ 95% of data within 2 $\sigma$ of mean
    - ~ 99.7% of data within 3 $\sigma$ of mean

In [16]:
random.seed(0)
numTrials = 20
resultDict = {}
games = (FairRoulette, EuRoulette, AmRoulette)
for G in games:
   resultDict[G().__str__()] = []
for numSpins in (100, 1000, 10000):
   print('\nSimulate betting a pocket for', numTrials,
         'trials of', numSpins, 'spins each')
   for G in games:
       pocketReturns = findPocketReturn(G(), 20, numSpins, False)
       mean, std = getMeanAndStd(pocketReturns)
       resultDict[G().__str__()].append((numSpins,
                                         100*mean, 100*std))
       print('Exp. return for', G(), '=', str(round(100*mean, 3))
             + '%,', '±' + str(round(100*1.96*std, 3))
             + '% with 95% confidence')


Simulate betting a pocket for 20 trials of 100 spins each
Exp. return for Fair Roulette = 6.2%, ±152.114% with 95% confidence
Exp. return for European Roulette = -8.2%, ±90.567% with 95% confidence
Exp. return for American Roulette = 2.6%, ±92.74% with 95% confidence

Simulate betting a pocket for 20 trials of 1000 spins each
Exp. return for Fair Roulette = 4.76%, ±39.658% with 95% confidence
Exp. return for European Roulette = -2.44%, ±31.31% with 95% confidence
Exp. return for American Roulette = -9.46%, ±36.273% with 95% confidence

Simulate betting a pocket for 20 trials of 10000 spins each
Exp. return for Fair Roulette = -1.306%, ±9.295% with 95% confidence
Exp. return for European Roulette = -4.096%, ±10.902% with 95% confidence
Exp. return for American Roulette = -5.698%, ±11.077% with 95% confidence
