# STATS INTRO NOTES

## DIFFERENT TESTING:
- Chi Square Test = χ2 test
- t-test
- correlation test

## VOCABULARY:

### Measures of Central Tendency
- mean: average value
- median: center most value
- mode: most frequently occuring value
    - bi-modal: two values tie as mode
- expected value: similar to mean but weighted

### Measures of Spread
- min: lowest value
- max: highest value
- range: difference between max and min
- percentile: cut into 100 equal parts

- quantile: an set of cut points that divides values equally

- quartile: slice a data set into four pieces
    - 25% of observations between min and Q1
    - 50% of observations between min and Q2
    - 75% of observations between min and Q3
    
- IQR: Q3-Q1
    
- Variance

- Standard Deviation: square root Variance 

- Skew: Symmetric - hump in middle and tails even on each side
    - left-skewed: mean is less than medial, left tail is longer
    - right-skewed: mean is greater then median, right tail is longer
    

# SIMULATIONS:
- Madeline file copy

In [8]:
%matplotlib inline
import numpy as np
import pandas as pd

np.random.seed(1349)

### How will we utilize Python to obtain probabilities?

We will utilize Monte Carlo simulations.

A Monte Carlo simulation is a means to recreate potential events and empirically take the results of simiulated trials to obtain a reasonably precise estimate of a desired probability.

What does this mean for us here?

In [7]:
# Let's take a hypothetical base probability. 
# What is the probability of rolling a one (1) on a single, standard, fair six-sided die?

In [9]:
# Potential outcomes of a die roll:
possible_outcomes = [1,2,3,4,5,6]

In [None]:
# options that equal 1: just 1, literally one

In [10]:
ideal_roll = 1

In [None]:
#theoretical probability: 1/6

In [11]:
1/6

0.16666666666666666

In [None]:
# Now how would we do this with a simulation?

In [None]:
# We will do it utilizing a large number of trials, that we calculate.

In [None]:
# Allow us to examine the same problem: Probability of rolling a 1 on a fair six-sided die.

In [12]:
# First, we will set a value for the number of trials that we want to conduct.
# We have the power of computation at our finger tips, so let's shoot for something like one million.

num_trails= 10 ** 5

In [13]:
# We have one die roll for each trial, which is our event, that we call a single simulation
n_dice= 1

In [None]:
# We will do a single simulation one million times, with each simulation being a die roll.

In [15]:
rolls = np.random.choice(possible_outcomes, num_trails*n_dice).reshape(num_trails, n_dice)

In [16]:
type(rolls)

numpy.ndarray

In [17]:
rolls

array([[3],
       [2],
       [5],
       ...,
       [4],
       [5],
       [5]])

In [18]:
rolls.shape

(100000, 1)

In [19]:
rolls == 1
#^indicates whether roll is 1 or not in True or False

array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]])

In [20]:
(rolls == 1).mean()
#^this gives the amount of times that roll is 1

0.16894

## Generating Random Numbers with Numpy

The `numpy.random` module provides a number of functions for generating random numbers.

- `np.random.choice`: selects random options from a list
- `np.random.uniform`: generates numbers between a given lower and upper bound
- `np.random.random`: generates numbers between 0 and 1
- `np.random.randn`: generates numbers from the standard normal distribution
- `np.random.normal`: generates numbers from a normal distribution with a specified mean and standard deviation

## Example Problems

### Carnival Dice Rolls

> You are at a carnival and come across a person in a booth offering you a game
> of "chance" (as people in booths at carnivals tend to do).

> You pay 5 dollars and roll 3 dice. If the sum of the dice rolls is greater
> than 12, you get 15 dollars. If it's less than or equal to 12, you get
> nothing.

> Assuming the dice are fair, should you play this game? How would this change
> if the winning condition was a sum greater than *or equal to* 12?

In [24]:
n_trials = nrows= 10_000 #number of times we're going to roll
n_dice = ncols= 3 #number of dice

rolls = np.random.choice([1,2,3,4,5,6], n_trials * n_dice)
rolls

array([3, 1, 3, ..., 5, 6, 5])

In [26]:
rolls = np.random.choice([1,2,3,4,5,6], n_trials * n_dice).reshape(nrows,ncols)
rolls #reshape for 10,000 columns and 3 rows


array([[4, 3, 4],
       [4, 6, 5],
       [2, 3, 4],
       ...,
       [5, 5, 3],
       [1, 5, 2],
       [5, 2, 1]])

In [27]:
#AGAIN, we want an outcome that is OVER 12
sums_by_trial = rolls.sum()
sums_by_trial
#this adds everything up row by row

104735

In [32]:
#to do it CORRECTLY
sums_by_trial = rolls.sum(axis=1)
sums_by_trial

array([11, 15,  9, ..., 13,  8,  8])

In [29]:
# We can now convert each value in our array to a boolean value indicating whether or not we won:

wins = sums_by_trial >12
wins
#this will give us whether the 3 rolls added up to over 12 in True and False

In [37]:
win_rate = wins.mean()
win_rate

0.2525

### with win rate, we can calculate profit:
- $15 if your 3 rolls add up to over 12

- $5 to play the game


In [40]:
expected_winnings = win_rate * 15
cost = 5
expected_profit = expected_winnings - cost
expected_profit
#losing 1.2125 on average

-1.2125

In [41]:
expected_winnings

3.7875

In [42]:
# change the standards to 12 or greater! instead of just greater then 12

In [45]:
wins= sums_by_trial >=12
win_rate = wins.mean()
expected_winnings= win_rate * 15 #prize
cost= 5
expected_profit = expected_winnings -cost
expected_profit
# just by changing for 12 or greater... your probability changes

0.4764999999999997

## Winnings = 3.7875 (greater than 12) VS .47649 (12 or greater)

# PRACTICE:

## #1: what is the probabilty of rolling "snake eyes" on a roll of two dice?

In [56]:
#ANSWER via probability
(1/6) * (1/6)
#probability of getting a 1 on TWO dice

0.027777777777777776

In [50]:
#ANSWER via simulation
n_trials = nrows= 10_000 #number of times we're going to roll
n_dice = ncols= 2 #number of dice

In [57]:
#to get your rolls
rolls = np.random.choice([1,2,3,4,5,6], n_trials * n_dice).reshape(nrows,ncols)

In [59]:
rolls.sum(axis=1) == 2
#dice total has to equal 2

array([False, False, False, ..., False, False, False])

In [60]:
(rolls.sum(axis=1) == 2).mean()

0.0275

## #2 There's a 30% chance my son takes a nap on any given weekend day. What is the chance that he takes a nap at least one day this weekend? What is the probability that he doesn't nap at all?

In [62]:
p_nap = .3 #30% probability of nap

ndays = n_cols = 2 #saturday and sunday (#of days)

n_simulated_weekends = n_rows = 10**6 #number of attempts

To determine whether or not a nap is taken on a given day, we'll generate a random number between 0 and 1, and say that it is a nap if it is less than our probability of taking a nap.

In [66]:
trails = np.random.random((n_rows, n_cols))
#gives randoms between 0 and 1

In [67]:
trails[:10]

array([[0.25899888, 0.62842359],
       [0.12435768, 0.87888073],
       [0.76481329, 0.90518264],
       [0.04701055, 0.0030957 ],
       [0.28765568, 0.35455063],
       [0.09130286, 0.82251575],
       [0.62664399, 0.16970214],
       [0.17970919, 0.474695  ],
       [0.77112604, 0.99291645],
       [0.73150172, 0.55338451]])

In [69]:
naps= trails < p_nap #probability that trails is less than 30%
naps

array([[ True, False],
       [ True, False],
       [False, False],
       ...,
       [ True, False],
       [False, False],
       [False, False]])

Now that we have each day as either true or false, we can take the sum of each row to find the total number of naps for the weekend. When we sum an array of boolean values, numpy will treat `True` as 1 and `False` as 0.

In [70]:
naps.sum(axis=1)
#0 = no nap
#1 = nap on 1 day
#2 = nap on both days

array([1, 1, 0, ..., 1, 0, 0])

In [74]:
# We can use this to answer our original questions, what is the probability that AT LEAST one nap is taken?
(naps.sum(axis=1) >= 1).mean()

0.509758

In [75]:
# What is the probability no naps are taken?
(naps.sum(axis=1) == 0).mean()

0.490242

In [76]:
# What is the probability naps being taken on both days?
(naps.sum(axis=1) == 2).mean()

0.090121

## #3  There are ten options in a blind-box style collectable, but you are only likely to get the one you want the most at a probability of one out of every twenty boxes because its a little rarer.

What is the probability of getting your desired collectable if you buy three blindbox toys?

In [78]:
p_collect = .05 #1/20 probability of desired collectable

n_boxes = ncols= 3 #number of boxes

n_trails = n_rows = 10**6 #number of attempts

In [84]:
total_purchases = np.random.random((n_rows, n_cols))
total_purchases

array([[0.68056255, 0.78103963],
       [0.72462227, 0.24469949],
       [0.57703801, 0.78767832],
       ...,
       [0.61142449, 0.03689136],
       [0.89539292, 0.91427327],
       [0.21325399, 0.65909719]])

In [85]:
total_purchases < p_collect

array([[False, False],
       [False, False],
       [False, False],
       ...,
       [False,  True],
       [False, False],
       [False, False]])

In [87]:
((total_purchases < p_collect).sum(axis=1) >=1).mean()

0.097729

## #4 One With Dataframes

Let's take a look at one more problem:

> What is the probability of getting at least one 3 in 3 dice rolls?

To simulate this, we'll use a similar strategy to how we modeled the dice rolls in the previous example, but this time, we'll store the results in a pandas dataframe so that we can apply a lambda function that will check to see if one of the rolls was a 3.

In [94]:
n_trials = nrows = 10 **6 #number of attemps
n_dice_rolled = ncols = 3 #number of dice 

rolls = np.random.choice([1,2,3,4,5,6], n_trials * n_dice_rolled)

In [97]:
pd.DataFrame(rolls).apply(lambda row: 3 in row.values, axis=1)

0          False
1          False
2          False
3          False
4          False
           ...  
2999995    False
2999996    False
2999997    False
2999998    False
2999999    False
Length: 3000000, dtype: bool

In [98]:
pd.DataFrame(rolls).apply(lambda row: 3 in row.values, axis=1).mean()

0.166653

Let's break down what's going on here:

1. First we assign values for the number of rows and columns we are going to use
1. Next we create the `rolls` variable that holds a 3 x 10,000 matrix where each element is a randomly chosen number from 1 to 6
1. Lastly we create a dataframe from the rolls
    1. `pd.DataFrame(rolls)` converts our 2d numpy matrix to a pandas DataFrame
    1. `.apply(...` applies a function to each **row** in our dataframe, because we specified `axis=1`, the function will be called with each row as it's argument. The body of the function checks to see if the value `3` is in the values of the row, and will return either `True` or `False`
    1. `.mean()` takes our resulting series of boolean values, and treats `True` as 1 and `False` as 0, to give us the average rate of `True`s, in this case, the simulated probability of getting a 3 in 3 dice rolls.

## #5 Recreate the blindbox problem utilizing the above strategy!

In [100]:
p_collect = .05 #1/20 probability of desired collectable

n_boxes = ncols= 3 #number of boxes

n_trails = n_rows = 10**6 #number of attempts

In [101]:
total_purchases = np.random.random((n_rows, n_cols))
total_purchases

array([[0.90559479, 0.53126914],
       [0.02318502, 0.89400176],
       [0.41830072, 0.92737271],
       ...,
       [0.80327722, 0.30264089],
       [0.70007345, 0.48303601],
       [0.72744118, 0.38709401]])

In [102]:
pd.DataFrame(total_purchases).apply(lambda row: 3 in row.values, axis=1)

0         False
1         False
2         False
3         False
4         False
          ...  
999995    False
999996    False
999997    False
999998    False
999999    False
Length: 1000000, dtype: bool

In [104]:
pd.DataFrame(rolls).apply(lambda row: 3 in row.values, axis=1).mean()

0.166653