In [4]:
from scipy.stats import norm, t, tstd

**Statistics Types**
- descriptive statistics: used to describe and summarize data
- inferential statistics: tries to uncover attributes of a larger population, usually based on a sample

**Population**
- particular group of interest that should be studied

**Sample**
- subset of population that is ideally random an unbiased
- is used to infer attributes about the population
- population is usually to big, so samples are used
- hard to get the bias out of the data though
- ie how do i get a random sample of all universities in US to represent all students? Without injecting any selection bias?
- *confirmation bias*: Selecting only data that supports your belief
- *self-selection bias*: Selecting samples that are already pre-selected
- *Survival bias*: Selects only the survivors, while the lost ones are not included in the sample
- Computers don't recognize bias in data, that has to be done by the practicioner

## Descriptive Statistics

**Mean**
- average
- sample mean: x-bar
- population mean: mu

In [1]:
sample = [0,1,3,4,2,7,8,9]

mean = sum(sample) / len(sample)
mean

4.25

**Weighted Mean**
- weight each data point before summing
- then divide by sum of weights, not the count

In [2]:
sample = [9,23,65,23,78]
weights = [3,7,45,3,2]

weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)
weighted_mean

55.63333333333333

**Median**
- middle most value in a set of ordered values
- if n is even, you average the two center most values
- median is helpful alternative to mean if the data is skewed by outliers
- when median is very different from mean, you have skewed dataset with many outliers
- mean is the 50% quantile (from 25%, 50%, 75%)*

**Mode**
- most frequent set of values in data set
- useful in repetitive data and you want to know which value is most frequent
- otherwise not used often
- bimodal: 2 values with highest frequency

**Variance**
- measure for how spread out the data is
- squared sum of differences between value and mean for each data point
- then take the average
- intuition: Average distance of a datapoint to the mean

**Standard Deviation**
- square root of Variance. gets it closer to original scale

**Standard Deviation and Variance in Samples**
- in samples, when dividing by the sample size, we divide by sample size - 1
- we do this to decrease the bias
- this increases the estimate of how spread out the data of the population is
- since it is in any case most likely more spread out than the sample

**The Normal Distribution**
- most important distribution
- most mass around its mean
- is seen a lot in nature and sciences
- *I think in all datasets where we can use normal distribution, this gives us a set of methods to infer certain characteristics about the dataset*
- *properties*:
- symmetrical: Both sides are identically mirrored at the mean
- most mass is at center around mean
- is has a spread (being narrow or wide) that is specified by standard deviation
- tails are the least likely outcomes and approach, but never touch, zero

**The probability Density Function (PDF)**
- is the function that creates the normal distribution
- can be used to look up likelyhoods at given values
- its continuous, so we need to integrate it to get an area

**The cumulative distribution function**
- finds the area under PDF (normal distribution)
- like with beta distribution
- ie whats the probability of a value being below x?

In [5]:
mean = 64.43
std_dev = 2.99
x = 65

P = norm.cdf(x, mean, std_dev)

print(P)

0.5755943933826899


In [9]:
# probability of value being in certain area y - x

mean = 70
std_dev = 6.34
x = 65
y = 72

P = norm.cdf(y, mean, std_dev) - norm.cdf(x, mean, std_dev)
P

0.4086326197841312

**Inverse CDF**
- opposite of CDF
- under what value do x% of data points fall?

In [10]:
# under what value do 95% of data points fall?
# loc = mean, scale = std_dev

x = norm.ppf(.95, loc = 64.43, scale = 2.99)
x

69.3481123445849

**Z scores**
- standard normal distribution: mean = 0, std_dev = 1)
- normal distributions are often scaled to standard normal distributions to make them more comparable
- standard normal distribution expresses all x values in terms of std_dev.
- Amount of std_devs is called the Z-Score
- Z = (x - mean)/std_dev

**Coefficient of variation**
- metric to discribe the spread of a distribution
- cv = std_dev/mean
- basically adjusts the std_dev for the mean

## Inferential Statistics
- Infer stuff about population by using sample

**Central Limit Theorem**
- uniform distribution: any value is equally likely, distribution becomes flat
- but when these values are grouped and averaged, they become a normal distribution
- *central limit theorem*: Interesting things happen when we take large enough samples of populations, calculate the mean of each sample, and plot as distribution:
1. the mean of the sample means becomes the population mean
2. if the population is normal, then the sample means will be normal
3. if population is not normal, but sample size is greater 30, the sample means are still roughly normal distribution
4. sample std_dev = population std_dev/sqrt(population size)
- you need at least 31 samples to satisfy central limit theorem and see normal distribution
- *then you can infer useful things about population using normal distribution, even if population is not normal*
- less than 31 samples, you need to use T-distribution

**probability distribution and sample size**
- if population is very asymmetric, ie skewed, you need a lot bigger sample size to get the sample means looking like normal distribution
- unimodal data: One peak. is good for central limit theorem
- multimodal data: more peaks. need bigger sample size

**Confidence Intervals**

-if we can get a sample of at least 31 data points, we can apply central limit theorem to calculate confidence interval, in which we have a certain confidence that a certain value will fall
- Confidence Interval: Is the range in which a value falls
- LOC (Level of confidence): The probability that a value falls in confidence interval
- gives a confidence (ie 95% confidence (LOC)) that a specific value of the population (ie population mean) lies in a confidence Interval (ie 10 and 12) of the corresponding value of the sample
- first choose a LOC (level of confidence) ie 95%
- this is the desired confidence a value (ie population mean) should have
- we can use central limit theorem to infer what this range for the population mean is

*Calculating Confidence Interval*
- First I need *critical z value*
- critical z value leaves 95% probability in the center, meaning remaining 5% is split up in 2.5% on either side
- *How is this symmetrical range containing 95% calculated?*
- we can use inverse CDF to get the x values for .025 and .975, the space between represents the symmetrical 95% area around the center
- then we use those x values and calculate the corresponding z values, that are the upper and lower bounds for the 95% center area
- since we use standard normal distribution that is centered around mean 0, the upper and lower z values will be the same but other sign. so +z and - z
- then calculate *margin of error*, which is the range around the sample mean that contains the population mean that level of confidence (95% certainty)
- margin of error around the confidence interval = confidence interval

**Confidence Intervals my explanation**

-You need: LOC, standard deviation, sample size
- Confidence Interval: Interval in which I expect a value of the population to lie, based on the corresponding value of the sample
- LOC: Level of confidence I want to have that the population value falls in Confidence Interval. Often 95%
- to calculate the confidence interval, I have to calculate its upper and lower value, which hold 95% probability that the population value falls into this interval
- We use inverse CDF to calculate two x values: X1 below which the bottom 2,5% of probabillity lies and X2 above which the top 2,5% of probability lie
- this means the middle hold the 95% probability
- Using those x values we calculate the corresponding z values (remember: standard normal distribuiton)
- we can than use margin of erro formula to calculat a +- value
- we can use this on the sample value to calculate the range (confidence interval) where population value is

In [1]:
from math import sqrt
from scipy.stats import norm


def critical_z_value(p):
    norm_dist = norm(loc=0.0, scale=1.0)
    left_tail_area = (1.0 - p) / 2.0
    upper_area = 1.0 - ((1.0 - p) / 2.0)
    return norm_dist.ppf(left_tail_area), norm_dist.ppf(upper_area)


def confidence_interval(p, sample_mean, sample_std, n):
    # Sample size must be greater than 30

    lower, upper = critical_z_value(p)
    lower_ci = lower * (sample_std / sqrt(n))
    upper_ci = upper * (sample_std / sqrt(n))

    return sample_mean + lower_ci, sample_mean + upper_ci

print(confidence_interval(p=.95, sample_mean=64.408, sample_std=2.05, n=31))

(63.68635915701992, 65.12964084298008)


**Confidence Intervals another explanation**
- confidence interval: the area in which a value lies with certain confidence
- LOC or alpha: The confidence (probability) with which a value lies in confidence interval
- critical_z_value: range in a standard normal distribution, that gives LOC (alpha) probability around the center
- Margin of error: Range around the sample mean that contains population mean at LOC confidence
- Margin of error formular: +-Zc * (std_dev/sqrt(n))
- Process: First calculate critical Z value, then calculate margin of error. Then apply margin of error on sample mean to calculate confidence interval
- to get critical z: determine probability that should be on left and right tail. If LOC = 95%, then 2.5% needs to be on left and right tail. So the 95% are centered in the middle
- then use inverse CDF to get the x values (or in this case z alues), that determine the upper and lower bound of LOC area
-  Then use critical z value to calculate error of margin
-  then use error of margin to calculate confidence interval

In [5]:
from math import sqrt
from scipy.stats import norm

# create standard normal distribution
std_norm = norm(loc = 0.0, scale = 1.0)

# set LOC to 95%, std_dev = 2.05, n = 31, mean = 64.408
p = .95
std_dev = 2.05
n = 31
mean = 64.40

# calculate probabilities for which we need to get the z values
# example: LOC = 95%
left_tail_prob = (1.0-p)/2.0 # 0.025
right_tail_prob = 1 - left_tail_prob # 0.975

# calculate z values
z_lower = std_norm.ppf(left_tail_prob) # -1.959...
z_upper = std_norm.ppf(right_tail_prob) # 1.959...

# margin of error. calculate lower margin and upper margin of confidence interval (ci)
error_margin_lower = z_lower * (std_dev / sqrt(n)) # -0.721....
error_margin_upper = z_upper * (std_dev / sqrt(n)) # 0.721....

# calculate confidence interval
ci_lower = mean + error_margin_lower # 63.678....
ci_upper = mean + error_margin_upper # 65.121....

print(ci_lower, ci_upper)

# interpretation:
# I am 95% confident that the population mean will be between 63.678 and 65.121

63.678359157019926 65.12164084298009


**P values**
- p value = probability that an experiment result occured by chance and not by the assumed explanation
- Null Hypothesis (H0): Variable in question had no impact on the result and any positive result is random luck
- Alternative Hypothesis (H1): The variable in question (control variable) is causing the positive result

**One Tailed Test**
- H0: x >= certain value
- H1: x < certain value
- One Tailed because hypotheses is only tested on one side of a value (in this case < certain value)
- significance level (alpha): if p value higher than this, reject sample
- to calculate alpha use inverse cdf as shown below

In [4]:
# cold has 18 days mean recovery, 1.5 std_dev
mean = 18
std_dev = 1.5

# what x value has 5% of probability under it?
# using inverse cdf
x = norm.ppf(0.05, mean, std_dev)
print(x)

15.53271955957279


- example above means:
- if another sample has a mean of below 15.53, then the likelyhood that it is due to chance is below 5%
- so if a cold treatment yields a mean recovery time of below 15.53, the likelyhood of this being due to chance and not the medicine is below 5%
- so the medicine is statistically working
- below: calculate p-value of sample that yielded mean of 16 to see if it is higher than alpha calculated above

In [3]:
# example where a drug yields a mean of 16
# calculating p value: how likely is this result due to chance?

# cold has 18 days mean recovery, 1.5 std_dev
mean = 18
std_dev = 1.5

# p value of experiment that yields mean of 16
p_value = norm.cdf(16, mean, std_dev)
print(p_value)

0.09121121972586788


**Two tailed test**
- looks on both sides of a value
- using equal and not equal
- H0: mean = 18
- H1: mean != 18
- now we are testing for statistical significance on each side
- so if we set alpha to 5%, we split it to 2.5% on either side
- if experiment mean is outside those areas, we reject H0
- example below: if sample mean lands within those bounds, it is not statistically significant, because likelyhood of it being due to chance is hihgher than 5%

In [6]:
# calculating range for a 2 sided statistical significance of 5% (2.5% on each side)

# cold has 18 days mean recovery and 1.5 std_dev
mean = 18
std_dev = 1.5

# which x value has 2.5% of probability below it?
x1 = norm.ppf(0.025, mean, std_dev)

# which area has 97.5% of probability below it (so 2.5% probability above it)?
x2 = norm.ppf(0.975, mean, std_dev)
print(x1)
print(x2)

15.060054023189918
20.93994597681008


- if calculating p-value for 2 tailed test, add the p-values for each side
- example: population_mean = 18, sample_mean = 16
- calculate p for below 16 and above 20
- because we also take p for symmetrical other side

In [8]:
# cold has 18 days mean recovery, and std_dev 1.5
mean = 18
std_dev = 1.5

# probability of 16 days or less
p1 = norm.cdf(16, mean, std_dev)

# probability of 20 days or more
p2 = 1.0 - norm.cdf(20, mean, std_dev)

# total probability of x being outside of significance threshold (alpha)
p = p2 + p1
print(p)

0.18242243945173575


- above example means the outcome has a likelyhood of 18% to be down to chance
- so it is statistically insignificant
- two tailed tests are usually preferable
- they make it more difficult to reject H0
- and are not biased to only testing one side

**P values in hypothesis testing my own explanation**
- P value: probability of an event occuring outside of a x value
- example: mean 10, sample mean 12
- one tailed test: p value = probability of sample mean being greater 12. If higher than alpha: insgnificant, prob just due to chance
- two tailed test: p value = probability of sample mean being greater than 12 or less than 8. rest same. Can calculate p1 and double it due to symmetry

**T-Distribution**
- when dealing with 30 samples or less
- T-distrbution is like normal distribution but with fatter tails
- so more probability in the tails compared to normal distribution
- reflects more variance and uncertainty
- after approaching 31 items in sample, T-distribution reflects normal distribution
- for confidence intervals you calculate critical t value, analog to critical z value for normal distribution

In [11]:
# get critical value range for 95% confidence interval
# with sample size = 25
# means: I am 95% confident the population mean falls within this value range of sample mean (if mean is taken as metric)

n = 25
lower = t.ppf(.025, df = n-1)
upper = t.ppf(.975, df = n-1)

print(lower,upper)

-2.063898561628021 2.063898561628021


- df = degrees of freedom and should be sample size - 1

**Beyond the Mean**
- in the example abpve we used confidence intervals and hypothesis testing with the mean
- we can also use other values like variance and std_dev and proportions, but then we need other distributions

**Big Data Considerations**
- law of truly large numbers: Rare events are likely to be found, we just dont know which ones
- so we cant really draw conclusions of finding a rare event at random
- can we find it again with new data? Did we speculate it exists and then find it? Is there a good reasonable explanation for it?
- if no: the rare event probably is completely random and no info should be infered from it
- to not fall for this trap: Use structured hypothesis testing

**Exercises**

**1)**
- diameter soll: 1.75 mm
- diameter samples: 1.78, 1.75, 1.72, 1.74, 1.77
- calculate mean and std_dev

In [5]:
list = [1.78, 1.75, 1.72, 1.74, 1.77]

# mean = sum / n
mean = sum(list)/len(list)

# std_dev = sqrt(sum of differences to mean) squared
std_dev = tstd(list)

print(mean, std_dev)

1.752 0.023874672772626667


**2)**
- mean consumer life: 42 months
- std_dev 8 months
- normal distribution
- probability a random iphon will last between 20 and 30 months?
- my approach: cdf, find probability for the x values

In [6]:
mean = 42
std_dev = 8

# P life between 20 and 30 months
p = norm.cdf(30, mean, std_dev) - norm.cdf(20, mean, std_dev)
print(p)

0.06382743803380352


**3)**
- pop_mean = 1.75 (to be confirmed)
- n_sample = 34
- sample_mean = 1.715588
- sample_std_dev = 0.029252.
- what is 99% confidence interval of population mean?
- solution process: for p = .99 get critical z values, then margin of error, then confidence interval

In [6]:
from math import sqrt
from scipy.stats import norm

# create standard normal distribution
std_norm = norm(loc = 0.0, scale = 1.0)

# set variables
p = .99
std_dev = 0.029252
n = 34
mean = 1.715588

# calculate probabilities for which we need to get the z values
# example: LOC = 99%
left_tail_prob = (1.0-p)/2.0
right_tail_prob = 1 - left_tail_prob

# calculate z values
z_lower = std_norm.ppf(left_tail_prob)
z_upper = std_norm.ppf(right_tail_prob)

# margin of error. calculate lower margin and upper margin of confidence interval (ci)
error_margin_lower = z_lower * (std_dev / sqrt(n))
error_margin_upper = z_upper * (std_dev / sqrt(n))

# calculate confidence interval
ci_lower = mean + error_margin_lower
ci_upper = mean + error_margin_upper

print(ci_lower, ci_upper)

1.7026658973748656 1.7285101026251342


**4)**
- mean pop = 10345
- std_dev pop = 552
- n sample = 45
- mean sample = 11641
- did sample campaign effect sales?

- my approach
- is sample mean due to randomness? Or due to campaign effect?
- I want to be 95% sure it is due to campaign effect
- H0: mean_sample_real = 10345
- H1: mean_sample_real != 10345
- alpha = .05 (1 - LOC)
- can use either p value or hypothesis testing for this. Here I will use p value
- p value process: calculate the probability that the real pop mean lies out side of sample mean. Two tailed test: on both sides. This probability is the p value. compare it to alpha. if its higher, its statistically insignificant

In [10]:
mean = 10345
std_dev = 552
sample_mean = 11641
sample_mean_mirror = mean - (sample_mean - mean)

# probability of 11641 dollars or more
p1 = 1 - norm.cdf(sample_mean, mean, std_dev)

# probability of 996 or less
p2 = norm.cdf(sample_mean_mirror, mean, std_dev)

print(p1 + p2)

0.018883335964961383


In [11]:
# concerning example above
# due to simmetry, I could have simply doubled p1 instead of calculating p2 (p1 and p2 are the same)