In [1]:
from scipy.stats import norm

**Statistics Types**
- descriptive statistics: used to describe and summarize data
- inferential statistics: tries to uncover attributes of a larger population, usually based on a sample

**Population**
- particular group of interest that should be studied

**Sample**
- subset of population that is ideally random an unbiased
- is used to infer attributes about the population
- population is usually to big, so samples are used
- hard to get the bias out of the data though
- ie how do i get a random sample of all universities in US to represent all students? Without injecting any selection bias?
- *confirmation bias*: Selecting only data that supports your belief
- *self-selection bias*: Selecting samples that are already pre-selected
- *Survival bias*: Selects only the survivors, while the lost ones are not included in the sample
- Computers don't recognize bias in data, that has to be done by the practicioner

## Descriptive Statistics

**Mean**
- average
- sample mean: x-bar
- population mean: mu

In [1]:
sample = [0,1,3,4,2,7,8,9]

mean = sum(sample) / len(sample)
mean

4.25

**Weighted Mean**
- weight each data point before summing
- then divide by sum of weights, not the count

In [2]:
sample = [9,23,65,23,78]
weights = [3,7,45,3,2]

weighted_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)
weighted_mean

55.63333333333333

**Median**
- middle most value in a set of ordered values
- if n is even, you average the two center most values
- median is helpful alternative to mean if the data is skewed by outliers
- when median is very different from mean, you have skewed dataset with many outliers
- mean is the 50% quantile (from 25%, 50%, 75%)*

**Mode**
- most frequent set of values in data set
- useful in repetitive data and you want to know which value is most frequent
- otherwise not used often
- bimodal: 2 values with highest frequency

**Variance**
- measure for how spread out the data is
- squared sum of differences between value and mean for each data point
- then take the average
- intuition: Average distance of a datapoint to the mean

**Standard Deviation**
- square root of Variance. gets it closer to original scale

**Standard Deviation and Variance in Samples**
- in samples, when dividing by the sample size, we divide by sample size - 1
- we do this to decrease the bias
- this increases the estimate of how spread out the data of the population is
- since it is in any case most likely more spread out than the sample

**The Normal Distribution**
- most important distribution
- most mass around its mean
- is seen a lot in nature and sciences
- *properties*:
- symmetrical: Both sides are identically mirrored at the mean
- most mass is at center around mean
- is has a spread (being narrow or wide) that is specified by standard deviation
- tails are the least likely outcomes and approach, but never touch, zero

**The probability Density Function (PDF)**
- is the function that creates the normal distribution
- can be used to look up likelyhoods at given values
- its continuous, so we need to integrate it to get an area

**The cumulative distribution function**
- finds the area under PDF (normal distribution)
- like with beta distribution
- ie whats the probability of a value being below x?

In [5]:
mean = 64.43
std_dev = 2.99
x = 65

P = norm.cdf(x, mean, std_dev)

print(P)

0.5755943933826899


In [9]:
# probability of value being in certain area y - x

mean = 70
std_dev = 6.34
x = 65
y = 72

P = norm.cdf(y, mean, std_dev) - norm.cdf(x, mean, std_dev)
P

0.4086326197841312

**Inverse CDF**
- opposite of CDF
- under what value do x% of data points fall?

In [10]:
# under what value do 95% of data points fall?
# loc = mean, scale = std_dev

x = norm.ppf(.95, loc = 64.43, scale = 2.99)
x

69.3481123445849

**Z scores**
- standard normal distribution: mean = 0, std_dev = 1)
- normal distributions are often scaled to standard normal distributions to make them more comparable
- standard normal distribution expresses all x values in terms of std_dev.
- Amount of std_devs is called the Z-Score
- Z = (x - mean)/std_dev

**Coefficient of variation**
- metric to discribe the spread of a distribution
- cv = std_dev/mean
- basically adjusts the std_dev for the mean

## Inferential Statistics
- Infer stuff about population by using sample

**Central Limit Theorem**
- uniform distribution: any value is equally likely, distribution becomes flat
- but when these values are grouped and averaged, they become a normal distribution
- *central limit theorem*: Interesting things happen when we take large enough samples of populations, calculate the mean of each sample, and plot as distribution:
1. the mean of the sample means becomes the population mean
2. if the population is normal, then the sample means will be normal
3. if population is not normal, but sample size is greater 30, the sample means are still roughly normal distribution
4. sample std_dev = population std_dev/sqrt(population size)
- you need at least 31 samples to satisfy central limit theorem and see normal distribution
- *then you can infer useful things about population using normal distribution, even if population is not normal*
- less than 31 samples, you need to use T-distribution

**probability distribution and sample size**
- if population is very asymmetric, ie skewed, you need a lot bigger sample size to get the sample means looking like normal distribution
- unimodal data: One peak. is good for central limit theorem
- multimodal data: more peaks. need bigger sample size

**Confidence Intervals**
- gives a confidence (ie 95% confidence) that a specific value of the population (ie population mean) lies between two values (ie 10 and 12)
- first choose a LOC (level of confidence) ie 95%
- this is the desired confidence a value (ie population mean) should have