## Chapter 2: Data and Sampling Distributions

### an unknown distribution (population) vs a random sampling 

Traditional statistics: theory based on assumptions about the population.

Modrn statistics: using sampling directly, where such assumptions are not needed.

Random sampling:
* with replacement: observations are put back and can be selected multiple times
* without: observations once selected are unavailable

Quality of sampling is important
* Literary Digest poll of 1936 that predicted a victory of Alf Landon over Franklin Roosevelt with 10 million people.
* George Gallup, founder of the Gallup Poll, conducted biweekly polls of just 2,000 people and accurately predicted a Roosevelt victory

The result was sample bias; that is, the sample was different in some meaningful and nonrandom way from the larger population
it was meant to represent.

In [None]:
# load libraies
list.of.packages <- c("boot", "ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(boot)
library(ggplot2)

In [None]:
# load data
PSDS_PATH <- file.path('~/Desktop', 'statistics-for-data-scientists')

loans_income <- read.csv(file.path(PSDS_PATH, 'data', 'loans_income.csv'))[,1]
sp500_px <- read.csv(file.path(PSDS_PATH, 'data', 'sp500_px.csv'))


In [None]:
x <- seq(from=-3, to=3, length=300)
gauss <- dnorm(x)

# png(filename=file.path(PSDS_PATH, 'figures', 'normal_density.png'),  width = 4, height=5, units='in', res=300)
par(mar=c(3, 3, 0, 0)+.1)
plot(x, gauss, type="l", col='blue', xlab='', ylab='', axes=FALSE)
polygon(x, gauss, col='blue')
# dev.off()

# png(filename=file.path(PSDS_PATH, 'figures', 'samp_hist.png'), width = 200, height = 250)
norm_samp <- rnorm(100)
par(mar=c(3, 3, 0, 0)+.1)
hist(norm_samp, axes=FALSE, col='red', main='')
# dev.off()

### Random error and bias error

Random error: unbiased process of gun shooting produces error (2-2).

Biased error: shots shift to the upper right corner.


![](02.png)
![](03.png)

### sample mean versus population mean

sample mean: $\=x$, from observation

population mean: $\mu$, inferred from a population

### Selection bias and data snooping

Selection bias: Bias resulting from the way in which observations are selected.

Data snooping: Extensive hunting through data in search of something interesting.

    “If you torture the data long enough, sooner or later it will confess.”

### The vast search effecct

One person toss coins 10 times and receive all head-up -> his special talent

people in a stadium toss conis -> 99% someone get 10 heads

again this: holdout sets or target shuffling (a permutation test).

### Regression to the mean

Extreme observations tend to be followed by more central ones.

e.g. tall parents tend to have not so tall kids.

![](04.png)

### Distribution of the sample vs distribution of mean values of sampling

The mean is likely to be more regular and bell-shaped than the distribution of the data itself.

a sample of 1,000 values, a sample of 1,000 means of 5 values, and a sample of 1,000 means of 20 values (below).

### Central limit theorem

It says that the means drawn from multiple samples will resemble the familiar bell-shaped normal curve (see “Normal Distribution” on page 69), even if the source population is
not normally distributed, provided that the sample size is large enough and the departure of the data from normality is not too great.

The central limit theorem allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis
tests.

In practice: bootstrap is available, which doesn't require the CLT in theory.

In [None]:
## Code snippet 2.1

# take a simple random sample
samp_data <- data.frame(income=sample(loans_income, 1000), 
                        type='data_dist')
# take a sample of means of 5 values
samp_mean_05 <- data.frame(
  income = tapply(sample(loans_income, 1000*5), 
                  rep(1:1000, rep(5, 1000)), FUN=mean),
  type = 'mean_of_5')
# take a sample of means of 20 values
samp_mean_20 <- data.frame(
  income = tapply(sample(loans_income, 1000*20), 
                  rep(1:1000, rep(20, 1000)), FUN=mean),
  type = 'mean_of_20')
# bind the data.frames and convert type to a factor
income <- rbind(samp_data, samp_mean_05, samp_mean_20)
income$type = factor(income$type, 
                     levels=c('data_dist', 'mean_of_5', 'mean_of_20'),
                     labels=c('Data', 'Mean of 5', 'Mean of 20'))
# plot the histograms
ggplot(income, aes(x=income)) +
  geom_histogram(bins=40) +
  facet_grid(type ~ .)

### Standard Error

Standard error = SE = $\frac{s}{\sqrt{n}}$

to reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4

(not s is already the sandard deviation of the sample)

The validity of the standard error formula arises from the central limit theorem.

By definition, to compute the standard error, need to sample new sets multiple times, calculate the deviation and then derive the standard deviation.

In practice, bootstrap is used to estimate the standard error, and doesn't reply on the central limit theorem.

### The Bootstrap

draw additional samples with replacement from the sample itself.

(make a synthetic "population" with replicates of the sample instead of drawing from the original population)

![](05.png)

In practice, this is done by drawing with replacement (thus the sample / "population" remains unchanged)


#### Algorithm

1. Draw a sample value, record it, and then replace it.
2. Repeat n times.
3. Record the mean of the n resampled values.
4. Repeat steps 1–3 R times.
5. Use the R results to:
    * a. Calculate their standard deviation (this estimates sample mean standard
error).
    * b. Produce a histogram or boxplot.
    * c. Find a confidence interval.

In R , this is combined in the "boot" function

The function stat_fun computes the median for a given sample specified by the
index idx.

The original estimate of the median is $62,000. The bootstrap distribution indicates
that the estimate has a bias of about –$70 and a standard error of $209.




In [None]:
stat_fun <- function(x, idx) median(x[idx])
boot_obj <- boot(loans_income, R = 1000, statistic=stat_fun)
boot_obj # print result as object

#### Concept of "Bagging"

short for “bootstrap aggregating”; see “Bagging and the Random Forest” on page 259

Multivariat data -> sample the rows as units (bootstrap) -> use the bootstrap sample to run multiple prediction trees -> averaging their predictions -> better performance


![](06.png)

#### understanding the bootstrap

The bootstrap does not compensate for a small sample size; it does
not create new data, nor does it fill in holes in an existing data set.
It merely informs us about how lots of additional samples would
behave when drawn from a population like our original sample.

#### Bootstrapping vs resampling

Synonymous in most case

In addition, resampling could include permutation procedures where mutiple samples are not replaced.

### Confidence Intervals

90% condifidence: the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic.

Algorithm:

1. Draw a random sample of size n with replacement from the data (a resample).
2. Record the statistic of interest for the resample.
3. Repeat steps 1–2 many (R) times.
4. For an x% confidence interval, trim [(100-x) / 2]% of the R resample results from
either end of the distribution.
5. The trim points are the endpoints of an x% bootstrap confidence interval.

* Confidence intervals are the typical way to present estimates as an interval range.
•* The more data you have, the less variable a sample estimate will be.

![](07.png)

### Normal distribution

Most of the variables follow a normal distribution -> when empirical probability distributions, or bootstrap distributions,
are not available -> use a normal distribution instead.

It's also termed a Gaussian distribution, attributed to Carl Friddrich Gauss

#### normalization / standardization

subtract the mean and then divide by the standard deviation; this is also called normalization or standardization

#### QQ-Plot

determine how close a sample is to a specified distribution

The QQ-Plot orders the z-scores from low to high and plots each value’s z-score on the y-axis; the x-axis is the corresponding quantile of a normal distribution for that value’s rank.

To convert data to z-scores, you subtract the mean of the data and divide by the standard deviation; you can then compare the data to a normal distribution.

(to derive the plot, you need the quantile of each point in the sampling and in a normal distribution.)

QQ-plot for 100 values generate from a normal distribution -> closely follow the line (indicating normal distribution).

(The "abline" function seems to plot a line on the canvas)

In [None]:
## Code for Figure 11
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0211.png'),  width = 4, height=4, units='in', res=300)
norm_samp <- rnorm(100)
par(mar=c(3, 3, 0, 0)+.1)
qqnorm(norm_samp, main='', xlab='', ylab='')
abline(a=0, b=1, col='grey')
# dev.off()

### The long-tailed distribution

Tail: The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency.

skewed: the data distribution is asymmetric.

discrete: e.g. bimordial distribution

The black swan theory by Nassim Taleab: anomalous events, such as a stock market crash, are much more likely to occur than would be predicted by the normal distribution.

Example: QQ-Plot for the daily stock returns for Netflix (NFLX).

The z scored value is higher than the normal distribution value in magnitude at the end, indicating a long tail distribution.

The points are close to the line for the data within one standard deviation of the mean, this is refered as data being "normal in the middle" but having long tails.

(this illustrate the point in practice, we need to first have a sense of the sample distribution).

In [None]:
## Code for Figure 12
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0212.png'),  width = 4, height=4, units='in', res=300)
par(mar=c(3, 3, 0, 0)+.1)
nflx <- sp500_px[,'NFLX']
nflx <- diff(log(nflx[nflx>0]))
qqnorm(nflx, main='', xlab='', ylab='')
abline(a=0, b=1, col='grey')
# dev.off()

### Student’s t-Distribution

* a normally shaped distribution, except that it is a bit thicker and longer on the tails.
* used extensively in depicting distributions of sample statistics.
* there is a family of t-distributions that differ depending on how large the sample is.

Gosset's question: “What is the sampling distribution of the mean of a sample, drawn from a larger population?”
* resampling experiment—drawing random samples of 4 from a data set of 3,000 measurements of criminals’ height and left-middle-finger length.
* plot z-scores on the x-axis and frequency on the y-axis.
* invent a function to fit the distribution (Student's t)

![](08.png)

Usage of the t-distribution: estimate confidence intervals to show sampling variation
* compute the mean of sample
* estimate a 90% confidence interval by:

(it's a method when we cannot access the power of computer)

![](09.png)

### Binomial Distribution

Binomial: 0 and 1 (yes or no)

1 usually means the success output

Binomial Distribution: Distribution of number of successes in n trials (possiblity of success: p).

Example: R function 
* dbiom - the probability of observing exactly x = 2 successes in size = 5 trials
* pbiom - the probability of observing two or fewer successes in five trials, where the probability of success for each trial is 0.1


#### mean, variance

* mean - n*p
* variance - n*p(1-p)
* approximantion - a normal distribution with mean and variance value.


In [None]:
dbinom(x=2, size=5, p=0.1)
pbinom(2, 5, 0.1)

### Chi-square distribution

measure whether the sample matches an expected distribution (variation in group menas matches "normal" random variation)

for discrete values

### F-distribution

for continous values

### Poisson Distribution

distribution of events per unit of time or space

e.g. How much capacity do we need to be 95% sure of fully processing the internet traffic that arrives on a server in any fivesecond period?

$\lambda$ - This is the mean number of events that occurs in a specified interval of time or space

Example: generate 100 random numbers from a Poisson distribution with λ = 2

For example, if incoming customer service calls average two per minute, this code will simulate 100 minutes, returning the number of calls in each of those 100 minutes.

In [None]:
rpois(100, lambda=2)

### Exponential distribution

similar as the possion distribution, model distribution of the time between events: time between visits to a website or
between cars arriving at a toll plaza.

Example: n = 100, rate = 0.2

In [None]:
rexp(n=100, rate=0.2)

### Weibull Distribution

if event rate changes

if typical interval is shorter than the period over which it changes -> good -> break the time period into small enough intervals and apply poisson distribution in each.

if the period over which it changes is shorter than typical interval -> weibull distribution (e.g. mechnical failure, risk of failure increases)

a shape parameter, β. If β > 1, the probability of an event increases over time; if β < 1, the probability decreases.

characteristic life (scale parameter), $\eta$ 

example: 100 random numbers (lifetimes) from a Weibull distribution with shape of 1.5 and characteristic life of 5,000

In [None]:
rweibull(100, 1.5, 5000)

### Estimating the Failure Rate

With rate data or no data at all: e.g. Aircraft engine failure

idea: if no events have been seen after 20 hours, you can be pretty sure that the rate is not 1 per hour.

-> estimate a rate about which it's unlikely

-> a goodness-of-fit test with the estimated rate (Chi-square test)

### Further reading

* Michael Harris’s article “Fooled by Randomness Through Selection Bias” provides
an interesting review of selection bias considerations in stock market trading
schemes, from the perspective of traders.

* The Black Swan, 2nd ed., by Nassim Nicholas Taleb (Random House, 2010)

* Handbook of Statistical Distributions with Applications, 2nd ed., by K. Krishnamoorthy
(Chapman & Hall/CRC Press, 2016)