## Chapter 3: Statistical Experiments and Significance Testing

Design of experiments

Some commom experiments and their meanings.

In [None]:
# load libraies
list.of.packages <- c("dplyr", "ggplot2", "lmPerm")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(ggplot2)
library(dplyr)
library(lmPerm)

In [None]:
# read data

PSDS_PATH <- file.path('~/Desktop', 'statistics-for-data-scientists')

session_times <- read.csv(file.path(PSDS_PATH, 'data', 'web_page_data.csv'))
session_times[,2] <- session_times[,2] * 100
four_sessions  <- read.csv(file.path(PSDS_PATH, 'data', 'four_sessions.csv'))
click_rate <-  read.csv(file.path(PSDS_PATH, 'data', 'click_rates.csv'))
imanishi <-  read.csv(file.path(PSDS_PATH, 'data', 'imanishi_data.csv'))

### A/B testing

This process starts with a hypothesis (“drug A is better than the existing standard drug,” or “price A is more profitable than the existing price B”)

-> an experiment (A/B test)

-> collect data

-> apply results


#### Results of A/B test

A and B: treatment vs no treatment -> 
* discrete data results could be summarized in a 2*2 table

![](10.png)

* continous value -> mean and SD of the resulting sample

    Revenue/page view with price A: mean = 3.87, SD = 51.10
    
    Revenue/page view with price B: mean = 4.11, SD = 62.98

### Hypotheis tests

reasons to have hypothesis rather than just compare results from A and B test group:

* failure to anticipate extreme events, or so-called “black swans”
* the tendency to misinterpret random events as having patterns of some significance.

#### an example: 

Ask several friends to invent a series of 50 coin flips: have them write down a
series of random Hs and Ts. Then ask them to actually flip a coin 50 times and write
down the results.

real one: will have longer run of of Hs and Ts.

written one: we tend to think we need to make up a T after several Hs in order to look random

On the other hand, when we do see the real-world equivalent of six Hs in a row (e.g., when one headline outperforms another by 10%), we are inclined to attribute it to something real, not just to chance.

#### Result of hypotheis test

* Random chance in assignment of subjects

* A true difference between A and B

#### The null hypothesis: any difference between the groups is due to chance

our task: prove the null hypothesis wrong.

general idea: resampling permutation procedure 

shuffle together A and B -> bootstrap / permutation -> resample -> see if we get a difference as extreme as the observed difference

#### Alternative hypothesis: true difference

one way (e.g you don't want to be fooled by option B): B > A (by chance of X%)

two way (e.g. you don't want to be fooled by both): A and B are different  (by chance of X%)

### Permutation test


#### Procedures

1. Combine the results from the different groups into a single data set.
2. Shuffle the combined data and then randomly draw (without replacement) a resample of the same size as group A (clearly it will contain some data from the other groups).
3. From the remaining data, randomly draw (without replacement) a resample of the same size as group B.
4. Do the same for groups C, D, and so on. You have now collected one set of resamples that mirror the sizes of the original samples.
5. Whatever statistic or estimate was calculated for the original samples (e.g., difference in group proportions), calculate it now for the resamples, and record; this constitutes one permutation iteration.
6. Repeat the previous steps R times to yield a permutation distribution of the test statistic. 

#### Results

observed difference within set of permuted -> no prove

observed difference out of set of permute -> prove our hypothesis (change is not responsible)

example: a proxy variable for sale of a high value product - session time on page

In [None]:
## Code for Figure 3
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0303.png'),  width = 4, height=4, units='in', res=300)
ggplot(session_times, aes(x=Page, y=Time)) + 
  geom_boxplot() +
  labs(y='Time (in seconds)') + 
  theme_bw()
# dev.off()

In [None]:
mean_a <- mean(session_times[session_times['Page']=='Page A', 'Time'])
mean_b <- mean(session_times[session_times['Page']=='Page B', 'Time'])
mean_b - mean_a

#### Question to ask: whether this difference is within the range of what random chance might produce, i.e., is statistically significant

apply a permutation test

* combine all the session times together
* repeatedly shuffle and divide them into groups of 21 and 15

(note on the differences to bootstrap: bootstrap sample with replacement)

R - "sample"

sample takes a sample of the specified size from the element of x using either with or without replacement

results - frequency: the observed result is biased

In [None]:
## Permutation test example with stickiness
perm_fun <- function(x, n1, n2)
{
  n <- n1 + n2
  idx_b <- sample(1:n, n1)
  idx_a <- setdiff(1:n, idx_b)
  mean_diff <- mean(x[idx_b]) - mean(x[idx_a])
  return(mean_diff)
}

## Code for Figure 4
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0304.png'),  width = 4, height=4, units='in', res=300)
perm_diffs <- rep(0, 1000)
for(i in 1:1000)
  perm_diffs[i] = perm_fun(session_times[,'Time'], 21, 15)
par(mar=c(4,4,1,0)+.1)
hist(perm_diffs, xlab='Session time differences (in seconds)', main='')
abline(v = mean_b - mean_a)
# dev.off()

results: around 12.5 %
-> difference of random permutations have this possibility to exceed the observed difference in session times
-> not statistically significant

In [None]:
mean(perm_diffs > (mean_b - mean_a))

### Statistical Significance and p-Values

meaning: measure whether an experiment (or even a study of existing data) yields a result more extreme than what chance might produce.
-> higher p value means the chance model is likely to produce extreme results as observed (confusing).
-> lower p value means result is statistically significant.

Example: results of the previous A/B test: data set is large (45000) but the dataset of conversion is small ($\sim 400$) -> significance test

Procedure of the test:

1. Put cards labeled 1 and 0 in a box: this represents the supposed shared conversion
rate of 382 ones and 45,945 zeros = 0.008246 = 0.8246%.
2. Shuffle and draw out a resample of size 23,739 (same n as price A), and record
how many 1s.
3. Record the number of 1s in the remaining 22,588 (same n as price B).
4. Record the difference in proportion of 1s.
5. Repeat steps 2–4.
6. How often was the difference >= 0.0368?

Results:

observed difference of 0.0358% is well within the range of chance variation

In other words: given a chance model, this is the possibility that results as extreme as the observed results could occur

For the discussion:

it seems high p value means the results is not proven but lower p value doesn't means the result is definitely meaningfully, it only carries the meaning that the result is not denied by a specific statistical model of chance.

"As a decision tool in an experiment, a p-value should not be considered controlling, but merely another point of information bearing on a decision. 
For example, p-values are sometimes used as intermediate inputs in some statistical or machine learning models—a feature might be included in or excluded from a model depending on its p-value."


In [None]:
## Code for Figure 5
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0305.png'),  width = 4, height=4, units='in', res=300)

obs_pct_diff <- 100*(200/23739 - 182/22588)
conversion <- c(rep(0, 45945), rep(1, 382))
perm_diffs <- rep(0, 1000)
for(i in 1:1000)
  perm_diffs[i] = 100*perm_fun(conversion, 23739, 22588 )
hist(perm_diffs, xlab='Conversion rate (percent)', main='')
abline(v = obs_pct_diff, lty=2, lwd=1.5)
text("  Observed\n  difference", x=obs_pct_diff,  y=par()$usr[4]-20, adj=0)

# dev.off()

#### p-value: The chance model produces a result more extreme than the observed result

approach 1: compare the differences between the two groups in permutation test "perm_diffs" to the observed difference

approach 2: prop test since we have a binomial distribution
-> result in the p-value as well as condidence interval
-> 95% percent condifence interval says it's most likely method 1 is better (lower bd is only -0.1%).

In [None]:
mean(perm_diffs > obs_pct_diff)

prop.test(x=c(200,182), n=c(23739,22588), alternative="greater")

#### alpha value

“Given a chance model, what is the probability of a result this extreme?”

aplpha-value - determine whether a result is too "unusual"

5% alpha value - “more extreme than 5% of the chance (null hypothesis) results”

### t-Tests

The general idea behind significance test: measure the effect you are interested in and help you determine whether that observed effect lies within the range of normal chance variation (if so, the result is not proven).

a type of significance test, named after Student's t-distribution

It doesn't perform permutation, thus doesn't require computational power.

(So the p-value is derived not from permutation test but from the t-distribution model)

It demonstrates how to standardize your data to compare it to the standard t-distribution

R: build-in t.test


In [None]:
## Histogram of resample
## t-test
t.test(Time ~ Page, data=session_times, alternative='less' )

### Multiple testing

“Torture the data long enough, and it will confess.”

For example, 20 predictor variables -> each passing the significant test by 0.05 level
-> possibility at least one tests significance -> 1 - 0.95^20 = 0.64

"fitting the model to the noise"

(There are high "possibility" that one of many preditor would pass the test of the "chance" model)

The more comparison being done, the more likely to be fooled by data (e.g. compare results from different test group, compare therapy at multiple stages)

-> adjustments (e.g. devide the alpha value by the number of tests)

-> a smaller alpha and thus a stronger bar for statistical significance.

#### The Degrees of freedom

e.g. if you know the mean of a sample size of n -> dof is n-1

(once you know n-1 of the sample values, the n th can be calculated and is not free
to vary)

big data : doesn't matter because n is already large

linear regression: this matters, because variables need to be independent as a set (cannot include more than n-1 factors).

### ANOVA

A comparison of multiple groups (e.g. A/B/C/D).

analysis of variance (ANOVA).

#### The stickness of 4 webpages

![](11.png)

6 different comparison in total:

* Page 1 compared to page 2
* Page 1 compared to page 3
* Page 1 compared to page 4
* Page 2 compared to page 3
* Page 2 compared to page 4
* Page 3 compared to page 4


“Could all the pages have the same underlying stickiness, and the differences among them be due to the random way in which a common set of session times got allocated among the four pages?"

In [None]:
four_sessions

In [None]:
## session times

## four groups ANOVA

## Code for Figure 6
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0306.png'),  width = 4, height=4, units='in', res=300)

ggplot(four_sessions, aes(x=Page, y=Time)) + 
  geom_boxplot() +
  labs(y='Time (in seconds)') +
  theme_bw()

# dev.off()

In [None]:
summary(aovp(Time ~ Page, data=four_sessions))
summary(aov(Time ~ Page, data=four_sessions))

In [None]:
## Chi square test
clicks <- matrix(click_rate$Rate, nrow=3, ncol=2, byrow=TRUE)
dimnames(clicks) <- list(unique(click_rate$Headline), unique(click_rate$Click))

chisq.test(clicks, simulate.p.value=TRUE)

chisq.test(clicks, simulate.p.value=FALSE)

In [None]:
## Code for Figure 7
x <- seq(1, 30, length=100)
chi <- data.frame(df = factor(rep(c(1, 2, 5, 10), rep(100, 4))),
                  x = rep(x, 4),
                  p = c(dchisq(x, 1), dchisq(x, 2), dchisq(x, 5), dchisq(x, 20)))

# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0307.png'),  width = 5, height=3, units='in', res=300)

ggplot(chi, aes(x=x, y=p)) +
  geom_line(aes(linetype=df)) +
  theme_bw() +
  labs(x='', y='')

# dev.off()

In [None]:
## Fishers exact test
fisher.test(clicks)

In [None]:
## Tufts example
## Code for Figure 8
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0308.png'),  width = 4, height=4, units='in', res=300)
imanishi$Digit <- factor(imanishi$Digit)

ggplot(imanishi, aes(x=Digit, y=Frequency)) +
  geom_bar(stat="identity") +
  theme_bw()
  
# dev.off()

### Further Reading

* The Drunkard’s Walk by Leonard Mlodinow (Pantheon, 2008) is a readable survey of the ways in which “randomness rules our lives.”
* Randomization Tests, 4th ed., by Eugene Edgington and Patrick Onghena (Chapman
& Hall/CRC Press, 2007)—but don’t get too drawn into the thicket of
nonrandom sampling
* Introductory Statistics and Analytics: A Resampling Perspective by Peter Bruce
(Wiley, 2014)