## Chapter 3: Statistical Experiments and Significance Testing

Design of experiments

Some commom experiments and their meanings.

In [None]:
# load libraies
list.of.packages <- c("dplyr", "ggplot2", "lmPerm")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(ggplot2)
library(dplyr)
library(lmPerm)

In [None]:
# read data

PSDS_PATH <- file.path('~/Desktop', 'statistics-for-data-scientists')

session_times <- read.csv(file.path(PSDS_PATH, 'data', 'web_page_data.csv'))
session_times[,2] <- session_times[,2] * 100
four_sessions  <- read.csv(file.path(PSDS_PATH, 'data', 'four_sessions.csv'))
click_rate <-  read.csv(file.path(PSDS_PATH, 'data', 'click_rates.csv'))
imanishi <-  read.csv(file.path(PSDS_PATH, 'data', 'imanishi_data.csv'))

### A/B testing

This process starts with a hypothesis (“drug A is better than the existing standard drug,” or “price A is more profitable than the existing price B”)

-> an experiment (A/B test)

-> collect data

-> apply results


#### Results of A/B test

A and B: treatment vs no treatment -> 
* discrete data results could be summarized in a 2*2 table

![](10.png)

* continous value -> mean and SD of the resulting sample

    Revenue/page view with price A: mean = 3.87, SD = 51.10
    
    Revenue/page view with price B: mean = 4.11, SD = 62.98

### Hypotheis tests

reasons to have hypothesis rather than just compare results from A and B test group:

* failure to anticipate extreme events, or so-called “black swans”
* the tendency to misinterpret random events as having patterns of some significance.

#### an example: 

Ask several friends to invent a series of 50 coin flips: have them write down a
series of random Hs and Ts. Then ask them to actually flip a coin 50 times and write
down the results.

real one: will have longer run of of Hs and Ts.

written one: we tend to think we need to make up a T after several Hs in order to look random

On the other hand, when we do see the real-world equivalent of six Hs in a row (e.g., when one headline outperforms another by 10%), we are inclined to attribute it to something real, not just to chance.

#### Result of hypotheis test

* Random chance in assignment of subjects

* A true difference between A and B

#### The null hypothesis: any difference between the groups is due to chance

our task: prove the null hypothesis wrong.

general idea: resampling permutation procedure 

shuffle together A and B -> bootstrap / permutation -> resample -> see if we get a difference as extreme as the observed difference

#### Alternative hypothesis: true difference

one way (e.g you don't want to be fooled by option B): B > A (by chance of X%)

two way (e.g. you don't want to be fooled by both): A and B are different  (by chance of X%)

### Permutation test


#### Procedures

1. Combine the results from the different groups into a single data set.
2. Shuffle the combined data and then randomly draw (without replacement) a resample of the same size as group A (clearly it will contain some data from the other groups).
3. From the remaining data, randomly draw (without replacement) a resample of the same size as group B.
4. Do the same for groups C, D, and so on. You have now collected one set of resamples that mirror the sizes of the original samples.
5. Whatever statistic or estimate was calculated for the original samples (e.g., difference in group proportions), calculate it now for the resamples, and record; this constitutes one permutation iteration.
6. Repeat the previous steps R times to yield a permutation distribution of the test statistic. 

#### Results

observed difference within set of permuted -> no prove

observed difference out of set of permute -> prove our hypothesis (change is not responsible)

example: a proxy variable for sale of a high value product - session time on page

In [None]:
## Code for Figure 3
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0303.png'),  width = 4, height=4, units='in', res=300)
ggplot(session_times, aes(x=Page, y=Time)) + 
  geom_boxplot() +
  labs(y='Time (in seconds)') + 
  theme_bw()
# dev.off()

In [None]:
mean_a <- mean(session_times[session_times['Page']=='Page A', 'Time'])
mean_b <- mean(session_times[session_times['Page']=='Page B', 'Time'])
mean_b - mean_a

#### Question to ask: whether this difference is within the range of what random chance might produce, i.e., is statistically significant

apply a permutation test

* combine all the session times together
* repeatedly shuffle and divide them into groups of 21 and 15

(note on the differences to bootstrap: bootstrap sample with replacement)

R - "sample"

sample takes a sample of the specified size from the element of x using either with or without replacement

results - frequency: the observed result is biased

In [None]:
## Permutation test example with stickiness
perm_fun <- function(x, n1, n2)
{
  n <- n1 + n2
  idx_b <- sample(1:n, n1)
  idx_a <- setdiff(1:n, idx_b)
  mean_diff <- mean(x[idx_b]) - mean(x[idx_a])
  return(mean_diff)
}

## Code for Figure 4
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0304.png'),  width = 4, height=4, units='in', res=300)
perm_diffs <- rep(0, 1000)
for(i in 1:1000)
  perm_diffs[i] = perm_fun(session_times[,'Time'], 21, 15)
par(mar=c(4,4,1,0)+.1)
hist(perm_diffs, xlab='Session time differences (in seconds)', main='')
abline(v = mean_b - mean_a)
# dev.off()

results: around 12.5 %
-> difference of random permutations have this possibility to exceed the observed difference in session times
-> not statistically significant

In [None]:
mean(perm_diffs > (mean_b - mean_a))

In [None]:
## Code for Figure 5
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0305.png'),  width = 4, height=4, units='in', res=300)

obs_pct_diff <- 100*(200/23739 - 182/22588)
conversion <- c(rep(0, 45945), rep(1, 382))
perm_diffs <- rep(0, 1000)
for(i in 1:1000)
  perm_diffs[i] = 100*perm_fun(conversion, 23739, 22588 )
hist(perm_diffs, xlab='Conversion rate (percent)', main='')
abline(v = obs_pct_diff, lty=2, lwd=1.5)
text("  Observed\n  difference", x=obs_pct_diff,  y=par()$usr[4]-20, adj=0)

# dev.off()

In [None]:
mean(perm_diffs > obs_pct_diff)

prop.test(x=c(200,182), n=c(23739,22588), alternative="greater")

In [None]:
## Histogram of resample
## t-test
t.test(Time ~ Page, data=session_times, alternative='less' )

In [None]:
## session times

## four groups ANOVA

## Code for Figure 6
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0306.png'),  width = 4, height=4, units='in', res=300)

ggplot(four_sessions, aes(x=Page, y=Time)) + 
  geom_boxplot() +
  labs(y='Time (in seconds)') +
  theme_bw()

# dev.off()

In [None]:
summary(aovp(Time ~ Page, data=four_sessions))
summary(aov(Time ~ Page, data=four_sessions))

In [None]:
## Chi square test
clicks <- matrix(click_rate$Rate, nrow=3, ncol=2, byrow=TRUE)
dimnames(clicks) <- list(unique(click_rate$Headline), unique(click_rate$Click))

chisq.test(clicks, simulate.p.value=TRUE)

chisq.test(clicks, simulate.p.value=FALSE)

In [None]:
## Code for Figure 7
x <- seq(1, 30, length=100)
chi <- data.frame(df = factor(rep(c(1, 2, 5, 10), rep(100, 4))),
                  x = rep(x, 4),
                  p = c(dchisq(x, 1), dchisq(x, 2), dchisq(x, 5), dchisq(x, 20)))

# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0307.png'),  width = 5, height=3, units='in', res=300)

ggplot(chi, aes(x=x, y=p)) +
  geom_line(aes(linetype=df)) +
  theme_bw() +
  labs(x='', y='')

# dev.off()

In [None]:
## Fishers exact test
fisher.test(clicks)

In [None]:
## Tufts example
## Code for Figure 8
# png(filename=file.path(PSDS_PATH, 'figures', 'psds_0308.png'),  width = 4, height=4, units='in', res=300)
imanishi$Digit <- factor(imanishi$Digit)

ggplot(imanishi, aes(x=Digit, y=Frequency)) +
  geom_bar(stat="identity") +
  theme_bw()
  
# dev.off()

### Further Reading

* The Drunkard’s Walk by Leonard Mlodinow (Pantheon, 2008) is a readable survey of the ways in which “randomness rules our lives.”
* Randomization Tests, 4th ed., by Eugene Edgington and Patrick Onghena (Chapman
& Hall/CRC Press, 2007)—but don’t get too drawn into the thicket of
nonrandom sampling
* Introductory Statistics and Analytics: A Resampling Perspective by Peter Bruce
(Wiley, 2014)