# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [62]:
import numpy as np
import pandas as pd
import sys
import scipy.stats as stats
df = pd.read_json('searchlog.json', orient='records', lines=True)
df_a = df[df['search_ui'] == 'A']
df_b = df[df['search_ui'] == 'B']
print(df_b['search_count'].mean()-df_a['search_count'].mean())

0.13500569535052287


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [65]:
counter = 0.
numSamples = 10000
reference_stat = 0.135
for i in range(numSamples):
    df['search_count']=np.random.permutation(df['search_count'])
    df_group=df.groupby('search_ui').agg('mean')
    if  df_group['search_count'].diff().values[1] > reference_stat:
        counter += 1.
print(counter/numSamples)

0.1382


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.**  Yes we are p-hacking because we keep doing analysis on the same data until we find something significant. We should collect new dataset like collect ui data only from instructors. 


## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [71]:
instructor_a = len(df_a[df_a['is_instructor'] == True])
instructor_b = len(df_b[df_b['is_instructor'] == True])
notinstructor_a = len(df_a[df_a['is_instructor'] == False])
notinstructor_b = len(df_b[df_b['is_instructor'] == False])
obs = [[instructor_a, instructor_b],
               [notinstructor_a, notinstructor_b]]
print('obs:')
print(obs)

obs_instructor_a, obs_notinstructor_b = np.sum(contingency, axis=1)
obs_a, obs_b = np.sum(contingency, axis=0)
total = np.sum(obs)
exp_instructor_a = (obs_instructor*obs_a)/total
exp_instructor_b = (obs_instructor*obs_b)/total
exp_notinstructor_a = (obs_notinstructor*obs_a)/total
exp_notinstructor_b = (obs_notinstructor*obs_b)/total
exp=[[exp_instructor_a,exp_instructor_b],[exp_notinstructor_a,exp_notinstructor_b]]
print('exp:')
print(exp)
chi_squared = (instructor_a-exp_instructor_a)**2/exp_instructor_a + \
    (instructor_b-exp_instructor_b)**2/exp_instructor_b + \
    (notinstructor_a-exp_notinstructor_a)**2/exp_notinstructor_a + \
    (notinstructor_b-exp_notinstructor_b)**2/exp_notinstructor_b
print(chi_squared)
#compare to the API
stats.chi2_contingency(obs,correction=False)

obs:
[[115, 120], [233, 213]]
exp:
[[120.08810572687224, 114.91189427312776], [227.91189427312776, 218.08810572687224]]
0.6731740891275046


(0.6731740891275046,
 0.41194715912043356,
 1,
 array([[120.08810573, 114.91189427],
        [227.91189427, 218.08810573]]))

Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** After we got the chi squared stat 0.67, we look up the chi square table to get p-value (0.41) based on critical value 0.67 and degree of freedom 1 (clumns - 1)  which is much greater than 0.05. So we can't reject the Null Hypothesis the two columns are independent. The correlation of the two columns is weak.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.