# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [2]:
#<-- Write Your Code -->
import pandas as pd
import numpy as np

df = pd.read_json('searchlog.json',lines=True)
df_interface = df[['search_ui','search_count']]
df_interface = df_interface.groupby('search_ui')['search_count'].agg('mean').reset_index()
A_count = df_interface.iloc[0,1]
B_count = df_interface.iloc[1,1]

difference_counts = abs(A_count - B_count)
difference_counts

0.13500569535052287

Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [41]:
#<-- Write Your Code -->
numSamples = 10000
df_1 = df[['search_ui','search_count']]
count = 0
for i in range(numSamples):
    df_1.iloc[:,1] = np.random.permutation(df_1.iloc[:,1])
    A_mean = df_1[df_1['search_ui']=='A'].mean()
    B_mean = df_1[df_1['search_ui']=='B'].mean()
    difference_counts = abs(A_mean - B_mean).values[0]
    if difference_counts > 0.135:
        count += 1
p_value = count/numSamples

In [42]:
p_value

0.2512

Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.** Yes we are doing p-hacking, we can decrease the level signifinance for two hypothesis test on the same data.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [7]:
observed = pd.crosstab(df['is_instructor'],df['search_ui'],margins=True)
observed.columns = ["A", "B","row_totals"]
observed.index = ["False","True","column_totals"]
expected =  np.outer(observed["row_totals"][0:2],
                     observed.loc["column_totals"][0:2]) / 681

expected = pd.DataFrame(expected)
expected.columns = ["A", "B"]
expected.index = ["False","True"]

chi_squared_value = (((observed-expected)**2)/expected).sum().sum()
observed

0.6731740891275046


Unnamed: 0,A,B,row_totals
False,233,213,446
True,115,120,235
column_totals,348,333,681


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** After calculation of chi square statistics, we will check the chi square table to get the p value with degree of freedom as 1. Based on the p value we can conclude to accept the Null hypothesis and say that is_instructor and search_ui are not correlated.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.