# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [23]:
#<-- Write Your Code -->
import pandas as pd
import json

data = pd.read_json('searchlog.json', lines=True)

data_a = data[data['search_ui'] == 'A']
data_b = data[data['search_ui'] == 'B']

mean_ui = abs(data_a['search_count'].mean() - data_b['search_count'].mean())
mean_ui


0.13500569535052287

Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [24]:
#<-- Write Your Code -->
numSamples = 10000
new_means = []

for i in range(numSamples):
	new_data = data.sample(frac=1).reset_index(drop=True)
	new_data_1 = new_data[:len(data_a)]
	new_data_2 = new_data[len(data_a):]

	new_means.append(new_data_1['search_count'].mean() - new_data_2['search_count'].mean())

new_means = list(filter(lambda new_mean: abs(new_mean) >= abs(mean_ui), new_means))

p_value = len(new_means) / numSamples

p_value

0.2594

Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.**
Yes, Taking only the instructor will influence the result as the data will be more biased from the instructor.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [25]:
#<-- Write Your Code -->
cross_tab = pd.crosstab(data['is_instructor'], data['search_ui'], margins=True, margins_name='total')

In [26]:
chi_square = 0

for i in ['A', 'B']:
    for j in [True, False]:
        O = cross_tab[i][j]
        E = cross_tab[i]['total'] * cross_tab['total'][j] / cross_tab['total']['total']
        chi_square += (O-E)**2/E

chi_square

0.6731740891275046

Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** 
1. First we create null hypothesis and alternative hypothesis:
Null hypothesis : There is no relationship between is_instructor and search_ui.
Alternative hypothesis : There is a relationship between is_instructor and search_ui

2. Then we create a contigency table that contains observed frequencies of both the variables is_instructor and search_ui.
This table contains 4 boxes each representing combinatiosn of is_instructor and search_ui ((yes, A), (yes, B), (no, A), (no, B)). The entried will be filled by counting the frequency of each observations. Also fill the table with total of each row and column, and the grand total.

3. Then we calculate the expected frequency for each cell based on the null hypothesis. The expected frequency is calculated as:
Expected Frequency = (row total x column total) / grand total

4. Then we calculate the chi-square statistic using the formula:
X^2 = Sum(Observed Frequency - Expected Frequency)^2 / Expected Frequency

5. Then we calculate the degrees of freedom for this case. (1)

6. Then we use a chi-square distribution table to determine the critical value for the test at the desired level of significance (0.05 for 95%)

7. Then we compare the calculated chi-square value to the critical value. If the calculated value is less than the critical value, we go with the null hypothesis and conclude there is no relationship between is_instructor and search_ui. But if the calculated value is above or equal to the critical value, we reject null hypothesis and go with alternative hypothesis ie., we conclude there is a relationship between is_instructor and search_ui.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.