# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
#<-- Write Your Code -->
import pandas as pd
filename = 'searchlog.json'
logs = pd.read_json(filename, lines=True)
logs_A = logs[logs["search_ui"] == 'A']
mean_A = logs_A["search_count"].mean()
logs_B = logs[logs["search_ui"] == 'B']
mean_B = logs_B["search_count"].mean()
difference = abs(mean_B - mean_A)
print("Difference is",difference)

Difference is 0.13500569535052287


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [3]:
#<-- Write Your Code -->
import numpy as np
#pV = list(logs_A["search_count"]) + list(logs_B["search_count"])
logp = logs.copy()
# Initialize permutations
pD = []
#Define p (number of permutations):
p=10000
# Permutation loop:
for i in range(0,p):
  # Shuffle the data:
    logp['search_count'] = np.random.permutation(logp['search_count'].values)
    # Calculate the mean here as well
    logp_A = logp[logp["search_ui"] == 'A']
    mean_A = logp_A["search_count"].mean()
    logp_B = logp[logp["search_ui"] == 'B']
    mean_B = logp_B["search_count"].mean()
    # Calculate the difference
    difference_l = abs(mean_B - mean_A)
    pD.append(difference_l)
# Calculating p-value
p_val = len(np.where(pD>=difference)[0])/p
print("P-value is",p_val)
if p_val < 0.01:
    print("Null hypothesis is rejected")
else:
    print("Null hypothesis is accepted")

P-value is 0.2602
Null hypothesis is accepted


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.** Yes, using the same dataset to perform analysis until we find something significant can result in p-hacking. This can also increase the number of false positives as we are ignoring the results of other hypothesis tests.To address p-hacking, we should disclose the number of hypothesis disclosed during study,all data collection decisions, all statistical analysis conducted and all p-values computed. One way to control p-hacking is to decrease the level of significance to $\alpha/N$ where 'N' is the number of hypothesis tests conducted. Another measure is to control the false discovery rate ($FP/(FP+TP))$ using another significance level $\beta$. We first find the index $k$ of largest p-value and it should be $<=(i/m)\beta$ were $m$ is the number of null hypothesis performed. All null hypothesis tests with p-value index $i <= k $ is considered to be statistically significant.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [5]:
#<-- Write Your Code -->
# Creating a contigency table
cont_tab = pd.crosstab(logs['is_instructor'], logs['search_ui'],margins = True)
obs = np.empty((2,0), int)

# Observations
for i in range(0,len(cont_tab)-1):
    obs = np.append(obs,cont_tab.iloc[i][0:2].values)

# Expectations
tuple_sum = cont_tab.iloc[0:2,2].values
attr_sum = cont_tab.iloc[2,0:2].values
total = cont_tab["All"].iloc[2]
exp = []
for j in range(0,len(cont_tab)-1):
    for l in attr_sum:
        exp.append(l*tuple_sum[j]/total)

# Chi Square calculation
chi_sqr = sum(((obs - exp)**2/exp))
print("Chi Square stat is",chi_sqr)

# Degree of freedom
df = (len(tuple_sum)-1)*(len(attr_sum)-1)
print("Degree of freedom is",df)


Chi Square stat is 0.6731740891275046
Degree of freedom is 1


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** To identify whether 'is_instructor' and 'search_ui' are correlated, we need to define a null hypothesis and its alternate.

Null Hypothesis: 
There is no important relationship between 'is_instructor' and 'search_ui'. Hence they are independant

Alternate Hypothesis:
There is an important relationship between 'is_instructor' and 'search_ui'. Hence they are dependant

Inorder to reject null hypothesis, the p-value should be less than significance level. Lets keep the significance level as 0.05 in this case. 

To perform Chi- Squared test, we need to calculate the Chi-squared value using the below formula
                        
$$X^2 = \sum_{i=1}^{\ n}{\frac{(observed_i−expected_i)^2}{(expected_i)}}$$

To calculate the above ,we created a contingency table with frequency count obtained in each cell to form a cross table . Values in this contingency table are observed values. Expected values are those value that can be obtained when the null hypothesis is true.Expected values for a single cell can be obtained using the below formula:

$$Expected\space Value = {\frac{(Row\space Sum * Column\space Sum)}{Grand\space Total}}$$

Now we plug in expected value and observed value in chi-squared formula to obtain the value. In this case our 
Chi-squared statistic was 0.6731. Post obtaining this value we need to calculate the degree of freedom using the below formula

$$Degree\space of\space Freedom = (number\space of\space rows - 1)(number\space of\space columns - 1)$$

In this example, we got the degree of freedom as 1. Further we check Chi-squared distribution table at first row(based on degree of freedom) to obtain the p-value matching to Chi-squared statistic value. In this case we could find the p-value is 0.900. 

Since p-value is greater than significance value, we can accept null hypothesis and hence conclude that 
is_instructor' and 'search_ui' are independant





## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.