# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
#<-- Write Your Code -->
import pandas as pd
import numpy as np

df = pd.read_json('searchlog.json', lines=True)
group_a = df[df['search_ui']=='A'].search_count
group_b = df[df['search_ui']=='B'].search_count

diff = group_b.mean() - group_a.mean() 
diff

0.13500569535052287

Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [2]:
#<-- Write Your Code -->
def permutation_test(group1, group2, num_samples=10000):
    observed_diff = np.mean(group2) - np.mean(group1) 
    combined_data = np.concatenate([group1, group2])

    count_extreme_values = 0
    for i in range(num_samples):
        np.random.shuffle(combined_data)      
        perm_group1 = combined_data[:len(group1)]
        perm_group2 = combined_data[len(group1):]        
        perm_diff = np.mean(perm_group2) - np.mean(perm_group1)
        if perm_diff >= observed_diff:
            count_extreme_values += 1

    p_value = count_extreme_values / num_samples    
    return p_value


p_value = permutation_test(group_a, group_b)
print(f"P-value: {p_value}")

P-value: 0.1352


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.** [Using the same dataset to conduct another analysis based on a smaller subset consisting of instructor samples could be considered as p-hacking, especially if this subgroup analysis wasn't planned before looking at the data. This is because the decision to focus on instructors is influenced by the data itself, rather than a hypothesis set a priori.

If must conduct the analysis on the subset, we should decrease the significance levels o account for the increased risk of type I errors. If possible, the best way should be to do the analysis using a new dataset collected from a different time period or a different set of users to see if the effect holds.]

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [3]:
#<-- Write Your Code -->
is_instructor = df['is_instructor']
search_ui = df['search_ui']

# Step 1: Create a contingency table
contingency_table = {}
for i, ui in enumerate(search_ui):
    if is_instructor[i] not in contingency_table:
        contingency_table[is_instructor[i]] = {}
    if ui not in contingency_table[is_instructor[i]]:
        contingency_table[is_instructor[i]][ui] = 1
    else:
        contingency_table[is_instructor[i]][ui] += 1

# Convert to numpy array for easier manipulation
categories = sorted(set(search_ui))
contingency_matrix = np.array([[contingency_table[True].get(category, 0) for category in categories],
                               [contingency_table[False].get(category, 0) for category in categories]])

# Step 2: Calculate expected frequencies
row_totals = contingency_matrix.sum(axis=1)
col_totals = contingency_matrix.sum(axis=0)
total = contingency_matrix.sum()

expected_frequencies = np.outer(row_totals, col_totals) / total

# Step 3: Compute the Chi-Squared statistic
chi_squared_stat = ((contingency_matrix - expected_frequencies) ** 2 / expected_frequencies).sum()
# chi_squared_stat_yates = ((abs(contingency_matrix - expected_frequencies) - 0.5) ** 2 / expected_frequencies).sum()

print(f"Chi-Squared Statistic: {chi_squared_stat}")


Chi-Squared Statistic: 0.6731740891275046


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** [ Here are the steps to use Chi-squared test:

1. Formulate Hypotheses
    - Null Hypothesis (H0): There is no association between is_instructor status and the search_ui version. They are independent.

    - Alternative Hypothesis (H1): There is an association between is_instructor status and the search_ui version. They are not independent.


2. Compute Chi-squared Statistic: As is shown above, calculate the Chi-squared Statistic from Contingency Table

3. Determine the P-value: The p-value is the probability of observing a chi-squared statistic at least as extreme as the one calculated, under the null hypothesis. Compare the p-value against a significance level (e.g., 0.05) to decide whether to reject the null hypothesis.

4. Make a Decision:
    - If the p-value is less than or equal to the significance level, reject the null hypothesis and conclude that there is a statistically significant association between is_instructor and search_ui.

    - If the p-value is greater than the significance level,  fail to reject the null hypothesis and conclude that there is not enough evidence to claim an association.]

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.