<a href="https://colab.research.google.com/github/kavyajeetbora/thinkstats/blob/master/MISC/01_statistical_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from scipy.stats import ttest_ind

## A/B Testing

In [None]:
# Sample data (replace with your actual data)
np.random.seed(42)  # for reproducibility
n_users = 1000
control_conversions = np.random.binomial(1, 0.10, n_users)  # 10% conversion rate
variant_conversions = np.random.binomial(1, 0.16, n_users)  # 12% conversion rate

data = pd.DataFrame({
    'group': ['control'] * n_users + ['variant'] * n_users,
    'converted': np.concatenate((control_conversions, variant_conversions))
})


# A/B Testing
control_group = data[data['group'] == 'control']['converted']
variant_group = data[data['group'] == 'variant']['converted']

# Perform t-test
t_statistic, p_value = ttest_ind(control_group, variant_group)

# Print results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Significance level (alpha)
alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference between the groups.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference between the groups.")

# Calculate conversion rates
control_conversion_rate = control_group.mean()
variant_conversion_rate = variant_group.mean()
print(f"\nControl Group Conversion Rate: {control_conversion_rate:.4f}")
print(f"Variant Group Conversion Rate: {variant_conversion_rate:.4f}")
print(f"Difference in conversion rates: {variant_conversion_rate - control_conversion_rate:.4f}")


#Example scenario 1: Website redesign

#Imagine you redesigned your website homepage.
#You want to test if the new design leads to more clicks on a key call to action button.


#Example Scenario 2: Email subject line A/B test

#You have two different email subject lines. You want to see which one has a higher open rate.

# Example scenario 3: Ad copy testing

#You have different versions of ad copy. You want to test which ad gets the most clicks or conversions.


T-statistic: -4.184613495048888
P-value: 2.9803061192511093e-05
Reject the null hypothesis. There is a statistically significant difference between the groups.

Control Group Conversion Rate: 0.1000
Variant Group Conversion Rate: 0.1630
Difference in conversion rates: 0.0630


## Anova Test



The ANOVA F-statistic is a ratio that compares the variability between groups to the variability within groups.

$F = \frac{MS_{between}}{MS_{within}}$

- MS_between: Represents the variance between the group means. It's calculated by dividing the sum of squares between groups (SSB) by the degrees of freedom between groups (df_between).
- MS_within: Represents the variance within each group. It's calculated by dividing the sum of squares within groups (SSW) by the degrees of freedom within groups (df_within).

In essence, a larger F-statistic indicates a greater difference between group means relative to the variation within groups, suggesting a statistically significant effect


**Real-world example:**
Imagine a company runs three different ad campaigns (A, B, and C) on social media to promote a new product. They want to determine if there's a significant difference in the number of clicks each campaign generates.

- The null hypothesis (H0) for ANOVA is that there is no significant difference in the means of the groups.

- The alternative hypothesis (H1) is that there is at least one group with a different mean.

In this example, the ANOVA test helps the company determine if the ad campaigns are performing significantly differently from each other, rather than just relying on visual observation or simple comparison.

In [None]:
# Generate sample data for three different ad campaigns
np.random.seed(42)
n_samples = 100

# Campaign A
campaign_a_clicks = np.random.normal(loc=150, scale=20, size=n_samples) # Average 150 clicks
# Campaign B
campaign_b_clicks = np.random.normal(loc=170, scale=25, size=n_samples) # Average 170 clicks
# Campaign C
campaign_c_clicks = np.random.normal(loc=160, scale=30, size=n_samples) # Average 160 clicks

# Create a Pandas DataFrame
data = pd.DataFrame({
    'campaign': ['A'] * n_samples + ['B'] * n_samples + ['C'] * n_samples,
    'clicks': np.concatenate((campaign_a_clicks, campaign_b_clicks, campaign_c_clicks))
})

data.sample(10)

Unnamed: 0,campaign,clicks
203,C,191.614062
246,C,145.751641
67,A,170.070658
206,C,175.451058
274,C,130.55474
215,C,182.769077
87,A,156.575022
86,A,168.308042
21,A,145.484474
138,B,190.33793


In [None]:
# Perform ANOVA test
f_statistic, p_value = f_oneway(data[data['campaign'] == 'A']['clicks'],
                               data[data['campaign'] == 'B']['clicks'],
                               data[data['campaign'] == 'C']['clicks'])

# Print results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

F-statistic: 20.014297382116737
P-value: 7.011652885427468e-09


In [None]:
# Interpret the results
alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis. There is a statistically significant difference in the mean clicks among the ad campaigns.")
else:
    print("Fail to reject the null hypothesis. There is no statistically significant difference in the mean clicks among the ad campaigns.")


# Calculate the mean clicks for each campaign
campaign_means = data.groupby('campaign')['clicks'].mean()
print("\nMean clicks per campaign:")
campaign_means

Reject the null hypothesis. There is a statistically significant difference in the mean clicks among the ad campaigns.

Mean clicks per campaign:


Unnamed: 0_level_0,clicks
campaign,Unnamed: 1_level_1
A,147.92307
B,170.557615
C,161.946888


ANOVA summary:

<img src="https://cdn1.qualitygurus.com/wp-content/uploads/2022/12/ANOVA-Degrees-of-Freedom-Calculation.png?lossy=1&w=1326&ssl=1" hieght=300/>