In [1]:
import pandas as pd
import numpy as np 


In [2]:
file = 'website_ab_test.csv'
df = pd.read_csv(file)
print('Features: ') 
for feat in df.columns.to_list():
    print('- ',feat) 

Features: 
-  Theme
-  Click Through Rate
-  Conversion Rate
-  Bounce Rate
-  Scroll_Depth
-  Age
-  Location
-  Session_Duration
-  Purchases
-  Added_to_Cart


# <font color=red>A/B Testing & Hypothesis Testing: Statistical Methods in Action </font> 

## **Introduction**
In data-driven decision-making, **A/B testing** plays a crucial role in evaluating whether changes in a product, website, or feature lead to statistically significant improvements. This notebook explores various **hypothesis testing techniques** commonly used in A/B testing scenarios.

### **Objective**
The goal of this analysis is to determine whether **website theme (Light vs. Dark)** has an impact on **user behavior**, particularly **purchases** and **potential other engagement metrics**.

### **Statistical Tests Covered**
1. **Chi-Square Test** (✅ Implemented)  
   - Used to determine if there is a **significant association** between **categorical variables** (e.g., Theme & Purchase behavior).
   - Since **both variables are categorical**, the **Chi-Square test** is an appropriate choice.

2. **T-Test** (✅ Implemented)  
   - Suitable if we compare **continuous variables** (e.g., session duration, purchase amount) between the two themes.
   - Will test whether the means of these metrics **significantly differ** between Light and Dark themes.

3. **Z-Test** (❌ Not Needed)
    - For a 2×2 scenario (Theme vs. Purchases), the two-proportion Z test and the Chi-Square test of independence are statistically equivalent—so running a separate Z test is redundant.
    - For continuous metrics (Session Duration, etc.), we do not know the population standard deviation, making a t-test the standard approach.

### **Why This Notebook?**
- This notebook provides a structured, hands-on exploration of **statistical hypothesis testing** in A/B experiments. It serves as a reference for selecting the right test based on data types, verifying statistical assumptions, and drawing meaningful business insights.
- Link: https://statso.io/light-theme-and-dark-theme-case-study/

### **Next Steps**
- Perform **a t-test** for comparing **continuous variables**.
- Interpret statistical results in a business context.
- Expand the notebook with additional A/B testing case studies.


In [4]:
df.head() 

Unnamed: 0,Theme,Click Through Rate,Conversion Rate,Bounce Rate,Scroll_Depth,Age,Location,Session_Duration,Purchases,Added_to_Cart
0,Light Theme,0.05492,0.282367,0.405085,72.489458,25,Chennai,1535,No,Yes
1,Light Theme,0.113932,0.032973,0.732759,61.858568,19,Pune,303,No,Yes
2,Dark Theme,0.323352,0.178763,0.296543,45.737376,47,Chennai,563,Yes,Yes
3,Light Theme,0.485836,0.325225,0.245001,76.305298,58,Pune,385,Yes,No
4,Light Theme,0.034783,0.196766,0.7651,48.927407,25,New Delhi,1437,No,No


# <font color=red>Chi-Square Test</font> 

## Feature Selection and Application of the Chi-Square Test

### Feature Selection
In this analysis, the primary goal is to evaluate whether the **theme of the website (Light vs. Dark)** has a significant impact on the **purchase behavior** of users. The following features were selected:

- **Theme**: Categorical variable representing the visual theme of the website (Light Theme or Dark Theme).
- **Purchases**: Categorical outcome variable with two possible states: "Yes" (purchase made) and "No" (no purchase made).

These variables were chosen because the relationship between a website's theme and user behavior (specifically purchase decisions) is critical for assessing the effectiveness of UI design strategies.

### Why the Chi-Square Test?
The **Chi-Square Test of Independence** is an appropriate statistical test for this scenario because:
1. Both variables (**Theme** and **Purchases**) are **categorical**.
2. The test evaluates whether there is a **significant association** between the two variables.
3. It determines whether the observed differences in purchase behavior across themes are likely due to chance or represent a true relationship.

The test was applied to a **2×2 contingency table** where:
- Rows represent the two themes (Light and Dark).
- Columns represent the two outcomes (Yes and No).

### Key Assumptions
- **Independence**: Each user interaction is independent of others.
- **Expected Frequency**: The expected frequency in each cell of the contingency table is greater than 5, ensuring the validity of the Chi-Square approximation.

By running the Chi-Square test, we aim to understand whether the choice of theme significantly affects purchase behavior, providing actionable insights into UI/UX optimization.


In [7]:
purchases = df.groupby('Theme')['Purchases'].value_counts().to_frame().reset_index().pivot(index='Theme', columns='Purchases', values ='count').reset_index() 
purchases

Purchases,Theme,No,Yes
0,Dark Theme,255,259
1,Light Theme,228,258


In [8]:
no_yes = purchases['No Yes'.split()].columns.to_list()
row_total = []
col_total = [] 
for row in purchases.index.to_list():
    row_sum = purchases.iloc[row]['No Yes'.split()].sum()
    print(f"{purchases.iloc[row]['Theme']} row total: {row_sum}")
    row_total.append(row_sum)
purchases['Row Total'] = None
purchases['Row Total'] = row_total
purchases

Dark Theme row total: 514
Light Theme row total: 486


Purchases,Theme,No,Yes,Row Total
0,Dark Theme,255,259,514
1,Light Theme,228,258,486


In [9]:
row_total = purchases[["No", "Yes"]].sum()
new_index = purchases.index.max() + 1 
purchases.loc[new_index] = ['Colum Total', row_total["No"], row_total["Yes"], sum(row_total)]
purchases 

Purchases,Theme,No,Yes,Row Total
0,Dark Theme,255,259,514
1,Light Theme,228,258,486
2,Colum Total,483,517,1000


In [10]:
exp_no = []
for i in [0,1,5]:
    try:
        exp = purchases['Row Total'][i] * purchases['No'].iloc[-1] / purchases['Row Total'].iloc[-1]
        print(purchases['Theme'][i],'expected No: ', exp)
    except:
        exp = None
    finally:
        exp_no.append(exp)
print('\n') 
exp_yes = [] 
for i in [0,1,5]:
    try:
        exp = purchases['Row Total'][i] * purchases['Yes'].iloc[-1] / purchases['Row Total'].iloc[-1]
        print(purchases['Theme'][i],'expected Yes: ', exp)
    except:
        exp = None
    finally:
        exp_yes.append(exp)
print('\n') 
print(f'Expected No: {exp_no}')
print(f'Expected Yes: {exp_yes}')

Dark Theme expected No:  248.262
Light Theme expected No:  234.738


Dark Theme expected Yes:  265.738
Light Theme expected Yes:  251.262


Expected No: [248.262, 234.738, None]
Expected Yes: [265.738, 251.262, None]


In [11]:
purchases['Expected No'] = exp_no
purchases['Expected Yes'] = exp_yes
purchases

Purchases,Theme,No,Yes,Row Total,Expected No,Expected Yes
0,Dark Theme,255,259,514,248.262,265.738
1,Light Theme,228,258,486,234.738,251.262
2,Colum Total,483,517,1000,,


# $$\chi^2 = \sum_{i=1}^{k}{\frac{(F_{i}^{Obs} - F_{i}^{Esp})^2}{F_{i}^{Esp}}}$$

### Formula Explanation
The Chi-Square statistic ($\chi^2$) is calculated to test whether there is a significant difference between **observed frequencies** and **expected frequencies** in categorical data. Here’s what each component of the formula represents:

- **$\chi^2$**: The Chi-Square statistic, which quantifies the difference between observed and expected frequencies.
- **$k$**: The number of categories or events.
- **$F_{i}^{Obs}$**: The **observed frequency** for category $i$.
- **$F_{i}^{Esp}$**: The **expected frequency** for category $i$.
- **$\sum_{i=1}^{k}$**: The summation over all $k$ categories.

### How It Works
1. For each category, compute the difference between the observed frequency ($F_{i}^{Obs}$) and the expected frequency ($F_{i}^{Esp}$).
2. Square this difference to ensure positive contributions.
3. Divide the squared difference by the expected frequency ($F_{i}^{Esp}$).
4. Sum these values across all categories to get the final $\chi^2$ statistic.

### Key Insights
- A **larger $\chi^2$** value indicates a greater difference between observed and expected frequencies, which might suggest a significant relationship between variables.
- This statistic is then compared to a critical value (or converted into a p-value) based on the degrees of freedom to draw conclusions about statistical significance.


In [13]:
observed = purchases.iloc[0:2]['No Yes'.split()].values
expected = purchases.iloc[0:2]['Expected No,Expected Yes'.split(',')].values
n_rows = observed.shape[0]
n_cols = observed.shape[1] 
print(f"- Observed frequencies: {observed}") 
print(f"- Expected frequencies: {expected}")
print(f"- Number of categories/rows: {n_rows}") 
print(f"- Number of events/columns: {n_cols}")
degrees_of_freedom = (n_rows - 1) * (n_cols - 1) 
print(f'- Degrees of Freedom: {degrees_of_freedom}') 

- Observed frequencies: [[255 259]
 [228 258]]
- Expected frequencies: [[248.262 265.738]
 [234.738 251.262]]
- Number of categories/rows: 2
- Number of events/columns: 2
- Degrees of Freedom: 1


- Instead of using a loop to calculate the Chi-Square statistic for each cell in the contingency table, we leverage NumPy's ability to perform **vectorized operations**. This approach allows us to compute the differences, square them, divide by the expected frequencies, and sum the results across all elements in the table in a single line of code. This not only simplifies the implementation but also makes it more efficient and scalable for larger datasets.


In [15]:
chi_2 = np.sum((observed - expected) ** 2 / expected, axis=None)
print('Chi-Squared:', chi_2)

Chi-Squared: 0.7278216183118809


Now we calculate the **critical value** for the Chi-Square test at a 95% confidence level using `chi2.ppf()`. This value represents the threshold for rejecting the null hypothesis. By comparing the computed Chi-Square statistic (`chi_2`) with the critical value, we determine whether the observed differences are statistically significant. If `chi_2` exceeds the critical value, we reject the null hypothesis, indicating a significant relationship between the variables.


In [17]:
from scipy.stats import chi2, chi2_contingency

alpha = 0.05  # 95% confidence level
critical_value = chi2.ppf(1 - alpha, degrees_of_freedom)

print(f'Critical Value: {critical_value}')
if chi_2 > critical_value:
    print("Reject the null hypothesis: There is a significant relationship.")
else:
    print("Fail to reject the null hypothesis: No significant relationship.")


Critical Value: 3.8414588206941205
Fail to reject the null hypothesis: No significant relationship.


The p-value represents the probability of obtaining a Chi-Square statistic as extreme as `chi_2`, or more so, under the assumption that the null hypothesis is true. In this cell, the p-value is calculated using `chi2.sf()`, which computes the upper tail probability for the Chi-Square distribution. If the p-value is less than the significance level `alpha = 0.05`, we reject the null hypothesis, indicating a statistically significant relationship between the variables. Otherwise, we fail to reject the null hypothesis.

In [19]:
alpha = 0.05
p_value = chi2.sf(chi_2, degrees_of_freedom) 
print(f'P Value: {p_value}') 
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant relationship.")
else:
    print("Fail to reject the null hypothesis: No significant relationship.")

P Value: 0.3935901993412787
Fail to reject the null hypothesis: No significant relationship.


All the calculations were performed manually, but we  can use `scipy.stats.chi2_contingency()` to perform the Chi-Square test of independence on the contingency table (`observed`). This function automates the calculation of the Chi-Square statistic, p-value, degrees of freedom, and expected frequencies. 

We set `correction=False` to disable Yates' continuity correction, which is applied by default for 2×2 tables. While Yates' correction helps prevent overestimation of significance in small samples, it can lead to overly conservative results. Disabling it ensures the calculation matches the manually computed Chi-Square statistic and p-value.

The decision rule is based on the p-value: if it is less than the significance level $\alpha = 0.05$, we reject the null hypothesis, indicating a significant relationship between the variables. Otherwise, we fail to reject the null hypothesis, suggesting no strong evidence of association.


In [21]:
chi2_stat, p_value, dof, expected_freq = chi2_contingency(observed, correction=False) 

print(f'Chi-Square Statistic: {chi2_stat}')
print(f'p-value: {p_value}')
print(f'Degrees of freedom: {dof}') 
print(f'Expected frequency: {expected_freq}') 
if chi_2 < p_value:
    print("Reject the null hypothesis: There is a significant relationship.")
else:
    print("Fail to reject the null hypothesis: No significant relationship.")

Chi-Square Statistic: 0.7278216183118809
p-value: 0.3935901993412787
Degrees of freedom: 1
Expected frequency: [[248.262 265.738]
 [234.738 251.262]]
Fail to reject the null hypothesis: No significant relationship.


# <font color=red>T-Test</font> 

## Feature Selection and Application of the T-Test for Click-Through Rate

### Feature Selection
In this analysis, we aim to determine whether the **theme of the website (Light vs. Dark)** has a significant impact on the **Click-Through Rate (CTR)**. The following features were selected:

- **Theme**: Categorical variable representing the visual theme of the website (Light Theme or Dark Theme).
- **Click-Through Rate (CTR)**: A continuous variable representing the percentage of users who clicked on an element after viewing it.

These variables were chosen because **CTR is a key engagement metric**, and understanding whether different themes affect user interaction can help optimize website design.

### Why the T-Test?
The **Independent Samples T-Test** is the appropriate statistical test for this scenario because:
1. **The independent variable (Theme) is categorical** with two groups: Light Theme and Dark Theme.
2. **The dependent variable (CTR) is continuous**, making it suitable for a mean comparison.
3. The test evaluates whether the **mean CTR differs significantly** between the two themes.

The **null hypothesis (\(H_0\))** assumes that there is **no difference in the mean CTR** between the Light and Dark themes. The **alternative hypothesis (\(H_A\))** assumes that the themes lead to significantly different click-through behaviors.

### Key Assumptions
- **Independence**: The observations in each theme group are independent of one another.
- **Normality**: CTR data should be approximately normally distributed within each group.
- **Equal Variance**: The variance of CTR across both groups should be approximately equal (can be tested using Levene’s test).

By applying the T-test, we aim to evaluate whether **website theme impacts user engagement**, providing actionable insights for design optimization.


## Checking Sample Sizes Before the T-Test

Before performing a **T-test** to compare the mean Click-Through Rate (CTR) between the Light and Dark themes, we check the sample sizes for each group.

For the **Chi-Square test**, this step was **not necessary** because:
- The **Chi-Square test is robust to sample size differences**, as it operates on categorical data.
- The test validity was ensured by meeting the **expected frequency condition** (i.e., no expected values below 5).

However, for a **T-test**, sample size differences can impact the assumption of **equal variance**, potentially affecting test accuracy. This is why we assess both **sample sizes and variances** before proceeding.


In [25]:
ctr_grouped = df.groupby('Theme')['Click Through Rate'].apply(list)
ctr_pivot = pd.DataFrame(ctr_grouped.tolist(), index=ctr_grouped.index).T
ctr_pivot

Theme,Dark Theme,Light Theme
0,0.323352,0.054920
1,0.110551,0.113932
2,0.302031,0.485836
3,0.492174,0.034783
4,0.493888,0.173419
...,...,...
509,0.265413,
510,0.212645,
511,0.282792,
512,0.299917,


In [26]:
df['Theme'].value_counts()

Theme
Dark Theme     514
Light Theme    486
Name: count, dtype: int64

## Calculating Variance for Each Theme

To determine if the **variances of Click-Through Rate (CTR) differ between themes**, we compute the variance for each theme separately.


In [28]:
vars=[]
for theme in df['Theme'].unique():
    var=df['Click Through Rate'][df['Theme']==theme].var()
    vars.append(var) 

print(vars)

[0.018811911457100493, 0.019836431683112173]


## Assessing Variance Differences Before the T-Test

At first sight, the difference between the variances of Click-Through Rate (CTR) for the Light and Dark themes appears to be **insignificant**. However, to ensure statistical rigor, we will formally test for **variance equality** using **Levene’s Test for Variance Equality**.

This step is critical because:
- The **T-test assumes equal variances** between groups.
- If the assumption is violated, using a standard T-test may lead to incorrect conclusions.
- **Levene’s Test** provides an objective way to determine whether we should proceed with a **standard T-test** or switch to **Welch’s T-test** (which does not assume equal variances).

By running this test, we ensure that our analysis meets the **necessary statistical rigor** and that our results are **valid and reliable**.


## Levene’s Test for Variance Equality

### **Why Are We Running This Test?**
Before performing a **T-test** to compare the mean Click-Through Rate (CTR) between the Light and Dark themes, we need to check a key assumption: **homogeneity of variances** (i.e., both groups should have similar variances).

The **Independent Samples T-test** assumes that the two groups have equal variance. If this assumption does not hold, using a standard T-test may lead to incorrect results. **Levene's Test** helps us determine whether we should assume equal variances or adjust our test accordingly.

### **What is Levene’s Test?**
Levene’s test checks whether the variance in **Click-Through Rate (CTR)** is significantly different between the two groups.

- **Null Hypothesis $(H_0)$**: The variances of the two groups are equal.
- **Alternative Hypothesis $(H_A)$**: The variances of the two groups are different.

### **Decision Rule**
- If **p-value > 0.05** → **Fail to reject $(H_0)$** → Assume equal variances → Use **standard T-test** (`equal_var=True`).
- If **p-value ≤ 0.05** → **Reject $(H_0)$** → Variances are different → Use **Welch’s T-test** (`equal_var=False`).

### **Why This Matters for Our T-Test**
- If Levene’s Test confirms that variances are **equal**, we can proceed with the **standard Independent Samples T-test**.
- If variances are **not equal**, we use **Welch’s T-test**, which adjusts for unequal variances.

By running Levene’s Test, we ensure that we choose the **correct version of the T-test**, making our statistical conclusions more reliable.


In [31]:
from scipy.stats import levene


dark_theme_ctr = df[df['Theme'] == 'Dark Theme']['Click Through Rate']
light_theme_ctr = df[df['Theme'] == 'Light Theme']['Click Through Rate']


stat, p = levene(dark_theme_ctr, light_theme_ctr)

print(f"Levene's Test Statistic: {stat}")
print(f"p-value: {p}")

# Decision based on p-value
if p > 0.05:
    print("Fail to reject null hypothesis: Variances are equal.")
else:
    print("Reject null hypothesis: Variances are not equal.")


Levene's Test Statistic: 1.0173917110137343
p-value: 0.31338298665162745
Fail to reject null hypothesis: Variances are equal.


## Extracting Summary Statistics: Mean, Standard Deviation, and Sample Size

### **Why Are We Computing These Statistics?**
Before conducting hypothesis tests (such as the t-test), it is important to summarize key descriptive statistics for each group. This allows us to understand the central tendency and variability in the Click-Through Rate (CTR) data for both the Light and Dark themes.

The three key statistics we extract are:
- **Mean (Average CTR):** This tells us the typical click-through rate for each theme.
- **Standard Deviation (Dispersion):** This measures how spread out the CTR values are within each theme.
- **Sample Size:** This indicates how many observations (users) were exposed to each theme.

By computing these values first, we ensure that we have a clear understanding of the data before performing hypothesis testing.


In [33]:
means = []
stdevs = []
sample_sizes = [] 

for theme in df['Theme'].unique():
    n = df['Click Through Rate'][df['Theme']==theme].shape[0]
    sample_sizes.append(n) 
    mean = df['Click Through Rate'][df['Theme']==theme].mean()
    means.append(mean)
    stdev = df['Click Through Rate'][df['Theme']==theme].std()
    stdevs.append(stdev) 

print('Means: ', means) 
print('Standard deviations: ', stdevs)
print('Sample Sizes: ', sample_sizes) 

Means:  [0.24710871082680833, 0.2645008624648531]
Standard deviations:  [0.1371565217446859, 0.1408418676499008]
Sample Sizes:  [486, 514]


# **Independent Samples T-Test: Manual Calculation**

## **Why Are We Running This Test?**
The goal of this test is to determine whether the **Click-Through Rate (CTR)** differs significantly between the **Light Theme** and **Dark Theme** groups. Since we are comparing the means of two independent groups, we use an **Independent Samples T-Test**.

However, before running the t-test, we performed **Levene’s Test for Equality of Variances**, which showed that the variances of both groups are statistically similar (p-value > 0.05). This means we can assume equal variances and proceed with the **pooled variance t-test**, rather than Welch’s t-test, which would be required if the variances were unequal.

---

## **Understanding the Statistical Concepts**
We will compute the following:

### **1. Difference Between Means**
$$
\text{Difference} = \bar{X}_1 - \bar{X}_2
$$
This measures how much the average Click-Through Rate (CTR) differs between the Light and Dark themes.

### **2. Degrees of Freedom (df)**
Since we assume equal variances, the degrees of freedom for an independent t-test is calculated as:

$$
df = (n_1 - 1) + (n_2 - 1) = n_1 + n_2 - 2
$$

This accounts for the total number of observations while adjusting for the two sample groups.

### **3. Pooled Variance (Equal Variance Assumption)**
Since Levene’s test confirmed equal variances, we use **pooled variance**, which combines the variance estimates from both groups:

$$
s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{df}
$$

This formula weights each sample’s variance by its degrees of freedom, ensuring a more stable estimate.

### **4. Standard Error (SE) of the Difference in Means**
$$
SE = \sqrt{s_p^2 \times \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}
$$
The standard error accounts for the variability in sample means and helps determine whether the difference between the two groups is meaningful.

### **5. T-Statistic**
$$
t = \frac{\bar{X}_1 - \bar{X}_2}{SE}
$$
The t-statistic measures how many standard errors the observed difference is from 0.

### **6. P-Value**
$$
p = 2 \times P(T > |t|)
$$
The p-value tells us the probability of obtaining a difference as extreme as the one observed if the null hypothesis (no difference) is true. Since this is a **two-tailed** test, we multiply by 2.

---

## **Why Are We Using Pooled Variance Instead of Welch’s T-Test?**
- **Levene’s Test Outcome:** Since Levene’s test failed to reject the null hypothesis (p > 0.05), we assume that the variances in both groups are equal.
- **Consequence:** Under equal variance assumption, using pooled variance provides more statistical power than Welch’s test, which is used when variances are unequal.
- **If Variances Were Unequal?** We would use Welch’s t-test (`equal_var=False` in `scipy.stats.ttest_ind`), which adjusts for variance differences.


In [35]:
from scipy.stats import t as t_student

In [36]:
difference_means = means[0] - means[1]  # Difference between group means
deg_freedom = sum([(n-1) for n in sample_sizes])  
print('Degrees of freedom: ', deg_freedom)
pooled_variance = 0
for i in range(len(sample_sizes)):
    n_minus_one = sample_sizes[i] - 1 
    pooled_variance += n_minus_one * (stdevs[i]**2)
pooled_variance = pooled_variance/deg_freedom
print(f"Pooled variance: {pooled_variance}")
stnd_error=np.sqrt(pooled_variance * sum([1/n for n in sample_sizes])) 


print(f"Difference between means: {difference_means}")
print(f"Standard Error: {stnd_error}") 

t_manual = abs(difference_means / stnd_error)  
print(f"T statistic: {t_manual}")
p_manual = 2 * t_student.sf(t_manual, deg_freedom)
print(f"Manually Calculated P-Value: {p_manual}")

Degrees of freedom:  998
Pooled variance: 0.019338543597324936
Difference between means: -0.017392151638044778
Standard Error: 0.008798571909437
T statistic: 1.9767016530706132
Manually Calculated P-Value: 0.048350311405825


# **Automating the T-Test with `scipy.stats.ttest_ind`**

## **Everything We Did Manually Can Be Done with a Single Function**

Previously, we manually computed the **t-statistic** and **p-value** step by step. However, Python provides a built-in function, `scipy.stats.ttest_ind`, that automates this entire process in a single line of code.

Below, we break down each part of the function so that you understand exactly what it does.

---

## **1. Importing the Necessary Function**
```python
from scipy.stats import ttest_ind
```
After these lines, we have:

- `dark_theme_ctr`: A list of CTR values for users who saw the Dark Theme.
- `light_theme_ctr`: A list of CTR values for users who saw the Light Theme.


`t_scipy, p_value = ttest_ind(dark_theme_ctr, light_theme_ctr, equal_var=True)`

- **`t_scipy`** → The computed **t-statistic**.
- **`p_value`** → The **p-value**, which helps determine statistical significance.
- **`equal_var=True`**  
  - This tells `ttest_ind` that the two groups **have equal variances**.
  - We set this to `True` **because Levene’s Test confirmed equal variances**.
  - If **Levene’s test had rejected the null hypothesis** (indicating unequal variances), we would set `equal_var=False`, which would make `ttest_ind` run **Welch’s T-test** instead.


In [38]:
from scipy.stats import ttest_ind

# Extract Click-Through Rate values for both themes
dark_theme_ctr = df[df['Theme'] == 'Dark Theme']['Click Through Rate']
light_theme_ctr = df[df['Theme'] == 'Light Theme']['Click Through Rate']

# Run independent T-test
t_scipy, p_value = ttest_ind(dark_theme_ctr, light_theme_ctr, equal_var=True)  # Set to False if Levene's test fails

# Print results
print(f"Scipy T-statistic: {abs(t_scipy)}")
print(f"Scipy p-value: {p_value}")


Scipy T-statistic: 1.9767016530706143
Scipy p-value: 0.04835031140582486


## **Conclusion: Manual Calculation vs. SciPy's `ttest_ind`**

The manually computed **t-statistic** and **p-value** closely match the results obtained from SciPy's `ttest_ind()` function:

| **Metric**         | **Manual Calculation**  | **SciPy's `ttest_ind`** |
|--------------------|------------------------|-------------------------|
| **T-Statistic**    | 1.9767016530706132     | 1.9767016530706143      |
| **P-Value**       | 0.048350311405825      | 0.04835031140582486     |

### **Key Observations**
1. **Negligible Difference in T-Statistic**  
   - The difference between the manually computed t-statistic (**1.9767016530706132**) and SciPy's result (**1.9767016530706143**) is **extremely small** (on the order of \(10^{-15}\)).  
   - This is due to floating-point precision limitations in Python but is **not practically significant**.

2. **Identical P-Values Up to Precision Limits**  
   - The manually calculated p-value (**0.048350311405825**) and SciPy's output (**0.04835031140582486**) differ **only at the 15th decimal place**.  
   - This negligible discrepancy is purely computational and does **not** impact the hypothesis test's outcome.

### **Final Interpretation**
- The **manual computation correctly replicates the t-test**, verifying that SciPy's `ttest_ind()` performs the same statistical operations internally.
- **Both methods lead to the same conclusion:**  
  - Since the **p-value (≈ 0.04835) is slightly below the conventional threshold of 0.05**, we **reject the null hypothesis** at a 5% significance level.  
  - However, the result is **marginally significant**, meaning the evidence for a difference in click-through rates is not overwhelmingly strong.

### **Practical Recommendation**
- Since SciPy's `ttest_ind()` automates the t-test and ensures accuracy, it should be the **preferred method for future analyses**.
- The **manual calculation remains valuable for learning and verification**, helping ensure an understanding of statistical concepts.


In [40]:
alpha = 0.05
if p_value > alpha:
    print('Fail to reject null hypothesis. No significant difference between the means') 
else:
    print('Reject null hypothesis. There is a significant difference between the means') 

Reject null hypothesis. There is a significant difference between the means


## <font color=red>**Conclusion**</red>

The goal of this analysis is **not** to decide whether to implement a new feature but rather to evaluate whether there is a **true performance difference** between the two layouts (Light Theme vs. Dark Theme), as explained on the dataset's description on the link.
> An online bookstore is looking to optimize its website design to improve user engagement and ultimately increase book purchases. The website currently offers two themes for its users: “Light Theme” and “Dark Theme.” The bookstore’s data science team wants to conduct an A/B testing experiment to determine which theme leads to better user engagement and higher conversion rates for book purchases.

Based on **click-through rate (CTR) alone**, the evidence is **marginal**, and it does not provide an unequivocal answer to whether one layout definitively outperforms the other.

### **Key Points to Emphasize**
1. **Marginal Significance**  
   - Although the p-value is just below 0.05, the effect size is small, indicating that **if** there is a difference, it may be minimal in practical terms.

2. **Limits of a Single Metric**  
   - CTR alone **cannot conclusively prove or disprove** a true performance difference between the themes.
   - Other metrics (e.g., time on site, conversion rates, return visits) may offer additional insight into user behavior and engagement.

3. **Need for Broader Analysis**  
   - To **confidently** determine whether one theme genuinely outperforms the other, consider:
     - **Multiple Performance Indicators**: Session duration, pages per session, conversion rates, etc.  
     - **User Segmentation**: Performance may vary across different user demographics or usage contexts.

### **Bottom Line**
Given the **borderline statistical significance** and the **narrow focus on CTR**, we cannot **definitively conclude** that one theme is superior overall. For a more **robust** conclusion, analyze a **broader set of user engagement and conversion metrics** to see if the Dark Theme truly offers meaningful advantages—or if the observed difference is simply **not strong enough** to warrant action or a definitive claim of better performance.
