# Project Description: A/B Test Statistical Analysis
This project executes a comprehensive statistical analysis of an A/B test comparing the conversion performance of two different billing pages: /billing (Control) and /billing-2 (Test).

The analysis follows three main steps: data extraction, preparation of the contingency table, and hypothesis testing using the Chi-Squared method.

In [1]:
#pip install pandas sqlalchemy mysql-connector-python notebook
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
import numpy as np
from scipy.stats import chi2_contingency

## Data Extraction

The initial phase connects to a MySQL database using SQLAlchemy to extract aggregated metrics for the two billing page variants over a defined time period.

The SQL query performs the following aggregations:

* total_sessions: Counts the unique number of website sessions (users) that viewed each page.

* total_orders: Counts the unique number of successful orders (conversions) associated with sessions for each page.

* total_revenue: Calculates the total revenue generated by each page.

In [2]:
# 1. DATABASE CONFIGURATION
DB_USER = input("database user: ")
DB_PASSWORD = input("database password: ")
DB_PORT = input("database port: ")
DB_HOST = 'localhost'
DB_NAME = 'toy_store_ecommerce'

# Create SQLAlchemy Engine
connection_string = f"mysql+mysqlconnector://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(connection_string, echo=True)
conn = engine.connect()

2025-12-06 11:57:39,666 INFO sqlalchemy.engine.Engine SELECT DATABASE()
2025-12-06 11:57:39,680 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-06 11:57:39,680 INFO sqlalchemy.engine.Engine SELECT @@sql_mode
2025-12-06 11:57:39,680 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-06 11:57:39,680 INFO sqlalchemy.engine.Engine SELECT @@lower_case_table_names
2025-12-06 11:57:39,680 INFO sqlalchemy.engine.Engine [raw sql] {}


In [3]:
sql_query = text("""
    WITH billing_sessions_unique AS (
        SELECT 
            website_session_id,
            pageview_url
        FROM website_pageviews
        WHERE pageview_url IN ('/billing', '/billing-2')
        AND created_at BETWEEN '2012-09-01' AND '2013-01-31'
    ),
    ab_test_billing_results AS (
        SELECT 
            b.pageview_url,
            COUNT(DISTINCT b.website_session_id) AS total_sessions,
            COUNT(DISTINCT o.order_id) AS total_orders,
            COALESCE(SUM(o.price_usd), 0) AS total_revenue
        FROM billing_sessions_unique b
        LEFT JOIN orders o 
            ON b.website_session_id = o.website_session_id
        GROUP BY b.pageview_url
    )
    SELECT
        pageview_url,
        total_sessions,
        total_orders,
        total_revenue
    FROM ab_test_billing_results
""")

output = conn.execute(sql_query)
result = pd.DataFrame(output.fetchall())

2025-12-06 11:57:39,694 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-06 11:57:39,695 INFO sqlalchemy.engine.Engine 
    WITH billing_sessions_unique AS (
        SELECT 
            website_session_id,
            pageview_url
        FROM website_pageviews
        WHERE pageview_url IN ('/billing', '/billing-2')
        AND created_at BETWEEN '2012-09-01' AND '2013-01-31'
    ),
    ab_test_billing_results AS (
        SELECT 
            b.pageview_url,
            COUNT(DISTINCT b.website_session_id) AS total_sessions,
            COUNT(DISTINCT o.order_id) AS total_orders,
            COALESCE(SUM(o.price_usd), 0) AS total_revenue
        FROM billing_sessions_unique b
        LEFT JOIN orders o 
            ON b.website_session_id = o.website_session_id
        GROUP BY b.pageview_url
    )
    SELECT
        pageview_url,
        total_sessions,
        total_orders,
        total_revenue
    FROM ab_test_billing_results

2025-12-06 11:57:39,696 INFO sqlalchemy.engine.E

In [4]:
result.head(5)

Unnamed: 0,pageview_url,total_sessions,total_orders,total_revenue
0,/billing,1797,816,40791.84
1,/billing-2,2154,1345,67686.55


## Data Preparation and Contingency Table Creation

The raw results, loaded into a Pandas DataFrame named result, are prepared for statistical testing.Conversion Metrics: The core data for the test are the successes (total_orders) and failures (no_orders). Calculating Failures: A new column, no_orders, is calculated as:$$\text{no\_orders} = \text{total\_sessions} - \text{total\_orders}$$ Contingency Table: The data is structured into a $2\times 2$ table, which is the required format for the Chi-Squared Test, comparing the outcome (Orders vs. No Orders) across the two pages (/billing vs. /billing-2).

In [5]:
result['no_orders'] = result['total_sessions'] - result['total_orders']
table_2x2 = result[['total_orders', 'no_orders']]

table_2x2.index = result['pageview_url']
contingency_table = table_2x2.T

print(contingency_table)

pageview_url  /billing  /billing-2
total_orders       816        1345
no_orders          981         809


## Statistical Analysis

The final step performs a one-sided Chi-Squared test to determine if the difference in conversion rates is statistically significant.

Hypotheses:
$H_0$ (Null Hypothesis): The odds of conversion do not differ between the pages.
$H_1$ (Alternative Hypothesis): The odds of conversion for /billing-2 are significantly higher than for /billing.

Test Execution: 
The scipy.stats.chi2_contingency function is used. The two-sided $p$-value is then halved to test the directional hypothesis ($H_1$).

Conclusion: 
With a one-sided p-value close to 0 and CR /billing-2 ($\approx 62.44\%$) being much greater than CR /billing ($\approx 45.41\%$), the null hypothesis is rejected. The test concludes that /billing-2 provides a statistically significant improvement in conversion rate.

In [7]:
# H0: The odds of conversion do not differ between pages.
# H1: The odds of conversion for /billing-2 are significantly higher.

alpha = 0.05
alpha_one_sided = alpha / 2

chi2_stat, p_value, df, expected_counts = chi2_contingency(contingency_table, correction=False)

# Calculate Conversion Rates (CR) for reporting
cr_billing = contingency_table.iloc[0, 0] / contingency_table.iloc[:, 0].sum()
cr_billing_2 = contingency_table.iloc[0, 1] / contingency_table.iloc[:, 1].sum()

# Outputs and Conclusion
print(f"Conversion rate /billing: {cr_billing * 100:.2f}%")
print(f"Conversion rate /billing-2: {cr_billing_2 * 100:.2f}%")
print("-" * 50)
print("--- Chi-Squared Test Results (One-Sided) ---")
print(f"Significance Level (alpha one-sided): {alpha_one_sided}")
print(f"Chi-Squared Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.5f}")
print("-" * 50)

# Statistical Conclusion
if p_value < alpha_one_sided:
    print("The one-sided p-value is less than alpha (0.05). We reject the null hypothesis (H0).")
    print("Conclusion: There is statistically significant evidence that the conversion rate for /billing-2 is higher.")
else:
    print("The one-sided p-value is greater than or equal to alpha (0.05). We fail to reject the null hypothesis (H0).")
    print("Conclusion: No statistically significant difference was found in the hypothesized direction.")

Conversion rate /billing: 45.41%
Conversion rate /billing-2: 62.44%
--------------------------------------------------
--- Chi-Squared Test Results (One-Sided) ---
Significance Level (alpha one-sided): 0.025
Chi-Squared Statistic: 114.7025
P-value: 0.00000
--------------------------------------------------
The one-sided p-value is less than alpha (0.05). We reject the null hypothesis (H0).
Conclusion: There is statistically significant evidence that the conversion rate for /billing-2 is higher.
