# Task 3 — A/B Hypothesis Testing
**Goal:** Statistically validate or reject the following null hypotheses:
- H0: There are no risk differences across provinces
- H0: There are no risk differences between zip codes
- H0: There is no significant margin (profit) difference between zip codes
- H0: There is no significant risk difference between Women and Men

Run each cell in order. Outputs and plots will be saved under `outputs/task3/`.


In [1]:
# Cell: Setup - imports and folders
import os
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Create output directories
OUT_DIR = Path("outputs/task3")
OUT_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "figures").mkdir(exist_ok=True)
(OUT_DIR / "tables").mkdir(exist_ok=True)

# Display options
pd.set_option("display.max_columns", 200)
sns.set(style="whitegrid")


### 1. Load cleaned data


In [3]:
# Cell: Load data
DATA_PATH = Path("../data/processed/cleaned_data.csv")
assert DATA_PATH.exists(), f"File not found: {DATA_PATH}. Put cleaned_data.csv in data/processed/"

df = pd.read_csv(DATA_PATH, parse_dates=["TransactionMonth"], low_memory=False)
print("Loaded:", DATA_PATH)
print("Shape:", df.shape)
df.head(3)


Loaded: ..\data\processed\cleaned_data.csv
Shape: (1000098, 56)


Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,MaritalStatus,Gender,Country,Province,PostalCode,MainCrestaZone,SubCrestaZone,ItemType,mmcode,VehicleType,RegistrationYear,make,Model,Cylinders,cubiccapacity,kilowatts,bodytype,NumberOfDoors,VehicleIntroDate,CustomValueEstimate,AlarmImmobiliser,TrackingDevice,CapitalOutstanding,NewVehicle,WrittenOff,Rebuilt,Converted,CrossBorder,NumberOfVehiclesInFleet,SumInsured,TermFrequency,CalculatedPremiumPerTerm,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims,LossRatio,VehicleAge,Province_Encoded,VehicleType_Encoded
0,145249,12827,2015-03-01,True,,Close Corporation,Mr,English,First National Bank,Current account,Not specified,Not specified,South Africa,Gauteng,1459,Rand East,Rand East,Mobility - Motor,44069150.0,Passenger Vehicle,2004,MERCEDES-BENZ,E 240,6.0,2597.0,130.0,S/D,4.0,6/2002,119300.0,Yes,No,119300,More than 6 months,No,No,No,No,,-0.400557,Monthly,25.0,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,-0.173593,0.001403,0.0,21,0,0
1,145249,12827,2015-05-01,True,,Close Corporation,Mr,English,First National Bank,Current account,Not specified,Not specified,South Africa,Gauteng,1459,Rand East,Rand East,Mobility - Motor,44069150.0,Passenger Vehicle,2004,MERCEDES-BENZ,E 240,6.0,2597.0,130.0,S/D,4.0,6/2002,119300.0,Yes,No,119300,More than 6 months,No,No,No,No,,-0.400557,Monthly,25.0,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,-0.173593,0.001403,0.0,21,0,0
2,145249,12827,2015-07-01,True,,Close Corporation,Mr,English,First National Bank,Current account,Not specified,Not specified,South Africa,Gauteng,1459,Rand East,Rand East,Mobility - Motor,44069150.0,Passenger Vehicle,2004,MERCEDES-BENZ,E 240,6.0,2597.0,130.0,S/D,4.0,6/2002,119300.0,Yes,No,119300,More than 6 months,No,No,No,No,,-0.400557,Monthly,25.0,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,-0.268822,0.001403,0.0,21,0,0


### 2. KPI creation: ClaimFrequency, ClaimSeverity, Margin


In [4]:
# Cell: KPI creation
# ClaimFrequency: 1 if TotalClaims > 0, else 0
df["ClaimFrequency"] = (df["TotalClaims"] > 0).astype(int)

# ClaimSeverity: amount given a claim occurred. NaN for non-claim rows
df["ClaimSeverity"] = df["TotalClaims"].where(df["TotalClaims"] > 0, np.nan)

# Margin: TotalPremium - TotalClaims
df["Margin"] = df["TotalPremium"] - df["TotalClaims"]

# Quick checks
kpi_table = df[["TotalPremium", "TotalClaims", "ClaimFrequency", "ClaimSeverity", "Margin"]].describe().T
kpi_table.to_csv(OUT_DIR / "tables/kpi_summary.csv")
kpi_table


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TotalPremium,1000098.0,-1.78755e-17,1.0,-3.667128,-0.268822,-0.259363,-0.173593,283.218049
TotalClaims,1000098.0,-9.458773000000001e-17,1.0,-902.412051,0.001403,0.001403,0.001403,0.001403
ClaimFrequency,1000098.0,0.999995,0.002235954,0.0,1.0,1.0,1.0,1.0
ClaimSeverity,1000093.0,0.001402605,2.168405e-19,0.001403,0.001403,0.001403,0.001403,0.001403
Margin,1000098.0,6.44257e-17,1.416474,-3.66853,-0.270225,-0.260765,-0.174995,904.930374


### 3. Helper functions: group selection, tests, and reporting
- We'll use Chi-square for frequency (categorical) and two-sample t-tests for numeric (severity, margin).
- For two-proportion z-tests we implement a direct z-test when comparing two groups' claim frequencies.


In [5]:
# Cell: helper functions

def select_top_two_provinces(df, min_count=100):
    """Return two provinces with at least min_count records.
       If none, return the top two provinces by count."""
    counts = df["Province"].value_counts()
    enough = counts[counts >= min_count]
    if len(enough) >= 2:
        return list(enough.index[:2])
    else:
        return list(counts.index[:2])

def select_top_two_zipcodes(df, min_count=100):
    counts = df["PostalCode"].value_counts()
    enough = counts[counts >= min_count]
    if len(enough) >= 2:
        return list(enough.index[:2])
    else:
        # fallback: pick two with highest counts
        return list(counts.index[:2])

def two_sample_ttest(a, b, nan_policy="omit"):
    """Welch's t-test for two independent samples."""
    a_clean = a.dropna()
    b_clean = b.dropna()
    if len(a_clean) < 3 or len(b_clean) < 3:
        return {"stat": np.nan, "pvalue": np.nan, "n_a": len(a_clean), "n_b": len(b_clean)}
    stat, p = stats.ttest_ind(a_clean, b_clean, equal_var=False, nan_policy=nan_policy)
    return {"stat": stat, "pvalue": p, "n_a": len(a_clean), "n_b": len(b_clean)}

def chi2_test_frequency(df, group_col):
    """Run chi-square test for ClaimFrequency across categories of group_col."""
    table = pd.crosstab(df[group_col], df["ClaimFrequency"])
    chi2, p, dof, ex = stats.chi2_contingency(table)
    return {"chi2": chi2, "pvalue": p, "dof": dof, "table": table}

def two_proportion_ztest(count1, n1, count2, n2):
    """Manual two-proportion z-test (two-sided)."""
    p1 = count1 / n1
    p2 = count2 / n2
    p_pool = (count1 + count2) / (n1 + n2)
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
    if se == 0:
        return {"z": np.nan, "pvalue": np.nan, "p1": p1, "p2": p2}
    z = (p1 - p2) / se
    pvalue = 2 * (1 - stats.norm.cdf(abs(z)))
    return {"z": z, "pvalue": pvalue, "p1": p1, "p2": p2}


### 4. Visual checks and tables (save outputs)
We will create plots to support the tests and save them to outputs/task3/figures.


In [6]:
# Cell: plot claim frequency by province
freq_by_prov = df.groupby("Province")["ClaimFrequency"].agg(["mean", "count"]).sort_values("mean", ascending=False)
fig, ax = plt.subplots(figsize=(10, 5))
freq_by_prov["mean"].plot(kind="bar", ax=ax)
ax.set_ylabel("Claim Frequency (proportion)")
ax.set_title("Claim Frequency by Province")
plt.tight_layout()
fig.savefig(OUT_DIR / "figures/claim_frequency_by_province.png", dpi=150)
plt.close(fig)

# Save table
freq_by_prov.to_csv(OUT_DIR / "tables/claim_frequency_by_province.csv")
freq_by_prov.head()


Unnamed: 0_level_0,mean,count
Province,Unnamed: 1_level_1,Unnamed: 2_level_1
Eastern Cape,1.0,30336
Free State,1.0,8099
KwaZulu-Natal,1.0,169781
Limpopo,1.0,24836
North West,1.0,143287


In [13]:
# Cell: boxplot of ClaimSeverity by Province (only for claims)
claims = df[df["ClaimFrequency"] == 1]
plt.figure(figsize=(12,6))
order = claims["Province"].value_counts().index[:10]
sns.boxplot(data=claims[claims["Province"].isin(order)], x="Province", y="ClaimSeverity")
plt.xticks(rotation=45)
plt.title("Claim Severity by Province (top 10 provinces by count)")
plt.tight_layout()
plt.savefig(OUT_DIR / "figures/claim_severity_by_province_top10.png", dpi=150)
plt.close()


### 5. Hypothesis tests
We run tests and produce a summary table with p-values and interpretation.


In [14]:
# Cell: Hypothesis A — Provinces: claim frequency difference (Chi-square)
prov_result = chi2_test_frequency(df, "Province")
print("Chi-square test for ClaimFrequency across provinces:")
print(f"chi2 = {prov_result['chi2']:.3f}, p = {prov_result['pvalue']:.6f}, dof = {prov_result['dof']}")
prov_result["table"].to_csv(OUT_DIR / "tables/province_claimfreq_contingency.csv")


Chi-square test for ClaimFrequency across provinces:
chi2 = 13.379, p = 0.099455, dof = 8


In [8]:
# Cell: Pairwise test example (Top 2 provinces)
p1, p2 = select_top_two_provinces(df, min_count=50)
print("Comparing provinces:", p1, "vs", p2)

gA = df[df["Province"] == p1]
gB = df[df["Province"] == p2]

# Two-proportion z-test for frequencies
countA = gA["ClaimFrequency"].sum()
nA = len(gA)
countB = gB["ClaimFrequency"].sum()
nB = len(gB)
prop_test = two_proportion_ztest(countA, nA, countB, nB)

print(f"{p1}: freq={countA}/{nA}={prop_test['p1']:.4f}")
print(f"{p2}: freq={countB}/{nB}={prop_test['p2']:.4f}")
print(f"two-prop z-test: z={prop_test['z']:.3f}, p={prop_test['pvalue']:.6f}")

# Save pairwise table
pairwise = pd.DataFrame({
    "group":[p1,p2],"count":[nA,nB],"claims":[countA,countB],"claim_rate":[prop_test['p1'], prop_test['p2']]
})
pairwise.to_csv(OUT_DIR / "tables/pairwise_province_freq.csv", index=False)


Comparing provinces: Gauteng vs Western Cape
Gauteng: freq=393863/393865=1.0000
Western Cape: freq=170795/170796=1.0000
two-prop z-test: z=0.116, p=0.907367


In [9]:
# Cell: Hypothesis B — Zipcodes (PostalCode) risk difference (choose two zip codes)
zip_a, zip_b = select_top_two_zipcodes(df, min_count=50)
print("Comparing zip codes:", zip_a, "vs", zip_b)

gA = df[df["PostalCode"] == zip_a]
gB = df[df["PostalCode"] == zip_b]

countA = gA["ClaimFrequency"].sum()
nA = len(gA)
countB = gB["ClaimFrequency"].sum()
nB = len(gB)
zip_prop_test = two_proportion_ztest(countA, nA, countB, nB)

print(f"{zip_a}: freq={countA}/{nA}={zip_prop_test['p1']:.4f}")
print(f"{zip_b}: freq={countB}/{nB}={zip_prop_test['p2']:.4f}")
print(f"two-prop z-test: z={zip_prop_test['z']:.3f}, p={zip_prop_test['pvalue']:.6f}")

# Save summary
pd.DataFrame([{
    "zip_a": zip_a, "n_a": nA, "claims_a": countA, "rate_a": zip_prop_test['p1'],
    "zip_b": zip_b, "n_b": nB, "claims_b": countB, "rate_b": zip_prop_test['p2'],
    "z": zip_prop_test['z'], "pvalue": zip_prop_test['pvalue']
}]).to_csv(OUT_DIR / "tables/zip_comparison_freq.csv", index=False)


Comparing zip codes: 2000 vs 122
2000: freq=133498/133498=1.0000
122: freq=49171/49171=1.0000
two-prop z-test: z=nan, p=nan


In [10]:
# Cell: Hypothesis C — Margin difference between ZIP codes (t-test)
margin_A = gA["Margin"]
margin_B = gB["Margin"]
margin_test = two_sample_ttest(margin_A, margin_B)
print("Margin two-sample t-test (Welch):")
print(f"n_a={margin_test['n_a']}, n_b={margin_test['n_b']}, t={margin_test['stat']}, p={margin_test['pvalue']}")

# Save numbers
pd.DataFrame([{"zip_a": zip_a, "n_a": margin_test['n_a'], "zip_b": zip_b, "n_b": margin_test['n_b'],
               "t_stat": margin_test['stat'], "pvalue": margin_test['pvalue']}]).to_csv(OUT_DIR / "tables/zip_margin_ttest.csv", index=False)


Margin two-sample t-test (Welch):
n_a=133498, n_b=49171, t=9.384438505987141, p=6.467145608365507e-21


In [11]:
# Cell: Hypothesis D — Gender differences
# Frequency: chi-square or two-prop
gender_table = pd.crosstab(df["Gender"], df["ClaimFrequency"])
chi2_g, p_g, dof_g, _ = stats.chi2_contingency(gender_table)
print("Chi-square for ClaimFrequency by Gender:")
print(f"chi2={chi2_g:.3f}, p={p_g:.6f}, dof={dof_g}")
gender_table.to_csv(OUT_DIR / "tables/gender_claimfreq_contingency.csv")

# Severity: t-test for those with claims
male_sev = df[(df["Gender"] == "Male") & (df["ClaimFrequency"] == 1)]["ClaimSeverity"]
female_sev = df[(df["Gender"] == "Female") & (df["ClaimFrequency"] == 1)]["ClaimSeverity"]
sev_test = two_sample_ttest(male_sev, female_sev)
print("ClaimSeverity t-test (Male vs Female):")
print(f"n_male={sev_test['n_a']}, n_female={sev_test['n_b']}, t={sev_test['stat']}, p={sev_test['pvalue']}")

# Save summary
pd.DataFrame([{"chi2": chi2_g, "p_value": p_g, "severity_t": sev_test['stat'], "severity_p": sev_test['pvalue']}]).to_csv(OUT_DIR / "tables/gender_tests_summary.csv", index=False)


Chi-square for ClaimFrequency by Gender:
chi2=0.261, p=0.877761, dof=2
ClaimSeverity t-test (Male vs Female):
n_male=42817, n_female=6755, t=-82.18272324521742, p=0.0


  res = hypotest_fun_out(*samples, **kwds)


### 6. Interpretation and Business-friendly summary
The code below prints human-readable interpretations using the standard p-value threshold 0.05.


In [15]:
# Cell: Interpretation helper
def interpret_test(name, pvalue, alpha=0.05):
    if pd.isna(pvalue):
        return f"{name}: Insufficient data to test."
    if pvalue < alpha:
        return f"{name}: **Reject H0** (p={pvalue:.4f}) — there is a statistically significant difference."
    else:
        return f"{name}: **Fail to reject H0** (p={pvalue:.4f}) — no evidence of a difference."

# Interpret province chi2
print(interpret_test("Provinces (Claim Frequency - chi2)", prov_result["pvalue"]))

# Interpret pairwise province z-test
print(interpret_test(f"Province pair {p1} vs {p2} (Claim Frequency - two-prop z)", prop_test['pvalue']))

# Interpret zip frequency
print(interpret_test(f"Zip {zip_a} vs {zip_b} (Claim Frequency - two-prop z)", zip_prop_test['pvalue']))

# Interpret zip margin
print(interpret_test(f"Zip {zip_a} vs {zip_b} (Margin - t-test)", margin_test['pvalue']))

# Interpret gender tests
print(interpret_test("Gender (Claim Frequency - chi2)", p_g))
print(interpret_test("Gender (Claim Severity - t-test)", sev_test['pvalue']))

# Save a short report file
report_lines = [
    interpret_test("Provinces (Claim Frequency - chi2)", prov_result["pvalue"]),
    interpret_test(f"Province pair {p1} vs {p2} (Claim Frequency - two-prop z)", prop_test['pvalue']),
    interpret_test(f"Zip {zip_a} vs {zip_b} (Claim Frequency - two-prop z)", zip_prop_test['pvalue']),
    interpret_test(f"Zip {zip_a} vs {zip_b} (Margin - t-test)", margin_test['pvalue']),
    interpret_test("Gender (Claim Frequency - chi2)", p_g),
    interpret_test("Gender (Claim Severity - t-test)", sev_test['pvalue']),
]
with open(OUT_DIR / "tables/summary_interpretation.txt", "w") as f:
    f.write("\n".join(report_lines))
print("\nSummary written to:", OUT_DIR / "tables/summary_interpretation.txt")


Provinces (Claim Frequency - chi2): **Fail to reject H0** (p=0.0995) — no evidence of a difference.
Province pair Gauteng vs Western Cape (Claim Frequency - two-prop z): **Fail to reject H0** (p=0.9074) — no evidence of a difference.
Zip 2000 vs 122 (Claim Frequency - two-prop z): Insufficient data to test.
Zip 2000 vs 122 (Margin - t-test): **Reject H0** (p=0.0000) — there is a statistically significant difference.
Gender (Claim Frequency - chi2): **Fail to reject H0** (p=0.8778) — no evidence of a difference.
Gender (Claim Severity - t-test): **Reject H0** (p=0.0000) — there is a statistically significant difference.

Summary written to: outputs\task3\tables\summary_interpretation.txt


### 7. Recommended next steps (to include in your Task 3 report)
- If province-level differences are significant, consider estimating effect sizes (difference in claim rates) and adding province as a feature in segmentation or pricing rules.
- If ZIP-level differences are significant, investigate local factors (socioeconomic, traffic) and consider zip-level premium adjustments.
- For any tested group with significant margin differences, compute expected revenue impact of a small premium change.
- For gender or other sensitive attributes: if test shows no difference, **do not** use the attribute for pricing (ethical/legal risk).
- Consider multiple-testing corrections (Bonferroni) if running many pairwise tests.


In [None]:
# Cell (shell): Git workflow - run in a terminal or as a notebook shell cell
# Create a branch for this work, add notebook, push, and create a PR on GitHub
!git checkout -b task-3
!git add notebooks/task_3.ipynb
!git commit -m "task-3: add hypothesis testing notebook and results"
!git push -u origin task-3
# (Then create a Pull Request on GitHub)


### End of notebook
You can now collect the figures and tables from `outputs/task3/figures` and `outputs/task3/tables` to paste into your Task 3 report. If you want, I can generate a ready-to-copy report text section that references each saved figure/table by filename.
