# UIDAI Biometric Update Analysis - Statistical Hypothesis Testing

**UIDAI Data Hackathon 2026**  
**Project**: Age-Group-Wise Biometric Update Patterns

---

## Notebook Purpose

This notebook performs **Step 3: Statistical Hypothesis Testing and Anomaly Detection** to:
1. Validate findings from exploratory data analysis
2. Test statistical significance of age-quality relationships
3. Detect anomalous biometric quality patterns
4. Provide evidence-based conclusions for governance recommendations

---

## Statistical Tests Performed

### 1. **Chi-Square Test**
- **Question**: Are quality categories independent of age groups?
- **Purpose**: Test if quality distribution differs across ages

### 2. **One-Way ANOVA**
- **Question**: Do mean quality scores differ across age groups?
- **Purpose**: Test if average quality varies by age

### 3. **Kruskal-Wallis Test**
- **Question**: Do quality distributions differ (non-parametric)?
- **Purpose**: Robust alternative to ANOVA

### 4. **Isolation Forest**
- **Question**: Which records have unusual age-quality combinations?
- **Purpose**: ML-based anomaly detection

### 5. **Statistical Anomaly Detection**
- **Question**: Which records deviate from their age group's pattern?
- **Purpose**: Identify exceptional cases

---

## Significance Level

**α = 0.05** (5% significance level)
- If p-value < 0.05: Result is statistically significant
- If p-value ≥ 0.05: Result is not statistically significant

## 1. Setup and Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Import custom modules
import sys
sys.path.append('..')

from scripts.statistical_tests import (
    chi_square_test_quality_by_age,
    anova_test_quality_by_age,
    kruskal_wallis_test,
    isolation_forest_anomaly_detection,
    detect_age_quality_anomalies,
    generate_statistical_report
)

# Configure settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

print("✓ All modules imported successfully")

In [None]:
# Load cleaned datasets
df_enrolment = pd.read_csv('../data/processed/enrolment_cleaned.csv')
df_updates = pd.read_csv('../data/processed/updates_cleaned.csv')

print(f"✓ Loaded enrolment data: {len(df_enrolment):,} records")
print(f"✓ Loaded update data: {len(df_updates):,} records")

# Quick preview
print("\nData columns:")
print(df_enrolment.columns.tolist())

## 2. Chi-Square Test: Quality Categories × Age Groups

### Test Logic:

**Null Hypothesis (H₀)**: Quality categories are independent of age groups (no relationship)  
**Alternative Hypothesis (H₁)**: Quality categories depend on age groups (relationship exists)

### How it works:
1. Creates a contingency table (age groups × quality categories)
2. Calculates expected frequencies if variables were independent
3. Compares observed vs expected frequencies using Chi-square statistic
4. If differences are large → p-value is small → reject H₀

### Interpretation:
- **p < 0.05**: Age significantly affects quality distribution
- **p ≥ 0.05**: No significant relationship found

### Governance Implication:
If significant → Age-specific enrollment protocols are justified

In [None]:
# Perform Chi-Square Test
chi2_results = chi_square_test_quality_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_category_column='Quality_Category',
    alpha=0.05
)

### Understanding the Results:

**Chi-Square Statistic**: Measures how much observed frequencies differ from expected  
- Larger value = stronger relationship

**P-value**: Probability of observing this data if H₀ were true  
- Small p-value = strong evidence against H₀

**Cramér's V (Effect Size)**: Strength of association (0 to 1)  
- 0.1 = small effect
- 0.3 = medium effect
- 0.5+ = large effect

## 3. One-Way ANOVA: Mean Quality Scores Across Age Groups

### Test Logic:

**Null Hypothesis (H₀)**: All age groups have the same mean quality score  
**Alternative Hypothesis (H₁)**: At least one age group has different mean quality

### How it works:
1. Calculates variance **between** age groups (how different are group means?)
2. Calculates variance **within** age groups (how spread out is data within each group?)
3. Computes F-statistic = Between-group variance / Within-group variance
4. If F is large → group means differ significantly

### Assumptions:
- **Normality**: Data in each group is normally distributed
- **Homogeneity of variance**: Equal variances across groups
- **Independence**: Observations are independent

### Interpretation:
- **p < 0.05**: Age groups have significantly different mean quality
- **p ≥ 0.05**: No significant difference in means

### Governance Implication:
If significant → Identifies which age groups need quality improvement

In [None]:
# Perform One-Way ANOVA
anova_results = anova_test_quality_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score',
    alpha=0.05
)

### Understanding the Results:

**F-Statistic**: Ratio of between-group to within-group variance  
- Larger value = stronger evidence of group differences

**P-value**: Probability of observing this F-statistic if all means were equal  
- Small p-value = strong evidence that means differ

**Eta-squared (η²)**: Proportion of variance explained by age groups  
- 0.01 = small effect (1% of variance)
- 0.06 = medium effect (6% of variance)
- 0.14+ = large effect (14%+ of variance)

## 4. Kruskal-Wallis Test: Non-Parametric Alternative

### Test Logic:

**Null Hypothesis (H₀)**: All age groups have the same distribution  
**Alternative Hypothesis (H₁)**: At least one group has different distribution

### How it works:
1. Ranks all quality scores from lowest to highest (ignoring age groups)
2. Calculates average rank for each age group
3. If groups have similar distributions → average ranks should be similar
4. Computes H-statistic based on rank differences

### Advantages over ANOVA:
- **No normality assumption**: Works with any distribution
- **Robust to outliers**: Uses ranks instead of raw values
- **Works with ordinal data**: Doesn't require interval scale

### When to use:
- When ANOVA assumptions are violated
- When data has outliers
- When you want conservative results

### Interpretation:
- **p < 0.05**: Distributions differ significantly
- **p ≥ 0.05**: No significant difference

### Governance Implication:
If both ANOVA and Kruskal-Wallis are significant → Very strong evidence

In [None]:
# Perform Kruskal-Wallis Test
kruskal_results = kruskal_wallis_test(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score',
    alpha=0.05
)

### Understanding the Results:

**H-Statistic**: Measures differences in rank distributions  
- Larger value = stronger evidence of differences

**P-value**: Probability of observing this H-statistic if distributions were equal  

**Comparison with ANOVA**:
- If both significant → Strong evidence (robust to assumptions)
- If only ANOVA significant → Check for outliers/non-normality
- If only Kruskal significant → May have distribution differences not captured by means

## 5. Isolation Forest: ML-Based Anomaly Detection

### Algorithm Logic:

**How it works**:
1. Builds random decision trees on the data
2. **Anomalies** are easier to isolate (require fewer splits)
3. **Normal points** require more splits to isolate
4. Assigns anomaly score based on average path length

### Intuition:
- Imagine trying to find a specific person in a crowd
- Someone standing alone (anomaly) is easy to find
- Someone in the middle of the crowd (normal) takes more effort

### Parameters:
- **Contamination**: Expected % of anomalies (default: 5%)
- **Features**: Age, Quality Score, etc.

### What it detects:
- Unusually low quality for young age
- Unusually high quality for elderly
- Data entry errors
- Exceptional enrollment centers

### Governance Implication:
- Investigate anomalies for root causes
- Learn from positive anomalies (best practices)
- Fix negative anomalies (quality issues)

In [None]:
# Perform Isolation Forest Anomaly Detection
df_with_anomalies, iso_forest_stats = isolation_forest_anomaly_detection(
    df_enrolment,
    features=['Age', 'Biometric_Quality_Score'],
    contamination=0.05,
    random_state=42
)

In [None]:
# Visualize anomalies
if 'Anomaly' in df_with_anomalies.columns:
    plt.figure(figsize=(12, 6))
    
    # Plot normal points
    normal = df_with_anomalies[df_with_anomalies['Anomaly'] == 1]
    anomalies = df_with_anomalies[df_with_anomalies['Anomaly'] == -1]
    
    plt.scatter(normal['Age'], normal['Biometric_Quality_Score'], 
                c='blue', alpha=0.3, s=20, label='Normal')
    plt.scatter(anomalies['Age'], anomalies['Biometric_Quality_Score'], 
                c='red', alpha=0.8, s=50, marker='x', label='Anomaly')
    
    plt.title('Isolation Forest: Anomaly Detection in Age-Quality Space', 
              fontsize=14, fontweight='bold')
    plt.xlabel('Age', fontsize=12, fontweight='bold')
    plt.ylabel('Biometric Quality Score', fontsize=12, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    # Save figure
    plt.savefig('../outputs/figures/09_isolation_forest_anomalies.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Saved: outputs/figures/09_isolation_forest_anomalies.png")

## 6. Statistical Anomaly Detection: Z-Score Method

### Algorithm Logic:

**How it works**:
1. For each age group, calculate mean and standard deviation of quality scores
2. For each record, calculate z-score: z = (score - mean) / std
3. Flag records with |z| > 2 (more than 2 std dev from mean)

### Intuition:
- In normal distribution, ~95% of data falls within ±2 std dev
- Records outside this range are unusual for their age group

### What it detects:
- **Positive anomalies** (z > 2): Better quality than expected for age
  - Example: 70-year-old with excellent biometric quality
  - Action: Investigate what made this enrollment successful

- **Negative anomalies** (z < -2): Worse quality than expected for age
  - Example: 25-year-old with poor biometric quality
  - Action: Investigate enrollment center or operator issues

### Governance Implication:
- Positive anomalies → Learn best practices
- Negative anomalies → Fix quality issues

In [None]:
# Perform Statistical Anomaly Detection
df_with_statistical_anomalies, stat_anomaly_stats = detect_age_quality_anomalies(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score',
    threshold_std=2.0
)

In [None]:
# Analyze anomalies by type
if 'anomaly_type' in df_with_statistical_anomalies.columns:
    print("\nAnomaly Type Distribution:")
    print(df_with_statistical_anomalies['anomaly_type'].value_counts())
    
    # Show examples of each type
    print("\n" + "="*80)
    print("POSITIVE ANOMALIES (Unusually High Quality)")
    print("="*80)
    high_anomalies = df_with_statistical_anomalies[
        df_with_statistical_anomalies['anomaly_type'] == 'Unusually High Quality'
    ]
    if len(high_anomalies) > 0:
        display_cols = ['Age', 'Age_Group', 'Biometric_Quality_Score', 'z_score']
        print(high_anomalies[display_cols].head(10))
        print("\n→ These cases exceeded expectations. Investigate for best practices.")
    
    print("\n" + "="*80)
    print("NEGATIVE ANOMALIES (Unusually Low Quality)")
    print("="*80)
    low_anomalies = df_with_statistical_anomalies[
        df_with_statistical_anomalies['anomaly_type'] == 'Unusually Low Quality'
    ]
    if len(low_anomalies) > 0:
        display_cols = ['Age', 'Age_Group', 'Biometric_Quality_Score', 'z_score']
        print(low_anomalies[display_cols].head(10))
        print("\n→ These cases underperformed. Investigate for quality issues.")

In [None]:
# Visualize statistical anomalies
if 'is_anomaly' in df_with_statistical_anomalies.columns:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Plot 1: Scatter plot with anomalies highlighted
    ax1 = axes[0]
    normal = df_with_statistical_anomalies[~df_with_statistical_anomalies['is_anomaly']]
    anomalies = df_with_statistical_anomalies[df_with_statistical_anomalies['is_anomaly']]
    
    ax1.scatter(normal['Age'], normal['Biometric_Quality_Score'], 
                c='blue', alpha=0.3, s=20, label='Normal')
    ax1.scatter(anomalies['Age'], anomalies['Biometric_Quality_Score'], 
                c='red', alpha=0.8, s=50, marker='x', label='Anomaly')
    ax1.set_title('Statistical Anomaly Detection (Z-Score Method)', fontweight='bold')
    ax1.set_xlabel('Age', fontweight='bold')
    ax1.set_ylabel('Biometric Quality Score', fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Anomaly distribution by age group
    ax2 = axes[1]
    anomaly_by_age = df_with_statistical_anomalies[df_with_statistical_anomalies['is_anomaly']].groupby('Age_Group').size()
    anomaly_by_age.plot(kind='bar', ax=ax2, color='coral', edgecolor='black')
    ax2.set_title('Anomaly Count by Age Group', fontweight='bold')
    ax2.set_xlabel('Age Group', fontweight='bold')
    ax2.set_ylabel('Number of Anomalies', fontweight='bold')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.savefig('../outputs/figures/10_statistical_anomalies.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("✓ Saved: outputs/figures/10_statistical_anomalies.png")

## 7. Comprehensive Statistical Report

This section consolidates all test results into a single evidence-based report.

In [None]:
# Generate comprehensive statistical report
statistical_report = generate_statistical_report(
    chi2_results,
    anova_results,
    kruskal_results,
    stat_anomaly_stats
)

print(statistical_report)

## 8. Save Statistical Results

In [None]:
# Save test results to CSV
Path('../outputs/tables').mkdir(parents=True, exist_ok=True)

# Save test summary
test_summary = pd.DataFrame([
    {
        'Test': 'Chi-Square',
        'Statistic': chi2_results['chi2_statistic'],
        'P-value': chi2_results['p_value'],
        'Significant': chi2_results['is_significant'],
        'Effect_Size': chi2_results['effect_size']
    },
    {
        'Test': 'ANOVA',
        'Statistic': anova_results['f_statistic'],
        'P-value': anova_results['p_value'],
        'Significant': anova_results['is_significant'],
        'Effect_Size': anova_results['eta_squared']
    },
    {
        'Test': 'Kruskal-Wallis',
        'Statistic': kruskal_results['h_statistic'],
        'P-value': kruskal_results['p_value'],
        'Significant': kruskal_results['is_significant'],
        'Effect_Size': None
    }
])

test_summary.to_csv('../outputs/tables/statistical_test_summary.csv', index=False)
print("✓ Saved: outputs/tables/statistical_test_summary.csv")

# Save anomaly statistics
anomaly_summary = pd.DataFrame([
    {
        'Method': 'Isolation Forest',
        'Total_Anomalies': iso_forest_stats.get('n_anomalies', 0),
        'Anomaly_Rate': iso_forest_stats.get('anomaly_rate', 0)
    },
    {
        'Method': 'Statistical (Z-Score)',
        'Total_Anomalies': stat_anomaly_stats.get('n_anomalies', 0),
        'Anomaly_Rate': stat_anomaly_stats.get('anomaly_rate', 0)
    }
])

anomaly_summary.to_csv('../outputs/tables/anomaly_detection_summary.csv', index=False)
print("✓ Saved: outputs/tables/anomaly_detection_summary.csv")

# Save statistical report as text file
with open('../outputs/report/statistical_report.txt', 'w') as f:
    f.write(statistical_report)
print("✓ Saved: outputs/report/statistical_report.txt")

# Save anomalies for further investigation
if 'is_anomaly' in df_with_statistical_anomalies.columns:
    anomalies_only = df_with_statistical_anomalies[df_with_statistical_anomalies['is_anomaly']]
    anomalies_only.to_csv('../outputs/tables/detected_anomalies.csv', index=False)
    print(f"✓ Saved: outputs/tables/detected_anomalies.csv ({len(anomalies_only):,} anomalies)")

## 9. Key Findings and Conclusions

In [None]:
print("="*80)
print("STATISTICAL ANALYSIS - KEY FINDINGS")
print("="*80)

print("\n1. HYPOTHESIS TESTING RESULTS")
print("-" * 80)
print(f"   Chi-Square Test: {'SIGNIFICANT' if chi2_results['is_significant'] else 'NOT SIGNIFICANT'}")
print(f"   → Quality distribution {'DOES' if chi2_results['is_significant'] else 'DOES NOT'} vary by age")

print(f"\n   ANOVA Test: {'SIGNIFICANT' if anova_results['is_significant'] else 'NOT SIGNIFICANT'}")
print(f"   → Mean quality scores {'DO' if anova_results['is_significant'] else 'DO NOT'} differ by age")

print(f"\n   Kruskal-Wallis Test: {'SIGNIFICANT' if kruskal_results['is_significant'] else 'NOT SIGNIFICANT'}")
print(f"   → Result {'CONFIRMS' if kruskal_results['is_significant'] else 'DOES NOT CONFIRM'} age-quality relationship (robust)")

print("\n2. ANOMALY DETECTION RESULTS")
print("-" * 80)
print(f"   Isolation Forest: {iso_forest_stats.get('n_anomalies', 0):,} anomalies ({iso_forest_stats.get('anomaly_rate', 0):.2f}%)")
print(f"   Statistical Method: {stat_anomaly_stats.get('n_anomalies', 0):,} anomalies ({stat_anomaly_stats.get('anomaly_rate', 0):.2f}%)")
print(f"   → Unusually High Quality: {stat_anomaly_stats.get('n_unusually_high', 0):,}")
print(f"   → Unusually Low Quality: {stat_anomaly_stats.get('n_unusually_low', 0):,}")

print("\n3. GOVERNANCE RECOMMENDATIONS")
print("-" * 80)

all_significant = (chi2_results['is_significant'] and 
                  anova_results['is_significant'] and 
                  kruskal_results['is_significant'])

if all_significant:
    print("   ✓ STRONG STATISTICAL EVIDENCE: Age significantly affects biometric quality")
    print("\n   Recommended Actions:")
    print("   1. Implement age-specific enrollment protocols")
    print("   2. Deploy specialized biometric devices for elderly populations")
    print("   3. Provide assisted enrollment for vulnerable age groups")
    print("   4. Investigate positive anomalies to identify best practices")
    print("   5. Address negative anomalies to improve enrollment quality")
    print("   6. Prioritize re-enrollment campaigns for low-quality age groups")
else:
    print("   ⚠ MIXED EVIDENCE: Further investigation recommended")
    print("   → Some tests show significance, others don't")
    print("   → Consider additional data collection or analysis")

print("\n" + "="*80)
print("STATISTICAL ANALYSIS COMPLETE")
print("="*80)
print("\n✓ All test results saved to outputs/tables/")
print("✓ All visualizations saved to outputs/figures/")
print("✓ Statistical report saved to outputs/report/")
print("\n→ Ready for Step 4: Advanced Visualization")
print("→ Ready for Step 5: Insight Extraction and Final Report")

---

## Summary of Statistical Evidence

This notebook provided:

### ✅ Hypothesis Testing
- **Chi-Square Test**: Tests independence of quality categories and age groups
- **ANOVA**: Tests if mean quality differs across age groups
- **Kruskal-Wallis**: Non-parametric confirmation of age-quality relationship

### ✅ Anomaly Detection
- **Isolation Forest**: ML-based detection of unusual age-quality combinations
- **Z-Score Method**: Statistical detection of outliers within age groups

### ✅ Evidence-Based Conclusions
- Quantitative proof of age-quality relationships
- Identification of exceptional cases for investigation
- Statistical justification for governance recommendations

---

**UIDAI Data Hackathon 2026** | Backend Analytics Project