# UIDAI Biometric Update Analysis - Exploratory Data Analysis

**UIDAI Data Hackathon 2026**  
**Project**: Age-Group-Wise Biometric Update Patterns

---

## Notebook Purpose

This notebook performs **Step 2: Exploratory Data Analysis (EDA)** focusing on:
1. Age group distribution analysis
2. Biometric quality patterns by age
3. Update frequency and types by age group
4. Statistical relationships and correlations
5. Identification of vulnerable demographics

---

## Research Question

**Which age groups face the greatest biometric quality challenges, and what does this reveal about service stress and re-enrollment needs?**

---

## Expected Insights
- Elderly populations may have lower biometric quality (manual labor, aging)
- Young children may have quality issues (small fingers, growth)
- Certain age groups may require more frequent updates
- Update types may vary by age (biometric vs demographic)

## 1. Setup and Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Import custom modules
import sys
sys.path.append('..')

from scripts.analyzer import (
    analyze_age_distribution,
    analyze_biometric_quality_by_age,
    analyze_quality_categories_by_age,
    analyze_update_patterns_by_age,
    analyze_update_types_by_age,
    calculate_correlation_matrix,
    generate_summary_statistics,
    identify_outliers
)

from scripts.visualizer import (
    plot_age_distribution,
    plot_quality_by_age,
    plot_quality_categories_by_age,
    plot_mean_quality_by_age,
    plot_update_rates_by_age,
    plot_update_types_heatmap,
    plot_correlation_heatmap,
    create_multi_panel_summary
)

# Configure settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("âœ“ All modules imported successfully")

In [None]:
# Load cleaned datasets from Step 1
df_enrolment = pd.read_csv('../data/processed/enrolment_cleaned.csv')
df_updates = pd.read_csv('../data/processed/updates_cleaned.csv')

# Convert date columns
df_enrolment['Enrolment_Date'] = pd.to_datetime(df_enrolment['Enrolment_Date'])
df_updates['Update_Date'] = pd.to_datetime(df_updates['Update_Date'])

print(f"âœ“ Loaded enrolment data: {len(df_enrolment):,} records")
print(f"âœ“ Loaded update data: {len(df_updates):,} records")

# Quick preview
print("\nEnrolment Data Preview:")
display(df_enrolment.head())

print("\nUpdate Data Preview:")
display(df_updates.head())

## 2. Age Group Distribution Analysis

**What we're analyzing**: How enrolments are distributed across age groups  
**Why it matters**: Identifies over/under-represented demographics in the Aadhaar system

In [None]:
# Analyze age distribution
age_dist_stats = analyze_age_distribution(df_enrolment, age_group_column='Age_Group')

In [None]:
# Visualize age distribution
plot_age_distribution(
    df_enrolment, 
    age_group_column='Age_Group',
    save_path='../outputs/figures/01_age_distribution.png'
)

### ðŸ“Š Interpretation

**What the chart shows**:
- The number of Aadhaar enrolments in each age category
- Largest group represents the demographic with highest enrollment
- Smallest group may indicate underserved population or smaller demographic size

**Governance implications**:
- Large young adult (19-40) group suggests working-age population enrollment for employment/banking
- Small elderly (60+) group may indicate accessibility challenges or lower life expectancy
- Child enrollment rates reflect birth registration integration

## 3. Biometric Quality Analysis by Age Group

**What we're analyzing**: How biometric quality scores vary across age groups  
**Why it matters**: Identifies which demographics face biometric capture challenges

In [None]:
# Analyze biometric quality by age
quality_by_age_stats = analyze_biometric_quality_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score'
)

In [None]:
# Visualize quality distribution by age (box plot)
plot_quality_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score',
    save_path='../outputs/figures/02_quality_boxplot_by_age.png'
)

In [None]:
# Visualize mean quality trend
plot_mean_quality_by_age(
    quality_by_age_stats,
    age_group_column='Age_Group',
    save_path='../outputs/figures/03_mean_quality_trend.png'
)

### ðŸ“Š Interpretation

**What the charts show**:
- **Box plot**: Distribution of quality scores (median, quartiles, outliers) for each age group
- **Line plot**: Average quality score trend across age groups
- Red dashed line (60): Threshold between Fair and Good quality

**Expected patterns**:
- **Children (0-5)**: May have lower quality due to small finger size
- **Young adults (19-40)**: Typically highest quality (healthy, clear biometrics)
- **Elderly (60+)**: Lower quality due to:
  - Manual labor (worn fingerprints)
  - Age-related skin changes
  - Health conditions (diabetes affecting fingerprints)

**Governance implications**:
- Age groups with mean quality < 60 need special attention
- High variability (large box) indicates inconsistent capture quality
- May require age-specific enrollment protocols

## 4. Quality Category Distribution by Age

**What we're analyzing**: Percentage of Poor/Fair/Good/Excellent quality within each age group  
**Why it matters**: Identifies which age groups need re-enrollment most urgently

In [None]:
# Analyze quality categories by age
quality_categories_by_age = analyze_quality_categories_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_category_column='Quality_Category'
)

In [None]:
# Visualize quality categories (stacked bar chart)
plot_quality_categories_by_age(
    df_enrolment,
    age_group_column='Age_Group',
    quality_category_column='Quality_Category',
    save_path='../outputs/figures/04_quality_categories_stacked.png'
)

### ðŸ“Š Interpretation

**What the chart shows**:
- Each bar = 100% of an age group
- Colors show percentage in each quality category
- Red/orange = Poor/Fair (likely need re-enrollment)
- Yellow/green = Good/Excellent (acceptable quality)

**Key metrics to watch**:
- **Poor quality %**: Direct re-enrollment need
- **Fair quality %**: May need re-enrollment if authentication fails
- **Combined Poor+Fair %**: Total at-risk population

**Governance implications**:
- Age groups with >30% Poor+Fair quality need targeted re-enrollment campaigns
- Elderly with high Poor% may need assisted enrollment or alternative authentication
- Children with quality issues may need age-specific biometric devices

## 5. Update Pattern Analysis by Age Group

**What we're analyzing**: How frequently each age group updates their Aadhaar  
**Why it matters**: High update rates indicate instability or system quality issues

In [None]:
# Analyze update patterns by age
update_patterns_by_age = analyze_update_patterns_by_age(
    df_updates,
    df_enrolment,
    age_group_column='Age_Group'
)

In [None]:
# Visualize update rates
plot_update_rates_by_age(
    update_patterns_by_age,
    save_path='../outputs/figures/05_update_rates_by_age.png'
)

### ðŸ“Š Interpretation

**What the chart shows**:
- Updates per 1,000 enrolments for each age group
- Higher bars = more frequent updates

**Expected patterns**:
- **Young adults (19-40)**: High update rate due to:
  - Marriage (name/address changes)
  - Migration for employment
  - Mobile number changes
- **Elderly (60+)**: May have high biometric update rate due to quality degradation
- **Children (0-5)**: Low update rate (stable family environment)

**Governance implications**:
- High update rates indicate:
  - Life event transitions (normal)
  - Enrollment quality issues (problematic)
  - Vulnerable populations (frequent movers)
- Need to distinguish between normal life events vs system quality issues

## 6. Update Type Analysis by Age Group

**What we're analyzing**: Which types of updates each age group makes most  
**Why it matters**: Reveals age-specific needs and challenges

In [None]:
# Analyze update types by age
update_types_by_age = analyze_update_types_by_age(
    df_updates,
    df_enrolment,
    age_group_column='Age_Group',
    update_type_column='Update_Type'
)

In [None]:
# Visualize update types heatmap
plot_update_types_heatmap(
    update_types_by_age,
    save_path='../outputs/figures/06_update_types_heatmap.png'
)

### ðŸ“Š Interpretation

**What the heatmap shows**:
- Rows = Age groups
- Columns = Update types
- Color intensity = Percentage of that update type within age group
- Darker red = More common

**Expected patterns**:
- **Biometric updates**: Higher in elderly (quality degradation) and children (growth)
- **Demographic updates**: Higher in young adults (marriage name changes)
- **Address updates**: Higher in working-age adults (migration)
- **Mobile updates**: Distributed across all ages (phone number churn)

**Governance implications**:
- High biometric update % confirms quality challenges in specific age groups
- Address update patterns reveal internal migration flows
- Update type distribution helps resource planning for update centers

## 7. Statistical Correlation Analysis

**What we're analyzing**: Relationships between numerical variables  
**Why it matters**: Confirms statistical associations (e.g., age â†” quality)

In [None]:
# Calculate correlation matrix
corr_matrix = calculate_correlation_matrix(
    df_enrolment,
    columns=['Age', 'Biometric_Quality_Score', 'Enrolment_Year']
)

In [None]:
# Visualize correlation heatmap
if not corr_matrix.empty:
    plot_correlation_heatmap(
        corr_matrix,
        save_path='../outputs/figures/07_correlation_heatmap.png'
    )

### ðŸ“Š Interpretation

**What the heatmap shows**:
- Correlation coefficient ranges from -1 to +1
- **Positive (red)**: Variables increase together
- **Negative (blue)**: One increases, other decreases
- **Near zero (white)**: No linear relationship

**Key relationships to watch**:
- **Age â†” Quality**: Negative correlation confirms quality decreases with age
- Strong correlation (|r| > 0.7): Very strong relationship
- Moderate correlation (0.3 < |r| < 0.7): Meaningful relationship
- Weak correlation (|r| < 0.3): Little relationship

**Statistical significance**:
- Correlation confirms patterns observed in visualizations
- Provides quantitative evidence for governance recommendations

## 8. Outlier Detection

**What we're analyzing**: Unusual biometric quality scores  
**Why it matters**: Identifies data quality issues or exceptional cases

In [None]:
# Identify outliers in biometric quality scores
outliers, outlier_stats = identify_outliers(
    df_enrolment,
    column='Biometric_Quality_Score',
    method='iqr'
)

if len(outliers) > 0:
    print("\nOutlier Examples (first 10):")
    display(outliers[['Age', 'Age_Group', 'Biometric_Quality_Score', 'Quality_Category']].head(10))

## 9. Comprehensive Summary Dashboard

**What we're creating**: Single-page visual summary of all key findings  
**Why it matters**: Provides at-a-glance overview for report and presentation

In [None]:
# Create 4-panel summary dashboard
create_multi_panel_summary(
    df_enrolment,
    age_group_column='Age_Group',
    quality_column='Biometric_Quality_Score',
    quality_category_column='Quality_Category',
    save_path='../outputs/figures/08_summary_dashboard.png'
)

## 10. Key Findings Summary

Let's consolidate all the insights from this EDA:

In [None]:
print("="*80)
print("KEY FINDINGS FROM EXPLORATORY DATA ANALYSIS")
print("="*80)

print("\n1. AGE GROUP DISTRIBUTION")
print("-" * 80)
print(age_dist_stats.to_string(index=False))

print("\n2. BIOMETRIC QUALITY BY AGE GROUP")
print("-" * 80)
print(quality_by_age_stats.to_string(index=False))

print("\n3. UPDATE PATTERNS BY AGE GROUP")
print("-" * 80)
print(update_patterns_by_age.to_string(index=False))

print("\n4. QUALITY CATEGORIES DISTRIBUTION (%)")
print("-" * 80)
print(quality_categories_by_age.round(1))

print("\n" + "="*80)
print("ANALYSIS COMPLETE - READY FOR STATISTICAL TESTING (STEP 3)")
print("="*80)

## 11. Save Analysis Results

Export all statistical tables for use in the final report:

In [None]:
# Create output directory for tables
Path('../outputs/tables').mkdir(parents=True, exist_ok=True)

# Save all analysis results
age_dist_stats.to_csv('../outputs/tables/age_distribution.csv', index=False)
quality_by_age_stats.to_csv('../outputs/tables/quality_by_age.csv', index=False)
update_patterns_by_age.to_csv('../outputs/tables/update_patterns_by_age.csv', index=False)
quality_categories_by_age.to_csv('../outputs/tables/quality_categories_by_age.csv')
update_types_by_age.to_csv('../outputs/tables/update_types_by_age.csv')

if not corr_matrix.empty:
    corr_matrix.to_csv('../outputs/tables/correlation_matrix.csv')

print("âœ“ All analysis results saved to outputs/tables/")
print("âœ“ All visualizations saved to outputs/figures/")
print("\nâœ“ EDA COMPLETE - Proceed to Step 3: Statistical Analysis")

---

## Next Steps

1. **Statistical Analysis (Notebook 03)**: 
   - Hypothesis testing (Chi-square, ANOVA)
   - Significance testing for age-quality relationships
   - Predictive indicators for re-enrollment needs

2. **Advanced Visualization (Notebook 04)**:
   - Publication-quality charts for final report
   - Geographic analysis (if state data available)
   - Time-series trends

3. **Insight Extraction (Notebook 05)**:
   - Translate findings into governance recommendations
   - Identify actionable interventions for UIDAI
   - Prepare final PDF report content

---

**UIDAI Data Hackathon 2026** | Backend Analytics Project