# Exploratory Data Analysis (EDA)
## UIDAI Datathon 2026 - Project ALI

This notebook performs comprehensive exploratory data analysis on Aadhaar demographic and biometric datasets.

**Objective:** Identify coverage gaps, service strain patterns, and operational inefficiencies in the Aadhaar ecosystem.

## 1. Setup and Data Loading

Import necessary libraries and load the merged dataset using our custom data loader module.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sys
import os

# Add src directory to path
sys.path.insert(0, os.path.abspath('..'))

from src.data_loader import load_and_merge_data
from src.analytics import calculate_ssi, calculate_gap_by_district

# Set visual style
sns.set_theme(style="whitegrid")
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['figure.dpi'] = 100  # Lower DPI for notebook display

print("Libraries imported successfully!")

In [None]:
# Load and merge datasets
df = load_and_merge_data(base_path="../data/raw")

# Display basic information
print(f"\nDataset Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

## 2. Univariate Analysis: Demographic Volatility

Analyzing the distribution of adolescent population (Age 5-17) across pincodes to understand demographic spread.

In [None]:
if 'demo_age_5_17' in df.columns:
    plt.figure(figsize=(12, 6))
    sns.histplot(df['demo_age_5_17'], kde=True, color='#2c3e50', bins=30)
    plt.title('Figure 1: Distribution of Adolescent Population (Age 5-17) per Pincode', 
              fontsize=14, pad=20)
    plt.xlabel('Resident Count (Ages 5-17)', fontsize=12)
    plt.ylabel('Frequency (Number of Pincodes)', fontsize=12)
    plt.axvline(df['demo_age_5_17'].mean(), color='red', linestyle='--', 
                label=f"National Average: {df['demo_age_5_17'].mean():.0f}")
    plt.legend()
    plt.tight_layout()
    plt.savefig('../reports/figures/univariate_demographics.png', dpi=300)
    plt.show()
    
    # Statistical summary
    print("\nStatistical Summary:")
    print(df['demo_age_5_17'].describe())
else:
    print("Column 'demo_age_5_17' not found in dataset")

**Key Insight:** The distribution reveals concentration patterns. A right-skewed distribution indicates that most pincodes have relatively small populations, while a few urban centers have significantly higher counts.

## 3. Bivariate Analysis: Supply vs Demand

Examining the relationship between demographic demand (population) and biometric supply (updates processed).

In [None]:
if 'demo_age_5_17' in df.columns and 'bio_age_5_17' in df.columns:
    plt.figure(figsize=(12, 8))
    
    # Scatter plot with regression line
    sns.regplot(x='demo_age_5_17', y='bio_age_5_17', data=df, 
                scatter_kws={'alpha':0.4, 'color':'#3498db'}, 
                line_kws={'color':'red', 'label':'Supply Trend Line'})
    
    plt.title('Figure 2: Biometric Update Velocity vs. Demographic Demand', fontsize=14)
    plt.xlabel('Potential Demand (Population 5-17)', fontsize=12)
    plt.ylabel('Actual Biometric Updates (Processed)', fontsize=12)
    
    # Equilibrium line (where Supply = Demand)
    lims = [0, max(df['demo_age_5_17'].max(), df['bio_age_5_17'].max())]
    plt.plot(lims, lims, '--k', alpha=0.5, label='Equilibrium Line (Supply=Demand)')
    
    plt.legend()
    max_x = df['demo_age_5_17'].max()
    plt.annotate('Under-served Zones\n(High Demand, Low Supply)', 
                 xy=(max_x*0.7, 100), color='red', fontweight='bold')
    plt.tight_layout()
    plt.savefig('../reports/figures/bivariate_supply_demand.png', dpi=300)
    plt.show()
    
    # Correlation analysis
    correlation = df[['demo_age_5_17', 'bio_age_5_17']].corr().iloc[0, 1]
    print(f"\nCorrelation between Demand and Supply: {correlation:.3f}")
else:
    print("Required columns not found in dataset")

**Key Insight:** Points below the equilibrium line indicate regions where biometric updates lag behind demographic demand, signaling potential service gaps.

## 4. Trivariate Analysis: Gap Analysis by District

Identifying the top 10 districts with the highest service gap (demand - supply).

In [None]:
# Calculate gap by district
top_strained = calculate_gap_by_district(df, top_n=10)

# Plotting
fig, ax = plt.subplots(figsize=(14, 7))
top_strained.plot(x='district', y=['demo_age_5_17', 'bio_age_5_17'], 
                  kind='bar', ax=ax, color=['#e74c3c', '#2ecc71'])

plt.title('Figure 3: Top 10 High-Priority Districts (Update Backlog)', fontsize=14)
plt.ylabel('Count of Residents', fontsize=12)
plt.xlabel('District', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(['Required Updates (Demand)', 'Completed Updates (Supply)'])
plt.tight_layout()
plt.savefig('../reports/figures/gap_bar_chart.png', dpi=300)
plt.show()

# Display district gap data
print("\nTop 10 Strained Districts:")
print(top_strained[['district', 'gap']].to_string(index=False))

**Key Insight:** These districts require immediate intervention for mobile update camps or additional infrastructure.

## 5. Service Strain Index (SSI) Heatmap

Visualizing SSI across districts and pincodes to identify priority zones.

In [None]:
# Calculate SSI
df_ssi = calculate_ssi(df)

# Select top 20 high-strain pincodes
heat_data = df_ssi.sort_values(by='SSI', ascending=False).head(20)

if not heat_data.empty and 'district' in heat_data.columns and 'pincode' in heat_data.columns:
    # Create pivot table
    heat_pivot = heat_data.pivot_table(index="district", columns="pincode", 
                                       values="SSI", aggfunc='mean')
    
    plt.figure(figsize=(14, 10))
    sns.heatmap(heat_pivot, annot=True, fmt='.2f', cmap='RdYlGn_r', center=1.5, 
                cbar_kws={'label': 'SSI Score'})
    plt.title('Figure 4: Service Strain Index (SSI) Matrix - Priority Identification', 
              fontsize=14)
    plt.xlabel('Pincode', fontsize=12)
    plt.ylabel('District', fontsize=12)
    plt.tight_layout()
    plt.savefig('../reports/figures/ssi_heatmap.png', dpi=300)
    plt.show()
    
    print("\nSSI Interpretation:")
    print("SSI < 1.0: Supply exceeds demand (well-served)")
    print("SSI = 1.0: Equilibrium")
    print("SSI > 1.5: High strain (under-served, priority intervention needed)")
    print(f"\nAverage SSI in dataset: {df_ssi['SSI'].mean():.2f}")
else:
    print("Insufficient data for heatmap generation")

## 6. Summary and Conclusions

### Key Findings:

1. **Demographic Concentration**: Urban pincodes show significantly higher population density
2. **Service Gap**: Weak correlation between demand and supply indicates systemic inefficiencies
3. **Priority Districts**: Top 10 districts identified for immediate resource allocation
4. **SSI Hotspots**: High SSI zones require targeted mobile enrollment camps

### Recommendations:

- Deploy mobile biometric update units to high-SSI districts
- Investigate systemic issues in under-performing regions
- Implement predictive scheduling based on demographic trends
- Enhance infrastructure in consistently strained zones

---
**Project ALI (Aadhaar Lifecycle Intelligence)**  
UIDAI Datathon 2026  
Prepared by: Priyanshu