# UIDAI Biometric Update Analysis - Data Loading and Cleaning

**UIDAI Data Hackathon 2026**  
**Project**: Backend Data Analytics - Biometric Update Patterns

---

## Notebook Purpose

This notebook performs **Step 1** of the analysis pipeline:
1. Load Aadhaar enrolment and update datasets
2. Perform initial data exploration
3. Clean and preprocess the data
4. Create age groups for demographic analysis
5. Save cleaned datasets for further analysis

---

## Expected Outputs
- Cleaned enrolment dataset with age groups
- Cleaned update dataset with quality categories
- Data quality report
- Saved processed files in `data/processed/`

## 1. Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Import custom modules
import sys
sys.path.append('..')  # Add parent directory to path

from scripts.data_loader import (
    load_enrolment_data,
    load_update_data,
    get_data_info,
    convert_date_columns,
    save_processed_data
)

from scripts.data_cleaner import (
    create_age_groups,
    handle_missing_values,
    standardize_categorical_columns,
    remove_duplicates,
    create_biometric_quality_categories,
    get_cleaning_summary
)

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ All libraries and modules imported successfully")
print(f"✓ Pandas version: {pd.__version__}")
print(f"✓ NumPy version: {np.__version__}")

## 2. Load Datasets

**Note**: Replace the file paths below with your actual UIDAI dataset files.  
If you don't have the datasets yet, we'll create sample data in the next section.

In [None]:
# Define file paths
# TODO: Replace these with your actual dataset file paths
ENROLMENT_FILE = '../data/raw/enrolment_data.csv'
UPDATE_FILE = '../data/raw/update_data.csv'

# Check if files exist
enrolment_path = Path(ENROLMENT_FILE)
update_path = Path(UPDATE_FILE)

if enrolment_path.exists() and update_path.exists():
    print("✓ Dataset files found")
    USE_SAMPLE_DATA = False
else:
    print("⚠ Dataset files not found. Will create sample data for demonstration.")
    USE_SAMPLE_DATA = True

### 2.1 Create Sample Data (if needed)

This section creates realistic sample Aadhaar data for testing the pipeline.  
**Skip this if you have actual UIDAI datasets.**

In [None]:
if USE_SAMPLE_DATA:
    print("Creating sample datasets...")
    
    # Set random seed for reproducibility
    np.random.seed(42)
    
    # Sample size
    n_enrolments = 50000
    n_updates = 15000
    
    # Create sample enrolment data
    states = ['Maharashtra', 'Uttar Pradesh', 'Bihar', 'West Bengal', 'Madhya Pradesh', 
              'Tamil Nadu', 'Rajasthan', 'Karnataka', 'Gujarat', 'Andhra Pradesh']
    genders = ['Male', 'Female', 'Transgender']
    center_types = ['Permanent', 'Temporary', 'Mobile', 'Bank', 'Post Office']
    
    df_enrolment = pd.DataFrame({
        'Enrolment_ID': range(1, n_enrolments + 1),
        'Enrolment_Date': pd.date_range('2015-01-01', periods=n_enrolments, freq='H'),
        'State': np.random.choice(states, n_enrolments),
        'Age': np.random.choice(range(0, 100), n_enrolments, p=[
            0.02 if i < 6 else 
            0.015 if i < 19 else 
            0.025 if i < 41 else 
            0.015 if i < 61 else 
            0.008 for i in range(100)
        ]),
        'Gender': np.random.choice(genders, n_enrolments, p=[0.51, 0.485, 0.005]),
        'Center_Type': np.random.choice(center_types, n_enrolments, p=[0.4, 0.25, 0.15, 0.12, 0.08]),
        'Biometric_Quality_Score': np.random.randint(30, 100, n_enrolments)
    })
    
    # Create sample update data
    update_types = ['Biometric', 'Demographic', 'Address', 'Mobile', 'Email']
    update_reasons = ['Error Correction', 'Life Event', 'Biometric Degradation', 'Document Change']
    
    df_updates = pd.DataFrame({
        'Update_ID': range(1, n_updates + 1),
        'Enrolment_ID': np.random.choice(df_enrolment['Enrolment_ID'], n_updates),
        'Update_Date': pd.date_range('2018-01-01', periods=n_updates, freq='2H'),
        'Update_Type': np.random.choice(update_types, n_updates, p=[0.35, 0.25, 0.20, 0.15, 0.05]),
        'Update_Reason': np.random.choice(update_reasons, n_updates, p=[0.25, 0.30, 0.30, 0.15]),
        'Previous_State': np.random.choice(states, n_updates),
        'New_State': np.random.choice(states, n_updates)
    })
    
    # Save sample data
    Path('../data/raw').mkdir(parents=True, exist_ok=True)
    df_enrolment.to_csv('../data/raw/enrolment_data.csv', index=False)
    df_updates.to_csv('../data/raw/update_data.csv', index=False)
    
    print(f"✓ Created sample enrolment data: {len(df_enrolment):,} records")
    print(f"✓ Created sample update data: {len(df_updates):,} records")
    print("✓ Saved to data/raw/ directory")

### 2.2 Load Data Using Custom Functions

In [None]:
# Load enrolment data
df_enrolment = load_enrolment_data('../data/raw/enrolment_data.csv')

# Get data information
enrolment_info = get_data_info(df_enrolment, "Enrolment Dataset")

In [None]:
# Load update data
df_updates = load_update_data('../data/raw/update_data.csv')

# Get data information
update_info = get_data_info(df_updates, "Update Dataset")

### 2.3 Preview the Data

In [None]:
# Display first few rows of enrolment data
print("Enrolment Data - First 5 Rows:")
display(df_enrolment.head())

print("\nBasic Statistics:")
display(df_enrolment.describe())

In [None]:
# Display first few rows of update data
print("Update Data - First 5 Rows:")
display(df_updates.head())

print("\nBasic Statistics:")
display(df_updates.describe())

## 3. Data Cleaning

Now we'll clean the datasets by:
1. Converting date columns to proper datetime format
2. Handling missing values
3. Removing duplicates
4. Standardizing categorical columns

### 3.1 Convert Date Columns

In [None]:
# Convert enrolment date column
df_enrolment = convert_date_columns(df_enrolment, ['Enrolment_Date'])

# Convert update date column
df_updates = convert_date_columns(df_updates, ['Update_Date'])

print("\n✓ Date columns converted successfully")

### 3.2 Handle Missing Values

In [None]:
# Check and handle missing values in enrolment data
df_enrolment_clean = handle_missing_values(df_enrolment, strategy='report')

# If you want to actually handle missing values, use one of these strategies:
# df_enrolment_clean = handle_missing_values(df_enrolment, strategy='drop_cols', threshold=0.3)
# df_enrolment_clean = handle_missing_values(df_enrolment, strategy='fill_median')

In [None]:
# Check and handle missing values in update data
df_updates_clean = handle_missing_values(df_updates, strategy='report')

### 3.3 Remove Duplicates

In [None]:
# Remove duplicate enrolment records
df_enrolment_clean = remove_duplicates(df_enrolment_clean, subset=['Enrolment_ID'], keep='first')

In [None]:
# Remove duplicate update records
df_updates_clean = remove_duplicates(df_updates_clean, subset=['Update_ID'], keep='first')

### 3.4 Standardize Categorical Columns

In [None]:
# Standardize categorical columns in enrolment data
categorical_cols_enrolment = ['State', 'Gender', 'Center_Type']
df_enrolment_clean = standardize_categorical_columns(
    df_enrolment_clean, 
    categorical_cols_enrolment, 
    case='title'
)

In [None]:
# Standardize categorical columns in update data
categorical_cols_updates = ['Update_Type', 'Update_Reason', 'Previous_State', 'New_State']
df_updates_clean = standardize_categorical_columns(
    df_updates_clean, 
    categorical_cols_updates, 
    case='title'
)

## 4. Feature Engineering

Create new columns for analysis:
1. Age groups from age data
2. Biometric quality categories
3. Time-based features

### 4.1 Create Age Groups

In [None]:
# Create age groups in enrolment data
df_enrolment_clean = create_age_groups(
    df_enrolment_clean,
    age_column='Age',
    new_column_name='Age_Group'
)

# Visualize age group distribution
plt.figure(figsize=(10, 6))
age_dist = df_enrolment_clean['Age_Group'].value_counts().sort_index()
age_dist.plot(kind='bar', color='steelblue')
plt.title('Distribution of Enrolments by Age Group', fontsize=14, fontweight='bold')
plt.xlabel('Age Group', fontsize=12)
plt.ylabel('Number of Enrolments', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print(f"\n✓ Age groups created successfully")

### 4.2 Create Biometric Quality Categories

In [None]:
# Create biometric quality categories
df_enrolment_clean = create_biometric_quality_categories(
    df_enrolment_clean,
    quality_column='Biometric_Quality_Score',
    new_column_name='Quality_Category'
)

# Visualize quality distribution
plt.figure(figsize=(10, 6))
quality_dist = df_enrolment_clean['Quality_Category'].value_counts().sort_index()
quality_dist.plot(kind='bar', color='coral')
plt.title('Distribution of Biometric Quality Scores', fontsize=14, fontweight='bold')
plt.xlabel('Quality Category', fontsize=12)
plt.ylabel('Number of Enrolments', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 4.3 Create Time-Based Features

In [None]:
# Extract time-based features from enrolment date
df_enrolment_clean['Enrolment_Year'] = df_enrolment_clean['Enrolment_Date'].dt.year
df_enrolment_clean['Enrolment_Month'] = df_enrolment_clean['Enrolment_Date'].dt.month
df_enrolment_clean['Enrolment_Quarter'] = df_enrolment_clean['Enrolment_Date'].dt.quarter

# Extract time-based features from update date
df_updates_clean['Update_Year'] = df_updates_clean['Update_Date'].dt.year
df_updates_clean['Update_Month'] = df_updates_clean['Update_Date'].dt.month
df_updates_clean['Update_Quarter'] = df_updates_clean['Update_Date'].dt.quarter

print("✓ Time-based features created successfully")

## 5. Data Quality Summary

In [None]:
# Generate cleaning summary for enrolment data
enrolment_summary = get_cleaning_summary(df_enrolment, df_enrolment_clean)

In [None]:
# Generate cleaning summary for update data
update_summary = get_cleaning_summary(df_updates, df_updates_clean)

## 6. Save Cleaned Datasets

In [None]:
# Save cleaned enrolment data
save_processed_data(
    df_enrolment_clean,
    file_name='enrolment_cleaned.csv',
    output_dir='../data/processed'
)

In [None]:
# Save cleaned update data
save_processed_data(
    df_updates_clean,
    file_name='updates_cleaned.csv',
    output_dir='../data/processed'
)

## 7. Final Data Overview

In [None]:
print("="*70)
print("FINAL CLEANED DATASETS - READY FOR ANALYSIS")
print("="*70)

print("\nEnrolment Dataset:")
print(f"  Records: {len(df_enrolment_clean):,}")
print(f"  Columns: {len(df_enrolment_clean.columns)}")
print(f"  Date Range: {df_enrolment_clean['Enrolment_Date'].min()} to {df_enrolment_clean['Enrolment_Date'].max()}")
print(f"  Age Groups: {df_enrolment_clean['Age_Group'].nunique()}")

print("\nUpdate Dataset:")
print(f"  Records: {len(df_updates_clean):,}")
print(f"  Columns: {len(df_updates_clean.columns)}")
print(f"  Date Range: {df_updates_clean['Update_Date'].min()} to {df_updates_clean['Update_Date'].max()}")
print(f"  Update Types: {df_updates_clean['Update_Type'].nunique()}")

print("\n✓ Data loading and cleaning complete!")
print("✓ Ready for exploratory data analysis and statistical modeling")

---

## Next Steps

1. **Exploratory Data Analysis** (Notebook 02): Analyze patterns in biometric updates
2. **Statistical Analysis** (Notebook 03): Test hypotheses about age-quality relationships
3. **Visualization** (Notebook 04): Create charts for the final report
4. **Insights Extraction** (Notebook 05): Derive actionable recommendations for UIDAI

---

**UIDAI Data Hackathon 2026** | Backend Analytics Project