# Data Consolidation & Standardisation

## Objective
This notebook consolidates raw UIDAI datasets related to **Enrolments**, **Demographic Updates**, and **Biometric Updates** into clean, standardized, and analysis-ready formats.

Each dataset is originally split across multiple files. For each dataset, we:
- Load and concatenate all raw files
- Standardize column names and formats
- Parse and normalize date fields
- Aggregate activity counts to the **district level**
- Export a single cleaned CSV for downstream analysis

## Design Choices
- Analysis is conducted at the **district level** to reduce noise from highly granular PIN-code level fluctuations.
- Activity counts represent **system events**, not unique individuals.

## Step 1: Importing Packages and Global Settings

We'll import all necessary libraries for data manipulation, visualization, and machine learning.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.2f}".format)

## Step 2: Helper Functions

The following helper functions are used across all datasets to ensure consistency in loading, cleaning, and aggregation logic.

### a. load_and_concat: Loads and concatenates all files with a given extension from a folder.

In [2]:
def load_and_concat(folder_path, file_extension=".csv"):
    """
    Loads and concatenates all files with a given extension from a folder.
    """
    files = list(Path(folder_path).glob(f"*{file_extension}"))
    df_list = [pd.read_csv(f) for f in files]
    return pd.concat(df_list, ignore_index=True)

### b. standardize_columns: Standardizes column names to lowercase with underscores.

In [3]:
def standardize_columns(df):
    """
    Standardizes column names to lowercase with underscores.
    """
    df.columns = (
        df.columns
          .str.strip()
          .str.lower()
          .str.replace(" ", "_")
    )
    return df

### c. parse_date: Parses date column into pandas datetime.

In [4]:
def parse_date(df, date_col="date"):
    """
    Parses date column into pandas datetime.
    """
    df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
    return df

### d. aggregate_to_district: Aggregates activity counts to district level by date.

In [5]:
def aggregate_to_district(df, value_cols):
    """
    Aggregates activity counts to district level by date.
    """
    group_cols = ["date", "state", "district"]
    return (
        df.groupby(group_cols, as_index=False)[value_cols]
          .sum()
    )

## Step 3: Dataset-wise Consolidation and Aggregation

With common helper functions in place, we now process each UIDAI dataset **independently**.  
For every dataset (Enrolments, Demographic Updates, Biometric Updates), the same sequence of operations is applied:

1. Load and concatenate all raw files  
2. Standardize column names  
3. Parse date fields  
4. Aggregate activity counts to the district level  
5. Perform basic sanity checks  
6. Export a cleaned, analysis-ready dataset  

## Dataset 1: Enrolment Data

### Description of Raw Dataset

In [22]:
enrolment_raw_path = "../data/raw/enrolment/"

enrol_df = load_and_concat(enrolment_raw_path)
enrol_df = standardize_columns(enrol_df)
enrol_df = parse_date(enrol_df, "date")

print("Initial shape: ", enrol_df.shape)
print("\nInitial columns: ", list(enrol_df.columns))
print("\nFirst few rows: ")
display(enrol_df.head())

Initial shape:  (1006029, 7)

Initial columns:  ['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater']

First few rows: 


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,2025-02-03,Meghalaya,East Khasi Hills,793121,11,61,37
1,2025-09-03,Karnataka,Bengaluru Urban,560043,14,33,39
2,2025-09-03,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,2025-09-03,Uttar Pradesh,Aligarh,202133,62,29,15
4,2025-09-03,Karnataka,Bengaluru Urban,560016,14,16,21


### Specifying Value Columns

In [29]:
enrol_value_cols = [
    "age_0_5",
    "age_5_17",
    "age_18_greater"
]

### Aggregating to district level

In [30]:
enrol_clean = aggregate_to_district(enrol_df, enrol_value_cols)

### Description of Cleaned Dataset

In [28]:
print("Shape of cleaned dataset: ", enrol_clean.shape)
print("\nFinal columns: ", list(enrol_clean.columns))
print("\nFirst few rows: ")
display(enrol_clean.head())
print("\nDescription of cleaned dataset: ")
display(enrol_clean.describe())

Shape of cleaned dataset:  (21504, 6)

Final columns:  ['date', 'state', 'district', 'age_0_5', 'age_5_17', 'age_18_greater']

First few rows: 


Unnamed: 0,date,state,district,age_0_5,age_5_17,age_18_greater
0,2025-01-04,Assam,Baksa,408,483,187
1,2025-01-04,Assam,Barpeta,138,54,23
2,2025-01-04,Assam,Biswanath,104,114,32
3,2025-01-04,Assam,Bongaigaon,221,87,61
4,2025-01-04,Assam,Cachar,988,461,299



Description of cleaned dataset: 


Unnamed: 0,date,age_0_5,age_5_17,age_18_greater
count,21504,21504.0,21504.0,21504.0
mean,2025-06-19 10:23:06.160714240,73.6,43.73,5.35
min,2025-01-04 00:00:00,0.0,0.0,0.0
25%,2025-03-09 00:00:00,7.0,2.0,0.0
50%,2025-06-11 00:00:00,28.0,7.0,0.0
75%,2025-09-11 00:00:00,71.0,27.0,1.0
max,2025-12-11 00:00:00,6740.0,6314.0,2404.0
std,,219.91,200.48,46.92


### Saving the cleaned dataset

In [31]:
output_path = "../data/processed/enrolment_clean.csv"
enrol_clean.to_csv(output_path, index=False)

## Dataset 2: Demographic Update Data

### Description of Raw Dataset

In [38]:
demo_raw_path = "../data/raw/demographic_updates/"

demo_df = load_and_concat(demo_raw_path)
demo_df = standardize_columns(demo_df)
demo_df = parse_date(demo_df, "date")
demo_df = demo_df.rename(columns={"demo_age_17_": "demo_age_17_plus"})

print("Initial shape: ", demo_df.shape)
print("\nInitial columns: ", list(demo_df.columns))
print("\nFirst few rows: ")
display(demo_df.head())

Initial shape:  (2071700, 6)

Initial columns:  ['date', 'state', 'district', 'pincode', 'demo_age_5_17', 'demo_age_17_plus']

First few rows: 


Unnamed: 0,date,state,district,pincode,demo_age_5_17,demo_age_17_plus
0,2025-01-03,Uttar Pradesh,Gorakhpur,273213,49,529
1,2025-01-03,Andhra Pradesh,Chittoor,517132,22,375
2,2025-01-03,Gujarat,Rajkot,360006,65,765
3,2025-01-03,Andhra Pradesh,Srikakulam,532484,24,314
4,2025-01-03,Rajasthan,Udaipur,313801,45,785


### Specifying Value Columns

In [39]:
demo_value_cols = [
    "demo_age_5_17",
    "demo_age_17_plus"
]

### Aggregating to District Level

In [40]:
demo_clean = aggregate_to_district(demo_df, demo_value_cols)

### Description of Cleaned Dataset

In [41]:
print("Shape of cleaned dataset: ", demo_clean.shape)
print("\nFinal columns: ", list(demo_clean.columns))
print("\nFirst few rows: ")
display(demo_clean.head())
print("\nDescription of cleaned dataset: ")
display(demo_clean.describe())

Shape of cleaned dataset:  (35946, 5)

Final columns:  ['date', 'state', 'district', 'demo_age_5_17', 'demo_age_17_plus']

First few rows: 


Unnamed: 0,date,state,district,demo_age_5_17,demo_age_17_plus
0,2025-01-03,Andaman and Nicobar Islands,Nicobar,32,360
1,2025-01-03,Andaman and Nicobar Islands,North And Middle Andaman,20,402
2,2025-01-03,Andaman and Nicobar Islands,South Andaman,74,450
3,2025-01-03,Andhra Pradesh,Adilabad,390,3950
4,2025-01-03,Andhra Pradesh,Alluri Sitharama Raju,507,4448



Description of cleaned dataset: 


Unnamed: 0,date,demo_age_5_17,demo_age_17_plus
count,35946,35946.0,35946.0
mean,2025-06-17 03:36:45.908863232,91.31,812.04
min,2025-01-03 00:00:00,0.0,0.0
25%,2025-03-11 00:00:00,5.0,46.0
50%,2025-06-11 00:00:00,24.0,197.0
75%,2025-09-12 00:00:00,64.0,536.0
max,2025-12-12 00:00:00,9362.0,74631.0
std,,337.0,3133.83


### Saving the Cleaned Dataset

In [42]:
output_path = "../data/processed/demographic_updates_clean.csv"
demo_clean.to_csv(output_path, index=False)

## Dataset 3: Biometric Update Data

### Description of Raw Dataset

In [43]:
bio_raw_path = "../data/raw/biometric_updates/"

bio_df = load_and_concat(bio_raw_path)
bio_df = standardize_columns(bio_df)
bio_df = parse_date(bio_df, "date")
bio_df = bio_df.rename(columns={"bio_age_17_": "bio_age_17_plus"})

print("Initial shape: ", bio_df.shape)
print("\nInitial columns: ", list(bio_df.columns))
print("\nFirst few rows: ")
display(bio_df.head())

Initial shape:  (1861108, 6)

Initial columns:  ['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_plus']

First few rows: 


Unnamed: 0,date,state,district,pincode,bio_age_5_17,bio_age_17_plus
0,2025-01-03,Haryana,Mahendragarh,123029,280,577
1,2025-01-03,Bihar,Madhepura,852121,144,369
2,2025-01-03,Jammu and Kashmir,Punch,185101,643,1091
3,2025-01-03,Bihar,Bhojpur,802158,256,980
4,2025-01-03,Tamil Nadu,Madurai,625514,271,815


### Specifying Value Columns

In [44]:
bio_value_cols = [
    "bio_age_5_17",
    "bio_age_17_plus"
]

### Aggregating to District Level

In [45]:
bio_clean = aggregate_to_district(bio_df, bio_value_cols)

### Description of Cleaned Dataset

In [46]:
print("Shape of cleaned dataset: ", bio_clean.shape)
print("\nFinal columns: ", list(bio_clean.columns))
print("\nFirst few rows: ")
display(bio_clean.head())
print("\nDescription of cleaned dataset: ")
display(bio_clean.describe())

Shape of cleaned dataset:  (38059, 5)

Final columns:  ['date', 'state', 'district', 'bio_age_5_17', 'bio_age_17_plus']

First few rows: 


Unnamed: 0,date,state,district,bio_age_5_17,bio_age_17_plus
0,2025-01-03,Andaman & Nicobar Islands,Andamans,16,193
1,2025-01-03,Andaman and Nicobar Islands,Nicobar,178,101
2,2025-01-03,Andaman and Nicobar Islands,North And Middle Andaman,470,347
3,2025-01-03,Andaman and Nicobar Islands,South Andaman,948,450
4,2025-01-03,Andhra Pradesh,Adilabad,897,4366



Description of cleaned dataset: 


Unnamed: 0,date,bio_age_5_17,bio_age_17_plus
count,38059,38059.0,38059.0
mean,2025-06-06 06:00:30.079612928,703.88,755.6
min,2025-01-03 00:00:00,0.0,0.0
25%,2025-02-12 00:00:00,19.0,24.0
50%,2025-06-09 00:00:00,127.0,129.0
75%,2025-09-11 00:00:00,349.0,354.0
max,2025-12-12 00:00:00,56618.0,51939.0
std,,2294.15,2506.7


### Saving the Cleaned Dataset

In [47]:
output_path = "../data/processed/biometric_updates_clean.csv"
bio_clean.to_csv(output_path, index=False)

## Conclusion

This notebook completes the consolidation of all three UIDAI datasets—**Enrolments**, **Demographic Updates**, and **Biometric Updates**—into clean, standardized, district-level datasets ready for analysis and produces the following cleaned datasets:

- `enrolment_clean.csv`
- `demographic_updates_clean.csv`
- `biometric_updates_clean.csv`

These datasets form the foundation for all subsequent exploratory, stress, lifecycle, and prioritization analyses.