## Step A: Biometric Data Cleaning and Aggregation

The Aadhaar biometric update dataset is provided in multiple CSV files compressed within a ZIP archive. We first extracted and combined all files to form a unified dataset. Since fraud and identity misuse patterns emerge over time and geography rather than in isolated records, the data was cleaned and aggregated at the **pincode–month** level.

The dataset already provides biometric update counts separated by age groups. We specifically focus on biometric updates for individuals aged **17 and above**, as adult biometric traits are expected to remain stable. Frequent updates in this category may indicate biometric correction loops, identity reuse, or potential misuse.

Dates were normalized to a monthly format to enable temporal trend analysis. Missing values in biometric counts were treated as zero updates, and records without valid dates or pincodes were excluded to preserve analytical integrity.


In [57]:
import pandas as pd
import numpy as np
import zipfile
import os


In [58]:
import os
import zipfile

zip_path = "data/api_data_aadhar_biometric.zip"
extract_path = "data/biometric_updates"

os.makedirs(extract_path, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

os.listdir(extract_path)


['api_data_aadhar_biometric']

In [59]:
inner_path = os.path.join("data/biometric_updates", "api_data_aadhar_biometric")

csv_files = [
    os.path.join(inner_path, f)
    for f in os.listdir(inner_path)
    if f.endswith(".csv")
]

csv_files


['data/biometric_updates\\api_data_aadhar_biometric\\api_data_aadhar_biometric_0_500000.csv',
 'data/biometric_updates\\api_data_aadhar_biometric\\api_data_aadhar_biometric_1000000_1500000.csv',
 'data/biometric_updates\\api_data_aadhar_biometric\\api_data_aadhar_biometric_1500000_1861108.csv',
 'data/biometric_updates\\api_data_aadhar_biometric\\api_data_aadhar_biometric_500000_1000000.csv']

In [60]:
df_list = []

for file in csv_files:
    temp_df = pd.read_csv(file)
    df_list.append(temp_df)

df = pd.concat(df_list, ignore_index=True)

df.shape



(1861108, 6)

In [61]:
df.head()
df.columns

Index(['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_'], dtype='object')

In [62]:
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

df.columns


Index(['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_'], dtype='object')

In [63]:
df.isnull().sum()
df = df.dropna(subset=['pincode', 'date'])
df = df.fillna(0)


In [64]:
df['date'] = pd.to_datetime(df['date'], dayfirst=True, errors='coerce')
df['month'] = df['date'].dt.to_period('M').astype(str)

df['date'].isna().sum()


np.int64(0)

In [65]:
df_17plus = df[['pincode', 'month', 'bio_age_17_']]

agg_df = (
    df_17plus
    .groupby(['pincode', 'month'], as_index=False)
    .agg({'bio_age_17_': 'sum'})
)

agg_df.rename(columns={'bio_age_17_': 'bio_updates_17plus'}, inplace=True)
agg_df.head()

Unnamed: 0,pincode,month,bio_updates_17plus
0,110001,2025-03,247
1,110001,2025-04,163
2,110001,2025-05,163
3,110001,2025-06,164
4,110001,2025-07,201


In [66]:
agg_df.to_csv("data/biometric_pin_month.csv", index=False)


## Step B: Biometric Instability Feature Engineering

To quantify identity instability, we engineered features that capture both the **intensity** and **volatility** of biometric updates over time. Real adult identities typically exhibit stable biometric patterns, whereas manipulated or fraudulent identities often display sudden or abnormal changes.

We computed month-to-month biometric volatility for each pincode to capture abrupt spikes or drops in adult biometric updates. In addition, we calculated a normalized instability score using within-pincode standardization to identify months that significantly deviated from typical behavior.

These features were combined into a composite biometric risk score, prioritizing sudden changes over gradual trends. This approach ensures that correction bursts and irregular update patterns are highlighted as higher risk, even if absolute update counts are not extreme.


In [67]:
agg_df = agg_df.sort_values(by=['pincode', 'month']).reset_index(drop=True)


In [68]:
agg_df['bio_volatility'] = (
    agg_df
    .groupby('pincode')['bio_updates_17plus']
    .diff()
    .abs()
)
agg_df['bio_volatility'] = agg_df['bio_volatility'].fillna(0)


In [69]:
agg_df['bio_instability_score'] = (
    agg_df
    .groupby('pincode')['bio_updates_17plus']
    .transform(lambda x: (x - x.mean()) / (x.std() + 1e-6))
)


In [70]:
agg_df['bio_volatility_norm'] = (
    agg_df
    .groupby('pincode')['bio_volatility']
    .transform(lambda x: (x - x.mean()) / (x.std() + 1e-6))
)

agg_df['biometric_risk'] = (
    0.5 * agg_df['bio_instability_score'] +
    0.5 * agg_df['bio_volatility_norm']
)


In [71]:
agg_df[['pincode', 'month', 'bio_updates_17plus',
        'bio_volatility', 'bio_instability_score','bio_volatility_norm',
        'biometric_risk']].head(10)


Unnamed: 0,pincode,month,bio_updates_17plus,bio_volatility,bio_instability_score,bio_volatility_norm,biometric_risk
0,110001,2025-03,247,0.0,1.910289,-1.045066,0.432612
1,110001,2025-04,163,84.0,0.120725,1.68874,0.904732
2,110001,2025-05,163,0.0,0.120725,-1.045066,-0.462171
3,110001,2025-06,164,1.0,0.142029,-1.012521,-0.435246
4,110001,2025-07,201,37.0,0.93029,0.15911,0.5447
5,110001,2025-09,130,71.0,-0.582319,1.265651,0.341666
6,110001,2025-10,91,39.0,-1.413188,0.224201,-0.594493
7,110001,2025-11,109,18.0,-1.02971,-0.459251,-0.74448
8,110001,2025-12,148,39.0,-0.198841,0.224201,0.01268
9,110002,2025-03,427,0.0,-0.029506,-1.068546,-0.549026


## Step C: Anomaly Detection Using Unsupervised Learning

Since fraudulent behavior does not follow fixed thresholds, we applied an unsupervised anomaly detection approach to identify abnormal biometric activity patterns. Using Isolation Forest, the model learns the distribution of normal biometric behavior across pincodes and time periods without requiring labeled fraud examples.

The model flags pincode–month combinations that exhibit unusual combinations of biometric instability and volatility. These anomalies represent periods of heightened identity instability that may warrant further investigation or audit.

By relying on learned behavioral deviations rather than manual rules, this approach adapts to regional differences and evolving patterns in biometric update behavior.


In [72]:
features = agg_df[
    ['bio_instability_score', 'bio_volatility']
].copy()


We apply Isolation Forest to biometric instability features to identify anomalous pincode–month combinations. These anomalies represent periods of abnormal biometric correction activity that may indicate identity misuse or operational irregularities.

In [73]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.05,  # assume top 5% are anomalous
    random_state=42
)

agg_df['anomaly_flag'] = iso_forest.fit_predict(features)
agg_df['anomaly_score'] = iso_forest.decision_function(features)
agg_df[agg_df['anomaly_flag'] == -1][['pincode', 'month', 'biometric_risk',
        'anomaly_flag', 'anomaly_score']].head(10)

Unnamed: 0,pincode,month,biometric_risk,anomaly_flag,anomaly_score
17,110002,2025-12,1.75049,-1,-0.059287
32,110005,2025-04,1.057812,-1,-0.039346
33,110005,2025-05,0.657969,-1,-0.005016
36,110005,2025-09,0.571777,-1,-0.073123
37,110005,2025-10,-0.527915,-1,-0.013882
43,110006,2025-06,0.190344,-1,-0.05095
45,110006,2025-09,0.006998,-1,-0.095634
46,110006,2025-10,-0.631254,-1,-0.113097
48,110006,2025-12,1.323831,-1,-0.166947
50,110007,2025-04,1.189942,-1,-0.103184


In [74]:
agg_df.groupby('anomaly_flag').size()


anomaly_flag
-1      8450
 1    160546
dtype: int64

In [75]:
agg_df.to_csv("data/biometric_anomalies.csv", index=False)


## Step D: Reinforcement Learning State Integration

We frame Aadhaar fraud mitigation as a **sequential decision-making problem**, where audit resources are limited and must be deployed strategically. Instead of reacting to individual anomalies, we design a system that learns which geographic regions should be audited to minimize long-term identity instability.

The reinforcement learning agent observes biometric instability signals, anomaly scores, and volatility metrics as its state. Actions correspond to selecting high-risk pincodes for audit, and rewards are defined as reductions in biometric risk in subsequent time periods.

This formulation enables proactive governance by optimizing audit decisions over time rather than relying on static or reactive inspection strategies.


In [76]:
state_cols = [
    'biometric_risk',        # fraud likelihood
    'anomaly_score',         # unusual activity
    'bio_volatility',        # instability over time
    'bio_instability_score'  # combined instability
]

state_df = agg_df[state_cols].copy()


In [77]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
state_scaled = scaler.fit_transform(state_df)

state_df = pd.DataFrame(state_scaled, columns=state_cols)


In [78]:
state_df['pincode'] = agg_df['pincode'].values
state_df['month'] = agg_df['month'].values


Action Space:
At each time step, the agent selects the top K pincodes with the highest predicted risk/anomaly score for audit.

reward = biometric_risk_current_month - biometric_risk_next_month


In [79]:
state_df.to_csv("rl_state_pincode_month.csv", index=False)
state_df.head()

Unnamed: 0,biometric_risk,anomaly_score,bio_volatility,bio_instability_score,pincode,month
0,0.501561,0.675289,0.0,0.854282,110001,2025-03
1,0.607124,0.864507,0.014797,0.508129,110001,2025-04
2,0.301494,0.954365,0.0,0.508129,110001,2025-05
3,0.307514,0.938287,0.000176,0.51225,110001,2025-06
4,0.526623,0.867814,0.006518,0.664722,110001,2025-07


## Conclusion

This pipeline demonstrates how aggregated Aadhaar biometric update data can be transformed into actionable intelligence for fraud risk assessment. By combining temporal aggregation, instability modeling, unsupervised anomaly detection, and reinforcement learning concepts, the system moves beyond descriptive analytics toward decision-oriented governance.

The approach does not label individual Aadhaar records as fraudulent. Instead, it identifies statistically abnormal identity behavior at regional and temporal levels, supporting targeted audits and system improvements while preserving privacy.
