## 1. Setup & Load

In [27]:
import pandas as pd
import numpy as np
from pathlib import Path

# Paths
PROCESSED_DIR = Path("../data/processed")
INPUT_FILE = PROCESSED_DIR / "district_month_panel_duckdb.csv"

# Load panel
panel = pd.read_csv(INPUT_FILE)
print(f"Loaded panel from: {INPUT_FILE}")
print(f"panel.shape: {panel.shape}")

Loaded panel from: ..\data\processed\district_month_panel_duckdb.csv
panel.shape: (4355, 10)


In [3]:
import sys
print(sys.executable)

c:\MyProjects\uidai-asris\.venv\Scripts\python.exe


In [28]:
panel.head()

Unnamed: 0,state,district,year_month,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_
0,Kerala,Ernakulam,2025-12,679.0,243.0,87.0,1282.0,16957.0,8645.0,8794.0
1,Madhya Pradesh,Chhindwara,2025-12,1813.0,100.0,2.0,1446.0,8951.0,11688.0,9179.0
2,Madhya Pradesh,Satna,2025-12,1845.0,281.0,58.0,1640.0,10514.0,11263.0,9590.0
3,Madhya Pradesh,Tikamgarh,2025-12,1507.0,263.0,8.0,1324.0,7964.0,6769.0,6616.0
4,Maharashtra,Beed,2025-12,881.0,93.0,18.0,1924.0,21465.0,16845.0,12408.0


In [5]:
panel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4355 entries, 0 to 4354
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   state           4355 non-null   object 
 1   district        4355 non-null   object 
 2   year_month      4355 non-null   object 
 3   age_0_5         4355 non-null   float64
 4   age_5_17        4355 non-null   float64
 5   age_18_greater  4355 non-null   float64
 6   demo_age_5_17   4355 non-null   float64
 7   demo_age_17_    4355 non-null   float64
 8   bio_age_5_17    4355 non-null   float64
 9   bio_age_17_     4355 non-null   float64
dtypes: float64(7), object(3)
memory usage: 340.4+ KB


In [29]:
# Create aggregate columns (in case they don't exist)
panel["total_enrolment"] = panel["age_0_5"] + panel["age_5_17"] + panel["age_18_greater"]
panel["total_demo_updates"] = panel["demo_age_5_17"] + panel["demo_age_17_"]
panel["total_bio_updates"] = panel["bio_age_5_17"] + panel["bio_age_17_"]

print("Aggregate columns created/updated:")
print("  - total_enrolment")
print("  - total_demo_updates")
print("  - total_bio_updates")
print(f"\npanel.shape: {panel.shape}")

Aggregate columns created/updated:
  - total_enrolment
  - total_demo_updates
  - total_bio_updates

panel.shape: (4355, 13)


---
## 2. Time Handling & Ordering

Convert `year_month` (string "YYYY-MM") to a proper datetime and sort by district + time.

In [7]:
# Convert year_month string to datetime (first day of month)
panel["month_date"] = pd.to_datetime(panel["year_month"] + "-01", format="%Y-%m-%d")

print("month_date column created")
print(f"Date range: {panel['month_date'].min()} to {panel['month_date'].max()}")

month_date column created
Date range: 2025-03-01 00:00:00 to 2025-12-01 00:00:00


In [8]:
# Sort by district and month_date to establish correct time order per district
panel = panel.sort_values(["district", "month_date"]).reset_index(drop=True)

print("Panel sorted by [district, month_date]")
print(f"panel.shape: {panel.shape}")

Panel sorted by [district, month_date]
panel.shape: (4355, 14)


In [9]:
# Sanity check: verify time ordering for first 3 unique districts
sample_districts = panel["district"].unique()[:3]

print("Sanity check: time ordering for sample districts\n")
for dist in sample_districts:
    subset = panel[panel["district"] == dist][["state", "district", "year_month", "month_date", "total_enrolment"]]
    print(f"District: {dist}")
    print(subset.to_string(index=False))
    print()

Sanity check: time ordering for sample districts

District: ANUGUL
 state district year_month month_date  total_enrolment
Odisha   ANUGUL    2025-09 2025-09-01              3.0
Odisha   ANUGUL    2025-10 2025-10-01              3.0
Odisha   ANUGUL    2025-11 2025-11-01              3.0
Odisha   ANUGUL    2025-12 2025-12-01              4.0

District: Adilabad
         state district year_month month_date  total_enrolment
Andhra Pradesh Adilabad    2025-09 2025-09-01            654.0
     Telangana Adilabad    2025-09 2025-09-01           1211.0
     Telangana Adilabad    2025-10 2025-10-01            675.0
Andhra Pradesh Adilabad    2025-10 2025-10-01            308.0
     Telangana Adilabad    2025-11 2025-11-01            700.0
Andhra Pradesh Adilabad    2025-11 2025-11-01            349.0
Andhra Pradesh Adilabad    2025-12 2025-12-01            108.0
     Telangana Adilabad    2025-12 2025-12-01            401.0

District: Agar Malwa
         state   district year_month month_date  

---
## 3. Target Construction: Next-Month Total Enrolment

For each district, shift `total_enrolment` by -1 to get the value from the next month.

**Logic:**
- `target_enrolment_next_month` = the enrolment value from month $t+1$
- We use `shift(-1)` within each district group
- The last month per district will have NaN (no next month) – we drop these rows

In [10]:
# Create target: next-month total_enrolment (shift -1 within each district)
panel["target_enrolment_next_month"] = (
    panel
    .groupby("district")["total_enrolment"]
    .shift(-1)  # negative shift = future value
)

print("Target column created: target_enrolment_next_month")
print(f"Rows with NaN target (last month per district): {panel['target_enrolment_next_month'].isna().sum()}")

Target column created: target_enrolment_next_month
Rows with NaN target (last month per district): 939


In [11]:
# Verify target alignment: show sample districts with current and next-month values
verify_cols = ["district", "year_month", "total_enrolment", "target_enrolment_next_month"]

print("Target alignment verification (sample districts):\n")
for dist in sample_districts[:2]:
    subset = panel[panel["district"] == dist][verify_cols]
    print(f"District: {dist}")
    print(subset.to_string(index=False))
    print()

Target alignment verification (sample districts):

District: ANUGUL
district year_month  total_enrolment  target_enrolment_next_month
  ANUGUL    2025-09              3.0                          3.0
  ANUGUL    2025-10              3.0                          3.0
  ANUGUL    2025-11              3.0                          4.0
  ANUGUL    2025-12              4.0                          NaN

District: Adilabad
district year_month  total_enrolment  target_enrolment_next_month
Adilabad    2025-09            654.0                       1211.0
Adilabad    2025-09           1211.0                        675.0
Adilabad    2025-10            675.0                        308.0
Adilabad    2025-10            308.0                        700.0
Adilabad    2025-11            700.0                        349.0
Adilabad    2025-11            349.0                        108.0
Adilabad    2025-12            108.0                        401.0
Adilabad    2025-12            401.0                  

In [12]:
# Drop rows where target is NaN (last month per district has no next-month label)
rows_before = len(panel)
panel = panel.dropna(subset=["target_enrolment_next_month"]).reset_index(drop=True)
rows_after = len(panel)

print(f"Dropped {rows_before - rows_after} rows with NaN target")
print(f"Remaining rows: {rows_after}")

Dropped 939 rows with NaN target
Remaining rows: 3416


---
## 4. Feature Engineering

Build a strong feature set:
- **Current month raw counts:** all 7 original count columns + 3 totals
- **Lag features:** previous month values for totals (groupby + shift(1))
- **Trend features:** diff between current and previous month
- **Time features:** year, month extracted from month_date
- **Categorical:** keep `state` as-is for later encoding

In [13]:
# Lag features: previous month values (shift +1 = look back)
# Note: We use shift(1) which means "value from 1 period ago"

panel["total_enrolment_prev_1"] = (
    panel.groupby("district")["total_enrolment"].shift(1)
)
panel["total_demo_updates_prev_1"] = (
    panel.groupby("district")["total_demo_updates"].shift(1)
)
panel["total_bio_updates_prev_1"] = (
    panel.groupby("district")["total_bio_updates"].shift(1)
)

print("Lag features created:")
print("  - total_enrolment_prev_1")
print("  - total_demo_updates_prev_1")
print("  - total_bio_updates_prev_1")

Lag features created:
  - total_enrolment_prev_1
  - total_demo_updates_prev_1
  - total_bio_updates_prev_1


In [14]:
# Trend features: difference between current and previous month
panel["enrolment_diff_1"] = panel["total_enrolment"] - panel["total_enrolment_prev_1"]
panel["demo_updates_diff_1"] = panel["total_demo_updates"] - panel["total_demo_updates_prev_1"]
panel["bio_updates_diff_1"] = panel["total_bio_updates"] - panel["total_bio_updates_prev_1"]

print("Trend (diff) features created:")
print("  - enrolment_diff_1")
print("  - demo_updates_diff_1")
print("  - bio_updates_diff_1")

Trend (diff) features created:
  - enrolment_diff_1
  - demo_updates_diff_1
  - bio_updates_diff_1


In [15]:
# Time-derived features
panel["year"] = panel["month_date"].dt.year
panel["month"] = panel["month_date"].dt.month

print("Time features created:")
print("  - year")
print("  - month")

Time features created:
  - year
  - month


In [16]:
# Drop rows where lag features are NaN (first month per district has no previous month)
lag_cols = ["total_enrolment_prev_1", "total_demo_updates_prev_1", "total_bio_updates_prev_1"]

rows_before = len(panel)
panel = panel.dropna(subset=lag_cols).reset_index(drop=True)
rows_after = len(panel)

print(f"Dropped {rows_before - rows_after} rows with NaN lag features (first month per district)")
print(f"Remaining rows: {rows_after}")

Dropped 920 rows with NaN lag features (first month per district)
Remaining rows: 2496


In [17]:
# Verify feature engineering: show sample rows
feature_sample_cols = [
    "district", "year_month",
    "total_enrolment", "total_enrolment_prev_1", "enrolment_diff_1",
    "target_enrolment_next_month"
]

print("Feature engineering verification (sample):")
panel[feature_sample_cols].head(10)

Feature engineering verification (sample):


Unnamed: 0,district,year_month,total_enrolment,total_enrolment_prev_1,enrolment_diff_1,target_enrolment_next_month
0,ANUGUL,2025-10,3.0,3.0,0.0,3.0
1,ANUGUL,2025-11,3.0,3.0,0.0,4.0
2,Adilabad,2025-09,1211.0,654.0,557.0,675.0
3,Adilabad,2025-10,675.0,1211.0,-536.0,308.0
4,Adilabad,2025-10,308.0,675.0,-367.0,700.0
5,Adilabad,2025-11,700.0,308.0,392.0,349.0
6,Adilabad,2025-11,349.0,700.0,-351.0,108.0
7,Adilabad,2025-12,108.0,349.0,-241.0,401.0
8,Agar Malwa,2025-10,340.0,583.0,-243.0,955.0
9,Agar Malwa,2025-11,955.0,340.0,615.0,659.0


---
## 5. Data Quality & Leakage Checks

Verify:
1. No NaNs in target or feature columns
2. No label leakage: all features come from time $t$ or earlier; target is from $t+1$

In [18]:
# Define feature columns for modeling
feature_cols = [
    # Categorical
    "state",
    # Current month raw counts
    "age_0_5", "age_5_17", "age_18_greater",
    "demo_age_5_17", "demo_age_17_",
    "bio_age_5_17", "bio_age_17_",
    # Current month totals
    "total_enrolment", "total_demo_updates", "total_bio_updates",
    # Lag features (from previous month)
    "total_enrolment_prev_1", "total_demo_updates_prev_1", "total_bio_updates_prev_1",
    # Trend features
    "enrolment_diff_1", "demo_updates_diff_1", "bio_updates_diff_1",
    # Time features
    "year", "month"
]

target_col = "target_enrolment_next_month"

print(f"Feature columns ({len(feature_cols)}): {feature_cols}")
print(f"Target column: {target_col}")

Feature columns (19): ['state', 'age_0_5', 'age_5_17', 'age_18_greater', 'demo_age_5_17', 'demo_age_17_', 'bio_age_5_17', 'bio_age_17_', 'total_enrolment', 'total_demo_updates', 'total_bio_updates', 'total_enrolment_prev_1', 'total_demo_updates_prev_1', 'total_bio_updates_prev_1', 'enrolment_diff_1', 'demo_updates_diff_1', 'bio_updates_diff_1', 'year', 'month']
Target column: target_enrolment_next_month


In [19]:
# Check for NaNs in features and target
all_model_cols = feature_cols + [target_col]
nan_counts = panel[all_model_cols].isna().sum()

print("NaN counts in modeling columns:")
print(nan_counts[nan_counts > 0] if nan_counts.sum() > 0 else "No NaNs found ✓")

NaN counts in modeling columns:
No NaNs found ✓


In [20]:
# Confirm target has no NaNs
assert panel[target_col].isna().sum() == 0, "ERROR: Target column has NaN values!"
print(f"✓ Target column '{target_col}' has no NaN values")

# Confirm numeric features have no NaNs
numeric_features = [c for c in feature_cols if c != "state"]
nan_in_features = panel[numeric_features].isna().sum().sum()
assert nan_in_features == 0, f"ERROR: {nan_in_features} NaN values in feature columns!"
print(f"✓ All numeric feature columns have no NaN values")

✓ Target column 'target_enrolment_next_month' has no NaN values
✓ All numeric feature columns have no NaN values


### Leakage Verification

**No label leakage exists because:**

| Feature Type | Time Reference | Safe? |
|--------------|----------------|-------|
| Current month counts (`age_0_5`, `total_enrolment`, etc.) | Time $t$ | ✓ |
| Lag features (`*_prev_1`) | Time $t-1$ | ✓ |
| Diff features (`*_diff_1`) | Computed from $t$ and $t-1$ | ✓ |
| Time features (`year`, `month`) | Time $t$ metadata | ✓ |
| **Target** (`target_enrolment_next_month`) | **Time $t+1$** | Target |

All features are derived from information available at or before time $t$. The target is strictly from time $t+1$.

---
## 6. Final Feature/Target Selection

In [21]:
# Create X (features) and y (target)
X = panel[feature_cols].copy()
y = panel[target_col].copy()

print(f"X.shape: {X.shape}")
print(f"y.shape: {y.shape}")
print(f"\nNumber of rows in modeling dataset: {len(X)}")
print(f"Number of feature columns: {len(feature_cols)}")

X.shape: (2496, 19)
y.shape: (2496,)

Number of rows in modeling dataset: 2496
Number of feature columns: 19


In [22]:
# Preview X
print("X.head():")
X.head()

X.head():


Unnamed: 0,state,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,total_enrolment,total_demo_updates,total_bio_updates,total_enrolment_prev_1,total_demo_updates_prev_1,total_bio_updates_prev_1,enrolment_diff_1,demo_updates_diff_1,bio_updates_diff_1,year,month
0,Odisha,3.0,0.0,0.0,2.0,13.0,6.0,14.0,3.0,15.0,20.0,3.0,48.0,14.0,0.0,-33.0,6.0,2025,10
1,Odisha,3.0,0.0,0.0,0.0,34.0,0.0,11.0,3.0,34.0,11.0,3.0,15.0,20.0,0.0,19.0,-9.0,2025,11
2,Telangana,1022.0,189.0,0.0,684.0,6817.0,3835.0,4208.0,1211.0,7501.0,8043.0,654.0,5589.0,7619.0,557.0,1912.0,424.0,2025,9
3,Telangana,574.0,101.0,0.0,966.0,5494.0,6829.0,4438.0,675.0,6460.0,11267.0,1211.0,7501.0,8043.0,-536.0,-1041.0,3224.0,2025,10
4,Andhra Pradesh,236.0,71.0,1.0,459.0,4477.0,2245.0,5259.0,308.0,4936.0,7504.0,675.0,6460.0,11267.0,-367.0,-1524.0,-3763.0,2025,10


In [23]:
# Preview y
print("y.head():")
print(y.head())
print(f"\ny statistics:")
print(y.describe())

y.head():
0      3.0
1      4.0
2    675.0
3    308.0
4    700.0
Name: target_enrolment_next_month, dtype: float64

y statistics:
count    2496.000000
mean     1141.884615
std      1306.793938
min         1.000000
25%       166.000000
50%       735.500000
75%      1638.500000
max      9131.000000
Name: target_enrolment_next_month, dtype: float64


---
## 7. Export Modeling-Ready Dataset

Save a single CSV with:
- All feature columns
- Target column
- Key columns for traceability (state, district, year_month, month_date)

In [24]:
# Define columns to include in the modeling dataset
key_cols = ["state", "district", "year_month", "month_date"]
export_cols = key_cols + [c for c in feature_cols if c not in key_cols] + [target_col]

# Create model_df
model_df = panel[export_cols].copy()

print(f"model_df.shape: {model_df.shape}")
print(f"Columns: {model_df.columns.tolist()}")

model_df.shape: (2496, 23)
Columns: ['state', 'district', 'year_month', 'month_date', 'age_0_5', 'age_5_17', 'age_18_greater', 'demo_age_5_17', 'demo_age_17_', 'bio_age_5_17', 'bio_age_17_', 'total_enrolment', 'total_demo_updates', 'total_bio_updates', 'total_enrolment_prev_1', 'total_demo_updates_prev_1', 'total_bio_updates_prev_1', 'enrolment_diff_1', 'demo_updates_diff_1', 'bio_updates_diff_1', 'year', 'month', 'target_enrolment_next_month']


In [25]:
# Save to CSV
output_path = PROCESSED_DIR / "district_month_modeling.csv"
model_df.to_csv(output_path, index=False)

print(f"\n✓ Modeling dataset saved to: {output_path.resolve()}")
print(f"  Rows: {model_df.shape[0]}")
print(f"  Columns: {model_df.shape[1]}")


✓ Modeling dataset saved to: C:\MyProjects\uidai-asris\data\processed\district_month_modeling.csv
  Rows: 2496
  Columns: 23


In [26]:
# Final preview
model_df.head(10)

Unnamed: 0,state,district,year_month,month_date,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,...,total_bio_updates,total_enrolment_prev_1,total_demo_updates_prev_1,total_bio_updates_prev_1,enrolment_diff_1,demo_updates_diff_1,bio_updates_diff_1,year,month,target_enrolment_next_month
0,Odisha,ANUGUL,2025-10,2025-10-01,3.0,0.0,0.0,2.0,13.0,6.0,...,20.0,3.0,48.0,14.0,0.0,-33.0,6.0,2025,10,3.0
1,Odisha,ANUGUL,2025-11,2025-11-01,3.0,0.0,0.0,0.0,34.0,0.0,...,11.0,3.0,15.0,20.0,0.0,19.0,-9.0,2025,11,4.0
2,Telangana,Adilabad,2025-09,2025-09-01,1022.0,189.0,0.0,684.0,6817.0,3835.0,...,8043.0,654.0,5589.0,7619.0,557.0,1912.0,424.0,2025,9,675.0
3,Telangana,Adilabad,2025-10,2025-10-01,574.0,101.0,0.0,966.0,5494.0,6829.0,...,11267.0,1211.0,7501.0,8043.0,-536.0,-1041.0,3224.0,2025,10,308.0
4,Andhra Pradesh,Adilabad,2025-10,2025-10-01,236.0,71.0,1.0,459.0,4477.0,2245.0,...,7504.0,675.0,6460.0,11267.0,-367.0,-1524.0,-3763.0,2025,10,700.0
5,Telangana,Adilabad,2025-11,2025-11-01,577.0,121.0,2.0,1331.0,6819.0,7183.0,...,11366.0,308.0,4936.0,7504.0,392.0,3214.0,3862.0,2025,11,349.0
6,Andhra Pradesh,Adilabad,2025-11,2025-11-01,286.0,63.0,0.0,764.0,6371.0,1420.0,...,5560.0,700.0,8150.0,11366.0,-351.0,-1015.0,-5806.0,2025,11,108.0
7,Andhra Pradesh,Adilabad,2025-12,2025-12-01,92.0,16.0,0.0,635.0,6656.0,1641.0,...,6390.0,349.0,7135.0,5560.0,-241.0,156.0,830.0,2025,12,401.0
8,Madhya Pradesh,Agar Malwa,2025-10,2025-10-01,275.0,65.0,0.0,48.0,446.0,185.0,...,584.0,583.0,991.0,1424.0,-243.0,-497.0,-840.0,2025,10,955.0
9,Madhya Pradesh,Agar Malwa,2025-11,2025-11-01,868.0,86.0,1.0,102.0,1343.0,339.0,...,1371.0,340.0,494.0,584.0,615.0,951.0,787.0,2025,11,659.0


---
## Phase 3.2 Summary

**What was done in this notebook:**

1. **Target defined:** `target_enrolment_next_month` = total enrolment for month $t+1$ per district
2. **Features engineered:**
   - 7 raw count columns from current month
   - 3 total columns (`total_enrolment`, `total_demo_updates`, `total_bio_updates`)
   - 3 lag-1 features (previous month totals)
   - 3 diff-1 features (month-over-month change)
   - 2 time features (`year`, `month`)
   - 1 categorical (`state`)
3. **Quality verified:** No NaNs in features or target after dropping boundary rows
4. **Leakage verified:** All features from time $t$ or earlier; target strictly from $t+1$
5. **Output:** `district_month_modeling.csv` with features + target + keys

---

## Next Phase: 05_model_baseline.ipynb

The next notebook will:
- Load `district_month_modeling.csv`
- Create train/validation/test splits (time-aware to avoid leakage)
- Train baseline models:
  - Linear regression
  - Decision tree / Random Forest
- Evaluate metrics: MAE, RMSE, R²
- Establish baseline performance before advanced modeling