Process:
1:-Clean: Standardize Base_Ministry names (trim spaces, fix typos).
2:-Reshape: Melt fiscal year (FY) columns into long format (Fiscal_Year, Value).
3:-Map: Join with ministry_to_sector12.csv to map each ministry to a Sector_12.
4:-Filter: Drop rows where mapping is missing (unmapped).
5:-Aggregate: Group by (Sector_12, Fiscal_Year) and sum budget values.

Output:
data/sector_budget_timeseries.csv

In [48]:
# Step 1: Build sector budget time series → data/sector_budget_timeseries.csv

import re
import numpy as np
import pandas as pd
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
IN_BUDGETS = DATA / "standardized_budget_time_series.csv"
IN_MAP     = DATA / "ministry_to_sector12.csv"
OUT_SECTOR = DATA / "sector_budget_timeseries.csv"

def tidy_ministry(s: str) -> str:
    return re.sub(r"\s+", " ", str(s)).strip()

# Load budgets
dfb = pd.read_csv(IN_BUDGETS, dtype={"Base_Ministry": "string"})
dfb["Base_Ministry"] = dfb["Base_Ministry"].map(tidy_ministry)

# Identify FY columns like 05-06, 23-24, etc.
fy_cols = [c for c in dfb.columns if re.fullmatch(r"\d{2}-\d{2}", str(c))]
if not fy_cols:
    raise RuntimeError("No fiscal-year columns found in standardized_budget_time_series.csv")

# Sort FY columns by end year (…05-06, 06-07, …, 24-25)
def fy_end_year(col: str) -> int:
    end = int(str(col).split("-")[1])
    return 2000 + end
fy_cols = sorted(fy_cols, key=fy_end_year)

# Coerce numbers (remove commas/spaces), sum duplicates per ministry if any
dfb[fy_cols] = (dfb[fy_cols]
                .replace(r"[,\s]", "", regex=True)
                .replace("", np.nan)
                .apply(pd.to_numeric, errors="coerce"))
dfb = dfb.groupby("Base_Ministry", as_index=False)[fy_cols].sum(min_count=1)

# Melt to long
long = dfb.melt(
    id_vars="Base_Ministry",
    value_vars=fy_cols,
    var_name="Fiscal_Year",
    value_name="Budget_Amount"
).dropna(subset=["Budget_Amount"])

# Load mapping and clean keys
map_df = pd.read_csv(IN_MAP, dtype={"Base_Ministry": "string", "Sector_12": "string"})
map_df["Base_Ministry"] = map_df["Base_Ministry"].map(tidy_ministry)
map_df["Sector_12"] = map_df["Sector_12"].str.strip()

# Join mapping
merged = long.merge(map_df, on="Base_Ministry", how="left")

# Report unmapped ministries
unmapped = merged[merged["Sector_12"].isna()]["Base_Ministry"].unique().tolist()
if unmapped:
    print(f"Warning: {len(unmapped)} ministries have no Sector_12 mapping. Dropping these rows.")
    print("Examples:", unmapped[:10])

# Drop unmapped and aggregate to Sector_12
merged = merged.dropna(subset=["Sector_12"])
sector_ts = (merged.groupby(["Sector_12", "Fiscal_Year"], as_index=False)
                  .agg(Budget_Amount=("Budget_Amount", "sum")))

# Sort and save
sector_ts = sector_ts.sort_values(["Sector_12", "Fiscal_Year"], key=lambda s: s if s.name!="Fiscal_Year" else s.str.split("-").str[1].astype(int))
sector_ts.to_csv(OUT_SECTOR, index=False)

# Summary
years = sorted(sector_ts["Fiscal_Year"].unique(), key=lambda s: int(s.split("-")[1]))
print(f"Saved: {OUT_SECTOR}")
print(f"Sectors: {sector_ts['Sector_12'].nunique()}, Years: {len(years)}, Rows: {len(sector_ts)}")
print("Year span:", years[0], "→", years[-1])
print(sector_ts.head(10))

Saved: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_budget_timeseries.csv
Sectors: 11, Years: 17, Rows: 187
Year span: 05-06 → 23-24
                      Sector_12 Fiscal_Year  Budget_Amount
0  agriculture forestry fishing       05-06       27237.86
1  agriculture forestry fishing       06-07       38240.82
2  agriculture forestry fishing       07-08       44713.38
3  agriculture forestry fishing       09-10       86589.42
4  agriculture forestry fishing       10-11       93610.87
5  agriculture forestry fishing       11-12      104371.66
6  agriculture forestry fishing       12-13      109450.59
7  agriculture forestry fishing       13-14      115835.73
8  agriculture forestry fishing       15-16       21637.71
9  agriculture forestry fishing       16-17       41329.78


2.Integrate macro indicators
* Input: data/macro_indicators_wb.csv
* Keep key columns: GDP_Growth_Rate, Inflation_CPI, Exchange_Rate_USD, Fiscal_Deficit_GDP, Global_GDP_Growth, Election_Year, High_Inflation, GDP_Growth_Lag1, Inflation_Lag1.
* Left‑join on Fiscal_Year.
* Output: data/sector_budget_macro.csv

In [49]:
# Step 2: Integrate macro indicators → data/sector_budget_macro.csv

import pandas as pd
import numpy as np
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
IN_SECTOR = DATA / "sector_budget_timeseries.csv"
IN_MACRO  = DATA / "macro_indicators_wb.csv"
OUT_MERGE = DATA / "sector_budget_macro.csv"

# Load sector time series
sector = pd.read_csv(IN_SECTOR, dtype={"Sector_12": "string", "Fiscal_Year": "string"})
sector["Budget_Amount"] = pd.to_numeric(sector["Budget_Amount"], errors="coerce")

# Load macros and keep required columns
macro_cols_pref = [
    "GDP_Growth_Rate","Inflation_CPI","Exchange_Rate_USD","Fiscal_Deficit_GDP",
    "Global_GDP_Growth","Election_Year","High_Inflation","GDP_Growth_Lag1","Inflation_Lag1"
]
mac = pd.read_csv(IN_MACRO, dtype={"Fiscal_Year": "string"})
keep_cols = ["Fiscal_Year"] + [c for c in macro_cols_pref if c in mac.columns]
mac = mac[keep_cols].copy()

# Coerce numerics
for c in keep_cols:
    if c != "Fiscal_Year":
        mac[c] = pd.to_numeric(mac[c], errors="coerce")

# If any duplicate FY in macro, average numerics
mac = mac.groupby("Fiscal_Year", as_index=False).mean(numeric_only=True)

# Merge (left)
merged = sector.merge(mac, on="Fiscal_Year", how="left")

# Column order
ordered = ["Sector_12","Fiscal_Year","Budget_Amount"] + [c for c in macro_cols_pref if c in merged.columns]
merged = merged[ordered]

# Save
merged.to_csv(OUT_MERGE, index=False)
print(f"Saved: {OUT_MERGE}")
print("Preview:")
print(merged.head(10))

# Simple QC
missing = merged[ordered[3:]].isna().sum()
if missing.any():
    print("\nMissing macro values by column:")
    print(missing[missing > 0].sort_values(ascending=False))
# Step 2: Integrate macro indicators → data/sector_budget_macro.csv

import pandas as pd
import numpy as np
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
IN_SECTOR = DATA / "sector_budget_timeseries.csv"
IN_MACRO  = DATA / "macro_indicators_wb.csv"
OUT_MERGE = DATA / "sector_budget_macro.csv"

# Load sector time series
sector = pd.read_csv(IN_SECTOR, dtype={"Sector_12": "string", "Fiscal_Year": "string"})
sector["Budget_Amount"] = pd.to_numeric(sector["Budget_Amount"], errors="coerce")

# Load macros and keep required columns
macro_cols_pref = [
    "GDP_Growth_Rate","Inflation_CPI","Exchange_Rate_USD","Fiscal_Deficit_GDP",
    "Global_GDP_Growth","Election_Year","High_Inflation","GDP_Growth_Lag1","Inflation_Lag1"
]
mac = pd.read_csv(IN_MACRO, dtype={"Fiscal_Year": "string"})
keep_cols = ["Fiscal_Year"] + [c for c in macro_cols_pref if c in mac.columns]
mac = mac[keep_cols].copy()

# Coerce numerics
for c in keep_cols:
    if c != "Fiscal_Year":
        mac[c] = pd.to_numeric(mac[c], errors="coerce")

# If any duplicate FY in macro, average numerics
mac = mac.groupby("Fiscal_Year", as_index=False).mean(numeric_only=True)

# Merge (left)
merged = sector.merge(mac, on="Fiscal_Year", how="left")

# Column order
ordered = ["Sector_12","Fiscal_Year","Budget_Amount"] + [c for c in macro_cols_pref if c in merged.columns]
merged = merged[ordered]

# Save
merged.to_csv(OUT_MERGE, index=False)
print(f"Saved: {OUT_MERGE}")
print("Preview:")
print(merged.head(10))

# Simple QC
missing = merged[ordered[3:]].isna().sum()
if missing.any():
    print("\nMissing macro values by column:")
    print(missing[missing > 0].sort_values(ascending=False))

Saved: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_budget_macro.csv
Preview:
                      Sector_12 Fiscal_Year  Budget_Amount  GDP_Growth_Rate  \
0  agriculture forestry fishing       05-06       27237.86         8.060733   
1  agriculture forestry fishing       06-07       38240.82         7.660815   
2  agriculture forestry fishing       07-08       44713.38         3.086698   
3  agriculture forestry fishing       09-10       86589.42         8.497585   
4  agriculture forestry fishing       10-11       93610.87         5.241316   
5  agriculture forestry fishing       11-12      104371.66         5.456388   
6  agriculture forestry fishing       12-13      109450.59         6.386106   
7  agriculture forestry fishing       13-14      115835.73         7.410228   
8  agriculture forestry fishing       15-16       21637.71         8.256306   
9  agriculture forestry fishing       16-17       41329.78         6.795383   

   Inflation_CPI  Exchange_Rate_USD  Fisca

3.Integrate sector growth (shares of GDP)
* Input: data/sector_shares_gdp.csv
* Melt to long (sector_key, Fiscal_Year, Sector_Share_GDP), map sector_key → Sector_12, drop unmapped.
* Join on (Sector_12, Fiscal_Year).
* Output: data/sector_budget_macro_panel.csv

In [50]:
# Step 3: Integrate sector growth (shares of GDP) → data/sector_budget_macro_panel.csv

import re
import pandas as pd
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
IN_PANEL = DATA / "sector_budget_macro.csv"      # from step 2
IN_SHARES = DATA / "sector_shares_gdp.csv"
OUT_PANEL = DATA / "sector_budget_macro_panel.csv"

# Load current panel
panel = pd.read_csv(IN_PANEL, dtype={"Sector_12": "string", "Fiscal_Year": "string"})

# Load shares wide table
shares = pd.read_csv(IN_SHARES, dtype={"sector": "string"})

# Find FY columns like 2005-06 .. 2024-25
fy_full_cols = [c for c in shares.columns if re.fullmatch(r"20\d{2}-\d{2}", str(c))]
if not fy_full_cols:
    raise RuntimeError("No FY columns like 2005-06 found in sector_shares_gdp.csv")

# Melt to long
shares_long = shares.melt(
    id_vars="sector",
    value_vars=fy_full_cols,
    var_name="FY_Full",
    value_name="Sector_Share_GDP"
)

# Map underscored keys → Sector_12 vocabulary
SHARE_TO_SECTOR12 = {
    "agriculture_forestry_fishing": "agriculture forestry fishing",
    "trade_hotels_transport_communication_broadcasting": "communication broadcasting culture and toursim",
    "defense_security": "defense security",
    "economic_services": "economic services",
    "energy_natural_resources": "energy and natural resources",
    "food_distribution": "food distribution",
    "governance_administration": "governance and administration",
    "infrastructure_transport": "infrastructure and transport",
    "regional_development": "regional and development",
    "science_innovation": "science and innovation",
    "social_services": "social and services",
}

shares_long["Sector_12"] = shares_long["sector"].map(lambda s: SHARE_TO_SECTOR12.get(str(s).strip()))
skipped = shares_long["Sector_12"].isna().sum()
if skipped:
    missing_keys = sorted(shares_long.loc[shares_long["Sector_12"].isna(), "sector"].dropna().unique().tolist())
    print("Info: skipping unmapped share categories:", missing_keys)

shares_long = shares_long.dropna(subset=["Sector_12"]).copy()

# Convert FY_Full (2005-06) → short '05-06' to match panel
def to_short_fy(fy_full: str) -> str:
    a, b = str(fy_full).split("-")
    return f"{a[-2:]}-{b}"

shares_long["Fiscal_Year"] = shares_long["FY_Full"].map(to_short_fy)
shares_long = shares_long.drop(columns=["sector", "FY_Full"])

# Coerce numeric
shares_long["Sector_Share_GDP"] = pd.to_numeric(shares_long["Sector_Share_GDP"], errors="coerce")

# Merge into panel
merged = panel.merge(shares_long, on=["Sector_12", "Fiscal_Year"], how="left")

# Save
merged.to_csv(OUT_PANEL, index=False)
print(f"Saved: {OUT_PANEL}")
print("Preview:")
print(merged.head(10))

# QC: count missing shares
miss = merged["Sector_Share_GDP"].isna().sum()
print(f"Rows with missing Sector_Share_GDP: {miss} / {len(merged)}")

Info: skipping unmapped share categories: ['agriculture forestry fishing', 'communication broadcasting culture and toursim', 'defense security', 'economic services', 'energy and natural resources', 'food distribution', 'governance and administration', 'infrastructure and transport', 'regional and development', 'science and innovation', 'social and services']
Saved: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_budget_macro_panel.csv
Preview:
                      Sector_12 Fiscal_Year  Budget_Amount  GDP_Growth_Rate  \
0  agriculture forestry fishing       05-06       27237.86         8.060733   
1  agriculture forestry fishing       06-07       38240.82         7.660815   
2  agriculture forestry fishing       07-08       44713.38         3.086698   
3  agriculture forestry fishing       09-10       86589.42         8.497585   
4  agriculture forestry fishing       10-11       93610.87         5.241316   
5  agriculture forestry fishing       11-12      104371.66         5.45

4.Feature engineering
* Add Year_End; per‑sector: Budget_Lag1/Lag2, Budget_Growth_Lag1; shares: Sector_Share_Lag1, Sector_Share_Growth; optional Trend, Inflation×Election.
* Coerce numerics; handle inf/NaN.
* Output: overwrite data/sector_budget_macro_panel.csv

In [51]:
import re
import difflib
import pandas as pd
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
PANEL  = DATA / "sector_budget_macro_panel.csv"
SHARES = DATA / "sector_shares_gdp.csv"
OUT    = PANEL  # overwrite

# Load current panel and shares
panel = pd.read_csv(PANEL, dtype={"Sector_12":"string","Fiscal_Year":"string"})
shares = pd.read_csv(SHARES)

print("Shares columns:", list(shares.columns))
print("\nUnique 'sector' values (first 20):")
if "sector" in shares.columns:
    print(sorted(shares["sector"].dropna().astype(str).unique())[:20])
else:
    print("No 'sector' column found. Check the file headers.")

# Canonical Sector_12 labels from your panel (keep exact spelling used in panel)
canonical = sorted(panel["Sector_12"].dropna().astype(str).unique().tolist())

def norm(s: str) -> str:
    s = str(s).lower().strip()
    s = re.sub(r"[\s/_&,+]+", " ", s)           # collapse separators
    s = re.sub(r"[^a-z0-9 ]+", "", s)           # keep alnum and space
    s = re.sub(r"\s+", " ", s).strip()
    return s

canon_norm = {norm(c): c for c in canonical}

# Optional manual overrides → map normalized shares name → exact Sector_12 (as in panel)
overrides = {
    # examples, edit to your data if needed:
    # "agriculture forestry and fishing": "agriculture forestry fishing",
    # "communications broadcasting culture and tourism": "communication broadcasting culture and toursim",  # keep panel spelling
    # "food and public distribution": "food distribution",
    # "regional development": "regional and development",
    # "science and technology": "science and innovation",
    # "defence security": "defense security",
}

# Build shares_long
fy_full_cols = [c for c in shares.columns if re.fullmatch(r"20\d{2}-\d{2}", str(c))]
if not fy_full_cols:
    raise RuntimeError("No FY columns like 2005-06 found in sector_shares_gdp.csv")

shares_long = shares.melt(
    id_vars=[c for c in shares.columns if c not in fy_full_cols],
    value_vars=fy_full_cols,
    var_name="FY_Full",
    value_name="Sector_Share_GDP"
)

# Decide Sector_12 for shares:
if "Sector_12" in shares_long.columns:
    # If your shares file already has Sector_12, normalize to match panel
    shares_long["Sector_12"] = shares_long["Sector_12"].astype(str)
    shares_long["Sector_12"] = shares_long["Sector_12"].map(lambda s: canon_norm.get(norm(s), s))
else:
    # Map from 'sector' column using normalization + fuzzy matching
    if "sector" not in shares_long.columns:
        raise RuntimeError("sector_shares_gdp.csv has no 'sector' column and no 'Sector_12' column to map from.")
    shares_long["sector"] = shares_long["sector"].astype(str)
    unique_shares = sorted(shares_long["sector"].dropna().unique().tolist())

    map_rows = []
    derived_map = {}
    for s in unique_shares:
        ns = norm(s)
        if ns in overrides:
            target = overrides[ns]
        else:
            # fuzzy match to canonical normalized names
            match_norms = list(canon_norm.keys())
            best = difflib.get_close_matches(ns, match_norms, n=1, cutoff=0.6)
            target = canon_norm[best[0]] if best else None
        derived_map[s] = target
        map_rows.append({"shares_sector": s, "mapped_Sector_12": target})

    map_df = pd.DataFrame(map_rows).sort_values("shares_sector")
    print("\nProposed mapping (first 20):")
    print(map_df.head(20))

    # Apply mapping
    shares_long["Sector_12"] = shares_long["sector"].map(derived_map)

# Convert FY_Full → short 'yy-yy'
def to_short_fy(fy_full: str) -> str:
    a, b = str(fy_full).split("-")
    return f"{a[-2:]}-{b}"

shares_long["Fiscal_Year"] = shares_long["FY_Full"].astype(str).map(to_short_fy)
shares_long["Sector_Share_GDP"] = pd.to_numeric(shares_long["Sector_Share_GDP"], errors="coerce")

# Keep only needed cols
shares_long = shares_long[["Sector_12","Fiscal_Year","Sector_Share_GDP"]].dropna(subset=["Sector_12"])

# Normalize join keys and merge
for col in ["Sector_12","Fiscal_Year"]:
    panel[col] = panel[col].astype(str).str.strip().str.lower()
    shares_long[col] = shares_long[col].astype(str).str.strip().str.lower()

merged = (panel.drop(columns=["Sector_Share_GDP","Sector_Share_Lag1","Sector_Share_Growth"], errors="ignore")
               .merge(shares_long, on=["Sector_12","Fiscal_Year"], how="left"))

missing_share = merged["Sector_Share_GDP"].isna().sum()
print(f"\nAfter mapping, rows missing Sector_Share_GDP: {missing_share} / {len(merged)}")

# Recompute share lags/growth
if "Year_End" in merged.columns:
    merged = merged.sort_values(["Sector_12","Year_End"])
else:
    merged = merged.sort_values(["Sector_12","Fiscal_Year"])

merged["Sector_Share_Lag1"] = merged.groupby("Sector_12")["Sector_Share_GDP"].shift(1)
merged["Sector_Share_Growth"] = (merged["Sector_Share_GDP"] / merged["Sector_Share_Lag1"] - 1)

# NEW: add budget lags and growth (used by DL and baselines)
merged["Budget_Lag1"] = merged.groupby("Sector_12")["Budget_Amount"].shift(1)
merged["Budget_Lag2"] = merged.groupby("Sector_12")["Budget_Amount"].shift(2)
merged["Budget_Growth_Lag1"] = merged["Budget_Amount"] / merged["Budget_Lag1"] - 1

# Clean infinities
merged["Trend"] = merged.groupby("Sector_12").cumcount() + 1

if {"Inflation_CPI", "Election_Year"}.issubset(merged.columns):
    merged["Inflation_x_Election"] = merged["Inflation_CPI"] * merged["Election_Year"]

if {"GDP_Growth_Rate", "Election_Year"}.issubset(merged.columns):
    merged["GDPGrowth_x_Election"] = merged["GDP_Growth_Rate"] * merged["Election_Year"]
# <<< END INSERT >>>

# Clean infinities
import numpy as np
merged.replace([np.inf, -np.inf], np.nan, inplace=True)

# Save
merged.to_csv(OUT, index=False)
print("Saved fixed panel:", OUT)
print(merged.head(8))

Shares columns: ['sector', '2005-06', '2006-07', '2007-08', '2008-09', '2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16', '2016-17', '2017-18', '2018-19', '2019-20', '2020-21', '2021-22', '2022-23', '2023-24', '2024-25']

Unique 'sector' values (first 20):
['agriculture forestry fishing', 'communication broadcasting culture and toursim', 'defense security', 'economic services', 'energy and natural resources', 'food distribution', 'governance and administration', 'infrastructure and transport', 'regional and development', 'science and innovation', 'social and services']

Proposed mapping (first 20):
                                     shares_sector  \
0                     agriculture forestry fishing   
1   communication broadcasting culture and toursim   
2                                 defense security   
3                                economic services   
4                     energy and natural resources   
5                                food distri

5.Define splits and scaling
* Hold out FY 23‑24 (Year_End=2024) for final test; train on <= 2023.
* Fit StandardScaler on train rows only; transform all rows.
* Persist list of feature columns for modeling.

In [52]:
# Step 5: Define splits and scaling

import json
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import joblib

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
IN_PANEL = DATA / "sector_budget_macro_panel.csv"

OUT_TRAIN = DATA / "sector_budget_features_train.csv"
OUT_TEST  = DATA / "sector_budget_features_test_2024.csv"
OUT_FEATS = DATA / "feature_columns.json"
OUT_SCALER = DATA / "feature_imputer_scaler.joblib"

# Load panel
df = pd.read_csv(IN_PANEL, dtype={"Sector_12":"string","Fiscal_Year":"string"})
# Ensure Year_End is integer-like
if "Year_End" not in df.columns:
    df["Year_End"] = df["Fiscal_Year"].map(lambda s: 2000 + int(str(s).split("-")[1])).astype("Int64")

# Identify columns
id_cols = ["Sector_12","Fiscal_Year","Year_End"]
target_col = "Budget_Amount"

# Candidate features = all numeric columns except id + target
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [c for c in num_cols if c not in ([target_col] + ["Year_End"])]  # keep Year_End as id/time key, not a feature

# Train/test masks
train_mask = df["Year_End"] <= 2023
test_mask  = df["Year_End"] == 2024  # FY 23-24

X_train = df.loc[train_mask, feature_cols].copy()
X_test  = df.loc[test_mask, feature_cols].copy()

# Drop feature columns that are entirely NaN in train (e.g., Sector_Share_* if merge failed)
non_allnan_feats = [c for c in feature_cols if not X_train[c].isna().all()]
dropped = sorted(set(feature_cols) - set(non_allnan_feats))
feature_cols = non_allnan_feats
X_train = X_train[feature_cols]
X_test  = X_test[feature_cols]

if dropped:
    print("Dropping all-NaN train features:", dropped)

# Build pipeline: impute (median) + scale
pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Fit on train only, transform both
X_train_z = pipe.fit_transform(X_train)
X_test_z  = pipe.transform(X_test)

z_cols = [f"z_{c}" for c in feature_cols]
train_out = pd.concat(
    [df.loc[train_mask, id_cols + [target_col]].reset_index(drop=True),
     pd.DataFrame(X_train_z, columns=z_cols)],
    axis=1
)
test_out = pd.concat(
    [df.loc[test_mask, id_cols + [target_col]].reset_index(drop=True),
     pd.DataFrame(X_test_z, columns=z_cols)],
    axis=1
)

# Save artifacts
train_out.to_csv(OUT_TRAIN, index=False)
test_out.to_csv(OUT_TEST, index=False)
joblib.dump({"pipeline": pipe, "feature_cols": feature_cols, "z_cols": z_cols}, OUT_SCALER)
with OUT_FEATS.open("w") as f:
    json.dump({"feature_cols": feature_cols, "z_cols": z_cols}, f, indent=2)

print(f"Saved train features: {OUT_TRAIN}  rows={len(train_out)}  feats={len(z_cols)}")
print(f"Saved 2024 test features: {OUT_TEST}  rows={len(test_out)}")
print(f"Saved scaler pipeline: {OUT_SCALER}")
print(f"Saved feature list: {OUT_FEATS}")

# Quick QC
print("\nTrain NA counts (should be 0 in z_ cols):")
print(train_out[z_cols].isna().sum().sum(), "total NaNs")
print("Test NA counts (should be 0 in z_ cols):")
print(test_out[z_cols].isna().sum().sum(), "total NaNs")

Saved train features: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_budget_features_train.csv  rows=176  feats=18
Saved 2024 test features: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_budget_features_test_2024.csv  rows=11
Saved scaler pipeline: /Users/vvmohith/Desktop/PROJECT/final_data/data/feature_imputer_scaler.joblib
Saved feature list: /Users/vvmohith/Desktop/PROJECT/final_data/data/feature_columns.json

Train NA counts (should be 0 in z_ cols):
0 total NaNs
Test NA counts (should be 0 in z_ cols):
0 total NaNs


6.Baseline models (for context)
* Naive: predict Budget_Lag1.
* Tabular: Linear/Ridge/GBM on non‑sequence features.
* Save metrics to compare vs DL.
* Output: data/sector_baseline_metrics_23_24.csv

In [53]:
# Step 6: Baseline models → data/metrics/sector_baseline_metrics_23_24.csv

from pathlib import Path
import json
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"

IN_TRAIN = DATA / "sector_budget_features_train.csv"
IN_TEST  = DATA / "sector_budget_features_test_2024.csv"
IN_META  = DATA / "feature_columns.json"
IN_PANEL = DATA / "sector_budget_macro_panel.csv"

OUT_DIR = DATA / "metrics"
OUT_DIR.mkdir(parents=True, exist_ok=True)
OUT_METRICS = OUT_DIR / "sector_baseline_metrics_23_24.csv"

# Load data
df_train = pd.read_csv(IN_TRAIN)
df_test  = pd.read_csv(IN_TEST)

# Resolve standardized feature columns (z_)
try:
    with IN_META.open() as f:
        meta = json.load(f)
    z_cols = meta.get("z_cols") or [c for c in df_train.columns if c.startswith("z_")]
except Exception:
    z_cols = [c for c in df_train.columns if c.startswith("z_")]

# Keep only columns present in both train and test
z_cols = [c for c in z_cols if c in df_train.columns and c in df_test.columns]
if not z_cols:
    raise RuntimeError("No standardized z_ feature columns found in both train and test.")

# Build two feature sets: with and without Sector_Share* (ablation)
z_cols_with = list(z_cols)
z_cols_without = [c for c in z_cols if "Sector_Share" not in c]

def metrics(y_true, y_pred):
    y_true = np.asarray(y_true).ravel()
    y_pred = np.asarray(y_pred).ravel()
    mask = y_true != 0
    mape = float(np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100) if mask.any() else np.nan
    mse = mean_squared_error(y_true, y_pred)
    rmse = float(np.sqrt(mse))
    return {
        "MAE": float(mean_absolute_error(y_true, y_pred)),
        "RMSE": rmse,
        "R2": float(r2_score(y_true, y_pred)),
        "MAPE_%": mape,
        "n_eval": int(len(y_true))
    }

def run_baselines(zset, tag):
    rows = []
    if not zset:
        return pd.DataFrame(rows, columns=["model","MAE","RMSE","R2","MAPE_%","n_eval"])
    X_train, X_test = df_train[zset].copy(), df_test[zset].copy()
    y_train, y_test = df_train["Budget_Amount"].copy(), df_test["Budget_Amount"].copy()

    # 1) Naive baseline (Budget_Lag1) via panel join
    try:
        panel = pd.read_csv(IN_PANEL, dtype={"Sector_12":"string","Fiscal_Year":"string"})
        if "Year_End" not in panel.columns:
            panel["Year_End"] = panel["Fiscal_Year"].map(lambda s: 2000 + int(str(s).split("-")[1])).astype("Int64")
        last_year = panel[["Sector_12","Year_End","Budget_Amount"]].rename(
            columns={"Year_End":"Year_End_Lag1","Budget_Amount":"Budget_Lag1"}
        )
        test_ids = df_test[["Sector_12","Year_End"]].copy()
        test_ids["Year_End_Lag1"] = test_ids["Year_End"] - 1
        naive_merge = test_ids.merge(last_year, on=["Sector_12","Year_End_Lag1"], how="left")
        y_pred_naive = naive_merge["Budget_Lag1"]
        m = y_pred_naive.notna().values
        rows.append({
            "model": f"Naive_Budget_Lag1[{tag}]",
            **(metrics(y_test[m].values, y_pred_naive[m].values) if m.any()
               else {"MAE": np.nan, "RMSE": np.nan, "R2": np.nan, "MAPE_%": np.nan, "n_eval": 0})
        })
    except Exception:
        rows.append({"model": f"Naive_Budget_Lag1[{tag}]", "MAE": np.nan, "RMSE": np.nan, "R2": np.nan, "MAPE_%": np.nan, "n_eval": 0})

    # 2) Linear Regression
    lr = LinearRegression().fit(X_train, y_train)
    rows.append({"model": f"LinearRegression[{tag}]", **metrics(y_test.values, lr.predict(X_test))})

    # 3) Ridge Regression
    ridge = Ridge(alpha=1.0).fit(X_train, y_train)
    rows.append({"model": f"Ridge(alpha=1.0)[{tag}]", **metrics(y_test.values, ridge.predict(X_test))})

    # 4) Gradient Boosting (GBM)
    gbm = GradientBoostingRegressor(random_state=42).fit(X_train, y_train)
    rows.append({"model": f"GradientBoostingRegressor[{tag}]", **metrics(y_test.values, gbm.predict(X_test))})

    return pd.DataFrame(rows, columns=["model","MAE","RMSE","R2","MAPE_%","n_eval"])

# Run with/without shares; drop empty ablation if needed
dfs = [run_baselines(z_cols_with, "with_shares")]
if len(z_cols_without) != len(z_cols_with):  # only add if something was actually removed
    dfs.append(run_baselines(z_cols_without, "no_shares"))

metrics_df = pd.concat(dfs, ignore_index=True)
metrics_df.to_csv(OUT_METRICS, index=False)
print(f"Saved metrics: {OUT_METRICS}")
print(metrics_df)

Saved metrics: /Users/vvmohith/Desktop/PROJECT/final_data/data/metrics/sector_baseline_metrics_23_24.csv
                                    model           MAE          RMSE  \
0          Naive_Budget_Lag1[with_shares]  22696.873636  33404.003003   
1           LinearRegression[with_shares]  21697.806564  26191.146275   
2           Ridge(alpha=1.0)[with_shares]  22526.791115  27747.155241   
3  GradientBoostingRegressor[with_shares]  12908.067483  16707.312038   
4            Naive_Budget_Lag1[no_shares]  22696.873636  33404.003003   
5             LinearRegression[no_shares]  21325.663380  26352.124242   
6             Ridge(alpha=1.0)[no_shares]  22244.754538  28049.966621   
7    GradientBoostingRegressor[no_shares]  11584.406048  15540.915970   

         R2     MAPE_%  n_eval  
0  0.908105  12.774786      11  
1  0.943506  22.037856      11  
2  0.936594  21.465432      11  
3  0.977012   8.202311      11  
4  0.908105  12.774786      11  
5  0.942809  21.876509      11  
6  0.9

7.Deep learning sequence model (train/evaluate on FY 23-24)
* Build 5-year sequences per sector on z_ features.
* Train GRU/LSTM(64) with Dense(64, ReLU) → Dropout(0.2) → Dense(1) on log1p(target).
* Validate with EarlyStopping/ReduceLROnPlateau; evaluate on FY 23-24.
* Outputs: data/sector_dl_predictions_23_24.csv, data/sector_dl_metrics_23_24.csv.

In [55]:
# Step 7: Deep learning sequence model → data/sector_dl_predictions_23_24.csv, data/sector_dl_metrics_23_24.csv

from pathlib import Path
import json
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, optimizers, regularizers

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"

IN_TRAIN = DATA / "sector_budget_features_train.csv"
IN_TEST  = DATA / "sector_budget_features_test_2024.csv"
IN_META  = DATA / "feature_columns.json"

OUT_PREDS   = DATA / "sector_dl_predictions_23_24.csv"
OUT_METRICS = DATA / "sector_dl_metrics_23_24.csv"

# Config (smaller models, stronger regularization)
LOOKBACK = 3
MODEL_TYPE = "GRU"  # "GRU", "LSTM", or "TCN"
UNITS = 32
DROPOUT = 0.30
L2 = 1e-4
BATCH_SIZE = 32
EPOCHS = 350
SEED = 42

tf.random.set_seed(SEED)
np.random.seed(SEED)

# Load data
df_tr = pd.read_csv(IN_TRAIN)
df_te = pd.read_csv(IN_TEST)

# Resolve standardized feature columns
try:
    with IN_META.open() as f:
        meta = json.load(f)
    z_cols = meta.get("z_cols") or [c for c in df_tr.columns if c.startswith("z_")]
except Exception:
    z_cols = [c for c in df_tr.columns if c.startswith("z_")]

# Keep only columns present in both train and test
z_cols = [c for c in z_cols if c in df_tr.columns and c in df_te.columns]
if not z_cols:
    raise RuntimeError("No standardized z_ feature columns found in both train and test.")

# Combine for sequence building
use_cols = ["Sector_12","Fiscal_Year","Year_End","Budget_Amount"] + z_cols
df = pd.concat([df_tr[use_cols], df_te[use_cols]], axis=0, ignore_index=True)
df = df.sort_values(["Sector_12","Year_End"]).reset_index(drop=True)

# Build sequences per sector (window t-L..t-1 -> target at t)
def build_sequences(df_group, lookback, feature_cols):
    X_list, y_list, ids = [], [], []
    vals = df_group[feature_cols].values
    tgt = df_group["Budget_Amount"].values
    years = df_group["Year_End"].values
    fys = df_group["Fiscal_Year"].values
    n = len(df_group)
    for i in range(lookback, n):
        X_list.append(vals[i-lookback:i, :])
        y_list.append(tgt[i])
        ids.append((df_group["Sector_12"].iloc[i], int(years[i]), fys[i]))
    return X_list, y_list, ids

X_all, y_all, meta_all = [], [], []
for sec, g in df.groupby("Sector_12", sort=False):
    g = g.sort_values("Year_End")
    Xs, ys, ids = build_sequences(g, LOOKBACK, z_cols)
    X_all += Xs; y_all += ys; meta_all += ids

if not X_all:
    raise RuntimeError("No sequences built. Increase data coverage or reduce LOOKBACK.")

X_all = np.asarray(X_all, dtype=np.float32)
y_all = np.asarray(y_all, dtype=np.float32)
meta_df = pd.DataFrame(meta_all, columns=["Sector_12","Year_End","Fiscal_Year"])

# Masks: train (<=2023), test (=2024); validation = 2023 (time-based)
m_train_all = meta_df["Year_End"] <= 2023
m_val = (meta_df["Year_End"] == 2023) & m_train_all
m_train = m_train_all & (~m_val)
m_test  = meta_df["Year_End"] == 2024

X_tr, y_tr = X_all[m_train], y_all[m_train]
X_val, y_val = X_all[m_val], y_all[m_val]
X_te,  y_te  = X_all[m_test], y_all[m_test]
meta_test = meta_df[m_test].reset_index(drop=True)

if len(X_te) == 0:
    raise RuntimeError("No test sequences for Year_End=2024 with the chosen LOOKBACK.")
if len(X_tr) == 0:
    raise RuntimeError("No training sequences (<=2022). Check data coverage or reduce LOOKBACK.")
if len(X_val) == 0:
    n = max(1, int(0.2 * len(X_tr)))
    X_val, y_val = X_tr[-n:], y_tr[-n:]
    X_tr, y_tr = X_tr[:-n], y_tr[:-n]

# Log-transform target for stability
y_tr_log = np.log1p(y_tr)
y_val_log = np.log1p(y_val)

# Build compact, regularized model
n_features = X_tr.shape[2]
inp = layers.Input(shape=(LOOKBACK, n_features))

if MODEL_TYPE.upper() == "LSTM":
    x = layers.LSTM(UNITS, return_sequences=False,
                    kernel_regularizer=regularizers.l2(L2),
                    recurrent_dropout=0.15)(inp)
elif MODEL_TYPE.upper() == "TCN":
    x = inp
    for d in [1, 2, 4]:
        y = layers.Conv1D(32, 3, padding="causal", dilation_rate=d,
                          activation="relu",
                          kernel_regularizer=regularizers.l2(L2))(x)
        y = layers.BatchNormalization()(y)
        y = layers.Dropout(DROPOUT)(y)
        if x.shape[-1] != y.shape[-1]:
            x = layers.Conv1D(32, 1, padding="same")(x)
        x = layers.Add()([x, y])
    x = layers.Activation("relu")(x)
    x = layers.GlobalAveragePooling1D()(x)
else:
    x = layers.GRU(UNITS, return_sequences=False,
                   kernel_regularizer=regularizers.l2(L2),
                   recurrent_dropout=0.15)(inp)

x = layers.Dense(32, activation="relu", kernel_regularizer=regularizers.l2(L2))(x)
x = layers.Dropout(DROPOUT)(x)
out = layers.Dense(1, activation="linear")(x)
model = models.Model(inputs=inp, outputs=out)

model.compile(optimizer=optimizers.Adam(learning_rate=5e-4),
              loss=tf.keras.losses.Huber(delta=0.5),
              metrics=["mae"])

cb = [
    callbacks.EarlyStopping(monitor="val_loss", patience=20, restore_best_weights=True),
    callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=8, min_lr=1e-5)
]

_ = model.fit(
    X_tr, y_tr_log,
    validation_data=(X_val, y_val_log),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    verbose=0,
    callbacks=cb
)

# Predict 2024 (invert log)
y_pred_log = model.predict(X_te, batch_size=BATCH_SIZE, verbose=0).ravel()
y_pred = np.expm1(y_pred_log)

# Metrics
def seq_metrics(y_true, y_pred):
    y_true = np.asarray(y_true).ravel()
    y_pred = np.asarray(y_pred).ravel()
    mask = y_true != 0
    mape = float(np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100) if mask.any() else np.nan
    mse = float(np.mean((y_true - y_pred) ** 2))
    rmse = float(np.sqrt(mse))
    mae = float(np.mean(np.abs(y_true - y_pred)))
    ss_res = float(np.sum((y_true - y_pred) ** 2))
    ss_tot = float(np.sum((y_true - np.mean(y_true)) ** 2))
    r2 = float(1 - ss_res / ss_tot) if ss_tot > 0 else np.nan
    return {"MAE": mae, "RMSE": rmse, "R2": r2, "MAPE_%": mape, "n_eval": int(len(y_true))}

m = seq_metrics(y_te, y_pred)

# Save predictions
preds = pd.DataFrame({
    "Sector_12": meta_test["Sector_12"],
    "Fiscal_Year": meta_test["Fiscal_Year"],
    "Year_End": meta_test["Year_End"],
    "Actual": y_te,
    "Predicted": y_pred
}).sort_values(["Sector_12"]).reset_index(drop=True)
preds.to_csv(OUT_PREDS, index=False)

# Save metrics
model_name = f"{MODEL_TYPE.upper()}_seq(L={LOOKBACK},U={UNITS})"
pd.DataFrame([{"model": model_name, **m}]).to_csv(OUT_METRICS, index=False)

print(f"Saved predictions: {OUT_PREDS} rows={len(preds)}")
print(f"Saved metrics: {OUT_METRICS}")
print(m)

Saved predictions: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_dl_predictions_23_24.csv rows=11
Saved metrics: /Users/vvmohith/Desktop/PROJECT/final_data/data/sector_dl_metrics_23_24.csv
{'MAE': 167469.65625, 'RMSE': 200467.6261145425, 'R2': -2.309661352553539, 'MAPE_%': 99.99530792236328, 'n_eval': 11}


8.Forecast next FY and compare (optional FY 24-25)
* If Year_End=2025 features exist, create last lookback window per sector.
* Forecast FY 24-25; if actuals/projections exist, join and compute metrics.
* Outputs: data/sector_dl_forecast_2425.csv and comparison plots (preds vs actual, residuals).

In [56]:
# ...existing code...
# Step 8: Forecast next FY (24-25) and compare → results/fy2425/*
from pathlib import Path
import json
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, optimizers

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
RESULTS = BASE / "results" / "fy2425"
PLOTS = RESULTS / "plots"
METRICS_DIR = RESULTS / "metrics"
RESULTS.mkdir(parents=True, exist_ok=True)
PLOTS.mkdir(parents=True, exist_ok=True)
METRICS_DIR.mkdir(parents=True, exist_ok=True)

IN_PANEL  = DATA / "sector_budget_macro_panel.csv"
IN_SCALER = DATA / "feature_imputer_scaler.joblib"  # from Step 5

OUT_FC25        = RESULTS / "sector_dl_forecast_2425.csv"
OUT_METRICS_JSON= METRICS_DIR / "metrics_2425.json"
OUT_METRICS_CSV = METRICS_DIR / "metrics_2425.csv"
OUT_FIG1        = PLOTS / "preds_vs_actual_2425.png"
OUT_FIG2        = PLOTS / "residuals_2425.png"

# Config
LOOKBACK = 5            # same as DL step
SEED = 42
EPOCHS = 250
BATCH = 32
VAL_SPLIT = 0.2

tf.random.set_seed(SEED)
np.random.seed(SEED)

def fy_from_year_end(y_end: int) -> str:
    a = (y_end - 1) % 100
    b = y_end % 100
    return f"{a:02d}-{b:02d}"

# Load panel
panel = pd.read_csv(IN_PANEL, dtype={"Sector_12":"string","Fiscal_Year":"string"})
if "Year_End" not in panel.columns:
    panel["Year_End"] = panel["Fiscal_Year"].map(lambda s: 2000 + int(str(s).split("-")[1])).astype("Int64")

# Load scaler artifacts (fit on train only)
art = joblib.load(IN_SCALER)
pipe = art["pipeline"]
feat_cols = art["feature_cols"]
z_cols = art["z_cols"]

# Transform all rows to z_ features using saved pipeline
X_all = panel[feat_cols].copy()
X_all_z = pipe.transform(X_all)
df_z = pd.concat(
    [
        panel[["Sector_12","Fiscal_Year","Year_End","Budget_Amount"]].reset_index(drop=True),
        pd.DataFrame(X_all_z, columns=z_cols)
    ],
    axis=1
).sort_values(["Sector_12","Year_End"]).reset_index(drop=True)

# Build sequences per sector
def build_sequences(df_group, lookback, feature_cols):
    X_list, y_list, ids = [], [], []
    vals = df_group[feature_cols].values
    tgt = df_group["Budget_Amount"].values
    years = df_group["Year_End"].values
    fys = df_group["Fiscal_Year"].values
    n = len(df_group)
    for i in range(lookback, n):
        X_list.append(vals[i-lookback:i, :])  # t-L .. t-1
        y_list.append(tgt[i])                  # target at t
        ids.append((df_group["Sector_12"].iloc[i], int(years[i]), fys[i]))
    return X_list, y_list, ids

X_seq, y_seq, meta_rows = [], [], []
for sec, g in df_z.groupby("Sector_12", sort=False):
    g = g.sort_values("Year_End")
    Xs, ys, ids = build_sequences(g, LOOKBACK, z_cols)
    X_seq += Xs; y_seq += ys; meta_rows += ids

if not X_seq:
    raise RuntimeError("No sequences built. Increase data coverage or reduce LOOKBACK.")

X_seq = np.array(X_seq, dtype=np.float32)
y_seq = np.array(y_seq, dtype=np.float32)
meta = pd.DataFrame(meta_rows, columns=["Sector_12","Year_End","Fiscal_Year"])

# Train on targets <= 2024
m_train = meta["Year_End"] <= 2024
X_train, y_train = X_seq[m_train], y_seq[m_train]
if len(X_train) == 0:
    raise RuntimeError("No training sequences (<=2024). Check data coverage or reduce LOOKBACK.")

# Build inference windows:
has_2025 = (meta["Year_End"] == 2025).any()
if has_2025:
    # Use sequences already aligned to 2025 targets
    m_inf = meta["Year_End"] == 2025
    X_inf = X_seq[m_inf]
    meta_inf = meta[m_inf].reset_index(drop=True)
else:
    # Fallback: last LOOKBACK window ending at last available year (should be 2024) → predict Year_End+1 (2025)
    print("Info: No Year_End=2025 rows; using last lookback window (ending 2024) to predict 2025.")
    X_inf_list, meta_inf_rows = [], []
    for sec, g in df_z.groupby("Sector_12", sort=False):
        g = g.sort_values("Year_End")
        if len(g) >= LOOKBACK:
            last_year = int(g["Year_End"].iloc[-1])
            X_inf_list.append(g[z_cols].tail(LOOKBACK).values.astype(np.float32))
            meta_inf_rows.append([sec, last_year + 1, fy_from_year_end(last_year + 1)])
    if not X_inf_list:
        raise RuntimeError("No sectors with enough history to form inference windows.")
    X_inf = np.stack(X_inf_list, axis=0)
    meta_inf = pd.DataFrame(meta_inf_rows, columns=["Sector_12","Year_End","Fiscal_Year"])

# Build GRU model
n_features = X_train.shape[2]
inp = layers.Input(shape=(LOOKBACK, n_features))
x = layers.GRU(64, return_sequences=False)(inp)
x = layers.Dense(64, activation="relu")(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(1, activation="linear")(x)
model = models.Model(inputs=inp, outputs=out)
model.compile(optimizer=optimizers.Adam(1e-3), loss="mse", metrics=["mae"])

cb = [
    callbacks.EarlyStopping(monitor="val_loss", patience=15, restore_best_weights=True),
    callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=7, min_lr=1e-5)
]

# Train on log1p target for stability
y_train_log = np.log1p(y_train)
_ = model.fit(
    X_train, y_train_log,
    validation_split=VAL_SPLIT,
    epochs=EPOCHS,
    batch_size=BATCH,
    verbose=0,
    callbacks=cb
)

# Forecast and invert log
y_inf_log = model.predict(X_inf, batch_size=64, verbose=0).ravel()
y_inf = np.expm1(y_inf_log)

# Assemble forecast table
fc = pd.DataFrame({
    "Sector_12": meta_inf["Sector_12"],
    "Fiscal_Year": meta_inf["Fiscal_Year"],
    "Year_End": meta_inf["Year_End"],
    "Forecast": y_inf
})

# Attach actuals if available
actual_next = panel.loc[panel["Year_End"].isin(meta_inf["Year_End"]), ["Sector_12","Year_End","Budget_Amount"]] \
                   .rename(columns={"Budget_Amount":"Actual"})
fc = fc.merge(actual_next, on=["Sector_12","Year_End"], how="left") \
       .sort_values(["Sector_12"]).reset_index(drop=True)

# Save forecast CSV
fc.to_csv(OUT_FC25, index=False)
print(f"Saved forecast: {OUT_FC25} rows={len(fc)}")

# Metrics and plots only if Actual present
metrics_out = {"info": "No actuals available for forecast year; metrics and plots skipped.", "n_eval": 0}
if not fc["Actual"].isna().all():
    sub = fc.dropna(subset=["Actual"]).copy()
    y_true = sub["Actual"].values
    y_pred = sub["Forecast"].values

    # Metrics
    mask = y_true != 0
    mape = float(np.mean(np.abs((y_true[mask]-y_pred[mask]) / y_true[mask])) * 100) if mask.any() else np.nan
    mse  = float(np.mean((y_true - y_pred)**2))
    rmse = float(np.sqrt(mse))
    mae  = float(np.mean(np.abs(y_true - y_pred)))
    ss_res = float(np.sum((y_true - y_pred)**2))
    ss_tot = float(np.sum((y_true - np.mean(y_true))**2))
    r2   = float(1 - ss_res/ss_tot) if ss_tot > 0 else np.nan
    metrics_out = {"MAE": mae, "RMSE": rmse, "R2": r2, "MAPE_%": mape, "n_eval": int(len(y_true))}
    pd.DataFrame([metrics_out]).to_csv(OUT_METRICS_CSV, index=False)

    # Plots
    sns.set_style("whitegrid")

    # Preds vs Actual
    plt.figure(figsize=(6,6))
    plt.scatter(y_true, y_pred, alpha=0.8)
    lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())]
    plt.plot(lims, lims, "k--", linewidth=1)
    plt.xlabel("Actual (next FY)")
    plt.ylabel("Forecast (next FY)")
    plt.title("Predicted vs Actual")
    plt.tight_layout()
    plt.savefig(OUT_FIG1, dpi=150)
    plt.close()

    # Residuals
    residuals = y_pred - y_true
    plt.figure(figsize=(7,4))
    sns.histplot(residuals, bins=15, kde=True)
    plt.axvline(0, color="k", linestyle="--", linewidth=1)
    plt.xlabel("Residual (Forecast - Actual)")
    plt.title("Residuals")
    plt.tight_layout()
    plt.savefig(OUT_FIG2, dpi=150)
    plt.close()

    print(f"Saved plots: {OUT_FIG1}, {OUT_FIG2}")
else:
    print(metrics_out["info"])

# Save metrics JSON
with open(OUT_METRICS_JSON, "w") as f:
    json.dump(metrics_out, f, indent=2)
print(f"Saved metrics JSON: {OUT_METRICS_JSON}")
# ...existing code...

Info: No Year_End=2025 rows; using last lookback window (ending 2024) to predict 2025.
Saved forecast: /Users/vvmohith/Desktop/PROJECT/final_data/results/fy2425/sector_dl_forecast_2425.csv rows=11
No actuals available for forecast year; metrics and plots skipped.
Saved metrics JSON: /Users/vvmohith/Desktop/PROJECT/final_data/results/fy2425/metrics/metrics_2425.json


In [57]:
# Build context around 2025 forecasts → results/fy2425/sector_dl_forecast_2425_with_context.csv
import pandas as pd
import numpy as np
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
RESULTS = BASE / "results" / "fy2425"

fc = pd.read_csv(RESULTS / "sector_dl_forecast_2425.csv")

panel = pd.read_csv(DATA / "sector_budget_macro_panel.csv", dtype={"Fiscal_Year":"string"})
if "Year_End" not in panel.columns:
    panel["Year_End"] = panel["Fiscal_Year"].map(lambda s: 2000 + int(str(s).split("-")[1]))
panel["Year_End"] = panel["Year_End"].astype(int)

last_actual = (panel[panel["Year_End"] == 2024][["Sector_12","Budget_Amount"]]
               .rename(columns={"Budget_Amount":"Actual_2024"}))

out = (fc.merge(last_actual, on="Sector_12", how="left")
         .assign(Growth_vs_2024_pct=lambda d: (d["Forecast"]/d["Actual_2024"] - 1))
         .sort_values("Sector_12"))

# Add ±MAE band from FY23-24 DL metrics if available
try:
    mae = float(pd.read_csv(DATA / "sector_dl_metrics_23_24.csv").iloc[0]["MAE"])
    out["Forecast_Lower_mae"] = np.maximum(0, out["Forecast"] - mae)
    out["Forecast_Upper_mae"] = out["Forecast"] + mae
except Exception:
    pass

out.to_csv(RESULTS / "sector_dl_forecast_2425_with_context.csv", index=False)
print("Saved:", RESULTS / "sector_dl_forecast_2425_with_context.csv")
display(out.head(10))

Saved: /Users/vvmohith/Desktop/PROJECT/final_data/results/fy2425/sector_dl_forecast_2425_with_context.csv


Unnamed: 0,Sector_12,Fiscal_Year,Year_End,Forecast,Actual,Actual_2024,Growth_vs_2024_pct,Forecast_Lower_mae,Forecast_Upper_mae
0,agriculture forestry fishing,24-25,2025,212294.34,,253185.88,-0.161508,44824.68375,379763.99625
1,communication broadcasting culture and toursim,24-25,2025,216282.56,,262565.88,-0.176273,48812.90375,383752.21625
2,defense security,24-25,2025,30580.012,,51126.31,-0.401873,0.0,198049.66825
3,economic services,24-25,2025,45148.19,,61298.69,-0.263472,0.0,212617.84625
4,energy and natural resources,24-25,2025,101737.29,,246136.28,-0.586663,0.0,269206.94625
5,food distribution,24-25,2025,144731.95,,209052.25,-0.307676,0.0,312201.60625
6,governance and administration,24-25,2025,54695.16,,94545.62,-0.421495,0.0,222164.81625
7,infrastructure and transport,24-25,2025,142611.78,,352198.41,-0.595081,0.0,310081.43625
8,regional and development,24-25,2025,27023.373,,31008.08,-0.128505,0.0,194493.02925
9,science and innovation,24-25,2025,19535.377,,28905.33,-0.32416,0.0,187005.03325


In [58]:
import pandas as pd
from pathlib import Path

BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
RESULTS = BASE / "results" / "fy2425"

fc = pd.read_csv(RESULTS / "sector_dl_forecast_2425.csv")
ctx = pd.read_csv(RESULTS / "sector_dl_forecast_2425_with_context.csv")

print("Forecast rows:", len(fc))
display(fc.head(10))

print("\nWith-context rows:", len(ctx))
display(ctx.head(10))

# Also show the metrics JSON message
import json
mjson = RESULTS / "metrics" / "metrics_2425.json"
if mjson.exists():
    print("\nmetrics_2425.json:")
    print(mjson.read_text())
else:
    print("\nNo metrics JSON found.")

Forecast rows: 11


Unnamed: 0,Sector_12,Fiscal_Year,Year_End,Forecast,Actual
0,agriculture forestry fishing,24-25,2025,212294.34,
1,communication broadcasting culture and toursim,24-25,2025,216282.56,
2,defense security,24-25,2025,30580.012,
3,economic services,24-25,2025,45148.19,
4,energy and natural resources,24-25,2025,101737.29,
5,food distribution,24-25,2025,144731.95,
6,governance and administration,24-25,2025,54695.16,
7,infrastructure and transport,24-25,2025,142611.78,
8,regional and development,24-25,2025,27023.373,
9,science and innovation,24-25,2025,19535.377,



With-context rows: 11


Unnamed: 0,Sector_12,Fiscal_Year,Year_End,Forecast,Actual,Actual_2024,Growth_vs_2024_pct,Forecast_Lower_mae,Forecast_Upper_mae
0,agriculture forestry fishing,24-25,2025,212294.34,,253185.88,-0.161508,44824.68375,379763.99625
1,communication broadcasting culture and toursim,24-25,2025,216282.56,,262565.88,-0.176273,48812.90375,383752.21625
2,defense security,24-25,2025,30580.012,,51126.31,-0.401873,0.0,198049.66825
3,economic services,24-25,2025,45148.19,,61298.69,-0.263472,0.0,212617.84625
4,energy and natural resources,24-25,2025,101737.29,,246136.28,-0.586663,0.0,269206.94625
5,food distribution,24-25,2025,144731.95,,209052.25,-0.307676,0.0,312201.60625
6,governance and administration,24-25,2025,54695.16,,94545.62,-0.421495,0.0,222164.81625
7,infrastructure and transport,24-25,2025,142611.78,,352198.41,-0.595081,0.0,310081.43625
8,regional and development,24-25,2025,27023.373,,31008.08,-0.128505,0.0,194493.02925
9,science and innovation,24-25,2025,19535.377,,28905.33,-0.32416,0.0,187005.03325



metrics_2425.json:
{
  "info": "No actuals available for forecast year; metrics and plots skipped.",
  "n_eval": 0
}


9.Forecast next FY and compare
* If 24‑25 features exist, forecast FY 24‑25 per sector using last lookback window.
* Join with projected/official 24‑25 budgets (if available) and compute metrics.
* Outputs: data/sector_dl_forecast_2425.csv, plots (preds vs actual, residuals).

In [59]:
# Step 9 (residual learning + sector embeddings + compact models)

import json
from pathlib import Path
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, optimizers, losses, regularizers

# Config (shorter lookback + compact models)
BASE = Path("/Users/vvmohith/Desktop/PROJECT/final_data")
DATA = BASE / "data"
RESULTS = BASE / "results" / "fy2425"
METRICS_DIR = RESULTS / "metrics"
DL_METRICS_DIR = METRICS_DIR / "dl"
PLOTS = RESULTS / "plots"
PRED_DIR = RESULTS / "predictions"
FC_DIR = RESULTS / "forecasts"

LOOKBACK = 3
BATCH = 32
EPOCHS = 350
VAL_SPLIT = 0.2
DROPOUT = 0.30
L2 = 1e-4
SEED = 42

for d in [RESULTS, METRICS_DIR, DL_METRICS_DIR, PLOTS, PRED_DIR, FC_DIR]:
    d.mkdir(parents=True, exist_ok=True)

tf.random.set_seed(SEED)
np.random.seed(SEED)

def fy_from_year_end(y_end: int) -> str:
    a = (y_end - 1) % 100
    b = y_end % 100
    return f"{a:02d}-{b:02d}"

def seq_metrics(y_true, y_pred):
    y_true = np.asarray(y_true).ravel()
    y_pred = np.asarray(y_pred).ravel()
    mask = y_true != 0
    mape = float(np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100) if mask.any() else np.nan
    mse = float(np.mean((y_true - y_pred) ** 2))
    rmse = float(np.sqrt(mse))
    mae = float(np.mean(np.abs(y_true - y_pred)))
    ss_res = float(np.sum((y_true - y_pred) ** 2))
    ss_tot = float(np.sum((y_true - np.mean(y_true)) ** 2))
    r2 = float(1 - ss_res / ss_tot) if ss_tot > 0 else np.nan
    return {"MAE": mae, "RMSE": rmse, "R2": r2, "MAPE_%": mape, "n_eval": int(len(y_true))}

# Build sequences with residual target (log growth residual vs last-year growth)
def build_resid_sequences(df_group, lookback, feature_cols):
    X_list, gresid_list, prev1_list, prev2_list, ids = [], [], [], [], []
    vals = df_group[feature_cols].values.astype(np.float32)
    tgt = df_group["Budget_Amount"].values.astype(np.float32)
    years = df_group["Year_End"].values
    fys = df_group["Fiscal_Year"].values
    n = len(df_group)
    for i in range(lookback, n):
        prev1 = tgt[i-1]
        prev2 = tgt[i-2] if i-2 >= 0 else np.nan
        if np.isnan(prev2):
            continue
        g_true = np.log1p(tgt[i]) - np.log1p(prev1)
        g_naive_last = np.log1p(prev1) - np.log1p(prev2)
        g_resid = g_true - g_naive_last
        X_list.append(vals[i-lookback:i, :])
        gresid_list.append(g_resid)
        prev1_list.append(prev1)
        prev2_list.append(prev2)
        ids.append((df_group["Sector_12"].iloc[i], int(years[i]), fys[i]))
    return X_list, gresid_list, prev1_list, prev2_list, ids

# Compact model zoo with sector embeddings
def make_model(kind: str, lookback: int, n_features: int, n_sectors: int, emb_dim: int = 4) -> tf.keras.Model:
    inp_seq = layers.Input(shape=(lookback, n_features), name="seq")
    inp_sid = layers.Input(shape=(), dtype="int32", name="sector_id")

    if kind == "GRU_L3_U32":
        x = layers.GRU(32, kernel_regularizer=regularizers.l2(L2), recurrent_dropout=0.15)(inp_seq)
    elif kind == "LSTM_L3_U32":
        x = layers.LSTM(32, kernel_regularizer=regularizers.l2(L2), recurrent_dropout=0.15)(inp_seq)
    elif kind == "BiGRU_L3_U32":
        x = layers.Bidirectional(layers.GRU(32, kernel_regularizer=regularizers.l2(L2), recurrent_dropout=0.15))(inp_seq)
    elif kind == "TCN_L3_F32_K3":
        x = inp_seq
        for d in [1, 2, 4]:
            y = layers.Conv1D(32, 3, padding="causal", dilation_rate=d,
                              activation="relu", kernel_regularizer=regularizers.l2(L2))(x)
            y = layers.BatchNormalization()(y)
            y = layers.Dropout(DROPOUT)(y)
            if x.shape[-1] != y.shape[-1]:
                x = layers.Conv1D(32, 1, padding="same")(x)
            x = layers.Add()([x, y])
        x = layers.Activation("relu")(x)
        x = layers.GlobalAveragePooling1D()(x)
    else:
        raise ValueError(f"Unknown model kind: {kind}")

    emb = layers.Embedding(n_sectors, emb_dim, name="sector_emb")(inp_sid)
    emb = layers.Flatten()(emb)

    h = layers.Concatenate()([x, emb])
    h = layers.Dense(32, activation="relu", kernel_regularizer=regularizers.l2(L2))(h)
    h = layers.Dropout(DROPOUT)(h)
    out = layers.Dense(1)(h)  # predicts residual growth

    model = models.Model(inputs=[inp_seq, inp_sid], outputs=out)
    model.compile(optimizer=optimizers.Adam(5e-4),
                  loss=losses.Huber(delta=0.5),
                  metrics=["mae"])
    return model

# Load and standardize features across all rows using saved pipeline from Step 5
panel = pd.read_csv(DATA / "sector_budget_macro_panel.csv", dtype={"Sector_12":"string","Fiscal_Year":"string"})
if "Year_End" not in panel.columns:
    panel["Year_End"] = panel["Fiscal_Year"].map(lambda s: 2000 + int(str(s).split("-")[1])).astype("Int64")

art = joblib.load(DATA / "feature_imputer_scaler.joblib")
pipe = art["pipeline"]; feat_cols = art["feature_cols"]; z_cols = art["z_cols"]

X_all = panel[feat_cols].copy()
X_all_z = pipe.transform(X_all)
df_z = pd.concat(
    [panel[["Sector_12","Fiscal_Year","Year_End","Budget_Amount"]].reset_index(drop=True),
     pd.DataFrame(X_all_z, columns=z_cols)],
    axis=1
).sort_values(["Sector_12","Year_End"]).reset_index(drop=True)

# Sector index for embeddings
sectors = sorted(df_z["Sector_12"].unique().tolist())
sec2idx = {s:i for i,s in enumerate(sectors)}

# Build dataset (residual targets)
X_seq, gresid_seq, prev1_seq, prev2_seq, meta_rows = [], [], [], [], []
for sec, g in df_z.groupby("Sector_12", sort=False):
    g = g.sort_values("Year_End")
    Xs, gs, p1, p2, ids = build_resid_sequences(g, LOOKBACK, z_cols)
    X_seq += Xs; gresid_seq += gs; prev1_seq += p1; prev2_seq += p2; meta_rows += ids

if not X_seq:
    raise RuntimeError("No sequences built. Increase data coverage or reduce LOOKBACK.")

X_seq = np.asarray(X_seq, dtype=np.float32)
y_resid = np.asarray(gresid_seq, dtype=np.float32)
prev1_seq = np.asarray(prev1_seq, dtype=np.float32)
prev2_seq = np.asarray(prev2_seq, dtype=np.float32)
meta = pd.DataFrame(meta_rows, columns=["Sector_12","Year_End","Fiscal_Year"])
sid = meta["Sector_12"].map(sec2idx).astype("int32").values

# Masks
m_test_2024 = meta["Year_End"] == 2024
m_train_le_2023 = meta["Year_End"] <= 2023
m_val = (meta["Year_End"] == 2023) & m_train_le_2023
m_train = m_train_le_2023 & (~m_val)

# Inference windows for FY24-25 (need prev1 and prev2)
def build_inf_windows_with_prev(df_z, lookback, z_cols):
    X_list, p1_list, p2_list, ids = [], [], [], []
    for sec, g in df_z.groupby("Sector_12", sort=False):
        g = g.sort_values("Year_End")
        if len(g) < lookback or len(g) < 2:
            continue
        vals = g[z_cols].tail(lookback).values.astype(np.float32)
        budgets = g["Budget_Amount"].values.astype(np.float32)
        p1 = budgets[-1]
        p2 = budgets[-2]
        last_year = int(g["Year_End"].iloc[-1])
        X_list.append(vals); p1_list.append(p1); p2_list.append(p2)
        ids.append([sec, last_year + 1, fy_from_year_end(last_year + 1)])
    if not X_list:
        return None, None, None, None
    return (np.stack(X_list, axis=0),
            np.asarray(p1_list, dtype=np.float32),
            np.asarray(p2_list, dtype=np.float32),
            pd.DataFrame(ids, columns=["Sector_12","Year_End","Fiscal_Year"]))

has_2025_sequences = (meta["Year_End"] == 2025).any()
if has_2025_sequences:
    X_inf_all = X_seq[meta["Year_End"] == 2025]
    prev1_inf_all = prev1_seq[meta["Year_End"] == 2025]
    prev2_inf_all = prev2_seq[meta["Year_End"] == 2025]
    meta_inf_all = meta[meta["Year_End"] == 2025].reset_index(drop=True)
    sid_inf_all = meta_inf_all["Sector_12"].map(sec2idx).astype("int32").values
else:
    X_inf_all, prev1_inf_all, prev2_inf_all, meta_inf_all = build_inf_windows_with_prev(df_z, LOOKBACK, z_cols)
    if X_inf_all is None:
        raise RuntimeError("No sectors with enough history to form inference windows.")
    sid_inf_all = meta_inf_all["Sector_12"].map(sec2idx).astype("int32").values

# Model sweep (compact)
MODEL_KINDS = ["GRU_L3_U32","LSTM_L3_U32","BiGRU_L3_U32","TCN_L3_F32_K3"]

agg_2324, agg_2425 = [], []
preds_2024_levels = []
fcs_2425_levels = []
best_model_by_2324, best_mae_2324 = None, float("inf")

for kind in MODEL_KINDS:
    print(f"\n=== Training {kind} (resid+emb) ===")
    MODEL_DIR = DL_METRICS_DIR / kind
    MODEL_DIR.mkdir(parents=True, exist_ok=True)
    KIND_PRED_DIR = PRED_DIR / kind
    KIND_FC_DIR = FC_DIR / kind
    KIND_PRED_DIR.mkdir(parents=True, exist_ok=True)
    KIND_FC_DIR.mkdir(parents=True, exist_ok=True)

    X_tr, y_tr = X_seq[m_train], y_resid[m_train]
    X_val, y_val = X_seq[m_val], y_resid[m_val]
    X_te, y_resid_te = X_seq[m_test_2024], y_resid[m_test_2024]
    prev1_te = prev1_seq[m_test_2024]
    prev2_te = prev2_seq[m_test_2024]
    meta_te = meta[m_test_2024].reset_index(drop=True)
    sid_tr = sid[m_train]; sid_val = sid[m_val]; sid_te = sid[m_test_2024]

    if len(X_te) == 0: raise RuntimeError("No test sequences for 2024.")
    if len(X_tr) == 0: raise RuntimeError("No training sequences (<=2022).")
    if len(X_val) == 0:
        n = max(1, int(0.2 * len(X_tr)))
        X_val, y_val, sid_val = X_tr[-n:], y_tr[-n:], sid_tr[-n:]
        X_tr, y_tr, sid_tr = X_tr[:-n], y_tr[:-n], sid_tr[:-n]

    model = make_model(kind, LOOKBACK, X_tr.shape[2], n_sectors=len(sectors))
    cb = [
        callbacks.EarlyStopping(monitor="val_loss", patience=20, restore_best_weights=True),
        callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=8, min_lr=1e-5),
    ]
    _ = model.fit([X_tr, sid_tr], y_tr, validation_data=([X_val, sid_val], y_val),
                  epochs=EPOCHS, batch_size=BATCH, verbose=0, callbacks=cb)

    # 2024 evaluation: add back naive growth and reconstruct levels
    g_naive_te = np.log1p(prev1_te) - np.log1p(prev2_te)
    g_hat_resid_te = model.predict([X_te, sid_te], batch_size=64, verbose=0).ravel()
    g_hat_te = g_naive_te + g_hat_resid_te
    y_pred_2024 = np.expm1(np.log1p(prev1_te) + g_hat_te)

    # True 2024 from prevs and true residual (consistency)
    y_true_2024 = np.expm1(np.log1p(prev1_te) + (g_naive_te + y_resid_te))

    m2324 = seq_metrics(y_true_2024, y_pred_2024)
    agg_2324.append({"model": f"{kind} (resid+emb)", **m2324})
    if m2324["MAE"] < best_mae_2324:
        best_mae_2324 = m2324["MAE"]; best_model_by_2324 = kind

    preds_2024_levels.append(pd.Series(y_pred_2024, index=meta_te.index))
    pd.DataFrame({
        "Sector_12": meta_te["Sector_12"], "Fiscal_Year": meta_te["Fiscal_Year"], "Year_End": meta_te["Year_End"],
        "Prev_Year_Amount": prev1_te, "Actual": y_true_2024, "Predicted": y_pred_2024
    }).sort_values(["Sector_12"]).to_csv(KIND_PRED_DIR / "dl_predictions_23_24.csv", index=False)

    # Forecast 24-25
    g_naive_inf = np.log1p(prev1_inf_all) - np.log1p(prev2_inf_all)
    g_hat_resid_inf = model.predict([X_inf_all, sid_inf_all], batch_size=64, verbose=0).ravel()
    y_inf = np.expm1(np.log1p(prev1_inf_all) + (g_naive_inf + g_hat_resid_inf))

    fcs_2425_levels.append(pd.Series(y_inf))

    fc = pd.DataFrame({
        "Sector_12": meta_inf_all["Sector_12"],
        "Fiscal_Year": meta_inf_all["Fiscal_Year"],
        "Year_End": meta_inf_all["Year_End"],
        "Prev_Year_Amount": prev1_inf_all,
        "Forecast": y_inf
    })
    actual_next = panel.loc[panel["Year_End"].isin(fc["Year_End"]), ["Sector_12","Year_End","Budget_Amount"]].rename(columns={"Budget_Amount":"Actual"})
    fc = fc.merge(actual_next, on=["Sector_12","Year_End"], how="left").sort_values(["Sector_12"]).reset_index(drop=True)
    fc.to_csv(FC_DIR / kind / "dl_forecast_24_25.csv", index=False)

# Aggregate comparisons — FY23-24
agg23 = pd.DataFrame(agg_2324).sort_values(["MAE","RMSE"]).reset_index(drop=True)
agg23.to_csv(DL_METRICS_DIR / "summary_23_24.csv", index=False)
print("\n=== Summary (FY23-24, residual+emb, compact) ===")
try:
    from IPython.display import display
    display(agg23)
except Exception:
    print(agg23.head(10))

# DL ensemble (avg of models)
if len(preds_2024_levels):
    ens_pred_2024 = pd.concat(preds_2024_levels, axis=1).mean(axis=1).values
    meta_te = meta[m_test_2024].reset_index(drop=True)
    prev1_te = prev1_seq[m_test_2024]; prev2_te = prev2_seq[m_test_2024]
    g_naive_te = np.log1p(prev1_te) - np.log1p(prev2_te)
    y_true_2024 = np.expm1(np.log1p(prev1_te) + (g_naive_te + y_resid[m_test_2024]))
    m_ens = seq_metrics(y_true_2024, ens_pred_2024)
    pd.DataFrame([{"model":"Ensemble_DL3 (resid+emb)", **m_ens}]).to_csv(DL_METRICS_DIR / "metrics_23_24_ensemble.csv", index=False)
    agg23 = pd.concat([agg23, pd.DataFrame([{"model":"Ensemble_DL3 (resid+emb)", **m_ens}])], ignore_index=True)
    agg23.sort_values(["MAE","RMSE"], inplace=True)
    agg23.to_csv(DL_METRICS_DIR / "summary_23_24.csv", index=False)

    # Ensemble forecast
    ens_fc_vals = pd.concat(fcs_2425_levels, axis=1).mean(axis=1).values
    fc_ens = meta_inf_all.copy()
    fc_ens["Forecast"] = ens_fc_vals
    actual_next = panel.loc[panel["Year_End"].isin(fc_ens["Year_End"]), ["Sector_12","Year_End","Budget_Amount"]].rename(columns={"Budget_Amount":"Actual"})
    fc_ens = fc_ens.merge(actual_next, on=["Sector_12","Year_End"], how="left").sort_values(["Sector_12"]).reset_index(drop=True)
    fc_ens.to_csv(RESULTS / "sector_dl_forecast_2425_dl_ensemble.csv", index=False)
    print("Saved DL ensemble forecast:", RESULTS / "sector_dl_forecast_2425_dl_ensemble.csv")

# Plots (23-24)
sns.set_style("whitegrid")
for metric in ["MAE", "RMSE", "R2", "MAPE_%"]:
    plt.figure(figsize=(8,4))
    order = agg23.sort_values(metric, ascending=(metric!="R2"))["model"]
    sns.barplot(data=agg23, x="model", y=metric, order=order, palette="viridis")
    plt.xticks(rotation=25, ha="right")
    plt.title(f"DL Model Comparison (FY23-24, resid+emb, compact) — {metric}")
    plt.tight_layout()
    plt.savefig(PLOTS / f"dl_compare_23_24_{metric}.png", dpi=150)
    plt.close()

# Overlay baselines if present
bl_path = DATA / "metrics" / "sector_baseline_metrics_23_24.csv"
if bl_path.exists():
    baselines = pd.read_csv(bl_path)
    baselines["model"] = baselines["model"].astype(str)
    comb = pd.concat([agg23.assign(kind="DL"), baselines.assign(kind="Baseline")], ignore_index=True, sort=False)
    for metric in ["MAE", "RMSE", "R2", "MAPE_%"]:
        plt.figure(figsize=(9,5))
        sns.barplot(data=comb, x="model", y=metric, hue="kind", palette="Set2")
        plt.xticks(rotation=25, ha="right")
        plt.title(f"DL vs Baselines (FY23-24, resid+emb, compact) — {metric}")
        plt.tight_layout()
        plt.savefig(PLOTS / f"dl_vs_baselines_23_24_{metric}.png", dpi=150)
        plt.close()


=== Training GRU_L3_U32 (resid+emb) ===

=== Training LSTM_L3_U32 (resid+emb) ===

=== Training BiGRU_L3_U32 (resid+emb) ===

=== Training TCN_L3_F32_K3 (resid+emb) ===

=== Summary (FY23-24, residual+emb, compact) ===


Unnamed: 0,model,MAE,RMSE,R2,MAPE_%,n_eval
0,GRU_L3_U32 (resid+emb),28036.300781,41585.615975,0.857577,16.407,11
1,LSTM_L3_U32 (resid+emb),35296.4375,52290.722083,0.774812,17.24522,11
2,BiGRU_L3_U32 (resid+emb),35733.082031,51145.118086,0.784571,19.335096,11
3,TCN_L3_F32_K3 (resid+emb),53908.875,79807.368081,0.475457,26.171171,11


Saved DL ensemble forecast: /Users/vvmohith/Desktop/PROJECT/final_data/results/fy2425/sector_dl_forecast_2425_dl_ensemble.csv



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=agg23, x="model", y=metric, order=order, palette="viridis")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=agg23, x="model", y=metric, order=order, palette="viridis")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=agg23, x="model", y=metric, order=order, palette="viridis")

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=agg23, x="model", y=metric, order=order, palette="viridis")


In [39]:
import json, pathlib
p = pathlib.Path("/Users/vvmohith/Desktop/PROJECT/final_data/data/feature_columns.json")
z_cols = json.loads(p.read_text())["z_cols"]
[c for c in z_cols if "Sector_Share" in c]

['z_Sector_Share_GDP', 'z_Sector_Share_Lag1', 'z_Sector_Share_Growth']