##      Forecasting Urban Energy Demand to Support Net-Zero Cities

 **problem statement** -Urban areas are major energy consumers and face challenges in balancing renewable integration with demand growth. This project develops a predictive framework that uses historical energy consumption and weather data to forecast short-term electricity demand. The insights can guide utilities and policymakers to optimize renewable usage, reduce fossil fuel dependence, and move closer to net-zero energy goals

In [1]:
##importing required libraries
# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score


## Data Collection and Understanding

For this project, we use publicly available urban energy datasets that contain:
- **Historical electricity consumption** (hourly/daily basis).
- **Weather parameters** (temperature, humidity, wind speed, etc.), since weather strongly influences demand.
- **Date-time features** (day, month, hour) to capture time-based demand patterns.

The dataset will help us analyze:
1. How demand varies by time of day and weather.
2. Seasonal/weekly consumption patterns.
3. Baseline energy usage levels.




In [3]:
import pandas as pd

df = pd.read_csv("data/CEEW - IRES Data.csv", low_memory=False)  
df.head()


Unnamed: 0,hhid,enumerator_id,s_name,state_abbv,s_code,d_name,d_code,village_ward_name,village_ward_census_code,interview_date,...,q610_e_bill_year,q613_picture_bill_yn,interview_length,replacement_details,sw_dist,sw_state,asset_index_1,asset_decile_1,asset_index_2,asset_decile_2
0,343937810,OR_4,Odisha,OR,21,PURI,387,SAINSASASAN,408946,12-02-2019,...,,,31,,3915.0,16129.0,-2.964007,1,-2.784152,1
1,3444379112,RJ_8,Rajasthan,RJ,8,BARMER,115,PATON KA BARA,87562,1/18/2020,...,,,46,,5237.0,11732.0,-2.964007,1,-2.784152,1
2,344869428,JH_4,Jharkhand,JH,20,LOHARDAGA,356,MASMANOTHAKURGAON,363164,12/16/2019,...,,,38,,969.0,16097.0,-2.964007,1,-2.784152,1
3,344393847,HAR_2,Haryana,HR,6,PANCHKULA,69,PANCHKULA (M CL) WARD NO.-0001,800363-1,12-11-2019,...,,,38,,1109.0,10525.0,-2.352909,1,-2.784152,1
4,352717410,UP_1,Uttar Pradesh,UP,9,SITAPUR,154,AKAI CHANDUPUR TAPPA,137895,1/31/2020,...,,,34,,8933.0,24252.0,-2.374308,1,-2.182223,1


In [6]:
import numpy as np
import re


na_like = [
    "NA","N/A","na","n/a"," ", "", "  ",
    "DK","D K","Don't know","Dont know","don’t know",
    "Refused","refused","Missing","Nil","-","--","nan","NULL","null"
]
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].astype(str).str.strip()   # <-- fixed
df.replace(na_like, np.nan, inplace=True)
# --- parse date/time columns if they exist ---
def parse_dt(s): 
    return pd.to_datetime(s, errors="coerce")

date_candidates = [c for c in df.columns if re.search(r"date$", c, re.I) or re.search(r"_date_", c, re.I) or c.lower()=="interview_date"]
time_candidates = [c for c in df.columns if re.search(r"time$", c, re.I) or c.lower() in ["interview_start_time","interview_end_time"]]

# --- parse date/time columns if they exist ---
# adjust format strings once you know what your data looks like

# for interview_date column (example: "25-08-2023")
if "interview_date" in df.columns:
    df["interview_date"] = pd.to_datetime(df["interview_date"], 
                                          format="%d-%m-%Y",  # change if needed
                                          errors="coerce")

# for start/end times (example: "14:35")
for col in ["interview_start_time", "interview_end_time"]:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], 
                                 format="%H:%M",  # change if needed
                                 errors="coerce").dt.time

# --- drop free-text "other" columns and fully-empty columns ---
drop_other = [c for c in df.columns if c.endswith("_other_s")]
drop_unnamed = [c for c in df.columns if c.lower().startswith("unnamed:")]
df.drop(columns=drop_other + drop_unnamed, inplace=True, errors="ignore")
empty_cols = [c for c in df.columns if df[c].isna().all()]
df.drop(columns=empty_cols, inplace=True)

# --- safely convert mostly-numeric text columns to numeric ---
obj_cols = df.select_dtypes(include="object").columns.tolist()
for c in obj_cols:
    s_num = pd.to_numeric(df[c], errors="coerce")
    non_null = df[c].notna().sum()
    if non_null and (s_num.notna().sum() / non_null) >= 0.9:
        df[c] = s_num

# --- save cleaned file ---
df.to_csv("data/CEEW_IRES_Data_cleaned.csv", index=False)

print("✅ done! saved -> data/CEEW_IRES_Data_cleaned.csv")
print("shape:", df.shape)


✅ done! saved -> data/CEEW_IRES_Data_cleaned.csv
shape: (14851, 492)


In [8]:
import pandas as pd


df = pd.read_csv("data/CEEW_IRES_Data_cleaned.csv")

# quick look at first 5 rows
print("Preview:")
display(df.head())

# summary info about columns and datatypes
print("\nInfo:")
print(df.info())

# see how many missing values each column has (top 15 only)
print("\nMissing values:")
print(df.isna().sum().sort_values(ascending=False).head(15))


Preview:


Unnamed: 0,hhid,enumerator_id,s_name,state_abbv,s_code,d_name,d_code,village_ward_name,village_ward_census_code,interview_date,...,q610_e_bill_year,q613_picture_bill_yn,interview_length,replacement_details,sw_dist,sw_state,asset_index_1,asset_decile_1,asset_index_2,asset_decile_2
0,343937810,OR_4,Odisha,OR,21,PURI,387.0,SAINSASASAN,408946,2019-12-02,...,,,31,,3915.0,16129.0,-2.964007,1,-2.784152,1
1,3444379112,RJ_8,Rajasthan,RJ,8,BARMER,115.0,PATON KA BARA,87562,,...,,,46,,5237.0,11732.0,-2.964007,1,-2.784152,1
2,344869428,JH_4,Jharkhand,JH,20,LOHARDAGA,356.0,MASMANOTHAKURGAON,363164,,...,,,38,,969.0,16097.0,-2.964007,1,-2.784152,1
3,344393847,HAR_2,Haryana,HR,6,PANCHKULA,69.0,PANCHKULA (M CL) WARD NO.-0001,800363-1,2019-12-11,...,,,38,,1109.0,10525.0,-2.352909,1,-2.784152,1
4,352717410,UP_1,Uttar Pradesh,UP,9,SITAPUR,154.0,AKAI CHANDUPUR TAPPA,137895,,...,,,34,,8933.0,24252.0,-2.374308,1,-2.182223,1



Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14851 entries, 0 to 14850
Columns: 492 entries, hhid to asset_decile_2
dtypes: float64(419), int64(61), object(12)
memory usage: 55.7+ MB
None

Missing values:
q457_geyser_3_min            14850
q442_1_ac_4_bee_rating       14850
q442_1_ac_4_months_no        14850
q442_1_ac_4_hrs              14850
q442_1_ac_4_cap              14850
q457_geyser_3_hrs            14850
q458_a_imm_rod_3_min         14849
q458_a_imm_rod_3_hrs         14849
q424_air_coolers_4_hrs       14847
q412_fan_8_hrs               14846
q442_1_ac_3_bee_rating_      14842
q442_1_ac_3_months_no        14842
q442_1_ac_3_hrs              14842
q442_1_ac_3_cap              14842
q321_emerg_light_spending    14839
dtype: int64


In [10]:


num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

# Fill numeric with median
for col in num_cols:
    df[col] = df[col].fillna(df[col].median())

# Fill categorical with mode
for col in cat_cols:
    if df[col].isna().sum() > 0:  # only if missing exists
        df[col] = df[col].fillna(df[col].mode()[0])

# Verify cleanup
print("\nAfter cleaning, missing values left:")
print(df.isna().sum().sum())



After cleaning, missing values left:
0


## “Urban areas are major energy consumers and face challenges in balancing renewable integration with demand growth. This project develops a predictive framework that uses historical energy consumption and related drivers to forecast long-term energy demand.”

 **In the absence of large-scale city-level time-series data, this prototype uses the CEEW–IRES household survey dataset as a proxy. The dataset provides household-level information on electricity bills, appliance ownership, and socio-economic indicators across multiple states.**

**Electricity bills (q610_e_bill_year) act as a surrogate for household energy demand.**

Appliances, demographics, and asset indices represent demand drivers, similar to how weather or income influence load in real systems.

By training predictive models on this data, we demonstrate the feasibility of demand forecasting at the household/urban level.

This serves as a proof-of-concept prototype. While it does not yet include time-series weather variables, the methodology (data cleaning → feature engineering → demand prediction) mirrors what would be done on larger datasets. Thus, this prototype lays the foundation for scaling to actual city-wide forecasting with integrated weather and renewable energy data.     

In [9]:
from sklearn.model_selection import train_test_split

# Example: pick a few input features (X) and one target (y)
X = df[["asset_index_1", "asset_index_2", "sw_dist"]]   # you can adjust features later
y = df["q610_e_bill_month"]   # monthly electricity bill (target for now)

# drop rows with missing target
data = pd.concat([X, y], axis=1).dropna()
X = data.drop(columns=["q610_e_bill_month"])
y = data["q610_e_bill_month"]

# split into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (3532, 3)
Test shape: (884, 3)
