# Machine Learning Energy Demand Forecasting

This project focuses on forecasting electricity demand for the **Western Regional Hub of PJM**, a major electric transmission organization. The hub represents a defined geographic area where electricity prices are aggregated for commercial energy trading and financial futures contracts.  

To build the forecast, I trained an **XGBoost regression model** on 12 years of historical hourly demand data (measured in megawatts). The model generates predictions of future electricity demand, which could support use cases such as validating infrastructure projects or informing energy trading strategies.  

The primary aim of this project, however, is to demonstrate my capability in applying **machine learning techniques for demand forecasting**. The same approach can be generalized to forecast demand for virtually any product or service, provided there is consistently recorded historical data available over time.

## Methodology  

### Data Exploration/ Exploration/ Preperation
Begin with a thorough exploration of the dataset to understand its structure, quality, and key patterns. This step involves examining time-series trends, identifying seasonal or cyclical effects, and checking for anomalies or missing data.  

### Feature Engineering  
Derive informative features that capture the underlying drivers of demand. These may include calendar variables (hour, day, month, season), lagged values of the target variable, rolling statistics to capture recent trends, and domain-specific indicators such as peak hours, business hours, or special events. Where relevant, external variables like weather or pricing data can also be integrated.  

### Model Development  
Select and train an appropriate machine learning model, ensuring that the temporal nature of the data is respected. This typically involves time-based train/test splits or rolling cross-validation. Hyperparameters are tuned to balance accuracy and generalization, while evaluation metrics such as MAE, RMSE, or R² are used to benchmark performance.  

### Results and Challenges  
Evaluate the model’s predictive accuracy and assess how well it captures both baseline demand and extreme values. Discuss challenges encountered, such as handling missing data, underestimation of peaks, or the absence of external drivers like weather variables.  

### Conclusion  
Summarize the effectiveness of the approach, highlight the main findings, and reflect on limitations. Finally, outline opportunities for improvement, such as incorporating additional data sources, experimenting with alternative algorithms, or applying the framework to other domains.

## Data Exploration/ Cleaning/ Preperation

- no null values were found
- abnormal values were found to be less than 3250 MW
- Training/Testing data was split into 80%/20%

![image.png](./Graphs/energy_usage.png)

![image.png](./Graphs/clean_metrics.png)

![image.png](./Graphs/train_vs_test_split_1.png)

![image.png](./Graphs/train_vs_test_split_2.png)


## Feature Engineering

To capture the temporal dynamics of electricity demand, I engineered a set of features based on **time, lags, and rolling statistics**. Calendar features such as hour of day, day of week, month, quarter, and season were included to model recurring daily, weekly, and annual patterns. Additional binary flags were created to represent **business hours**, **peak hours**, and **weekends**, while sine–cosine transformations were applied to cyclical variables like hour of day and day of week to better capture periodic effects.  

To incorporate historical dependencies, I created **lag features** (1 hour, 24 hours, and 168 hours) to capture short-term autocorrelation as well as daily and weekly seasonality. I also added **rolling statistics** (mean and standard deviation over 24 and 168-hour windows) to smooth fluctuations and represent recent demand trends. This combination of features allows the model to leverage both short-term variations and long-term seasonal cycles, improving its ability to generalize future demand patterns.

![image.png](./Graphs/example_of_feature.png)



## Usage by hour  

Electricity demand shows a noticeable increase during the afternoon and into the evening hours. This trend likely reflects consumer behavior, with demand rising as households turn on **air conditioning** during the warmer parts of the day and then peaking further when people return home from work to use appliances such as **televisions and lighting**

![image.png](./Graphs/hour_usage.png)


## Usage by Month  

Electricity demand is highest during the **summer** and **winter** months, largely driven by seasonal climate conditions. In summer, demand increases due to widespread use of **air conditioning**, while in winter, **heating requirements** significantly raise consumption. These seasonal peaks highlight how weather-related factors strongly influence energy usage patterns.

![image.png](./Graphs/monthly_usage.png)

# Usage by Week  

In addition to daily variability, a clear weekly pattern emerges: **electricity demand is lower on weekends compared to weekdays**. This likely reflects differences in consumer behavior, as many people spend more time away from home on weekends, leading to reduced residential energy use.

![image.png](./Graphs/weekly_usage_an.png)

# Model construction

![image.png](./Graphs/model_det.png)

![image.png](./Graphs/mae_model.png)

![image.png](./Graphs/feature_importance.png)

# Prediction Vs Actual

![image.png](./Graphs/raw_data_vs_prediction.png)

![image.png](./Graphs/week_prediction.png)

![image.png](./Graphs/accuracy_score.png)

## Results and Conclusion  

The XGBoost regression model achieved a **root mean squared error (RMSE)** of approximately *[insert your `mse_score` value here]* on the test set, alongside an **R² score** of *[insert your `R_squared_value` here]*, indicating a solid fit between predicted and actual demand. During training, the model steadily reduced the **mean absolute error (MAE)**, with validation MAE improving from around **569 MW** in early iterations to about **504 MW** by the final rounds.  

These results suggest that the model successfully captures the overall demand patterns within the PJM West region, though it shows some underestimation of extreme peaks — a common challenge in electricity load forecasting.  

### Limitations  
A key limitation of this model is the absence of **temperature and weather-related features**, which are strong drivers of demand volatility. Incorporating such variables was not feasible here due to the **large geographic area covered by the PJM West Hub**, where a single representative temperature measure would fail to capture local variation. As a result, the model relies solely on time-based and lagged features, which provide good baseline accuracy but lack the ability to explain weather-driven extremes.  

### Next Steps  
Future improvements could include:  
- Integrating **regionalized weather data** (e.g., weighted averages across key cities in the PJM West footprint).  
- Exploring **quantile regression objectives** in XGBoost to better capture peak demand events.  
- Comparing performance with **neural network architectures** (e.g., LSTMs or temporal CNNs) that may capture complex sequential dependencies more effectively.  

Despite these limitations, this project demonstrates the practicality of machine learning methods — particularly gradient boosting — for large-scale demand forecasting, and provides a transferable framework applicable to other industries and regions. 