Matthew Manberg, Sarah Cadet, Vedant Kharwal


<div align="center"><h1>CS506 Project Proposal: Energy Load Forecasting</h1></div>


**Description of the project**

NYISO was birthed out of a catastrophic power outage, costing the American public millions and resulting in deaths. They have their own forecasts (they release publicly and utilize similar methodology as the private utility companies). They oversee all of NY's jurisdictions, with an imperfect picture of (my guess due to poor data sharing common in utilities) of when new load is introduced or removed in addition to other noise. This forecast is important to prevent future catastrophe. 

<div align="center">
  <img src="NY_Zones.png" alt="NY Zones" style="max-width:100%; height:auto;"/>
</div>


*NYISO Zones, Source: https://www.nyiso.com/real-time-dashboard.*


Utility companies profit is already negotiated between the state and them in the rate case. They legally cannot charge more for what they buy, they can only charge utility bills for the Distribution and carry over the buying cost. A better forecast would result in less spot buying, and save the ratepayer (the person who pays the utility bill) millions of dollars a day in addition  to further informing NYISO’s important oversight.

Previous forecasts are rooted in a deterministic methodology despite the system acting as a non-linear chaotic environment. An empirical, dynamic, and inductive data-driven approach such as Deep Learning may prove to outcompete current forecasts. Business events, from outages, industrial load spikes, residential load spikes, etc… cause a sudden seemingly-stochastic drop in load. A decision-tree may prevent further error from switching models (such as the criteria of 10% error given a time-period). 


In [8]:
from IPython.display import display, HTML

display(HTML("""
<div align="center">
  <img src="Load_With_Losses.png" alt="Load_With_Losses" style="max-width:100%; height:auto;"/>
</div>
"""))


*Load With Losses, Source: https://www.nyiso.com/real-time-dashboard*


**Clear goal(s) (e.g. Successfully predict the number of students attending lecture based on the weather report).**

There are two main goals of this project:

1. Explore data behavior of NY’s Energy Load (ACF, business events, etc…).
2. **Attempt to outcompete NYISO’s time-series forecasting of Energy Load** on an hourly or more granular scale on an aggregate or zone basis.


**What data needs to be collected and how you will collect it (e.g. scraping xyz website or polling students).**


_Univariate Analyses:_  
- NYISO Energy Load, can be pulled from https://mis.nyiso.com/public/P-58Blist.htm (https://www.nyiso.com/custom-reports) via a web scraper we will make. 
- Every 5 minutes, with periods of N/A.

_Multivariate Analyses:_
NYS MESONET
-   Temperature and Precipitation, which can requested via email.  (https://www.nysmesonet.org/weather/requestdata)
-   Every 5 minutes, with periods of N/A.
NOAA Data
-   Weather Station API, 50+ variables but not-consistent in variables nor data. (https://www.weather.gov/documentation/services-web-api). 
-       NYISO uses the following weather stations: https://www.nyiso.com/documents/20142/38389687/Weather-Station-Names.pdf/6c105682-4f95-585d-3203-c24abfd93247
-   Every Hour, with periods of N/A. 

_General Analyses:_
NYISO’s Energy Load Forecast, can be pulled from https://www.nyiso.com/custom-reports. 

Data is not validated in nature and will need to go through calibration/processing. 2021 may be a of particular interest of exhibiting outlier behavior, as the COVID-19 pandemic likely caused an outlier in energy load data behavior. 

*(Time-pending, secondary)*: Employ a decision-tree, state-space machine, or another model for business events. 


**How you plan on modeling the data (e.g. clustering, fitting a linear model, decision trees, XGBoost, some sort of deep learning method, etc.).**

- ACF, PACF testing  
- Linear Regression  
- Generalized baseline, smoothed shape of aggregate data   
- Google’s TimesFM (Time Series Foundation Model), recommended by Professor Jeff Considine (BU CS/CDS)  
- Time-pending, decision tree business event model  

**How do you plan on visualizing the data? (e.g. interactive t-SNE plot, scatter plot of feature x vs. feature y).**

- Line Graphs of Variables through Time  
- Confidence intervals Line Graphs  
- Models and their Visual Representation (ex: recurrent neural networks explanatory image)  
- MAE Bar Chart comparison  
- Time-pending, PowerBI Dashboard


**What is your test plan? (e.g. withhold 20% of data for testing, train on data collected in October and test on data collected in November, etc.).**

We will use the years 2001 - 2021 as training data, 2022 a validation data, and 2023, 2024, and 2025 as testing data. We split the data in this fashion to split the data as close as possible to a 70%/10%/20% training/val/testing split while also taking into account the affects 2020 and 2021's quarantining could have had on the data. Furthermore, since this is time series data, we are using the lastest data for validation and testing to properly test how the model would perform on live data.

- Explore general data behavior for hypothesis testing and system behavior. Identify how granular business events may impact larger aggregate data. 
- Employ ACF, PACF, time-series behavior analysis.
- Employ Deep Learning and other forecasting techniques. 
- Employ Bootstrapping, Cross-validation, other empirical uncertainty analyses. 


Given the results of the ACF/PACF, we will determine how far out a forecast can be made. We can compare the MAE of how each model performs throughout time (ex: 2015 - 2024) through the future predicted points. For univariate analyses, the train and val variable will be load. For multivariate, it will be load and weather variables that can be wrangled (ex: temperature, solar irradiance, percipitation). 


**Working Works Cited**

https://www.nature.com/articles/s41598-025-04210-1

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4893923
