## DSCI DSCI521-001: Final Project Analysis

### Module submission group
- Group member 1
    - Name: Luke Chesley
    - Email: lc3368@drexel.edu
- Group member 2
    - Name: Lauren Miller
    - Email: lem324@drexel.edu
- Group member 3
    - Name: Caleb Miller
    - Email: cm3962@drexel.edu
- Group member 4
    - Name: Hashim Afzal
    - Email: ha695@drexel.edu

### Team Member Background Report 
#### Luke Chesley
Luke earned his Bachelor’s degree from University of the Arts in Instrumental Performance - Bass Trombone. He continues his career as a professional musician, performing with the groups such as the Reading Symphony Orchestra and The Philly Pops among others, while attending Drexel University to earn a Master’s degree in Data Science. His self-identified skills are creativity, determination and critical thinking. Luke has completed personal projects which has contributed to his knowledge of python and accessing/analyzing data. In the early stages of this project, his contributions include ideation of the project, and writing code to access the API and view portions of the data we will further analyze. 

#### Lauren Miller
Lauren holds a degree in Food Science from Drexel University and has a background in biotechnology. After completing her undergraduate studies, she gained two years of experience working at both a food science company and a biomass fermentation startup. Subsequently, Lauren returned to Drexel University to pursue a Master's degree in Data Science. She identifies her key skills as analytical thinking, problem-solving, and determination. In the initial stages of the current project, her contributions have involved project ideation and the creation of the project scoping document.

#### Caleb Miller
Caleb’s undergraduate background in Business Analytics gives him a unique perspective on Data Science. He has general knowledge of how a business operates, technical knowledge of an analyst/data scientist, and understands the divide between these departments and the executive suite. He aims to be a bridge between these departments. He has worked with a lot of data relating to sports, as his hope is to enter the world of sports analytics/science. His self identified skills include:
Coding Languages: Python, R, Java, C++, SQL, VBA
Data Analysis and Visualization: Pandas, NumPy, Dash, Matplotlib, Tableau, PowerBI
Natural Language Processing: NLTK, spaCy
Machine Learning: Model Development and Evaluation, Scikit-Learn
Web Scraping: Requests, Selenium, BeautifulSoup4
Monte Carlo Simulation
In the early stages of this project, his individual contributions to this project include ideation of the project and contributions to the project scoping document. 

#### Hashim Afzal
Hashim has an undergraduate degree in Biology from Arizona State University and has completed research in the field of cellular biology at the Biodesign Institute studying infectious diseases. Now Hashim is at Drexel University studying Data Science. His self-identified skills are in statistical analysis and study design utilizing the scientific method. In the early stages of this project, his individual contributions include ideation of the project and contributions to the project scoping document. 

******

### Discussion of Project
In this project we will be utilizing data from the U.S. Energy Information Administration. The U.S. Energy Information Administration has free and open data available through an Application Programming Interface (API) and its open data tools. EIA's API is multi-facetted and contains the following time-series data sets organized by the main energy categories.  We are analyzing hourly energy consumption data from 2018 to present. This data is also categorized by region/state. In this analysis, we aim to examine patterns within the data, including variations across locations/regions, trends over different time frames/date categories, and shifts in types of energy consumption.

https://www.eia.gov/opendata/

### Who Might be Interested
Parties who may show interest in this data encompass, but are not confined to, power producers, power consumers, and government entities. Power producers stand to benefit from insights into energy sources that could enhance profitability, as well as understanding energy sources that may pose challenges to sustained profit growth. Subsequently, power consumers, including businesses and individuals, would find value in this data by obtaining information on the most cost-effective resources. Individual consumers can enhance their understanding of personal energy usage by examining regional data and the average metrics within their respective regions, as well as fostering informed decisions towards sustainable energy consumption. Finally, government entities may find the power metrics interesting to aid in topics such as policy formulation, resource planning, and environmental impact. 

### Application of Investigation
The data we are accessing is hourly power consumption data spanning from 2018 to present. The data consists of over three hundred thousand rows which can be seamlessly integrated in various applications or investigative efforts. The substantial volume of data allows for predictive modeling techniques and the extraction of valuable insights and the expansion of knowledge based on the information available. The depth of this dataset lends itself to diverse applications, enhancing the potential for informed decision making and strategic planning. 
Furthermore, performing time series analysis on the hourly power consumption data provides an opportunity to delve into trends in fuel consumption over various time intervals. Analysis could include daily trends, weekly trends, yearly trends, visualization, statistical analysis, forecasting, and anomaly detection.

************

## Power Output Data Analysis


### Aquisition of Data: API

**File: database.ipynb**

The API access code is found in the file titled database.ipynb
The API was accessed through U.S. Energy Information Administration. 

### Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

Temporal Fusion Transformer (TFT) – a novel attentionbased architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. To learn temporal relationships at different scales, TFT uses recurrent layers for local processing and interpretable self-attention layers for long-term dependencies. TFT utilizes specialized components to select relevant features and a series of gating layers to suppress unnecessary components, enabling high performance in a wide range of scenarios. 

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.

### Model Architecture

**File: featurecreation.py**

**File: makingpredictions.ipynb**

**File: model.py**

**File: train.py**

Observed inputs are measured at each time step and unknown beforehand(target value). They are used for training but not inference. Known inputs are predetermined and generally not dependent on the observed inputs. In this model, they are time dependent. All features derived from time, day-of-year, hour-of-day etc.

Quantile regression focuses on estimating the conditional median or other quantiles of the response variable, offering a more complete view of the possible outcomes. It provides an output distribution rather than a single point. The quantile of 0.5 is the median target value, these are predicted for each group (fueltype) for each time step of prediction length(168 hours) for each quantile (0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98) using a fixed look-back length (config'pred_len').
Quantiles can be interpreted as the probability of the true value being below the estimated value at that quantile
Quantile loss minimizes the error across all groups, timesteps, and quantiles. 

Below is the architecture of this model. In this project, we used the power data to build a model that predicts future output of each power source. 

Included in our submission are files containing feature creation, making predictions, modeling, predicitng, and training. 

tft/data/power_consumption_by_fuel_type.csv

![Screen Shot 2024-03-06 at 10.40.03 AM.png](attachment:9c8bd92d-4d58-41b1-94fc-bee78949b900.png)

Model Architecture

![tft_architecture (1).png](attachment:f1a1cec7-7b31-4f36-a652-71eb42fea55c.png)

### Feature Creation
Feature creation refers to the creation of new features from existing data. Feature creation allows the model to more easily input data and create an output. 

Within the feature creation file, the datetime column in the origional csv of data is split hours, days of the week, day of year, week of year, month, and quarter. The date data is then normalized. Cyclic seasonality features through trigonometric sine and cosine transformations is shown below. The data is grouped by fuel type. The rolling mean and standard deviation for each time stamp and fuel type. There is one time stamp per hour of each fuel type. 

![Screen Shot 2024-03-05 at 3.56.15 PM.png](attachment:3f36594d-41c8-40fd-95eb-0b400c2dc1e8.png)

### Variable Selection Network (VSN)
Variable selection networks to select relevant input variables at each time step.While multiple variables may be available, their relevance and specific contribution to the output are typically unknown. TFT is designed to provide instance-wise variable selection through the use of variable selection networks applied to both static covariates and time-dependent covariates. Beyond providing insights into which variables are most significant for the prediction problem, variable selection also allows TFT to remove any unnecessary noisy inputs which could negatively impact performance. 

### Long short-term memory (LSTM)
Temporal processing to short-term temporal relationships from both observed and known time-varying inputs. The LSTM model introduces an intermediate type of storage via the memory cell. A memory cell is a composite unit, built from simpler nodes in a specific connectivity pattern, with the novel inclusion of multiplicative nodes. 

### Attention Mechinism
Attention mechanism weighs the importance of values based on the relationships between keys and queries. This is by analogy with information retrieval that would evaluate a search query (query) against document embeddings (keys) to retrieve most relevant documents (values). This allows to learn the relevance of each time step with respect to the rest of the input sequence, and therefore to capture long-range temporal dependencies. TFT adjusts this definition to ensure interpretability. As such, instead of having multiple head-specific weights for values, these are shared across all attention heads. This allows to easily trace back most relevant values. The outputs of all heads are then additively aggregated. 

### Training and Output
Instead of just a single value, TFT predicts quantiles of the distribution of target ŷ using a special quantile loss function. Prediction intervals via quantile forecasts to determine the range of likely target values at each prediction horizon. The training data is tft/data/power_consumption_by_fuel_type.csv. 2018-07-01 to 2023-12-31. 

![predict_vs_true_power_output.png](attachment:1b31d09e-dd32-4519-be47-44c8f166dbf8.png)

*******

## Visualizations

***File: visualizations.ipynb***

The visualization are found in the file named visualizations.ipynb

Coal (COL)

Natural Gas (NG)

Nuclear (NUC)

Oil (OIL)

Other (OTH)

Sun (SUN)

Water (WAT)

Wind (WIND)

All data for the visualization are divided by these 8 fuel types. 


The data is configured into a dateframe containing the varying fuel types and quantiles. A quantile is defined as the probability of the true value being below the estimated value at that quantile. Quantile loss minimizes the error across all groups, timesteps and quantiles. 

### 25th, 50th (Median), 75th Quantile

The first set of visuals use the raw data from the prediction model. Each graph shows the lowest quarter quantile power output prediction, the median quantile power output predicition and the top quarter power output prediction. 

![quantiles_power_output.png](attachment:826e697b-2c75-4839-845d-8ea0de80325e.png)

### True Value Compared to the 25th and 75th Quantile

The second set of visuals compare the true value versus the lowest quarter quantile power output number and the highest quarter quantile power output number. 

![truevalue_quantiles_power_output.png](attachment:e46583c2-c932-4956-92be-a449238be281.png)

### Actual Value (True Value) Compared to Predicted Value

The third set of visuals are most impactful in showing the over and underprediction of the power output for each power source. 

![under_over_predict_power_output.png](attachment:757056c5-7622-4a1d-bef3-0645281778b3.png)

**********

### Limitations
Although this dataset is expansive, there are some limitations. 

First, a large limitation is the lack of information on the cost of the fuel source in the data. This factor is crucial for understanding the economic implications of consumption trends. Integration of price data would allow for a more comprehensive analysis of cost effectiveness of different fuel sources. 

Second, external factors such as natural disasters, weather conditions, or power outages can significantly impact fuel consumption but are not accounted for in the analysis. Incorporating external data sources such as weather patterns or records of outages would enhance the analysis by providing a more holistic view of the influencing factors. 

Third, the category ‘Other’ may introduce ambiguity, limiting the precision of the analysis as it does not specify the type of fuel. Further categorization or detailed information on the composition of the fuel type in the ‘Other’ category could improve accuracy and specificity of the analysis. In this dataset, ‘Other’ consists of about 2% of the data, so it is not something that will significantly impact our analysis. 

Fourth, holidays often influence energy consumption patterns, but the analysis does not explicitly account for holidays. Incorporating a holiday calendar and examining fuel consumption during holiday period would provide further insights on how holidays or similar events impact energy usage. 

Lastly, the analysis does not differentiate between residential and commercial fuel consumption. This does not allow the opportunity to tailor sector-specific insights. Separating the data into residential and commercial categories would enable a more targeted analysis and considerations. 



### Continued Analysis 
Some of the limitations addressed, may also provide opportunity for continued analysis of the data. Continuing the analysis with inclusion of specific datasets can enhance the depth and accuracy of insights. 

First, price data correlated with consumption data allows for comprehensive economic analysis and understanding how fuel prices correlate with consumption patterns and provides insights into the cost-effectiveness of various energy sources. It helps identify periods of high consumption coinciding with price fluctuations, which both consumers and producers may be interested in. 

Second, weather conditions have a substantial impact on energy consumption. For example, cold weather may lead to increased heating requirements, affecting fuel usage. Incorporating weather data such as temperature, precipitation, etc could reveal correlations between weather patterns and energy consumption trends. In regions prone to more extreme weather events this could be crucial for forecasting and planning power usage. 

Lastly, incorporating holiday calendars allows for identification of holiday specific trends/anomalies. Understanding how holidays influence energy usage provides valuable insights for both short-term and long-term planning. 

### Dissemination of Analysis
We prepare a detailed presentation that covers the methodology, findings, and limitations of the analysis. This presentation includes visuals such as charts and graphs for better understanding. These findings will first be presented to the class and professor through platforms such as Microsoft PowerPoint and Jupyter Notebooks. 

Additionally, the analysis can be documented in a Jupyter Notebook or similar format with clear explanation and visuals. This analysis can be shared on GitHub and LinkedIn to reach a professional audience. Creating an interactive dashboard using tools like Tableau or Dash allows an engaging way for the audience to interact with the analysis and draw their own conclusions as well. The target audience is fuel/power producers and consumers, which includes most people and businesses in the world. 