# This Notebook loads in, treats and consolidates all different datasets used in this project. It imports scripts that have been written throughout, and calls them when necessary (to avoid cluttering). All scripts are contained within this folder.

## First, we load in all of our scripts (these scripts rely on external python packages. The list of these packages is available in the project's Readme)

In [1]:
import helpers
import preprocess
import weather
import train_test_validate_split
from datetime import datetime
import pandas as pd

### Now, we generate a dataframe containing all potentially relevant weather data from our location of interest (here, NYC)

In [2]:
# The data will be saved into the raw_data folder
weather_data = weather.generate_aggregated_weather_data('New York City',25000,datetime(2020,10,26),datetime(2023,10,1),savefile='raw_data/NYC_weather_data')

#### This Dataframe contains aggregated (and averaged) hourly weather data from weather stations within 25km of New York City. We calculated precipitation, temperature and pressure anomalies (essentially scaled variables grouped by week or month). Also included categorical variables like 'snowing' or 'raining' to see if those had any effect. Our chosen date range was from 2020-10-26 to 2023-10-01

In [8]:
weather_data.sample(3)

Unnamed: 0,temp,dwpt,rhum,prcp,wdir,wspd,pres,coco,dtime,week,...,weekly_Prec_anom,monthly_Prec_anom,weekly_Wind_anom,monthly_Wind_anom,weekly_Pressure_anom,monthly_Pressure_anom,snowing,raining,hail,cloudy
11860,-5.7,-15.6,45.5,0.0,322.7,12.7,1033.2,1.0,2022-03-04 04:00:00-05:00,9,...,-0.194052,-0.246471,-0.356613,-0.3255,2.327779,2.030431,0,0,0,0
13605,18.9,15.4,80.5,0.0,71.7,6.8,1010.0,4.0,2022-05-15 21:00:00-05:00,19,...,-0.174209,-0.254405,-0.839338,-0.947531,-1.803713,-1.330262,0,0,0,1
9467,4.5,-7.4,42.3,0.0,334.5,14.0,1027.6,2.0,2021-11-24 11:00:00-05:00,47,...,-0.204042,-0.236633,-0.031023,0.205798,1.166241,1.057568,0,0,0,0


### We now merge this weather dataframe with energy price (day-ahead & real-time) and network load data taken from the NRG website (https://www.nrg.com/resources/energy-tools/tracking-the-market.html)

In [2]:
merged = preprocess.preprocess_data('raw_data/','raw_data/dat_set_3/',savefile='raw_data/merged_energy_weather_data')

Create new features of day ahead price and load features at specified hours...
Missing 190 times.
Expand time featuer to new columns of day, week, date, hour ...
Processed the various data files.
<class 'pandas.core.frame.DataFrame'>
Index: 25661 entries, 168 to 25828
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Unnamed: 0                   25639 non-null  float64       
 1   temp                         25639 non-null  float64       
 2   dwpt                         25639 non-null  float64       
 3   time                         25661 non-null  datetime64[ns]
 4   DA_price                     25639 non-null  float64       
 5   RT_price                     25639 non-null  float64       
 6   load                         25378 non-null  float64       
 7   nat_gas_spot_price           25639 non-null  float64       
 8   monthly_avg_NY_natgas_price  24913 non-null

### We do one final clean up and add columns to our dataframe which tie a specific hourly instance to hourly instances 1, 2, 3, 4, 5, 6 & 7 days in the past for Day-Ahead and Real-Time prices, as well as for load.

In [6]:
cleaned_final = helpers.get_prev_day_cols(merged)
h24final = helpers.get_future_h_cols(cleaned_final,[i for i in range(1,25,1)],cols=['DA_price'])
evofin = helpers.get_prev_hour_cols(h24final,[i for i in range(1,49,1)],['DA_price'])

In [10]:
evofin.to_csv('raw_data/final_complete_dataset_hourly_evolution.csv')

### The last step builds common training, validation and testing datasets so that all of our models can be compared fairly and without potential biases generated by individual splits

In [10]:
train_test_validate_split.split_train_test_val_bydate('2022-11-30','2023-08-30','raw_data/final_complete_dataset_hourly_evolution.csv',['time','date'],'date','final_data/')

# All final datasets are saved into the final_data folder inside the data_processing folder. These csvs can be read-in as dataframes and manipulated according to the needs of specific models.