# EBA Data Ingestion

We need to collect data from two main sources for this project:
 - First is loading the EBA data into the SQL DB
 - Second is getting the real weather data from NOAA based off Weather stations.   
   This is done via FTP using `code/utils/get_weather_data.py` for the relevant time periods.
 - (Third would be accessing NOAA's forecast DB.)

 In all cases we will be loading the data into a Postgres Database for easier querying later.  


# Library Sketch and Table Sketch 

- 560 MB of weather station data
- 2.8GB of Energy data
- 5GB of forecast data  (could try to only extract station data)

Energy data x 100 ISOs
- Demand
- Demand Forecast
- Net Generation
    (by source)
- Transfers

Weather x 600 stations
- Temp
- Cloud cover
- Precipitation

Forecast
- Temp (gridded 24 hour forecast) of CONUS.  Probably don't want in DB.
- include file ref.
- Try to find nearest forecast pixel for all airports.

Given we want to think about a whole system forecast, we can live with having a few big tables separated by variable.
Use UTC time variables to allow a common index and forecast.

Demand Table
    id, 
    datetime
    iso1,
    iso2,
    ...
    index on datetime

Forecast Table
    ""

Net Generation (*)
    " " 
(same for sub-sources)

Transfers  (*)
   id, 
   datetime
   iso1,
   iso2,
   amount
   index on datetime, 

AirMeta
   id
   station_name
   lat
   long
   region
   city
   state

   
Temperature
   id, 
   datetime
   st1,
   st2,
   ...
   
   

In [1]:
import os
import sys

In [2]:
if '/tf' not in sys.path:
    sys.path.append('/tf/')

%load_ext autoreload
%autoreload 2

In [15]:
from us_elec.SQL.sqldriver import SQLDriver

In [16]:
sqld = SQLDriver()

In [18]:
#sqld.get_data('SELECT * FROM information_schema.tables;')

# Bulk EBA data import

The EBA data can be downloaded from `https://www.eia.gov/opendata/bulk/EBA.zip`.
As of Mar 6, 2023 it's around 2.8 GB, with around 2800 child series, stored in one JSONLines files.

That's downloaded to data/EBA/20230302.  
For initial quick exploration we you can grep out 'California' and 'Portland' series to 


In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
!pwd

/tf/us_elec/notebooks


In [22]:
!wc -l /tf/us_elec/data/EBA/EBA20230302/EBA.txt

2847 /tf/us_elec/data/EBA/EBA20230302/EBA.txt


In [4]:
!ls -l /tf/us_elec/data/EBA/EBA20230302

total 2893992
-rw-rw-rw- 1 root root 2760336845 Mar  3 13:13 EBA.txt
-rw-rw-r-- 1 root root  170876970 Mar  8 01:35 EBA_CA.txt
-rw-rw-r-- 1 root root   32219742 Mar  8 01:35 EBA_PDX.txt


In [5]:
eba_path = '/tf/us_elec/data/EBA/EBA20230302'
fname = f'{eba_path}/EBA_PDX.txt'

- grepped out all Portland files and California files for a smaller subset of data to play with while cleaning
up the ETL work

In [12]:
import json
import jsonlines
import re
from tqdm import tqdm

def read_eba_txt(fn:str, N:int=None, name_lookup:str=None):
    """Read in all JSON from Lines file.

    Args:
    N - maximum number of lines to read in
    name_lookup - optional string to search for.  
    Return:
    List of dicts
    """
    count = 0
    data = []

    #name_reg = re.compile(f'{name_lookup}') if name_lookup else None
    with jsonlines.open(fn, 'r') as fh:
        for obj in tqdm(fh):
            #print(obj.get('series_id'), obj.get('name'))

            if name_lookup:
                if name_lookup.lower() in obj.get('name').lower():
                    print(f"HIT! {obj['name']}")
                    data.append(obj)
                    
            else:
                data.append(obj)
            if N and len(data) >= N:
                break
    return data


- This eats a LOT of ram on it's own for all files.  
- Probably best to ETL one at a time.  Even in dict form it's eating around 20GB of RAM.

In [9]:
all_data = read_eba_txt(fn)

32it [00:01, 27.72it/s]


In [10]:
len(all_data)

32

In [25]:
for dat in all_data:
    if 'series_id' in dat.keys():
        print(dat['series_id'], dat['name'])
        print(len(dat['data']), dat['data'][0:2], dat['data'][-1])
        print()
    else:
        print(dat['category_id'], dat['name'], dat['childseries'])

EBA.PGE-ALL.D.H Demand for Portland General Electric Company (PGE), hourly - UTC time
66520 [['20230302T22Z', 2957], ['20230302T21Z', 3050]] ['20150722T08Z', 1936]

EBA.PGE-ALL.D.HL Demand for Portland General Electric Company (PGE), hourly - local time
66520 [['20230302T14-08', 2957], ['20230302T13-08', 3050]] ['20150722T01-07', 1936]

EBA.PGE-PACW.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to PacifiCorp West (PACW), hourly - UTC time
66250 [['20230301T08Z', 84], ['20230301T07Z', 102]] ['20150721T08Z', -92]

EBA.PACW-PGE.ID.H Actual Net Interchange for PacifiCorp West (PACW) to Portland General Electric Company (PGE), hourly - UTC time
66959 [['20230301T08Z', -84], ['20230301T07Z', -102]] ['20150701T08Z', 101]

EBA.PGE-BPAT.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to Bonneville Power Administration (BPAT), hourly - UTC time
66265 [['20230301T08Z', -1638], ['20230301T07Z', -1738]] ['20150721T08Z', -1268]

EBA.PGE-BPAT.ID.HL Ac

In [None]:
- note that the transfers are not fully aligned for the most recent data?  I suspect some sort of reconciliation procedure
clears that up?  Would need to look into that.  Useful for considering trades.

So we have 4 big categories of data in this thing.  All series are provided with local time and global time variations.

- Demand
- Demand Forecast
- Net Generation
- Net Generation (by source) - Much less data
- Total Interchange
- Interchange with other ISOs

- Around 8 years of data for demand/net generation.
- Around 5 years for generation by source data.
- Hourly resolution 
- Around 100 ISOs  (2850 series, 30 series per ISO, but variable interchanges).
- 60k data points per series at hourly resolution.

## Proposed SQL Table Structure - EBA

- Our initial project is focused on the demand forecasting piece.  Let's just focus on the bulk attributes for now, and return later if need be for
 breakdowns by generation type

### Options:
1) 1 table per series (hard to look up) - Reject.

2) 1 table per type (100 ISOs as columns).
    - Demand (Time, PDX, BPA, CAISO, ...)
    - Forecast (Time, PDX, BPA, CAISO, ...)
    - Net Generation (Time, PDX, BPA, CAISO,...)
    - Interchange(Time, P1, P2, Amount)

3) 1 major table per ISO (around 30 sub-series)
   -  PGE (Time, Demand, Forecast, Net Generation, COL, HYD, ..., PGE-BPA, PGE-PACW)
   -  BPA (Time, Demand, Forecast, Net Generation, COL, HYD, ..., BPA-PGE, PGE-PACW)

Leaning toward approach 3.  Better encapsulates system process.  Allows local time and UTC time
Also leaning towards only including UTC time variations.

- Need all series names (types of data)
- Need all ISOs and transferes.

In [29]:
len(raw_lines)

10

In [30]:
raw_lines

['series_id',
 'name',
 'units',
 'f',
 'description',
 'start',
 'end',
 'last_updated',
 'geoset_id',
 'data']

## Getting Metadata

(Increasingly getting feeling that Mongo is the way to really handle this data)

## EBA

Want:
- list of ISOs, names


## Airports

I think the `merge_air_df` is probably already close to what we want: mapping from id to name/region.


In [23]:
! head /tf/data/EBA/EBA20230302/metaseries.txt

{"category_id":"2122627","parent_category_id":"2123635","name":"Day-ahead demand forecast","notes":"","childseries":[]}
{"category_id":"3389848","parent_category_id":"2122627","name":"U.S.","notes":"","childseries":[]}
{"category_id":"3389851","parent_category_id":"3389848","name":"United States Lower 48 (US48)","notes":"","childseries":["EBA.US48-ALL.DF.H","EBA.US48-ALL.DF.HL"]}
{"category_id":"3389849","parent_category_id":"2122627","name":"Regions","notes":"","childseries":[]}
{"category_id":"3389852","parent_category_id":"3389849","name":"California (CAL)","notes":"","childseries":["EBA.CAL-ALL.DF.H","EBA.CAL-ALL.DF.HL"]}
{"category_id":"3389853","parent_category_id":"3389849","name":"Carolinas (CAR)","notes":"","childseries":["EBA.CAR-ALL.DF.H","EBA.CAR-ALL.DF.HL"]}
{"category_id":"3389854","parent_category_id":"3389849","name":"Central (CENT)","notes":"","childseries":["EBA.CENT-ALL.DF.H","EBA.CENT-ALL.DF.HL"]}
{"category_id":"3389855","parent_category_id":"3389849"

In [24]:
import pandas as pd

fn = '/tf/data/EBA/EBA20230302/metaseries.txt'
meta_df = pd.read_json(fn, lines=True)

In [34]:
meta_df.loc[19]

category_id                                         4670479
parent_category_id                                  3389849
name                                            Texas (TEX)
notes                                                      
childseries           [EBA.TEX-ALL.DF.H, EBA.TEX-ALL.DF.HL]
Name: 20, dtype: object

In [38]:
import re

In [61]:
#reg = re.compile(r'([\w+])\s+\(([\w+])\)')
reg = re.compile(r'(?:\(|\w+\s+)')
reg2 = re.compile(r'\(([A-z]+)\)')

In [65]:
st = 'Coll Name Brobert (CNB)'
reg.findall(st), reg2.search(st), re.findall('(\w+)', st)

(['Coll ', 'Name ', 'Brobert ', '('],
 <re.Match object; span=(18, 23), match='(CNB)'>,
 ['Coll', 'Name', 'Brobert', 'CNB'])

In [82]:

def parse_metadata(df):
    """Grab names, abbreviations and category ids"""
    parent_map = {}
    iso_map = {}
    for _, row in df.iterrows():
        if '(' in row['name']:
            tokens = re.findall('(\w+)', row['name'])
            name = ' '.join(tokens[:-1])
            abbrv = tokens[-1]
            iso_map[abbrv] = name
        #for ch in row.childseries
    return iso_map

In [85]:
%pdb off
iso_map = parse_metadata(meta_df)

Automatic pdb calling has been turned OFF


In [86]:
len(iso_map)

83

In [67]:
len(meta_df)

537

In [68]:
meta_df.loc[400:420]

Unnamed: 0,category_id,parent_category_id,name,notes,childseries
400,2122583,3390272,"NaturEner Wind Watch, LLC (WWA)",,"[EBA.WWA-NWMT.ID.H, EBA.WWA-NWMT.ID.HL]"
401,2122584,3390272,Nevada Power Company (NEVP),,"[EBA.NEVP-BPAT.ID.H, EBA.NEVP-BPAT.ID.HL, EBA...."
402,2122586,3390272,ISO New England (ISNE),,"[EBA.ISNE-HQT.ID.H, EBA.ISNE-HQT.ID.HL, EBA.IS..."
403,2122587,3390272,"New Harquahala Generating Company, LLC (HGMA)",,"[EBA.HGMA-SRP.ID.H, EBA.HGMA-SRP.ID.HL]"
404,2122588,3390272,Utilities Commission of New Smyrna Beach (NSB),,"[EBA.NSB-FPC.ID.H, EBA.NSB-FPC.ID.HL, EBA.NSB-..."
405,2122589,3390272,New York Independent System Operator (NYIS),,"[EBA.NYIS-HQT.ID.H, EBA.NYIS-HQT.ID.HL, EBA.NY..."
406,2122590,3390272,NorthWestern Corporation (NWMT),,"[EBA.NWMT-AESO.ID.H, EBA.NWMT-AESO.ID.HL, EBA...."
407,2122592,3390272,Ohio Valley Electric Corporation (OVEC),,"[EBA.OVEC-LGEE.ID.H, EBA.OVEC-LGEE.ID.HL, EBA...."
408,2122595,3390272,"PJM Interconnection, LLC (PJM)",,"[EBA.PJM-CPLE.ID.H, EBA.PJM-CPLE.ID.HL, EBA.PJ..."
409,2122596,3390272,PUD No. 1 of Douglas County (DOPD),,"[EBA.DOPD-BPAT.ID.H, EBA.DOPD-BPAT.ID.HL, EBA...."
