# EBA Data Ingestion

We need to collect data from two main sources for this project:
 - First is loading the EBA data into the SQL DB
 - Second is getting the real weather data from NOAA based off Weather stations.   
   This is done via FTP using `code/utils/get_weather_data.py` for the relevant time periods.
 - (Third would be accessing NOAA's forecast DB.)

 In all cases we will be loading the data into a Postgres Database for easier querying later.  


# Library Sketch and Table Sketch 

- 560 MB of weather station data
- 2.8GB of Energy data
- 5.0GB of forecast data  (could try to only extract station data)

Energy data x 100 ISOs
- Demand
- Demand Forecast
- Net Generation
    (by source)
- Transfers

Weather x 600 stations
- Temp
- Cloud cover
- Precipitation

Forecast
- Temp (gridded 24 hour forecast) of CONUS.  Probably don't want in DB.
- include file ref.
- Try to find nearest forecast pixel for all airports.

Given we want to think about a whole system forecast, we can live with having a few big tables separated by variable.
Use UTC time variables to allow a common index and forecast.

Demand Table
    id, 
    datetime
    iso1,
    iso2,
    ...
    index on datetime

Forecast Table
    ""

Net Generation (*)
    " " 
(same for sub-sources)

Transfers  (*)
   id, 
   datetime
   iso1,
   iso2,
   amount
   index on datetime, 

AirMeta
   id
   station_name
   lat
   long
   region
   city
   state
   
Temperature
   id, 
   datetime
   st1,
   st2,
   ...
   
   

In [1]:
import os
import sys

In [2]:
sqld = SQLDriver()

In [6]:
#sqld.get_data('SELECT * FROM information_schema.tables;')

# Bulk EBA data import

The EBA data can be downloaded from `https://www.eia.gov/opendata/bulk/EBA.zip`.
As of Mar 6, 2023 it's around 2.8 GB, with around 2800 child series, stored in one JSONLines files.

That's downloaded to data/EBA/20230302.  
For initial quick exploration we you can grep out 'California' and 'Portland' series to 


- grepped out all Portland files and California files for a smaller subset of data to play with while cleaning
up the ETL work
  `grep -r "Portland" EBA.txt > EBA_PDX.txt`
  `grep -r "California" EBA.txt > EBA_CA.txt`
  

In [15]:
import json
import jsonlines
import re
from tqdm import tqdm

def read_eba_txt(fn:str, N:int=None, name_lookup:str=None):
    """Read in all JSON from Lines file.

    Args:
    N - maximum number of lines to read in
    name_lookup - optional string to search for.  
    Return:
    List of dicts
    """
    count = 0
    data = []

    #name_reg = re.compile(f'{name_lookup}') if name_lookup else None
    with jsonlines.open(fn, 'r') as fh:
        for obj in tqdm(fh):
            #print(obj.get('series_id'), obj.get('name'))

            if name_lookup:
                if name_lookup.lower() in obj.get('name').lower():
                    print(f"HIT! {obj['name']}")
                    data.append(obj)
                    
            else:
                data.append(obj)
            if N and len(data) >= N:
                break
    return data


- This eats a LOT of ram on it's own for all files.  
- Probably best to ETL one at a time.  Even in dict form it's eating around 20GB of RAM.

In [17]:
eba_path = '/tf/data/EBA/EBA20230302'
fname = f'{eba_path}/EBA_PDX.txt'

In [18]:
all_data = read_eba_txt(fname)

0it [00:00, ?it/s]

2it [00:00, 14.24it/s]

4it [00:00, 16.70it/s]

6it [00:00, 13.36it/s]

8it [00:00, 13.97it/s]

11it [00:00, 15.24it/s]

16it [00:00, 19.11it/s]

22it [00:01, 20.55it/s]

32it [00:01, 24.78it/s]




In [19]:
len(all_data)

32

In [20]:
for dat in all_data:
    if 'series_id' in dat.keys():
        print(dat['series_id'], dat['name'])
        print(len(dat['data']), dat['data'][0:2], dat['data'][-1])
        print()
    else:
        print(dat['category_id'], dat['name'], dat['childseries'])

EBA.PGE-ALL.D.H Demand for Portland General Electric Company (PGE), hourly - UTC time
66520 [['20230302T22Z', 2957], ['20230302T21Z', 3050]] ['20150722T08Z', 1936]

EBA.PGE-ALL.D.HL Demand for Portland General Electric Company (PGE), hourly - local time
66520 [['20230302T14-08', 2957], ['20230302T13-08', 3050]] ['20150722T01-07', 1936]

EBA.PGE-PACW.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to PacifiCorp West (PACW), hourly - UTC time
66250 [['20230301T08Z', 84], ['20230301T07Z', 102]] ['20150721T08Z', -92]

EBA.PACW-PGE.ID.H Actual Net Interchange for PacifiCorp West (PACW) to Portland General Electric Company (PGE), hourly - UTC time
66959 [['20230301T08Z', -84], ['20230301T07Z', -102]] ['20150701T08Z', 101]

EBA.PGE-BPAT.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to Bonneville Power Administration (BPAT), hourly - UTC time
66265 [['20230301T08Z', -1638], ['20230301T07Z', -1738]] ['20150721T08Z', -1268]

EBA.PGE-BPAT.ID.HL Ac

In [None]:
- note that the transfers are not fully aligned for the most recent data?  I suspect some sort of reconciliation procedure
clears that up?  Would need to look into that.  Useful for considering trades.

So we have 4 big categories of data in this thing.  All series are provided with local time and global time variations.

- Demand
- Demand Forecast
- Net Generation
- Net Generation (by source) - Much less data
- Total Interchange
- Interchange with other ISOs

- Around 8 years of data for demand/net generation.
- Around 5 years for generation by source data.
- Hourly resolution 
- Around 100 ISOs  (2850 series, 30 series per ISO, but variable interchanges).
- 60k data points per series at hourly resolution.

## Proposed SQL Table Structure - EBA

- Our initial project is focused on the demand forecasting piece.  Let's just focus on the bulk attributes for now, and return later if need be for
 breakdowns by generation type

### Options:
1) 1 table per series (hard to look up) - Reject.

2) 1 table per type (100 ISOs as columns).
    - Demand (Time, PDX, BPA, CAISO, ...)
    - Forecast (Time, PDX, BPA, CAISO, ...)
    - Net Generation (Time, PDX, BPA, CAISO,...)
    - Interchange(Time, P1, P2, Amount)

3) 1 major table per ISO (around 30 sub-series)
   -  PGE (Time, Demand, Forecast, Net Generation, COL, HYD, ..., PGE-BPA, PGE-PACW)
   -  BPA (Time, Demand, Forecast, Net Generation, COL, HYD, ..., BPA-PGE, PGE-PACW)

Leaning toward approach 3.  Better encapsulates system process.  Allows local time and UTC time
Also leaning towards only including UTC time variations.

- Need all series names (types of data)
- Need all ISOs and transferes.

## Getting Metadata

(Increasingly getting feeling that Mongo is the way to really handle this data)

## EBA

Want:
- list of ISOs, names

## Airports

I think the `merge_air_df` is probably already close to what we want: mapping from id to name/region.


In [22]:
! head /tf/data/EBA/EBA20230302/metaseries.txt

{"category_id":"2122627","parent_category_id":"2123635","name":"Day-ahead demand forecast","notes":"","childseries":[]}
{"category_id":"3389848","parent_category_id":"2122627","name":"U.S.","notes":"","childseries":[]}
{"category_id":"3389851","parent_category_id":"3389848","name":"United States Lower 48 (US48)","notes":"","childseries":["EBA.US48-ALL.DF.H","EBA.US48-ALL.DF.HL"]}
{"category_id":"3389849","parent_category_id":"2122627","name":"Regions","notes":"","childseries":[]}
{"category_id":"3389852","parent_category_id":"3389849","name":"California (CAL)","notes":"","childseries":["EBA.CAL-ALL.DF.H","EBA.CAL-ALL.DF.HL"]}
{"category_id":"3389853","parent_category_id":"3389849","name":"Carolinas (CAR)","notes":"","childseries":["EBA.CAR-ALL.DF.H","EBA.CAR-ALL.DF.HL"]}
{"category_id":"3389854","parent_category_id":"3389849","name":"Central (CENT)","notes":"","childseries":["EBA.CENT-ALL.DF.H","EBA.CENT-ALL.DF.HL"]}
{"category_id":"3389855","parent_category_id":"3389849"

In [4]:
import pandas as pd

# grep -r category_id EBA.txt > metaseries.txt
fn = '/tf/data/EBA/EBA20230302/metaseries.txt'
meta_df = pd.read_json(fn, lines=True)

In [10]:
meta_df.loc[19]

category_id                                         4670478
parent_category_id                                  3389849
name                                        Tennessee (TEN)
notes                                                      
childseries           [EBA.TEN-ALL.DF.H, EBA.TEN-ALL.DF.HL]
Name: 19, dtype: object

In [6]:
import re

In [7]:

def parse_metadata(df):
    """Grab names, abbreviations and category ids"""
    parent_map = {}
    iso_map = {}
    for _, row in df.iterrows():
        if '(' in row['name']:
            tokens = re.findall('(\w+)', row['name'])
            name = ' '.join(tokens[:-1])
            abbrv = tokens[-1]
            if abbrv == abbrv.upper():
                iso_map[abbrv] = name
            
        #for ch in row.childseries
    return iso_map

In [8]:
%pdb off
iso_map = parse_metadata(meta_df)

Automatic pdb calling has been turned OFF


In [9]:
len(iso_map)

82

In [28]:
save_iso_names(iso_map)

'/tf/data/EBA/EBA20230302_iso_map.json'

In [29]:
i2 = load_iso_names()

In [14]:
from us_elec.SQL.sqldriver import EBAMeta

In [15]:
ebm = EBAMeta()

In [38]:
ebm.extract_meta_data()

In [48]:
ebm.save_iso_dict_json()


'/tf/data/EBA/EBA20230302/iso_name_file.json'

In [50]:
ebm.load_iso_dict_json()

{'US48': 'United States Lower 48',
 'CAL': 'California',
 'CAR': 'Carolinas',
 'CENT': 'Central',
 'FLA': 'Florida',
 'MIDA': 'Mid Atlantic',
 'MIDW': 'Midwest',
 'NW': 'Northwest',
 'SE': 'Southeast',
 'SW': 'Southwest',
 'ERCO': 'Electric Reliability Council of Texas Inc',
 'ISNE': 'ISO New England',
 'NYIS': 'New York Independent System Operator',
 'TVA': 'Tennessee Valley Authority',
 'NE': 'New England',
 'NY': 'New York',
 'TEN': 'Tennessee',
 'TEX': 'Texas',
 'AZPS': 'Arizona Public Service Company',
 'AECI': 'Associated Electric Cooperative Inc',
 'AVA': 'Avista Corporation',
 'BANC': 'Balancing Authority of Northern California',
 'BPAT': 'Bonneville Power Administration',
 'CISO': 'California Independent System Operator',
 'HST': 'City of Homestead',
 'TPWR': 'City of Tacoma Department of Public Utilities Light Division',
 'TAL': 'City of Tallahassee',
 'DUK': 'Duke Energy Carolinas',
 'FPC': 'Duke Energy Florida Inc',
 'CPLE': 'Duke Energy Progress East',
 'CPLW': 'Duke Energ

In [None]:
# TODO:
1. get metadata
2. create weather tables
3. populate tables

### Saving Airport Metadata

From `airport_play.ipynb` which downloaded all that data we have the merge_df which merged city and location information
with callsign info.


In [51]:
from us_elec.SQL.sqldriver import AirMeta

In [54]:
am = AirMeta()

In [64]:
df = am.get_air_meta_df()

In [65]:
df.sort_values(['ST', 'CALL'])

Unnamed: 0,name,City,CALL,USAF,WBAN,LAT,LON,ST
749,"Warren ""Bud"" Woods Palmer Municipal Airport",Palmer,PAAQ,702740,25331,61.596,-149.092,AK
750,Barter Island LRRS Airport,Barter Island Lrrs,PABA,700860,27401,70.134,-143.577,AK
751,Bethel Airport,Bethel,PABE,702190,26615,60.785,-161.829,AK
752,Allen Army Airfield,Delta Junction Ft Greely,PABI,702670,26415,63.994,-145.721,AK
753,Buckland Airport,Buckland,PABL,700632,26645,65.983,-161.133,AK
...,...,...,...,...,...,...,...,...
597,Riverton Regional Airport,Riverton,KRIW,726720,24061,43.064,-108.459,WY
598,Southwest Wyoming Regional Airport,Rock Springs,KRKS,725744,24027,41.595,-109.053,WY
614,Rawlins Municipal Airport/Harvey Field,Rawlins,KRWL,726690,24057,41.800,-107.200,WY
642,Sheridan County Airport,Sheridan,KSHR,726660,24029,44.769,-106.969,WY


In [66]:
am.save_callsigns()

In [68]:
am.load_callsigns()

In [4]:
from us_elec.SQL.sqldriver import SQLDriver

In [5]:
sqldriver = SQLDriver()

In [None]:
st = sqldriver._get_