# EBA Data Ingestion

We need to collect data from two main sources for this project:
 - First is loading the EBA data into the SQL DB
 - Second is getting the real weather data from NOAA based off Weather stations.   
   This is done via FTP using `code/utils/get_weather_data.py` for the relevant time periods.
 - (Third would be accessing NOAA's forecast DB.)

 In all cases we will be loading the data into a Postgres Database for easier querying later.  


In [1]:
import os
import sys

In [2]:
if '/tf' not in sys.path:
    sys.path.append('/tf/')

%load_ext autoreload
%autoreload 2

In [3]:
from us_elec.SQL.sqldriver import SQLDriver

In [8]:
sqld = SQLDriver()

OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
	Is the server running locally and accepting connections on that socket?


# Library Sketch and Table Sketch 

- 560 MB of weather station data
- 2.8GB of Energy data
- 5GB of forecast data  (could try to only extract station data)

Energy data x 100 ISOs
- Demand
- Demand Forecast
- Net Generation
    (by source)
- Transfers

Weather x 600 stations
- Temp
- Cloud cover
- Precipitation

Forecast
- Temp (gridded 24 hour forecast) of CONUS.  Probably don't want in DB.
- include file ref.
- Try to find nearest forecast pixel for all airports.

Given we want to think about a whole system forecast, we can live with having a few big tables separated by variable.
Use UTC time variables

Demand Table
    id, 
    datetime
    iso1,
    iso2,
    ...
    index on datetime

Forecast Table
    ""

Net Generation (*)
    " " 
(same for sub-sources)

Transfers  (*)
   id, 
   datetime
   iso1,
   iso2,
   amount
   index on datetime, 
   
   
Temperature
   id, 
   datetime
   st1,
   st2,
   ...
   
   

# Bulk EBA data import

The EBA data can be downloaded from `https://www.eia.gov/opendata/bulk/EBA.zip`.
As of Mar 6, 2023 it's around 2.8 GB, with around 2800 child series, stored in one JSONLines files.

That's downloaded to data/EBA/20230302.  
For initial quick exploration we you can grep out 'California' and 'Portland' series to 


In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
!pwd

/tf/us_elec/notebooks


In [22]:
!wc -l /tf/us_elec/data/EBA/EBA20230302/EBA.txt

2847 /tf/us_elec/data/EBA/EBA20230302/EBA.txt


In [4]:
!ls -l /tf/us_elec/data/EBA/EBA20230302

total 2893992
-rw-rw-rw- 1 root root 2760336845 Mar  3 13:13 EBA.txt
-rw-rw-r-- 1 root root  170876970 Mar  8 01:35 EBA_CA.txt
-rw-rw-r-- 1 root root   32219742 Mar  8 01:35 EBA_PDX.txt


In [5]:
eba_path = '/tf/us_elec/data/EBA/EBA20230302'
fname = f'{eba_path}/EBA_PDX.txt'

- grepped out all Portland files and California files for a smaller subset of data to play with while cleaning
up the ETL work

In [12]:
import json
import jsonlines
import re
from tqdm import tqdm

def read_eba_txt(fn:str, N:int=None, name_lookup:str=None):
    """Read in all JSON from Lines file.

    Args:
    N - maximum number of lines to read in
    name_lookup - optional string to search for.  
    Return:
    List of dicts
    """
    count = 0
    data = []

    #name_reg = re.compile(f'{name_lookup}') if name_lookup else None
    with jsonlines.open(fn, 'r') as fh:
        for obj in tqdm(fh):
            #print(obj.get('series_id'), obj.get('name'))

            if name_lookup:
                if name_lookup.lower() in obj.get('name').lower():
                    print(f"HIT! {obj['name']}")
                    data.append(obj)
                    
            else:
                data.append(obj)
            if N and len(data) >= N:
                break
    return data


- This eats a LOT of ram on it's own for all files.  
- Probably best to ETL one at a time.  Even in dict form it's eating around 20GB of RAM.

In [9]:
all_data = read_eba_txt(fn)

0it [00:00, ?it/s]

3it [00:00, 26.78it/s]

6it [00:00, 21.13it/s]

9it [00:00, 16.41it/s]

13it [00:00, 19.41it/s]

18it [00:00, 22.38it/s]

23it [00:01, 21.43it/s]

32it [00:01, 27.72it/s]




In [10]:
len(all_data)

32

In [25]:
for dat in all_data:
    if 'series_id' in dat.keys():
        print(dat['series_id'], dat['name'])
        print(len(dat['data']), dat['data'][0:2], dat['data'][-1])
        print()
    else:
        print(dat['category_id'], dat['name'], dat['childseries'])

EBA.PGE-ALL.D.H Demand for Portland General Electric Company (PGE), hourly - UTC time
66520 [['20230302T22Z', 2957], ['20230302T21Z', 3050]] ['20150722T08Z', 1936]

EBA.PGE-ALL.D.HL Demand for Portland General Electric Company (PGE), hourly - local time
66520 [['20230302T14-08', 2957], ['20230302T13-08', 3050]] ['20150722T01-07', 1936]

EBA.PGE-PACW.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to PacifiCorp West (PACW), hourly - UTC time
66250 [['20230301T08Z', 84], ['20230301T07Z', 102]] ['20150721T08Z', -92]

EBA.PACW-PGE.ID.H Actual Net Interchange for PacifiCorp West (PACW) to Portland General Electric Company (PGE), hourly - UTC time
66959 [['20230301T08Z', -84], ['20230301T07Z', -102]] ['20150701T08Z', 101]

EBA.PGE-BPAT.ID.H Actual Net Interchange for Portland General Electric Company (PGE) to Bonneville Power Administration (BPAT), hourly - UTC time
66265 [['20230301T08Z', -1638], ['20230301T07Z', -1738]] ['20150721T08Z', -1268]

EBA.PGE-BPAT.ID.HL Ac

So we have 4 big categories of data in this thing.  All series are provided with local time and global time variations.

- Demand
- Demand Forecast
- Net Generation
- Net Generation (by source) - Much less data
- Total Interchange
- Interchange with other ISOs

- Around 8 years of data for demand/net generation.
- Around 5 years for generation by source data.
- Hourly resolution 
- Around 100 ISOs  (2850 series, 30 series per ISO, but variable interchanges).
- 60k data points per series at hourly resolution.

## Proposed SQL Table Structure - EBA

- Our initial project is focused on the demand forecasting piece.  Let's just focus on the bulk attributes for now, and return later if need be for
 breakdowns by generation type

### Options:
1) 1 table per series (hard to look up) - Reject.

2) 1 table per type (100 ISOs as columns).
    - Demand (Time, PDX, BPA, CAISO, ...)
    - Forecast (Time, PDX, BPA, CAISO, ...)
    - Net Generation (Time, PDX, BPA, CAISO,...)
    - Interchange(Time, P1, P2, Amount)

3) 1 major table per ISO (around 30 sub-series)
   -  PGE (Time, Demand, Forecast, Net Generation, COL, HYD, ..., PGE-BPA, PGE-PACW)
   -  BPA (Time, Demand, Forecast, Net Generation, COL, HYD, ..., BPA-PGE, PGE-PACW)

Leaning toward approach 3.  Better encapsulates system process.  Allows local time and UTC time
Also leaning towards only including UTC time variations.

- Need all series names (types of data)
- Need all ISOs and transferes.

In [29]:
len(raw_lines)

10

In [30]:
raw_lines

['series_id',
 'name',
 'units',
 'f',
 'description',
 'start',
 'end',
 'last_updated',
 'geoset_id',
 'data']