# Converting Txt to SQL

This notebook is primarily used for creating and loading the EBA data into SQL tables.

We need to load the data into an easier to query format than just a huge json dump.

In particular we need ways to extract particular series by type, location and time-window.

This can then inform the simple models that try predicting energy usage.

Need to:
- load in newer data.
- write function to collate weather and prediction data.
- functions to find ISO regions / locations 


In [3]:
import sys


In [11]:
import sys

CODE_PATHS = ['/tf/us_elec', '/tf']

for cp in CODE_PATHS:
    if cp not in sys.path:
        sys.path.append(cp)


In [8]:
%load_ext autoreload
%autoreload 2

In [12]:
sys.path

['/tf/notebooks',
 '/usr/lib/python38.zip',
 '/usr/lib/python3.8',
 '/usr/lib/python3.8/lib-dynload',
 '',
 '/usr/local/lib/python3.8/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/tf/us_elec',
 '/tf']

In [13]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#stuff for ARIMA modelling
from statsmodels.tsa.stattools import adfuller, acf, pacf, arma_order_select_ic
from statsmodels.tsa.arima_model import ARIMA

from us_elec.util.get_weather_data import convert_isd_to_df, convert_state_isd
from us_elec.util.EBA_util import avg_extremes, remove_na


In [14]:

%matplotlib inline

pi = np.pi
#save higher-res figures.
# dpi=120
# mpl.rc("savefig",dpi=dpi)
# mpl.rcParams['figure.dpi']=dpi
COLOR = 'white'
mpl.rcParams['text.color'] = COLOR
mpl.rcParams['axes.labelcolor'] = COLOR
mpl.rcParams['xtick.color'] = COLOR
mpl.rcParams['ytick.color'] = COLOR


In [15]:
air_df = pd.read_csv('/tf/data/air_merge_df.csv.gz', index_col=0)

In [16]:
air_df.head()

Unnamed: 0,name,City,CALL,USAF,WBAN,LAT,LON,ST
0,South Alabama Regional At Bill Benton Field Ai...,Andalusia/Opp,K79J,722275,53843,31.309,-86.394,AL
1,Lehigh Valley International Airport,Allentown,KABE,725170,14737,40.65,-75.448,PA
2,Abilene Regional Airport,Abilene,KABI,722660,13962,32.411,-99.682,TX
3,Albuquerque International Sunport,Albuquerque,KABQ,723650,23050,35.042,-106.616,NM
4,Aberdeen Regional Airport,Aberdeen,KABR,726590,14929,45.443,-98.413,SD


In [15]:
%pdb off
#reads in list of airport codes.

#Just get the weather station data for cities in Oregon.
df_weather = convert_state_isd(air_df, 'OR')

#Read all of the weather data in.
#df_weather=pd.read_csv('data/airport_weather.gz',index_col=0,parse_dates=True)

Automatic pdb calling has been turned OFF


done with Astoria Regional Airport


done with Baker City Municipal Airport


done with Burns Municipal Airport


done with Corvallis Municipal Airport


done with Mahlon Sweet Field


done with Portland Hillsboro Airport


done with Crater Lake-Klamath Regional Airport


done with Rogue Valley International Medford Airport


done with Mc Minnville Municipal Airport


done with Newport Municipal Airport


done with Southwest Oregon Regional Airport


done with Eastern Oregon Regional At Pendleton Airport


done with Portland International Airport


done with Roberts Field


done with Salem Municipal Airport/McNary Field


done with Portland Troutdale Airport


done with Aurora State Airport


In [16]:
df_weather

Unnamed: 0,Temp,DewTemp,Pressure,WindDir,WindSpeed,CloudCover,Precip-1hr,Precip-6hr,city,state,"city, state",region
2015-07-01 00:00:00,,,,,,,,,,,,
2015-07-01 01:00:00,,,,,,,,,,,,
2015-07-01 02:00:00,,,,,,,,,,,,
2015-07-01 03:00:00,,,,,,,,,,,,
2015-07-01 04:00:00,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
2017-12-31 19:00:00,44.0,17.0,10212.0,160.0,15.0,0.0,0.0,,Aurora,OR,"Aurora, OR",Northwest
2017-12-31 20:00:00,67.0,28.0,10208.0,160.0,21.0,0.0,0.0,,Aurora,OR,"Aurora, OR",Northwest
2017-12-31 21:00:00,72.0,28.0,10203.0,120.0,15.0,4.0,0.0,,Aurora,OR,"Aurora, OR",Northwest
2017-12-31 22:00:00,83.0,33.0,10199.0,0.0,0.0,,0.0,,Aurora,OR,"Aurora, OR",Northwest


In [17]:
import json
from tqdm import tqdm

# Loading Data into a MongoDB database

The data is in json, so let's just exploit that and load it into an easy to query Mongo database.
Will be a lot easier to sub-select some data.

Note: need to have the mongo client running.
Can start it (assuming local operation and installation) using "sudo systemctl start mongod".
Can stop it with "sudo systemctl stop mongod"

In [7]:
from pymongo import MongoClient

In [9]:
local_mongo = 'localhost:27017'
client = MongoClient(local_mongo)
db = client.admin
eba_db = client.eba

In [20]:
# Make the eba database


In [28]:
def load_data_into_mongo():
    with open('/tf/data/EBA/EBA.txt', 'r') as fp:
        rv_lines = fp.readlines()
    out_list = []
    for i, r in enumerate(tqdm(rv_lines)):
        out_list.append(json.loads(r))
    for O in tqdm(out_list):
        res = eba_db.datasets.insert_one(O)

## Querying the DB


In [10]:
# get counts of documents
eba_db.datasets.count_documents({})

2789

In [22]:
# get a series
rv0 = eba_db.datasets.find_one({}, {"series_id": 1, "description": 1, "start":1, "end":1, "_id": 0, 'data':1})

In [None]:
- need some functions to select relevant data.  combination regexes

In [24]:
rv0['series_id']

'EBA.FPL-ALL.D.H'

In [41]:
# build up query against DB to find relevant series
rv = eba_db.datasets.find_one(filter={"series_id": {"$regex": "FPL"}}, projection={'series_id', 'name', 'description', 'start', 'end'})
                              

In [42]:
rv

{'_id': ObjectId('609b7109392e2e93643d3976'),
 'series_id': 'EBA.FPL-ALL.D.H',
 'name': 'Demand for Florida Power & Light Co. (FPL), hourly - UTC time',
 'description': 'Timestamps follow the ISO8601 standard (https://en.wikipedia.org/wiki/ISO_8601). Hourly representations are provided in Universal Time.',
 'start': '20150701T05Z',
 'end': '20200219T03Z'}

In [47]:
def get_descriptions(name_contains):
    qs_filter = {"series_id": {"$regex": str(name_contains)}}
    count = eba_db.datasets.count_documents(qs_filter)
    print(f'{count} documents with {name_contains}')
    qr = eba_db.datasets.find(qs_filter, projection={'series_id', 'name', 'description', 'start', 'end'})
    rv = list(qr)
    return rv    

In [48]:
get_descriptions('PGE')

26 documents with PGE


[{'_id': ObjectId('609b710d392e2e93643d39aa'),
  'series_id': 'EBA.PGE-ALL.D.H',
  'name': 'Demand for Portland General Electric Company (PGE), hourly - UTC time',
  'description': 'Timestamps follow the ISO8601 standard (https://en.wikipedia.org/wiki/ISO_8601). Hourly representations are provided in Universal Time.',
  'start': '20150722T08Z',
  'end': '20200219T03Z'},
 {'_id': ObjectId('609b710f392e2e93643d39d5'),
  'series_id': 'EBA.PGE-ALL.D.HL',
  'name': 'Demand for Portland General Electric Company (PGE), hourly - local time',
  'description': 'Timestamps follow the ISO8601 standard (https://en.wikipedia.org/wiki/ISO_8601). Hourly representations are provided in local time for the balancing authority or region.',
  'start': '20150722T01-07',
  'end': '20200218T19-08'},
 {'_id': ObjectId('609b711c392e2e93643d3a9f'),
  'series_id': 'EBA.PGE-BPAT.ID.H',
  'name': 'Actual Net Interchange for Portland General Electric Company (PGE) to Bonneville Power Administration (BPAT), hourly - 

In [45]:
#sub_list = [o for o in out_list if o.get('data')]
#df = pd.DataFrame([o['data'] for o in sub_list], columns=[o['name'] for o in sub_list])