<a href="https://colab.research.google.com/github/gschivley/FERC_714/blob/master/FERC714_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FERC 714 hourly demand data

This notebook extracts a couple years of hourly demand data from FERC 714 and starts exploring ways to match the FERC respondents to EIA utility/balancing authority entities in EIA-861. My goal is to match the FERC respondents with IPM regions. It's a working document that I'm sharing with the hope that other people will be able to check and improve on what I've done. If you have questions or want to suggest changes/additional data you can leave a comment in [the gist](https://gist.github.com/gschivley/09257d239a88fcbd8981ca5e0589321e), find on on twitter ([@gschivley](https://twitter.com/gschivley)), or email me at *greg at carbonimpact dot co*.

In [0]:
!wget https://raw.githubusercontent.com/gschivley/EIA_Cleaned_Hourly_Electricity_Demand_Code/master/anomaly_screening.py

--2020-03-25 19:19:34--  https://raw.githubusercontent.com/gschivley/EIA_Cleaned_Hourly_Electricity_Demand_Code/master/anomaly_screening.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19011 (19K) [text/plain]
Saving to: ‘anomaly_screening.py’


2020-03-25 19:19:34 (1.52 MB/s) - ‘anomaly_screening.py’ saved [19011/19011]



In [0]:
from itertools import combinations, chain
import pandas as pd
import numpy as np
from pathlib import Path
import zipfile
import urllib
from joblib import Parallel, delayed

from anomaly_screening import screen_timeseries, make_anomaly_summary

cwd = Path.cwd()
pd.set_option("display.max_columns", 100)

In [0]:
# Download the FERC 714 data to a temp folder that google is nice enough to host.
url = 'https://www.ferc.gov/docs-filing/forms/form-714/data/form714-database.zip'
save_folder = cwd / "FERC"
save_folder.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(url, save_folder / 'form714-database.zip')
### Unzip it
data_path = save_folder / "form714-database"
with zipfile.ZipFile(save_folder / 'form714-database.zip', 'r') as zfile:
    zfile.extractall(data_path)

In [0]:
df = pd.read_csv(
    data_path / "Part 3 Schedule 2 - Planning Area Hourly Demand.csv",
    parse_dates=["plan_date"], infer_datetime_format=True
)
df.head()

Unnamed: 0,respondent_id,report_yr,report_prd,spplmnt_num,row_num,plan_date,timezone,hour01,hour02,hour03,hour04,hour05,hour06,hour07,hour08,hour09,hour10,hour11,hour12,hour13,hour14,hour15,hour16,hour17,hour18,hour19,hour20,hour21,hour22,hour23,hour24,hour25,timezone_f,hour01_f,hour02_f,hour03_f,hour04_f,hour05_f,hour06_f,hour07_f,hour08_f,hour09_f,hour10_f,hour11_f,hour12_f,hour13_f,hour14_f,hour15_f,hour16_f,hour17_f,hour18_f,hour19_f,hour20_f,hour21_f,hour22_f,hour23_f,hour24_f,hour25_f
0,2,2006,12,0,100,2006-01-01,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,2006,12,0,200,2006-01-02,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,2006,12,0,300,2006-01-03,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2,2006,12,0,400,2006-01-04,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2,2006,12,0,500,2006-01-05,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
respondents = df.loc[df.report_yr >=2011,"respondent_id"].unique()

respondents

array([101, 102, 110, 115, 116, 118, 119, 121, 122, 124, 125, 128, 133,
       135, 138, 139, 140, 141, 142, 143, 150, 151, 156, 157, 159, 160,
       161, 162, 163, 164, 165, 166, 169, 171, 172, 173, 174, 177, 178,
       180, 182, 183, 185, 186, 187, 190, 191, 193, 194, 195, 197, 199,
       200, 201, 206, 209, 210, 211, 212, 213, 214, 217, 218, 219, 220,
       221, 223, 225, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
       240, 241, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 257,
       259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 271, 272, 273,
       274, 275, 277, 280, 282, 283, 284, 285, 287, 289, 292, 296, 297,
       298, 299, 307, 308, 311, 321, 328, 329])

In [0]:
def hour_25_looks_real(df):
    """
    The FERC demand data has a column hour25 for daylight savings. Determine if
    the value there looks like it could be a continuation of the series.
    Sometimes it seems to be a sum of all values for the day or something else
    weird.
    """
    df["hour_25_valid"] = False
    df["hour24_25_ratio"] = df["hour24"] / df["hour25"]
    df.loc[
        (df["hour24_25_ratio"] > 0.6)
        & (df["hour24_25_ratio"] < 1.5),
        "hour_25_valid"
    ] = True
    
    df.loc[df["hour_25_valid"] == False, "hour25"] = np.nan
    
    return df

In [0]:
# Only keeping 2011/2012 for now. Explore more of the data if you want!
corrected_df = hour_25_looks_real(
    df.loc[(df.report_yr.isin([2011, 2012])), :].copy()
)

In [0]:
# Check to see how many valid hour25 values there are for each respondent.
# Looks like 311 uses hour25 all the time...
hour_25_count = {}
for r in corrected_df.dropna()["respondent_id"].unique():
    hour_25_count[r] = len(corrected_df.query("respondent_id==@r").dropna())

hour_25_count

{121: 2,
 139: 2,
 141: 1,
 163: 1,
 166: 1,
 169: 1,
 195: 2,
 212: 2,
 225: 2,
 235: 2,
 237: 1,
 241: 2,
 247: 2,
 259: 2,
 275: 1,
 308: 1,
 311: 365}

In [0]:
# hour25 values do appear to be on dayslight savings change
corrected_df.query("hour25 > 0 & respondent_id != 311").sort_values("report_yr")

Unnamed: 0,respondent_id,report_yr,report_prd,spplmnt_num,row_num,plan_date,timezone,hour01,hour02,hour03,hour04,hour05,hour06,hour07,hour08,hour09,hour10,hour11,hour12,hour13,hour14,hour15,hour16,hour17,hour18,hour19,hour20,hour21,hour22,hour23,hour24,hour25,timezone_f,hour01_f,hour02_f,hour03_f,hour04_f,hour05_f,hour06_f,hour07_f,hour08_f,hour09_f,hour10_f,hour11_f,hour12_f,hour13_f,hour14_f,hour15_f,hour16_f,hour17_f,hour18_f,hour19_f,hour20_f,hour21_f,hour22_f,hour23_f,hour24_f,hour25_f,hour_25_valid,hour24_25_ratio
44501,121,2011,12,0,31000,2011-11-06,mst,338.0,340.0,332.0,328.0,336.0,334.0,345.0,358.0,368.0,386.0,382.0,388.0,375.0,375.0,388.0,367.0,380.0,395.0,432.0,424.0,427.0,418.0,392.0,374.0,364.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.027473
446989,259,2011,12,0,31000,2011-11-06,CDT,2866.0,2822.0,2811.0,2795.0,2797.0,2844.0,2879.0,2962.0,3042.0,3104.0,3102.0,3085.0,3058.0,3056.0,3047.0,3043.0,3023.0,3048.0,3085.0,3220.0,3218.0,3178.0,3120.0,3024.0,2920.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.035616
410830,247,2011,12,0,31000,2011-11-06,PST,1060.0,1028.0,945.0,965.0,1019.0,1076.0,1149.0,1221.0,1299.0,1323.0,1305.0,1283.0,1242.0,1217.0,1216.0,1225.0,1303.0,1410.0,1404.0,1383.0,1342.0,1267.0,1158.0,1072.0,981.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.092762
387820,241,2011,12,0,31000,2011-11-06,EST,685.0,658.0,633.0,630.0,640.0,662.0,693.0,720.0,749.0,768.0,772.0,770.0,772.0,760.0,762.0,762.0,797.0,890.0,893.0,872.0,838.0,791.0,730.0,683.0,645.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.058915
359332,235,2011,12,0,31000,2011-11-06,MST,4045.0,3915.0,3856.0,3869.0,3881.0,3920.0,4061.0,4200.0,4300.0,4490.0,4544.0,4523.0,4491.0,4504.0,4471.0,4471.0,4512.0,4724.0,5269.0,5337.0,5249.0,5070.0,4781.0,4415.0,4110.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.074209
505792,275,2011,12,0,31000,2011-11-06,CPT,91.0,86.0,84.0,84.0,83.0,86.0,84.0,91.0,98.0,99.0,97.0,97.0,89.0,94.0,96.0,95.0,96.0,94.0,98.0,110.0,108.0,105.0,105.0,102.0,99.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.030303
284094,212,2011,12,0,31000,2011-11-06,EST,1528.0,1487.0,1421.0,1420.0,1448.0,1506.0,1599.0,1684.0,1750.0,1760.0,1735.0,1704.0,1669.0,1642.0,1613.0,1632.0,1735.0,1956.0,1965.0,1919.0,1850.0,1716.0,1586.0,1502.0,1446.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.038728
238805,195,2011,12,0,31000,2011-11-06,CST,74.8,70.2,68.2,67.0,65.9,69.4,72.4,72.4,76.2,81.7,83.8,88.0,90.3,92.1,95.6,95.5,97.1,96.6,103.0,109.0,106.9,101.7,93.6,83.7,76.4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.09555
327192,225,2011,12,0,31000,2011-11-06,EST,446.0,405.0,399.0,395.0,389.0,391.0,408.0,429.0,441.0,479.0,517.0,540.0,544.0,552.0,546.0,552.0,553.0,554.0,566.0,599.0,592.0,571.0,535.0,503.0,468.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.074786
146765,163,2011,12,0,31000,2011-11-06,CPT,417.0,400.0,384.0,388.0,385.0,387.0,399.0,419.0,443.0,484.0,489.0,479.0,471.0,469.0,474.0,460.0,463.0,486.0,513.0,527.0,520.0,504.0,474.0,438.0,408.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,1.073529


In [0]:
# Same for a 0 value in hour02, which happens sometimes instead of skipping the hour
corrected_df.query("hour02 == 0 & hour01 != 0").sort_values("report_yr")

Unnamed: 0,respondent_id,report_yr,report_prd,spplmnt_num,row_num,plan_date,timezone,hour01,hour02,hour03,hour04,hour05,hour06,hour07,hour08,hour09,hour10,hour11,hour12,hour13,hour14,hour15,hour16,hour17,hour18,hour19,hour20,hour21,hour22,hour23,hour24,hour25,timezone_f,hour01_f,hour02_f,hour03_f,hour04_f,hour05_f,hour06_f,hour07_f,hour08_f,hour09_f,hour10_f,hour11_f,hour12_f,hour13_f,hour14_f,hour15_f,hour16_f,hour17_f,hour18_f,hour19_f,hour20_f,hour21_f,hour22_f,hour23_f,hour24_f,hour25_f,hour_25_valid,hour24_25_ratio
63985,128,2011,12,0,7200,2011-03-13,EDT,1300.0,0.0,1275.0,1283.0,1317.0,1383.0,1488.0,1641.0,1793.0,1750.0,1590.0,1472.0,1431.0,1403.0,1371.0,1363.0,1380.0,1416.0,1444.0,1551.0,1694.0,1612.0,1449.0,1272.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
545729,292,2011,12,0,7200,2011-03-13,,31.0,0.0,27.0,27.0,28.0,29.0,28.0,28.0,28.0,33.0,31.0,31.0,30.0,31.0,32.0,31.0,31.0,32.0,31.0,37.0,39.0,35.0,34.0,32.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
543538,289,2011,12,0,7200,2011-03-13,,82.0,0.0,82.0,75.0,80.0,76.0,81.0,80.0,84.0,87.0,91.0,91.0,93.0,97.0,96.0,93.0,95.0,98.0,99.0,108.0,105.0,105.0,98.0,89.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
525643,283,2011,12,0,7200,2011-03-13,CDT,132.0,0.0,127.0,125.0,124.0,126.0,129.0,131.0,136.0,142.0,147.0,146.0,149.0,148.0,145.0,144.0,146.0,148.0,150.0,162.0,169.0,162.0,152.0,141.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
505554,275,2011,12,0,7200,2011-03-13,CPT,99.0,0.0,97.0,94.0,94.0,93.0,106.0,109.0,105.0,104.0,111.0,106.0,107.0,99.0,94.0,91.0,93.0,92.0,101.0,101.0,105.0,112.0,106.0,95.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
485101,268,2011,12,0,7200,2011-03-13,EST,526.0,0.0,496.0,478.0,471.0,475.0,491.0,499.0,530.0,569.0,594.0,604.0,607.0,604.0,596.0,593.0,596.0,608.0,639.0,690.0,688.0,660.0,612.0,572.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
449879,260,2011,12,0,27800,2011-10-05,CST,51.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,
449761,260,2011,12,0,16000,2011-06-09,CST,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,
446751,259,2011,12,0,7200,2011-03-13,CDT,3096.0,0.0,3079.0,3065.0,3063.0,3104.0,3152.0,3234.0,3258.0,3284.0,3290.0,3263.0,3250.0,3221.0,3227.0,3200.0,3193.0,3191.0,3192.0,3194.0,3313.0,3323.0,3286.0,3177.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf
392330,243,2011,12,0,7200,2011-03-13,PST,962.0,0.0,915.0,899.0,891.0,913.0,945.0,988.0,1027.0,1102.0,1151.0,1167.0,1180.0,1177.0,1154.0,1154.0,1169.0,1231.0,1280.0,1342.0,1346.0,1282.0,1176.0,1070.0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False,inf


In [0]:
def timezone_to_tz(timezone):
    return 'Etc/GMT{:+}'.format(-timezone)

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [0]:
hourcols = ['hour{:02.0f}'.format(i) for i in range(1,26)]

# These are my best guesses for all of the timezone values that BAs listed
tz_offset = {
    '1  ': -5,
    "   ": -7,
    "AKS": -9,
    "CST": -6,
    "CPT": -6,
    "MST": -7,
    "PST": -8,
    "PDT": -8,
    "mst": -7,
    "EST": -5,
    "EDT": -5,
    " CS": -6,
    "HST": -10,
    "Est": -5,
    "PPT": -8,
    "MPT": -7,
    "EPT": -5,  
    "MPP": -7,
}
tz_ba = {key: timezone_to_tz(offset) for key, offset in tz_offset.items()}
error_dfs = {}
good_dfs = {}
years = [2011, 2012]
year_hours = {
    2011: 8760,
    2012: 8784
}

for r in respondents:
    # Not all respondents have data for all years
    r_all_years = corrected_df.loc[
            (corrected_df.respondent_id == r) & (corrected_df.report_yr.isin(years)),
            :
        ]
    # Only proceed if there is positive demand over all years (skip if all 0)
    if r_all_years[hourcols].sum().sum() > 0:
        valid_years = sorted(r_all_years["report_yr"].unique().tolist())
        tz = r_all_years["timezone"].values[0]
        dt = pd.date_range(
            f"{valid_years[0]}-01-01", 
            f"{valid_years[-1] + 1}-01-01",
            freq="H",
            closed="left",
            tz=tz_ba[tz]
        )

        df_list = []
        for year in valid_years:
            r_single_year = r_all_years.loc[r_all_years.report_yr == year, :]

            # Try to drop March DST changeover values if there are more valid hours
            # (not nan) than hours in the year.
            # I found that hour02, hour03, and hour24 all had 0 values for at
            # least one respondent.
            # Set the 0 values to np.nan so they can be dropped after melting
            if r_single_year[hourcols].count().sum() > year_hours[year]:
                r_single_year.loc[
                    (r_single_year["hour02"] == 0)
                    & (r_single_year["hour01"] != 0)
                    & (r_single_year["plan_date"].dt.month == 3),
                    "hour02"
                ] = np.nan
                r_single_year.loc[
                    (r_single_year["hour03"] == 0)
                    & (r_single_year["hour01"] != 0)
                    & (r_single_year["plan_date"].dt.month == 3),
                    "hour03"
                ] = np.nan
                r_single_year.loc[
                    (r_single_year["hour24"] == 0)
                    & (r_single_year["hour01"] != 0)
                    & (r_single_year["plan_date"].dt.month == 3),
                    "hour24"
                ] = np.nan
            
            tidy_df = pd.melt(r_single_year, id_vars='plan_date', value_vars=hourcols, 
                 var_name='hour', value_name='demand_MW')
            tidy_df = tidy_df.sort_values(["plan_date", "hour"])

            tidy_df = tidy_df.dropna()
            tidy_df["hour"] = tidy_df["hour"].str[-2:].astype(int)
            tidy_df["respondent_id"] = r

            df_list.append(tidy_df)

        # Concat the years together
        r_df = pd.concat(df_list)
        r_df = r_df.reset_index(drop=True)

        # If the length of one year, all years, or some combination of years
        if len(r_df) in [sum(x) for x in powerset(year_hours.values())]:

            r_df["date_time"] = dt
            columns = ["date_time", "demand_MW", "respondent_id"]
            r_df = r_df.loc[:, columns]
            good_dfs[r] = r_df

        else:
            print(r, len(r_df))
            error_dfs[r] = r_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


225 17545
311 9149


From the code above it looks like ~120 BAs have data clean enough that I'm able to extract the correct number of hours. 

\#225 has 2 days with demand in hour25 (daylight savings) but only one of the years has an hour in March with 0 demand. I suppose I could just remove an hour from that day?

\# 311 has demand in hour25 every day (or just about). Interestingly, hour01 and hour02 seem to always have the same values. Maybe just drop hour01 and shift everything over?

## Check the demand data for anomalies

The anomaly checking functions and parameter values below are all from [a notebook](https://github.com/truggles/EIA_Cleaned_Hourly_Electricity_Demand_Code) by Tyler Ruggles, which he developed to screen [hourly demand data from EIA-931](https://www.eia.gov/realtime_grid/#/status?end=20200325T07). I've modified some functions to speed them up. EIA's hourly data only goes back to mid-2015, which is why I'm using FERC 714. Fortunately it looks like the FERC data — once properly extracted — doesn't have many of the anomalies that Tyler found in the EIA data. Tyler also [developed a method](https://github.com/truggles/EIA_Cleaned_Hourly_Electricity_Demand_Code/blob/master/MICE_step.Rmd) to impute missing or anomalous data but at first glance it looks like the FERC demand data can largely be used as-is without too many issues. Let me know if you disagree!

In [0]:
short_hour_window = 24 # 48 hour moving median (M_{t,48hr})
iqr_hours = 24*5 # width in hours of IQR values of relative deviations from diurnal cycle template (IQR_{dem,t})
nDays = 10 # Used for normalized hourly demand template (h_{t,diurnal}) and 480 hour moving median (M_{t,480hr})
global_dem_cut = 10 # threshold selection for global demand filter
local_dem_cut_up = 3.5 # upwards threshold for local demand filter
local_dem_cut_down = 2.5 # downwards threshold for local demand filter
delta_multiplier = 2 # selection threshold for double-sided delta filter
delta_single_multiplier = 5 # selection threshold for single-sided delta filter
rel_multiplier = 15 # other selection threshold for single-sided delta filter
anomalous_regions_width = 24 # width in hours of anomalous region filter
anomalous_pct = .85 # required pct of good data in anomalous region filter

In [0]:
name_df_list = Parallel(n_jobs=-1, verbose=10)(delayed(screen_timeseries)(
    name=name, 
    df=df, 
    short_hour_window=short_hour_window,
    iqr_hours=iqr_hours,
    nDays=nDays,
    global_dem_cut=global_dem_cut,
    local_dem_cut_up=local_dem_cut_up,
    local_dem_cut_down=local_dem_cut_down,
    delta_multiplier=delta_multiplier,
    delta_single_multiplier=delta_single_multiplier,
    rel_multiplier=rel_multiplier,
    anomalous_regions_width=rel_multiplier,
    anomalous_pct=anomalous_pct
    ) for name, df in good_dfs.items()
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   31.7s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   49.5s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  57 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  81 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done  94 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 109 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  6.5min finished


In [0]:
anomaly_dict = {name: df for name, df in name_df_list}
summary_df = make_anomaly_summary(anomaly_dict)
summary_df

Unnamed: 0,name,median_demand,OKAY,MISSING,NEG_OR_ZERO,GLOBAL_DEM,GLOBAL_DEM_PLUS_MINUS,LOCAL_DEM_UP,LOCAL_DEM_DOWN,DELTA,SINGLE_DELTA,IDENTICAL_RUN,ANOMALOUS_REGION,pct_OKAY
0,101,879.07,17538,0,1,0,0,0,1,4,0,0,0,0.999658
1,102,6753.00,17543,0,0,0,0,0,0,1,0,0,0,0.999943
2,110,4747.00,17544,0,0,0,0,0,0,0,0,0,0,1.000000
3,115,78.00,14613,0,0,0,0,0,2,0,0,897,2032,0.832934
4,116,3286.00,17544,0,0,0,0,0,0,0,0,0,0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,298,1261.00,17543,0,0,0,0,0,0,1,0,0,0,0.999943
116,299,348.00,17518,0,0,0,0,0,0,0,0,26,0,0.998518
117,307,6728.60,17544,0,0,0,0,0,0,0,0,0,0,1.000000
118,308,62.00,17371,0,1,0,0,0,0,1,0,171,0,0.990139


In [0]:
summary_df.describe()

Unnamed: 0,name,median_demand,OKAY,MISSING,NEG_OR_ZERO,GLOBAL_DEM,GLOBAL_DEM_PLUS_MINUS,LOCAL_DEM_UP,LOCAL_DEM_DOWN,DELTA,SINGLE_DELTA,IDENTICAL_RUN,ANOMALOUS_REGION,pct_OKAY
count,120.0,118.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0
mean,210.433333,4074.016356,16993.791667,0.0,50.033333,0.05,0.0,4.191667,16.291667,4.891667,2.25,303.15,96.15,0.972809
std,55.879633,10646.382812,2758.183088,0.0,445.498568,0.547723,0.0,24.807933,163.768867,21.980815,14.088705,1965.903862,855.626338,0.151178
min,101.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,164.75,350.25,17510.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.998176
50%,213.5,1136.5,17537.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.999629
75%,257.5,2765.0,17542.25,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,26.0,0.0,0.9999
max,321.0,85017.5,17544.0,0.0,4750.0,6.0,0.0,225.0,1794.0,224.0,142.0,17288.0,9169.0,1.0


## Matching FERC respondents to EIA-816 utilities

The SI of [Auffhammer et al](https://www.pnas.org/content/pnas/suppl/2017/02/01/1613193114.DCSupplemental/pnas.1613193114.sapp.pdf) describes how they derived geographic coverage of FERC respondents using EIA-861 and the crosswalk between FERC respondents and EIA utilities/BAs. I'm including a little code to start exploring that here.

This is very exploratory! I'm trying to match codes between FERC and 861 but haven't figured out yet if they should be matched to EIA BAs or Utilities. At a minimum it looks like there are 84 FERC respondents that don't have a match in either category of the 2012 861.

A few extra resources that might be helpful:
- SPP [historical hourly load](https://marketplace.spp.org/pages/hourly-load#) back through 2011. The company acronyms (at least for 2011/2012 data) are described in [this document](https://www.nerc.com/pa/rrm/Resources/Monitoring_and_Situational_Awareness_Conference2/2%20Testing%20Your%20Sensitivity%20To%20Loss%20of%20Data_T%20Miller.pdf).
- MISO has [archivled historical hourly load](https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/market-report-archives/#nt=%2FMarketReportType%3ASummary%2FMarketReportName%3AArchived%20Historical%20Regional%20Forecast%20and%20Actual%20Load%20%20(zip)&t=10&p=0&s=MarketReportPublished&sd=desc) for 2007-2014 and [historical hourly load](https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/#nt=%2FMarketReportType%3ASummary%2FMarketReportName%3AHistorical%20Regional%20Forecast%20and%20Actual%20Load%20(xls)&t=10&p=0&s=MarketReportPublished&sd=desc) back through 2013 for Central, East, and West regions.
- ERCOT has [historical hourly load](http://www.ercot.com/gridinfo/load/load_hist/) in each of the 8 weather zones. (NOTE: the IPM region ERC_PHDL has no demand, so all weather zones are split into ERC_WEST and ERC_REST)
- PJM requires a login and API key to access their [historical metered data](https://dataminer2.pjm.com/feed/hrl_load_metered/definition) through dataminer (back to 1993).
- NYISO has hourly load by zone (which look like they group well into IPM regions) back through 2001. I need to figure out the difference between [real time actual load](http://mis.nyiso.com/public/P-58Blist.htm) and [integrated real time actual load](http://mis.nyiso.com/public/P-58Clist.htm).
- ISONE has [hourly load by zone](https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info) back through 2011. Unfortunately, from a quick glance at 2011 it looks like while there are only 8760 hourly demand values the 2am on March 13 (DST) has demand of 0. Oh, but the 2am value on Nov. 6 is twice the surrounding hours, so they appear to add two hours together on one line.... (Need to remember this as something other places might do)

In [0]:
# Download the 2012 EIA-861 data to a temp folder
url = 'https://www.eia.gov/electricity/data/eia861/archive/zip/f8612012.zip'
save_folder = cwd / "EIA861"
save_folder.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(url, save_folder / 'f8612012.zip')
### Unzip it
data_path = save_folder / "f8612012"
with zipfile.ZipFile(save_folder / 'f8612012.zip', 'r') as zfile:
    zfile.extractall(data_path)

In [0]:
ferc_eia_map = pd.read_csv(cwd / "FERC" / "form714-database" / "Respondent IDs.csv", index_col=0)

# Only keep the respondents that are in the years of data we're looking at
ferc_eia_map = ferc_eia_map.loc[respondents, :]
ferc_eia_map.head()

Unnamed: 0_level_0,respondent_name,eia_code
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,PowerSouth Energy Cooperative (Alabama Electri...,189
102,Alabama Power Company ...,195
110,"American Electric Power Company, Inc. ...",829
115,"Arizona Electric Power Cooperative, Inc. ...",796
116,Arizona Public Service Company ...,803


In [0]:
# 307 is PacifiCorp - Part II Sch 2 (East & West combined), which looks to be
# EIA utility number 14354
summary_df["eia_code"] = summary_df["name"].map(ferc_eia_map["eia_code"])
summary_df.query("~(eia_code > 0)")

In [0]:
eia8612012_territory = pd.read_excel(cwd / "EIA861" / "f8612012" / "service_territory_2012.xls")
eia8612012_territory.head()

Unnamed: 0,Data Year,Utility Number,Utility Name,State,County
0,2012,34,City of Abbeville - (SC),SC,Abbeville
1,2012,55,City of Aberdeen - (MS),MS,Monroe
2,2012,59,City of Abbeville - (LA),LA,Vermilion
3,2012,84,A & N Electric Coop,MD,Somerset
4,2012,84,A & N Electric Coop,VA,Accomack


In [0]:
# Document which ferc respondents can match with a utility. These utilities have
# a list of all the counties they are active in.
ferc_eia_map["utility_match"] = False
ferc_eia_map.loc[
    ferc_eia_map.eia_code.isin(eia8612012_territory['Utility Number']),
    "utility_match"
] = True

In [0]:
eia8612012_bas = pd.read_excel(cwd / "EIA861" / "f8612012" / "balancing_authority_2012.xls")
eia8612012_bas.head()

Unnamed: 0,Data Year,Utility Number,Utility Name,BA Code,Balancing Authority Name
0,2012,34,City of Abbeville - (SC),5416,Duke Energy Carolinas
1,2012,55,City of Aberdeen - (MS),18642,Tennessee Valley Authority
2,2012,59,City of Abbeville - (LA),3265,"Cleco Corporation, Inc."
3,2012,84,A & N Electric Coop,14725,PJM Interconnection
4,2012,87,City of Ada,56669,Midwest Independent System Operator


In [0]:
ferc_eia_map["ba_match"] = False
ferc_eia_map.loc[
    ferc_eia_map.eia_code.isin(eia8612012_bas['BA Code']),
    "ba_match"
] = True

## Examine which entities we can match to BAs and utilities

It looks like some of the respondents are only BAs, some are only utilities, and some are both. Will need to figure out if any of the utilities that are not a BA are within a BA territory - are there cases where demand from one respondent is a subset of another respondent?

In [0]:
ferc_eia_map.query("ba_match==True")

Unnamed: 0_level_0,respondent_name,eia_code,utility_match,ba_match
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
101,PowerSouth Energy Cooperative (Alabama Electri...,189,False,True
116,Arizona Public Service Company ...,803,True,True
118,"Associated Electric Cooperative, Inc. ...",924,False,True
119,Avista Corporation ...,20169,True,True
122,"Bonneville Power Administration, USDOE ...",1738,False,True
...,...,...,...,...
273,Western Area Power Administration - Colorado-M...,28503,False,True
274,Western Area Power Administration - Lower Colo...,19610,False,True
275,Western Area Power Administration - Upper Miss...,25471,False,True
277,Western Farmers Electric Cooperative ...,20447,False,True


In [0]:
ferc_eia_map.query("utility_match==False & ba_match==False")

Unnamed: 0_level_0,respondent_name,eia_code,utility_match,ba_match
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
110,"American Electric Power Company, Inc. ...",829,False,False
115,"Arizona Electric Power Cooperative, Inc. ...",796,False,False
124,"Buckeye Power, Inc. ...",7004,False,False
128,"Central Electric Power Cooperative, Inc. ...",40218,False,False
169,Florida Municipal Power Agency ...,6567,False,False
173,"Golden Spread Electric Cooperative, Inc. ...",7349,False,False
183,Indiana Municipal Power Agency ...,9234,False,False
190,Westar Energy (KPL) ...,10015,False,False
199,Massachusetts Municipal Wholesale ...,11806,False,False
201,Metropolitan Water District of Southern Califo...,12397,False,False


In [0]:
# Some entities are both a BA and a utility
ferc_eia_map.query("utility_match==True & ba_match==True")

Unnamed: 0_level_0,respondent_name,eia_code,utility_match,ba_match
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
116,Arizona Public Service Company ...,803,True,True
119,Avista Corporation ...,20169,True,True
133,"Chugach Electric Association, Inc. ...",3522,True,True
138,City of Lafayette Utilities System ...,9096,True,True
139,"City of Tacoma, Dept. of Public Utilities ...",18429,True,True
140,City of Tallahassee ...,18445,True,True
142,Cleco Corporation ...,3265,True,True
156,City of North Little Rock ...,13718,True,True
157,"Duke Energy Carolinas, LLC ...",5416,True,True
159,East Kentucky Power Cooperative ...,5580,True,True


In [0]:
ferc_eia_map.query("utility_match==True & ba_match==False")

Unnamed: 0_level_0,respondent_name,eia_code,utility_match,ba_match
respondent_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
102,Alabama Power Company ...,195,True,False
121,Black Hills Corporation ...,19545,True,False
125,California Independent System Operator ...,229,True,False
135,City of Burbank ...,2507,True,False
141,"City Utilities of Springfield, MO ...",17833,True,False
143,Colorado Springs Utilities ...,3989,True,False
150,"Dayton Power & Light Company, The ...",4922,True,False
151,Decatur Utilities ...,4958,True,False
162,Electric Power Board of Chattanooga ...,3408,True,False
166,Eugene Water & Electric Board ...,6022,True,False
