# About this notebook

This notebook was used to prototype logic for the data pipeline interactively.  It is preserved for reference, but should not be expected to run as written because the package implementation has evolved over the lifetime of the notebook.  For a tutorial on how to run the pipeline or use the package methods, see {TODO}.  For the data quality expectations development notebook, look at `expectations.ipynb`

In [1]:
import os
import re

import great_expectations as ge
import numpy as np
import pandas as pd

from typing import Tuple

import rad_pipeline.rad_pipeline as rp
import rad_pipeline.zipcodes as zc

In [2]:
import importlib
importlib.reload(rp)
importlib.reload(zc)

<module 'rad_pipeline.zipcodes' from '/Users/alexhasha/repos/massenergize/rad_pipeline/rad_pipeline/zipcodes.py'>

### Output data structure

Because the datasets differ by fields provided, and so some will offer richer metrics than others, and 
because we may want to present data aggregated at multiple levels, I propose the following output data structure:

- locale: str (e.g. "02186" or "Milton" or "Norfolk County")
- zipcodes: List[str], list of zipcodes contained in the locale
- technology: str (e.g. "ASHP" or "Solar Panels")
- sector: str (e.g. "Residential", "Commercial", "Municipal", "Industrial", etc.)
- metric_name: str, The name of the metric (e.g. "Total Cost" or "Percent income support")
- value: decimal  (One could imagine wanting to compute metrics of non-numeric type, but we can deal with that separately)
- value_unit: str (e.g., Dollars, kWh, Count, BTU/h, etc)
- start_date: datetime, beginning of time period of aggregation
- end_date: datetime, end of time period of aggregation
- (TODO) data_update_date: datetime, date of most recent update of data source
- (TODO) data_source_id: int (identifier of raw data source metric was calculated from)


**Aggregate quantities of interest at the town and zipcode level**

* Quantity
* Total Rebates
* Average Rebate
* Total Cost
* Average Cost
* Installed Capacity (kW, Solar only)
* Quantity Income-Eligible (TODO)

## Systematic Agg Dataset Build

In [66]:
def locale_aggregation(df_cleaned: pd.DataFrame, locale_field: str, source: str) -> Tuple[pd.core.groupby.generic.DataFrameGroupBy, pd.DataFrame]:
    field_map = rp.FIELDS[source]
    groups = df_cleaned.groupby(locale_field)
    zipcodes = groups['zip_cleaned'].\
                    apply(lambda x: list(np.unique(x))).\
                    rename_axis("locale")
    start_date = groups[field_map['date']].\
                min().\
                rename_axis("locale").rename("start_date")
    end_date = groups[field_map['date']].\
                    max().\
                    rename_axis("locale").rename("end_date")
    
    result = pd.DataFrame(data = {
        "zipcodes": zipcodes, 
        "start_date": start_date, 
        "end_date": end_date, 
    })
    
    return groups, result
    

    

In [67]:
# Locale-level aggregation
metric_groups = []

SECTOR_LOOKUP = {
    "Air-source Heat Pumps": "Residential",
    "Ground-source Heat Pumps": "Residential and Small Scale",
    "Solar Panels": "All",
    "EVs": "Consumer",
}

for source in ["Air-source Heat Pumps", "Ground-source Heat Pumps", "EVs", "Solar Panels"]:
    try:
        df_cleaned = rp.clean_data_load(source)
        print(f"Loaded {source}")
    except FileNotFoundError:
        print(f"Skipping {source}")
        continue
    
    for locale_field in ["town", "zip_cleaned"]:
        
        
        groups, locale_base = locale_aggregation(df_cleaned, locale_field, source)
        
        base_df = locale_base.copy()
        base_df["technology"] = source
        base_df["sector"] = SECTOR_LOOKUP[source]
             
        if "rebate" in df_cleaned.columns:
            # Quantity of Rebates
            metric_group = base_df.copy()
            metric_group["value_unit"] = "count"
            metric_group["metric_name"] = "Number of Rebates"
            metric_group["value"] = groups['rebate'].count()
            metric_groups.append(metric_group)

            # Dollar Total of Rebates
            metric_group = base_df.copy()
            metric_group["value_unit"] = "$USD"
            metric_group["metric_name"] = "Total Rebate Value"
            metric_group["value"] = groups['rebate'].sum()
            metric_groups.append(metric_group)

            # Dollar Average of Rebates
            metric_group = base_df.copy()
            metric_group["value_unit"] = "$USD"
            metric_group["metric_name"] = "Average Rebate Value"
            metric_group["value"] = groups['rebate'].mean()
            metric_groups.append(metric_group)
        
        if "cost" in df_cleaned.columns:
            # Dollar Total of Costs
            metric_group = base_df.copy()
            metric_group["value_unit"] = "$USD"
            metric_group["metric_name"] = "Total Cost"
            metric_group["value"] = groups['cost'].sum()
            metric_groups.append(metric_group)

            # Dollar Average of Costs
            metric_group = base_df.copy()
            metric_group["value_unit"] = "$USD"
            metric_group["metric_name"] = "Average Cost"
            metric_group["value"] = groups['cost'].mean()
            metric_groups.append(metric_group)
        
        if "capacity" in df_cleaned.columns: # Solar only
            
            # Quantity of Solar Panel facilitys
            metric_group = base_df.copy()
            metric_group["value_unit"] = "count"
            metric_group["metric_name"] = "Number of generation facilities"
            metric_group["value"] = groups['capacity'].count()
            metric_groups.append(metric_group)
            
            # Total Panel Power Capacity
            metric_group = base_df.copy()
            metric_group["value_unit"] = "kW"
            metric_group["metric_name"] = "Total Generation Capacity"
            metric_group["value"] = groups['capacity'].sum()
            metric_groups.append(metric_group)
            
            # Average Power Capacity
            metric_group = base_df.copy()
            metric_group["value_unit"] = "kW"
            metric_group["metric_name"] = "Average Generation Capacity"
            metric_group["value"] = groups['capacity'].mean()
            metric_groups.append(metric_group)


Loaded Air-source Heat Pumps
Loaded Ground-source Heat Pumps
Loaded EVs
Loaded Solar Panels


### Sector-level aggregation for Solar Panels

In [114]:
def locale_sector_aggregation(df_cleaned: pd.DataFrame, locale_field: str, source: str) -> Tuple[pd.core.groupby.generic.DataFrameGroupBy, pd.DataFrame]:
    field_map = rp.FIELDS[source]
    sector_field = field_map["sector"]
    groups = df_cleaned.groupby([locale_field, sector_field])
    zipcodes = groups['zip_cleaned'].\
                    apply(lambda x: list(np.unique(x))).\
                    rename_axis(["locale", "sector"])
    start_date = groups[field_map['date']].\
                    min().\
                    rename_axis(["locale", "sector"]).rename("start_date")
    end_date = groups[field_map['date']].\
                    max().\
                    rename_axis(["locale", "sector"]).rename("end_date")
    
    result = pd.DataFrame(data = {
        "zipcodes": zipcodes, 
        "start_date": start_date, 
        "end_date": end_date, 
    })
    
    return groups, result

In [115]:
metric_groups = []
df_cleaned = rp.clean_data_load("Solar Panels")
for locale_field in ["town", "zip_cleaned"]:
    groups, locale_base = locale_sector_aggregation(df_cleaned, locale_field, "Solar Panels")

    base_df = locale_base.copy()
    base_df["technology"] = source

    if "capacity" in df_cleaned.columns: # Solar only
            
        # Quantity of Solar Panel facilitys
        metric_group = base_df.copy()
        metric_group["value_unit"] = "count"
        metric_group["metric_name"] = "Number of generation facilities"
        metric_group["value"] = groups['capacity'].count()
        metric_groups.append(metric_group)

        # Total Panel Power Capacity
        metric_group = base_df.copy()
        metric_group["value_unit"] = "kW"
        metric_group["metric_name"] = "Total Generation Capacity"
        metric_group["value"] = groups['capacity'].sum()
        metric_groups.append(metric_group)

        # Average Power Capacity
        metric_group = base_df.copy()
        metric_group["value_unit"] = "kW"
        metric_group["metric_name"] = "Average Generation Capacity"
        metric_group["value"] = groups['capacity'].mean()
        metric_groups.append(metric_group)

In [116]:
RAD_df = pd.concat(metric_groups, axis=0)

In [117]:
RAD_df.shape

(12048, 7)

In [118]:
RAD_df.metric_name.unique()

array(['Number of generation facilities', 'Total Generation Capacity',
       'Average Generation Capacity'], dtype=object)

In [119]:
RAD_df.technology.unique()

array(['Solar Panels'], dtype=object)

In [120]:
RAD_df.loc["Milton"]

Unnamed: 0_level_0,zipcodes,start_date,end_date,technology,value_unit,metric_name,value
sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Commercial / Office,[02186],2012-07-25,2016-06-10,Solar Panels,count,Number of generation facilities,3.0
Industrial,[02186],2013-01-24,2017-02-17,Solar Panels,count,Number of generation facilities,2.0
Municipal - K-12 School,[02186],2010-12-21,2012-01-31,Solar Panels,count,Number of generation facilities,5.0
Municipal / Government / Public,[02186],2010-11-02,2011-05-18,Solar Panels,count,Number of generation facilities,2.0
Residential (3 or fewer dwelling units per building),[02186],2004-02-24,2018-11-23,Solar Panels,count,Number of generation facilities,274.0
School (K-12),[02186],2012-08-10,2012-08-10,Solar Panels,count,Number of generation facilities,1.0
Commercial / Office,[02186],2012-07-25,2016-06-10,Solar Panels,kW,Total Generation Capacity,50.4
Industrial,[02186],2013-01-24,2017-02-17,Solar Panels,kW,Total Generation Capacity,128.25
Municipal - K-12 School,[02186],2010-12-21,2012-01-31,Solar Panels,kW,Total Generation Capacity,785.22
Municipal / Government / Public,[02186],2010-11-02,2011-05-18,Solar Panels,kW,Total Generation Capacity,76.36


In [40]:
df_cleaned.columns

Index(['zip_cleaned', 'zip4_cleaned', 'zip_valid', 'Capacity \n(DC, kW)',
       'Date In Service', 'Total Cost with Design Fees', 'Total Grant', 'City',
       'Zip', 'County', 'Program Name', 'Facility Type', 'Installer',
       'Module Manufacturer', 'Inverter Manufacturer', 'Meter Manufacturer',
       'Utility', '3rd Party Owner', 'SREC Eligible',
       'Estimated Annual Production (kWhr)', 'town', 'zip_exists',
       'town_valid', 'cost', 'capacity'],
      dtype='object')

In [42]:
df_cleaned[rp.FIELDS["Solar Panels"]["sector"]]

0                                               Industrial
1                                      Commercial / Office
2                                      Commercial / Office
3                                          Community Solar
4                                               Industrial
                               ...                        
90136                                        School (K-12)
90137    Residential (3 or fewer dwelling units per bui...
90138    Residential (3 or fewer dwelling units per bui...
90139    Residential (3 or fewer dwelling units per bui...
90140    Residential (3 or fewer dwelling units per bui...
Name: Facility Type, Length: 83351, dtype: object

In [76]:
groups = df_cleaned.groupby(["town", rp.FIELDS["Solar Panels"]["sector"]])

In [56]:
groups.index

AttributeError: 'DataFrameGroupBy' object has no attribute 'index'

In [63]:
res = groups['zip_cleaned'].apply(lambda x: list(np.unique(x)))
res

town      Facility Type                                                   
Abington  Commercial / Office                                                 [02351]
          Industrial                                                          [02351]
          Multi-family residential (4 or more dwelling units per building)    [02351]
          Municipal - K-12 School                                             [02351]
          Other                                                               [02351]
                                                                               ...   
Wrentham  Industrial                                                          [02093]
          Municipal / Government / Public                                     [02093]
          Religious                                                           [02093]
          Residential (3 or fewer dwelling units per building)                [02093]
          Retail                                                 

In [65]:
res.reset_index("Facility Type")

Unnamed: 0_level_0,Facility Type,zip_cleaned
town,Unnamed: 1_level_1,Unnamed: 2_level_1
Abington,Commercial / Office,[02351]
Abington,Industrial,[02351]
Abington,Multi-family residential (4 or more dwelling u...,[02351]
Abington,Municipal - K-12 School,[02351]
Abington,Other,[02351]
...,...,...
Wrentham,Industrial,[02093]
Wrentham,Municipal / Government / Public,[02093]
Wrentham,Religious,[02093]
Wrentham,Residential (3 or fewer dwelling units per bui...,[02093]


In [94]:
field_map = rp.FIELDS['Solar Panels']
sector_field = field_map["sector"]

# start_date = groups[field_map['date']].\
#                 min().\
#                 reset_index(sector_field).\
#                 rename_axis("locale").rename("start_date")

start_date = groups[field_map['date']].\
                    min().\
                    rename_axis(["locale", "sector"]).rename("start_date").\
                    reset_index("sector")

start_date

Unnamed: 0_level_0,sector,start_date
locale,Unnamed: 1_level_1,Unnamed: 2_level_1
Abington,Commercial / Office,2013-01-08
Abington,Industrial,2015-12-15
Abington,Multi-family residential (4 or more dwelling u...,2014-12-11
Abington,Municipal - K-12 School,2018-06-12
Abington,Other,2016-12-30
...,...,...
Wrentham,Industrial,2016-02-22
Wrentham,Municipal / Government / Public,2013-03-05
Wrentham,Religious,2013-11-18
Wrentham,Residential (3 or fewer dwelling units per bui...,2003-09-04


In [77]:
res = groups[field_map['date']].min()
res

town      Facility Type                                                   
Abington  Commercial / Office                                                2013-01-08
          Industrial                                                         2015-12-15
          Multi-family residential (4 or more dwelling units per building)   2014-12-11
          Municipal - K-12 School                                            2018-06-12
          Other                                                              2016-12-30
                                                                                ...    
Wrentham  Industrial                                                         2016-02-22
          Municipal / Government / Public                                    2013-03-05
          Religious                                                          2013-11-18
          Residential (3 or fewer dwelling units per building)               2003-09-04
          Retail                             

In [88]:
res0 = res.rename_axis(["locale", "sector"]).rename("start_date")
res0

locale    sector                                                          
Abington  Commercial / Office                                                2013-01-08
          Industrial                                                         2015-12-15
          Multi-family residential (4 or more dwelling units per building)   2014-12-11
          Municipal - K-12 School                                            2018-06-12
          Other                                                              2016-12-30
                                                                                ...    
Wrentham  Industrial                                                         2016-02-22
          Municipal / Government / Public                                    2013-03-05
          Religious                                                          2013-11-18
          Residential (3 or fewer dwelling units per building)               2003-09-04
          Retail                             

In [89]:
res1 = res0.reset_index("sector")
res1

Unnamed: 0_level_0,sector,date
locale,Unnamed: 1_level_1,Unnamed: 2_level_1
Abington,Commercial / Office,2013-01-08
Abington,Industrial,2015-12-15
Abington,Multi-family residential (4 or more dwelling u...,2014-12-11
Abington,Municipal - K-12 School,2018-06-12
Abington,Other,2016-12-30
...,...,...
Wrentham,Industrial,2016-02-22
Wrentham,Municipal / Government / Public,2013-03-05
Wrentham,Religious,2013-11-18
Wrentham,Residential (3 or fewer dwelling units per bui...,2003-09-04


In [82]:
res2 = res1.rename_axis("locale")
res2

Unnamed: 0_level_0,Facility Type,Date In Service
locale,Unnamed: 1_level_1,Unnamed: 2_level_1
Abington,Commercial / Office,2013-01-08
Abington,Industrial,2015-12-15
Abington,Multi-family residential (4 or more dwelling u...,2014-12-11
Abington,Municipal - K-12 School,2018-06-12
Abington,Other,2016-12-30
...,...,...
Wrentham,Industrial,2016-02-22
Wrentham,Municipal / Government / Public,2013-03-05
Wrentham,Religious,2013-11-18
Wrentham,Residential (3 or fewer dwelling units per bui...,2003-09-04


In [83]:
res2.rename("start_date")

TypeError: Index(...) must be called with a collection of some kind, 'start_date' was passed