<a href="https://colab.research.google.com/github/massenergize/rad/blob/hasha_refactor/merge-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understand the python environment

The google colab environment has a bunch of packages installed.  It's not using conda environments.

In [1]:
from platform import python_version

print(python_version())

3.6.9


Python 3.6 will be end-of-life at the end of 2021, and lacks some important new features available in Python 3.7 and 3.8 for writing high-quality code (e.g. explicit type hinting).  It's fine to work with for now, but as we harden this we will want to migrate out of colab notebooks and update the code to v3.8 syntax

In [2]:
! conda list

/bin/bash: conda: command not found


In [3]:
! pip list

Package                       Version        
----------------------------- ---------------
absl-py                       0.10.0         
alabaster                     0.7.12         
albumentations                0.1.12         
altair                        4.1.0          
appdirs                       1.4.4          
argon2-cffi                   20.1.0         
asgiref                       3.3.1          
astor                         0.8.1          
astropy                       4.1            
astunparse                    1.6.3          
async-generator               1.10           
atari-py                      0.2.6          
atomicwrites                  1.4.0          
attrs                         20.3.0         
audioread                     2.1.9          
autograd                      1.3            
Babel                         2.9.0          
backcall                      0.2.0          
beautifulsoup4                4.6.3          
bleach                        3.2.

# RENEWABLES ACTION DATASET ANNOTATIONS 


## Electric Vehicles
* Source: Center for Sustainable Energy (2020). Massachusetts Department of  Energy Resources Massachusetts Offers Rebates for Electric Vehicles, Rebate Statistics.  
* Retrieved 09/08/2020 from: https://mor-ev.org/program-statistics
* Data last updated 08/21/2020. Data date range includes 06/19/2014 - 08/15/2020.
* Sectors: Residential.


## Residential Air-source Heat Pumps (ASHP)

* Source: Massachusetts Clean Energy Center (2020). Air Source Heat Pump Program - Residential Projects.
* Retrieved 09/08/2020 from: http://files-cdn.masscec.com/ResidentialASHPProjectDatabase%2011.4.2019.xlsx
* Data last updated 11/04/2019. Data date range includes 12/26/2014 - 10/23/2019.
* Sectors: Residential.


## Ground-source Heat Pumps (GSHP)

* Source: Massachusetts Clean Energy Center (2020). Ground Source Heat Pump Program - Residential & Small-Scale Projects Database.
* Retrieved 09/08/2020 from: http://files-cdn.masscec.com/get-clean-energy/govt-np/clean-heating-cooling/ResidentialandSmallScaleGSHPProjectDatabase.xlsx 
* Data last updated June 2020. Data date range includes 01/02/2015 - 06/09/2020.
* Sectors: Residential, Small Commercial.

## Production Tracking System for Solar Photovoltaic Report (PV in PTS)

* Source: Massachusetts Clean Energy Center
* According to a [September 2017 Department of Public Utilities Report](https://fileservice.eea.comacloud.net/FileService.Api/file/FileRoom/9174030), "On a monthly basis, DOER and MassCEC compile data from the production tracking system to produce the MA PV Report, which is a publicly available document.  The MA PV Report is available electronically at
http://files.masscec.com/uploads/attachments/PVinPTSwebsite.xlsx."  However, the file at that URL does not seem to have been updated since November 2019.  We analyze it here, but need to search for a source of ongoing updated data.
* Sectors: Residential, Commercial, Institutional



# Mount input data files from Google Drive

Instructions from [this tutorial](https://colab.research.google.com/notebooks/io.ipynb)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
!ls /content/drive/Shareddrives/MEinternal-DataWG/RAD\ Renewable\ Actions\ Data

'05-04-2020 Data Group Agenda.gdoc'
 Analyses
 Code
'Data Downloads'
'Data Fields Overview.xlsx'
'Data Not Yet Included'
 MassData
 merge-data-AH.ipynb
 merge-data-v3.ipynb
'[old] merge-data.ipynb'
 PVinPTSwebsite.xlsx
'Renewable Actions Dataset Instructions.gdoc'
'Renewables Action Dataset Annotations.gdoc'
'ResidentialASHPProjectDatabase 11.4.2019.xlsx'
 SAMPLE-all-actions-data.xlsx
'Untitled document.gdoc'
'Zip Code Community.xlsx'
'Zip Code Muni Name.drawio'


# Dataset filename mappings

In [6]:
import os
import re

import numpy as np
import pandas as pd

%load_ext google.colab.data_table

In [7]:
data_dir = "/content/drive/Shareddrives/MEinternal-DataWG/RAD Renewable Actions Data"
data_files = {
    "zip_code_community": os.path.join(data_dir, "Zip Code Community.xlsx"),
    "residental_ashp_project_database": os.path.join(data_dir, "ResidentialASHPProjectDatabase 11.4.2019.xlsx"),
    "pv_in_pts_website": os.path.join(data_dir, "PVinPTSwebsite.xlsx")
}
data_files

{'pv_in_pts_website': '/content/drive/Shareddrives/MEinternal-DataWG/RAD Renewable Actions Data/PVinPTSwebsite.xlsx',
 'residental_ashp_project_database': '/content/drive/Shareddrives/MEinternal-DataWG/RAD Renewable Actions Data/ResidentialASHPProjectDatabase 11.4.2019.xlsx',
 'zip_code_community': '/content/drive/Shareddrives/MEinternal-DataWG/RAD Renewable Actions Data/Zip Code Community.xlsx'}

# Work out robust logic for cleaning zipcodes

In [8]:
# Load Zip Code Municipality -- 
# @params: file_name, column_name
zipcodes = pd.read_excel(data_files["zip_code_community"], 'Villages to Muni with Zip')
municipalities = pd.read_excel(data_files["zip_code_community"], '351 Mass Munis')

In [9]:
zipcodes.head()

Unnamed: 0,Zip Code,City,Municipality,County,Unnamed: 4,Unnamed: 5
0,1001.0,Agawam,Agawam,Hampden,,
1,1002.0,Amherst,Amherst,Hampshire,,
2,1003.0,Amherst,Amherst,Hampshire,,
3,1004.0,Amherst,Amherst,Hampshire,,
4,1005.0,Barre,Barre,Worcester,,


In [10]:
zipcodes.dtypes

Zip Code        float64
City             object
Municipality     object
County           object
Unnamed: 4       object
Unnamed: 5       object
dtype: object

Zipcode fields are a diversity of floats, ints, and strings.  We need a zipcode standardization function that can handle them all robustly.  Here's a test case with some pathological examples.

In [11]:
import numpy as np
pathalogical_test_case = pd.Series(data = [2186.0, "2186", 2186, "01545-", "   1590 ", "01545-2790", "12345-123", "02128y", 176.2])
correct_output = pd.DataFrame(data={
    "zip_cleaned": ["02186", "02186", "02186", "01545", "01590", "01545", "12345-123", "02128y", 176.2],
    "zip4_cleaned": ["", "", "", "", "", "2790", "", "", ""],
    "zip_valid": [True, True, True, True, True, True, False, False, False]
})

In [12]:
# Testing out a regular expression for valid zip+4 strings, handling floats as long as they represent integers.
valid_zipcode_regex = r"^([0-9]{3,5})(?:[.]0)?(?:-([0-9]{4})|-)?$"
res = pathalogical_test_case.astype(str).str.strip().str.extract(valid_zipcode_regex)
res[0].str.zfill(5)

0    02186
1    02186
2    02186
3    01545
4    01590
5    01545
6      NaN
7      NaN
8      NaN
Name: 0, dtype: object

In [13]:
cleaned_zipcode_series = \
res[0].str.zfill(5)\
  .str.cat(res[1], sep="-", na_rep="")\
  .str.rstrip("-")
cleaned_zipcode_series

0         02186
1         02186
2         02186
3         01545
4         01590
5    01545-2790
6              
7              
8              
Name: 0, dtype: object

In [14]:
valid_zipcode_series = cleaned_zipcode_series.str.match(valid_zipcode_regex)
valid_zipcode_series

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7    False
8    False
Name: 0, dtype: bool

In [15]:
def clean_zipcodes(zip_series):
  """Standardize zipcodes in a pandas series
  
  Pandas will likely load zipcodes from an excel file as an object series
  mixing string and numeric types.  This function will cast all entries to 
  string types, strip whitespace, left pad to a minimum of 5 characters with zeros, then 
  validate the entry contains a valid zipcode using a regular expression.

  Return: pandas.DataFrame
    - Pandas series of cleaned zipcodes: str
    - Pandas series of cleaned zip4, or empty string if missing: str
    - Pandas series of valid zipcode indicators: boolean
  """
  # Valid zipcodes are 3-5 numeric digits, followed by an optional dash and 4 more digits, or an optional ".0" if the data has been cast as floats.
  valid_zipcode_regex = r"^([0-9]{3,5})(?:[.]0)?(?:-([0-9]{4})|-)?$"
  #The extract function will match this pattern, and extract the zip5 group into column 0 and the zip4 group into column 1.  NaN if group is not present or pattern isn't matched.
  res = zip_series.astype(str).str.strip().str.extract(valid_zipcode_regex)

  cleaned_zipcode_series = res[0].str.zfill(5).fillna('')
  cleaned_zip4_series = res[1].fillna('')

  valid_zipcode_series = cleaned_zipcode_series.str.match(valid_zipcode_regex)
  
  #Replace invalid value rows with original inputs
  cleaned_zipcode_series.loc[~valid_zipcode_series] = zip_series[~valid_zipcode_series]
  return pd.DataFrame(data={"zip_cleaned": cleaned_zipcode_series, "zip4_cleaned": cleaned_zip4_series, "zip_valid": valid_zipcode_series})



In [16]:
z_df = clean_zipcodes(pathalogical_test_case)

In [17]:
if pd.testing.assert_frame_equal(z_df, correct_output) is None:
  print("Success!")

Success!


# Load Air Source Heat Pump (ASHP) program data and Solar program data

Load from excel into pandas dataframes and peek at it for sanity.

In [18]:
#PTS solar data next is the important one, tracking data

# One thing for these datasets is that it's possible to read all the ZipCode fields as strings, and in that case it would not get rid of the '0' prefix in front of all Zip Codes
f_ashp = pd.read_excel(data_files["residental_ashp_project_database"], 'Sheet1', skiprows=3)
df_ashp = f_ashp.drop([0]) #remove first null row for formatting purposes

#Read PVinPTS Data
df_pv = pd.read_excel(data_files["pv_in_pts_website"], 'PV in PTS', skiprows=7)

In [19]:
df_ashp.head()

Unnamed: 0,Date Rebate Payment Approved by MassCEC,Site City/Town,Site Zip Code,Installer Company Name,Heating Fuel Being Replaced,Cooling Type Being Replaced,# of Outdoor Units,# of Indoor Units,Capacity of Heat Pumps at 5°F,Single- Head Heat Pump #1,Single- Head Heat Pump #2,Single- Head Heat Pump #3,Multi-Head Heat Pump #1,Multi-Head Heat Pump #2,Multi-Head Heat Pump #3,Total System Costs,Receiving an Income-Based Adder?,Rebate Amount
1,2019-10-23,CENTERVILLE,2632,"Seaside Gas Service, Inc",Natural Gas,,1,3,25000.761761,,,,,,,11575,0,1302.08
2,2019-10-23,NORTHFIELD,1360,Arctic Refrigeration,Pellet Stove,2 window units,1,1,20300.0,,,,,,,4050,0,625.0
3,2019-10-16,West Tisbury,2575,Nelson Mechanical Design Inc,Propane,Window Unit(s),2,-,57200.4,,,,Mitsubishi MXZ-3C30NAHZ2,Mitsubishi MXZ-3C30NAHZ2,,25560,No - Not Applicable,2000.0
4,2019-10-16,FITCHBURG,1420,Royal Steam Heater Co.,Natural Gas,Window fan,4,15,92997.920266,,,,,,,44352,0,6200.0
5,2019-10-09,Haverhill,1835,Climate Zone,Oil,Window Unit(s),1,,28600.0,0.0,0.0,0.0,Mitsubishi-MXZ-3C30NAHZ2,0,0.0,13700,No - Not Applicable,1191.67


Looks like data describing individual installation projects. 

In [20]:
df_pv.head()

Unnamed: 0,"Capacity \n(DC, kW)",Date In Service,Total Cost with Design Fees,Total Grant,City,Zip,County,Program Name,Facility Type,Installer,Module Manufacturer,Inverter Manufacturer,Meter Manufacturer,Utility,3rd Party Owner,SREC Eligible,Estimated Annual Production (kWhr)
0,1077.48,2019-08-08,3462750.0,0.0,Boston,2128,Suffolk,Non-RET Funded Grants,Industrial,ECA Solar,Jinko Solar,Solectria;SolarEdge Technologies,Elkor Technologies,NSTAR (DBA EverSource),N,Y,1345600.0
1,218.28,2019-07-30,600000.0,0.0,Pittsfield,1201,Berkshire,Non-RET Funded Grants,Commercial / Office,"BVD, LLC",Seraphim Solar System,Solectria;Solectria;Solectria,Elkor Technologies,WMECO (DBA EverSource),N,Y,288557.0
2,296.0,2019-07-09,760806.0,0.0,Great Barrington,1230,Berkshire,Non-RET Funded Grants,Commercial / Office,Solect Energy Development LLC,LG Electronics,HiQ Solar,eGauge,National Grid,Y,Y,309800.0
3,1408.96,2019-06-27,4064405.0,0.0,Millbury,1527,Worcester,Non-RET Funded Grants,Community Solar,M&amp;W Energy,Hansol Technics,Sungrow Power,Accuenergy,National Grid,N,Y,1760000.0
4,657.0,2019-06-21,738468.0,0.0,Walpole,2081,Norfolk,Non-RET Funded Grants,Industrial,ECA Solar,LG Electronics,Solectria;Solectria;Solectria,Elkor Technologies,NSTAR (DBA EverSource),N,Y,804168.0


# Clean zipcode data

In [21]:
zipcodes = pd.concat([clean_zipcodes(zipcodes['Zip Code']), zipcodes], axis=1)
zipcodes.head()

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid,Zip Code,City,Municipality,County,Unnamed: 4,Unnamed: 5
0,1001,,True,1001.0,Agawam,Agawam,Hampden,,
1,1002,,True,1002.0,Amherst,Amherst,Hampshire,,
2,1003,,True,1003.0,Amherst,Amherst,Hampshire,,
3,1004,,True,1004.0,Amherst,Amherst,Hampshire,,
4,1005,,True,1005.0,Barre,Barre,Worcester,,


In [22]:
# There are a bunch of entries in the zipcode mapping table with null zipcode
zipcodes[~zipcodes.zip_valid]

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid,Zip Code,City,Municipality,County,Unnamed: 4,Unnamed: 5
703,,,False,,,Alford,,,
704,,,False,,,Aquinnah,,,
705,,,False,,,Clarksburg,,,
706,,,False,,,Hancock,,,
707,,,False,,,Hawley,,,
708,,,False,,,Leyden,,,
709,,,False,,,Montgomery,,,
710,,,False,,,Mount Washington,,,
711,,,False,,,New Ashford,,,
712,,,False,,,Pelham,,,


In [23]:
#For now, let's drop zipcode mappings with null zipcode
zipcodes = zipcodes[zipcodes.zip_valid]

In [24]:
#Are there any zipcodes mapping to multiple towns?  Looks like there aren't.
zipcodes.groupby("zip_cleaned")["Municipality"].count().max()

1

In [25]:
zipcodes.groupby("Municipality")["zip_cleaned"].count().max()

55

In [26]:
x = clean_zipcodes(df_ashp['Site Zip Code'])
x[~x.zip_valid]

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid
1093,20,,False
1590,019081047,,False
1606,0212y,,False


In [27]:
df_ashp = pd.concat([x, df_ashp], axis=1)
df_ashp.head()

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid,Date Rebate Payment Approved by MassCEC,Site City/Town,Site Zip Code,Installer Company Name,Heating Fuel Being Replaced,Cooling Type Being Replaced,# of Outdoor Units,# of Indoor Units,Capacity of Heat Pumps at 5°F,Single- Head Heat Pump #1,Single- Head Heat Pump #2,Single- Head Heat Pump #3,Multi-Head Heat Pump #1,Multi-Head Heat Pump #2,Multi-Head Heat Pump #3,Total System Costs,Receiving an Income-Based Adder?,Rebate Amount
1,2632,,True,2019-10-23,CENTERVILLE,2632,"Seaside Gas Service, Inc",Natural Gas,,1,3,25000.761761,,,,,,,11575,0,1302.08
2,1360,,True,2019-10-23,NORTHFIELD,1360,Arctic Refrigeration,Pellet Stove,2 window units,1,1,20300.0,,,,,,,4050,0,625.0
3,2575,,True,2019-10-16,West Tisbury,2575,Nelson Mechanical Design Inc,Propane,Window Unit(s),2,-,57200.4,,,,Mitsubishi MXZ-3C30NAHZ2,Mitsubishi MXZ-3C30NAHZ2,,25560,No - Not Applicable,2000.0
4,1420,,True,2019-10-16,FITCHBURG,1420,Royal Steam Heater Co.,Natural Gas,Window fan,4,15,92997.920266,,,,,,,44352,0,6200.0
5,1835,,True,2019-10-09,Haverhill,1835,Climate Zone,Oil,Window Unit(s),1,,28600.0,0.0,0.0,0.0,Mitsubishi-MXZ-3C30NAHZ2,0,0.0,13700,No - Not Applicable,1191.67


In [28]:
y = clean_zipcodes(df_pv['Zip'])
y[~y.zip_valid]

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid


In [29]:
df_pv = pd.concat([y, df_pv], axis=1)
df_pv.head()

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid,"Capacity \n(DC, kW)",Date In Service,Total Cost with Design Fees,Total Grant,City,Zip,County,Program Name,Facility Type,Installer,Module Manufacturer,Inverter Manufacturer,Meter Manufacturer,Utility,3rd Party Owner,SREC Eligible,Estimated Annual Production (kWhr)
0,2128,,True,1077.48,2019-08-08,3462750.0,0.0,Boston,2128,Suffolk,Non-RET Funded Grants,Industrial,ECA Solar,Jinko Solar,Solectria;SolarEdge Technologies,Elkor Technologies,NSTAR (DBA EverSource),N,Y,1345600.0
1,1201,,True,218.28,2019-07-30,600000.0,0.0,Pittsfield,1201,Berkshire,Non-RET Funded Grants,Commercial / Office,"BVD, LLC",Seraphim Solar System,Solectria;Solectria;Solectria,Elkor Technologies,WMECO (DBA EverSource),N,Y,288557.0
2,1230,,True,296.0,2019-07-09,760806.0,0.0,Great Barrington,1230,Berkshire,Non-RET Funded Grants,Commercial / Office,Solect Energy Development LLC,LG Electronics,HiQ Solar,eGauge,National Grid,Y,Y,309800.0
3,1527,,True,1408.96,2019-06-27,4064405.0,0.0,Millbury,1527,Worcester,Non-RET Funded Grants,Community Solar,M&amp;W Energy,Hansol Technics,Sungrow Power,Accuenergy,National Grid,N,Y,1760000.0
4,2081,,True,657.0,2019-06-21,738468.0,0.0,Walpole,2081,Norfolk,Non-RET Funded Grants,Industrial,ECA Solar,LG Electronics,Solectria;Solectria;Solectria,Elkor Technologies,NSTAR (DBA EverSource),N,Y,804168.0


## Standardize municipality names

Join municipality name from lookup table via zipcode, and check for discrepencies with raw municipality name in the source data

In [30]:
zipcodes.head()

Unnamed: 0,zip_cleaned,zip4_cleaned,zip_valid,Zip Code,City,Municipality,County,Unnamed: 4,Unnamed: 5
0,1001,,True,1001.0,Agawam,Agawam,Hampden,,
1,1002,,True,1002.0,Amherst,Amherst,Hampshire,,
2,1003,,True,1003.0,Amherst,Amherst,Hampshire,,
3,1004,,True,1004.0,Amherst,Amherst,Hampshire,,
4,1005,,True,1005.0,Barre,Barre,Worcester,,


In [31]:
ashp_cleaned = pd.merge(zipcodes[['zip_cleaned', "Municipality", "City", "County"]], df_ashp, on="zip_cleaned", how='outer')
ashp_cleaned.head()

Unnamed: 0,zip_cleaned,Municipality,City,County,zip4_cleaned,zip_valid,Date Rebate Payment Approved by MassCEC,Site City/Town,Site Zip Code,Installer Company Name,Heating Fuel Being Replaced,Cooling Type Being Replaced,# of Outdoor Units,# of Indoor Units,Capacity of Heat Pumps at 5°F,Single- Head Heat Pump #1,Single- Head Heat Pump #2,Single- Head Heat Pump #3,Multi-Head Heat Pump #1,Multi-Head Heat Pump #2,Multi-Head Heat Pump #3,Total System Costs,Receiving an Income-Based Adder?,Rebate Amount
0,1001,Agawam,Agawam,Hampden,,True,2019-05-01,Agawam,1001,A Plus HVAC Inc.,Natural Gas,Centralized System,1,-,10896.0,Mitsubishi -MUZ-FH12NAH,,,,,,5640.0,No - Not Applicable,500.0
1,1001,Agawam,Agawam,Hampden,,True,2019-04-03,AGAWAM,1001,Mark Couto Plumbing & Heating Inc. ...,,,1,1,18000.0,MUZ-FH15NA,,,,,,4000.0,No - Not Applicable,925.0
2,1001,Agawam,Agawam,Hampden,,True,2019-02-20,Agawam,1001,Swift River HVAC Inc,Natural Gas,Window Unit(s),1,-,25500.0,,,,Fujitsu AOU24RLXFZH,,,7209.58,No - Not Applicable,1062.5
3,1001,Agawam,Agawam,Hampden,,True,2019-01-30,Agawam,1001,American Installations,Natural Gas,Window Unit(s),1,-,13596.0,Mitsubishi -MUZ-FH18NAH2,,,,,,5780.0,No - Not Applicable,500.0
4,1001,Agawam,Agawam,Hampden,,True,2019-01-30,Agawam,1001,McNeill Heating & A/C,Pellet Stove,Window Unit(s),1,-,20496.0,Fujitsu AOU15RLS3H-ASU15RLS3Y Indoor Unit,,,,,,3950.0,No - Not Applicable,500.0


In [32]:
ashp_cleaned.shape

(20166, 24)

In [33]:
ashp_cleaned[ashp_cleaned.Municipality.isna() & ashp_cleaned.zip_valid]

Unnamed: 0,zip_cleaned,Municipality,City,County,zip4_cleaned,zip_valid,Date Rebate Payment Approved by MassCEC,Site City/Town,Site Zip Code,Installer Company Name,Heating Fuel Being Replaced,Cooling Type Being Replaced,# of Outdoor Units,# of Indoor Units,Capacity of Heat Pumps at 5°F,Single- Head Heat Pump #1,Single- Head Heat Pump #2,Single- Head Heat Pump #3,Multi-Head Heat Pump #1,Multi-Head Heat Pump #2,Multi-Head Heat Pump #3,Total System Costs,Receiving an Income-Based Adder?,Rebate Amount
20160,8180,,,,,True,2019-05-08,Andover,8180,"Royal Air Systems, Inc.",Oil,Window Unit(s),1,-,18000.0,Mitsubishi MUZ-FH18NA2,,,,,,7422,No - Not Applicable,500.0
20161,2174,,,,,True,2019-04-24,ARLINGTON,2174,RER Fuel Inc,Natural Gas,Window Unit(s),2,-,47499.6,,,,Fujitsu AOU18RLXFZH,Fujitsu AOU24RLXFZH,,23750,No - Not Applicable,1979.17
20165,673,,,,,True,2018-02-13,WEST YARMOUTH,673,"Cape Cod Mechanical Systems, Inc.",Natural Gas,,1,3,25500.0,,,,FUJITSU AOU24RLXFZH,,,7900,Not Applicable,1328.12


In [34]:
ashp_cleaned[ashp_cleaned.City.str.lower().str.strip() != ashp_cleaned['Site City/Town'].str.lower().str.strip()]

Unnamed: 0,zip_cleaned,Municipality,City,County,zip4_cleaned,zip_valid,Date Rebate Payment Approved by MassCEC,Site City/Town,Site Zip Code,Installer Company Name,Heating Fuel Being Replaced,Cooling Type Being Replaced,# of Outdoor Units,# of Indoor Units,Capacity of Heat Pumps at 5°F,Single- Head Heat Pump #1,Single- Head Heat Pump #2,Single- Head Heat Pump #3,Multi-Head Heat Pump #1,Multi-Head Heat Pump #2,Multi-Head Heat Pump #3,Total System Costs,Receiving an Income-Based Adder?,Rebate Amount
13,01001,Agawam,Agawam,Hampden,,True,2018-10-11,FEEDING HILLS,1001,Air Experts Inc. ...,Natural Gas,1 window unit,1,1,21600.0,Fujitsu AOU18RLXFW1,,,,,,10900,No - Not Applicable,625
45,01002,Amherst,Amherst,Hampshire,,True,2019-06-26,Leverett,1002,Rock Valley HVAC Inc,,,3,-,84828.0,Mitsubishi MUZ-FH06NA,Mitsubishi MUZ-FH06NA,,Mitsubishi MXZ-4C36NAHZ Non-Ducted Indoor Units,,,16900,No - Not Applicable,2000
51,01002,Amherst,Amherst,Hampshire,,True,2019-05-29,Pelham,1002,"Orange Oil Company, Inc.",Oil,Window Unit(s),2,-,33900.0,Mitsubishi MUZ-FH15NA,Mitsubishi MUZ-FH09NA,,,,,9517.19,No - Not Applicable,1000
56,01002,Amherst,Amherst,Hampshire,,True,2019-05-15,Pelham,1002,Pioneer Heating & Cooling Inc.,Oil,,3,-,58608.0,Mitsubishi MUZ-FH18NA2,Mitsubishi MUZ-FH09NA,Mitsubishi MUZ-FH09NA,,,,16850,No - Not Applicable,1500
57,01002,Amherst,Amherst,Hampshire,,True,2019-05-15,Pelham,1002,American Installations,,,1,-,48000.0,,,,Mitsubishi MXZ-5C42NAHZ,,,16800,No - Not Applicable,2000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20161,02174,,,,,True,2019-04-24,ARLINGTON,2174,RER Fuel Inc,Natural Gas,Window Unit(s),2,-,47499.6,,,,Fujitsu AOU18RLXFZH,Fujitsu AOU24RLXFZH,,23750,No - Not Applicable,1979.17
20162,20,,,,,False,2019-04-17,Walpole,20,"All Temp Systems Mechanical, Inc.",Oil,Window Unit(s),1,-,8700.0,Mitsubishi MUZ-FH09NA,,,,,,4300,No - Not Applicable,500
20163,019081047,,,,,False,2019-03-27,Nahant,019081047,The Jim Walsh Company,Natural Gas,Window Unit(s),1,-,36406.8,,,,Fujitsu AOU36RLXFZH,,,7650,No - Not Applicable,1516.96
20164,0212y,,,,,False,2019-03-27,Boston,0212y,Mass Mini-Splits,Natural Gas,,1,-,45000.0,,,,LG LMU300HHV,,,21000,No - Not Applicable,1875


In [35]:
# Drop ashp program records with invalid zipcode or municipality name for now
ashp_cleaned = ashp_cleaned[~(ashp_cleaned.Municipality.isna() & ashp_cleaned.zip_valid)]

In [36]:
Fail, don't run the stuff below

SyntaxError: ignored

In [None]:
#Checks that the data's town name is present in the 351 mass munis list or villages to muni with zip list. 
def matchMuniName(siteCityName, siteZipCode):
    x = municipalities.loc[municipalities['Municipality'] == siteCityName]
    if not x.empty:
        if x.size > 1:
            print("There are multiple municipalities for,", siteCityName)
        return x.iloc[0]['City']
    else:
        y = zipcodes.loc[zipcodes['City'] == siteCityName]
        if y.size > 1: 
            print("There are multiple municipalities for", siteCityName)
        return y.iloc[0]['City']
        #also use zip codes
    return "NF"

matchMuniName("Florence", '02138')

In [None]:
#Checks that the data's town name is present in the 351 mass munis list or villages to muni with zip list. 
def matchMuniName(siteCityName, siteZipCode):
    x = municipalities.loc[municipalities['Municipality'] == siteCityName]
    if not x.empty:
        # print(x.values)
        return x.values.astype('str')
    else:
        y = zipcodes.loc[zipcodes['City'] == siteCityName]
        print(y)
        return y
        #also use zip codes
    return "NF"

matchMuniName("Boston", '02138')

In [None]:
#Implement and use the predefined matchMuniName function. 
def processZipCode(df, zipCodeColumnName, townColumnName, newMuniName):
    counter = 0
    for index,row in df.iterrows():
        if row[townColumnName]:
            if not df.loc[df[townColumnName].str.lower() == row['Site City/Town'].lower()].empty:
                print(row['Site City/Town'])
        #first use the 351 Mass Munis (the municipalities dataframe)
        #then use the Villages to Munu with Zip (the zipcodes data frame) to match if value is not present in mass munis. 
            
            #else: if the Site City/Town is not in the municipality list, then choose a different 

        #if there is a zip code 
            #if not a zip code 
                #check multi zip code
                    #find address
                #use matchMuniName()

processZipCode(df_ashp, "", "", "")

In [None]:
#Format Technology

df_ashp = df_ashp.assign(Technology = 'Air Source Heat Pumps')
#Do the Same for other data sets

print(df_ashp)

In [None]:
#Date and Time Formatting

def processDates(df, dateColumn, yearName, monthName):
    for index, row in df.iterrows():
        date = row[dateColumn]
        row[yearName] = date.year
        row[monthName] = date.month
    #select the year, month from these columns

processDates(df_ashp,'Date Rebate Payment Approved by MassCEC', 'Year', 'Month')

In [None]:
#Quantity: Use Pandas df['Application Number'].unique() to find all unique values

print(df_ashp['Site City/Town'].unique())

In [None]:
#Rebates

#Assign rebates directy from the reabtes in the ASHP data -- read all as integer or dollar values

In [None]:
#"Snap to Grid"
# Match Column names and copy over data


In [None]:
#Data Fields for Final Dataset 
dataFields = ['Municipality', 'Zip Code', 'Tech', 'Year', 'Month', 'Quantity', 'Average Cost',
    'Total Cost', 'Total Rebates', 'Average Rebate', 'Count Income-Eligible']

df_final = pd.DataFrame(columns=dataFields)

print(df_final)

#to_csv