## Objectives

The objective of this notebook is to get familiarity with the electricity consumption data and conduct data cleanings and transformations in preparation for the statistical analyses and anomaly detections. More specifically, we need to:
- resolve the data quality issues and clean the dataset, such that there is only one row per account per billing period (an account is determined by the combination of building and meter number)
- identify duplication and overlapping of billing periods
- prorate the KWH Consumption values use the calendar month, instead of relying on the original revenue month
- identify the billing gaps per calendar month


## Known data quality issues according to  the domain knowledge expert
1. multiple account names for the same building_id
2. meter number switch over the year for the same account
3. overlapping/duplication of billing windows for the same account

## Steps
1. read in data, create a dataframe to log the rows with issues
2. general cleaning - remove null rows, convert data types, set up a group of metrics to indicate data cleanliness
3. resolve the known issues & other issues discovered along the way
5. identify duplication and overlapping of billing periods
6. prorate the billings use the calendar month, instead of relying on the original revenue month
7. identify the billing gaps per calendar month

## Step 1 - read in data, create a dataframe to log the rows with issues

Below is a list of Python packages required for data processing and analysis:

In [2]:
from __future__ import division
import pandas as pd
import numpy as np
import pandasql as pdsql
import math
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import *

import matplotlib.pyplot as plt
# Setup matplotlib to display in notebook:
%matplotlib inline

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
# initiate notebook for offline plot
init_notebook_mode(connected=True)         

### 1.1  Read in the data 

In [857]:
df = pd.read_csv("../data/NYC Open Data - Electric_Consumption_And_Cost__2010_-__June_2018_.csv", low_memory=False)

In [858]:
df.columns

Index(['Development Name', 'Borough', 'Account Name', 'Location', 'Meter AMR',
       'Meter Scope', 'TDS #', 'EDP', 'RC Code', 'Funding Source', 'AMP #',
       'Vendor Name', 'UMIS BILL ID', 'Revenue Month', 'Service Start Date',
       'Service End Date', '# days', 'Meter Number', 'Estimated',
       'Current Charges', 'Rate Class', 'Bill Analyzed', 'Consumption (KWH)',
       'KWH Charges', 'Consumption (KW)', 'KW Charges', 'Other charges'],
      dtype='object')

Change column names for easy reference.

In [859]:
df.columns = ['Development_Name', 'Borough', 'Account_Name', 'Location', 'Meter_AMR',
       'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source', 'AMP #',
       'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month', 'Service_Start_Date',
       'Service_End_Date', '# days', 'Meter_Number', 'Estimated',
       'Current_Charges', 'Rate_Class', 'Bill_Analyzed', 'Consumption_KWH',
       'KWH_Charges', 'Consumption_KW', 'KW_Charges', 'Other_Charges']

Check the number of empty values in each column.

In [860]:
df.isnull().sum()

Development_Name         146
Borough                  146
Account_Name             146
Location                9041
Meter_AMR                187
Meter_Scope           296588
TDS #                   1717
EDP                      146
RC_Code                  146
Funding_Source           146
AMP #                   1657
Vendor_Name              146
UMIS_BILL_ID             146
Revenue_Month            146
Service_Start_Date       146
Service_End_Date         146
# days                   146
Meter_Number             146
Estimated                146
Current_Charges          146
Rate_Class               146
Bill_Analyzed            146
Consumption_KWH          146
KWH_Charges              146
Consumption_KW           146
KW_Charges               146
Other_Charges            146
dtype: int64

### 1.2  Save a copy of the dataframe before data cleaning to flag the rows with problems

Save df as df_orig before data cleaning and use the index as a unique row identifier to connect the two dataframes.

In [861]:
df = df.reset_index()

In [862]:
df.columns

Index(['index', 'Development_Name', 'Borough', 'Account_Name', 'Location',
       'Meter_AMR', 'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source',
       'AMP #', 'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Meter_Number',
       'Estimated', 'Current_Charges', 'Rate_Class', 'Bill_Analyzed',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges'],
      dtype='object')

In [863]:
df.columns= ['row', 'Development_Name', 'Borough', 'Account_Name', 'Location',
       'Meter_AMR', 'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source',
       'AMP #', 'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Meter_Number',
       'Estimated', 'Current_Charges', 'Rate_Class', 'Bill_Analyzed',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges']

In [864]:
# df = df_orig

In [865]:
df_orig = df

Create a data frame to log the rows with problems.

In [866]:
df_flags = pd.DataFrame(columns = ['row', 'flag'])

## Step 2 - general cleaning

### 2.1 Remove empty rows

In [867]:
mask = (pd.isna(df['Account_Name']) == True)

Remove the problematic rows from the working datafrome df and log them in the df_flags dataframe.

In [868]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'NULL Values'})])
df = df[~mask]

All the rows with only null values have been removed.

In [869]:
df.isnull().sum()

row                        0
Development_Name           0
Borough                    0
Account_Name               0
Location                8895
Meter_AMR                 41
Meter_Scope           296442
TDS #                   1571
EDP                        0
RC_Code                    0
Funding_Source             0
AMP #                   1511
Vendor_Name                0
UMIS_BILL_ID               0
Revenue_Month              0
Service_Start_Date         0
Service_End_Date           0
# days                     0
Meter_Number               0
Estimated                  0
Current_Charges            0
Rate_Class                 0
Bill_Analyzed              0
Consumption_KWH            0
KWH_Charges                0
Consumption_KW             0
KW_Charges                 0
Other_Charges              0
dtype: int64

### 2.2 Remove rows for which electricity charges were estimated

In [870]:
df['Estimated'].value_counts()

N             260863
Y              51749
NA               389
Name: Estimated, dtype: int64

In [871]:
df['Estimated'].value_counts().index.values

array(['N         ', 'Y         ', 'NA        '], dtype=object)

Identify & log the problematic rows, delete them from the working dataframe df.

In [872]:
mask = (df['Estimated'] ==  'N         ')

df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inaccurate Values of Charges'})])

df = df[mask]

### 2.3 Data Type Converstion

Check data types of fields. All timestamp fields and some numerical fields are stored as objects and thus need to be converted.

In [873]:
df.dtypes

row                     int64
Development_Name       object
Borough                object
Account_Name           object
Location               object
Meter_AMR              object
Meter_Scope            object
TDS #                 float64
EDP                   float64
RC_Code                object
Funding_Source         object
AMP #                  object
Vendor_Name            object
UMIS_BILL_ID          float64
Revenue_Month          object
Service_Start_Date     object
Service_End_Date       object
# days                float64
Meter_Number           object
Estimated              object
Current_Charges        object
Rate_Class             object
Bill_Analyzed          object
Consumption_KWH       float64
KWH_Charges            object
Consumption_KW         object
KW_Charges             object
Other_Charges          object
dtype: object

2.3.1. Change the following fields from object to numerical:
    - "Consumption_KW", "Current_Charges", "KWH_Charges", "KW_Charges", "Other_Charges"

In [874]:
df["Consumption_KW"] = df["Consumption_KW"].apply(lambda x: x.replace(",","") if type(x) == str else str(x))
df["Consumption_KW"] = df["Consumption_KW"].astype(float)

In [875]:
df["Current_Charges"] = df["Current_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["Current_Charges"] = df["Current_Charges"].astype(float)

In [876]:
df["KWH_Charges"] = df["KWH_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["KWH_Charges"] = df["KWH_Charges"].astype(float, inplace = True)

In [877]:
df["KW_Charges"] = df["KW_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["KW_Charges"] = df["KW_Charges"].astype(float, inplace = True)

In [878]:
df["Other_Charges"] = df["Other_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["Other_Charges"] = df["Other_Charges"].astype(float, inplace = True)

2.3.2. Convert Revenue_Month and service date fields to datetime type.

In [879]:
df["Revenue_Month"] = df["Revenue_Month"].map(lambda x: datetime.strptime(x.split(" ")[0], '%m/%d/%Y'))
df['Service_Start_Date'] = df['Service_Start_Date'].map(lambda x: datetime.strptime(x, '%m/%d/%Y'))
df['Service_End_Date'] = df['Service_End_Date'].map(lambda x: datetime.strptime(x, '%m/%d/%Y'))

### 2.4 Clean up the Meter_Number field

remove leading zeros:

In [880]:
df['Meter_Number'] = df['Meter_Number'].apply(lambda x: x.lstrip("0").strip(" "))

remove white spaces:

In [881]:
df['Meter_Length'] = df['Meter_Number'].apply(lambda x: len(x))

standardize the format for meter_numbers with the similar patterns:

In [882]:
df['Meter_Length'].value_counts()

7     257568
8       1847
12       456
5        412
6        292
18       287
10         1
Name: Meter_Length, dtype: int64

certain meter numbers are recorded in different formats that need to be standardized.

In [883]:
df[df['Meter_Length'] == 12]['Meter_Number'].value_counts()

1860113_7500    68
7860113_7500    68
7860113_1600    66
1860113_1600    66
1096662-58.5    35
8096662-41.5    35
8096662-58.5    35
1096662-41.5    35
8096662 58-5    12
1096662 58-5    12
1096662 41-5    12
8096662 41-5    12
Name: Meter_Number, dtype: int64

In [884]:
mask = df['Meter_Number'] == '1096662 41-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '1096662 41.5'

In [885]:
mask = df['Meter_Number'] == '1096662 58-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '1096662-58.5'

In [886]:
mask = df['Meter_Number'] == '8096662 41-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '8096662-41.5'

In [887]:
mask = df['Meter_Number'] == '8096662 58-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '8096662-58.5'

### 2.5 Correct values of Revenue_Month field

In some cases the Revenue_Month is not in the same year as the Service Start and End dates when those two are.

In [888]:
df['start_date_year'] = df['Service_Start_Date'].apply(lambda x: datetime(x.year, 1, 1))

df['end_date_year'] = df['Service_End_Date'].apply(lambda x: datetime(x.year, 1, 1))

df['revenue_month_year'] = df['Revenue_Month'].apply(lambda x: datetime(x.year, 1, 1))

mask = ((df['end_date_year'] == df['start_date_year']) & (df['revenue_month_year'] != df['end_date_year']))

In [889]:
df[mask][['Revenue_Month', 'Service_Start_Date', 'Service_End_Date', 'Meter_Number']].sort_values(['Revenue_Month', 'Service_Start_Date', 'Meter_Number'])

Unnamed: 0,Revenue_Month,Service_Start_Date,Service_End_Date,Meter_Number
44361,2011-10-01,2010-09-22,2010-10-22,5934193
44362,2011-10-01,2010-09-22,2010-10-22,6439093
44363,2011-10-01,2010-09-22,2010-10-22,6443262
44364,2011-10-01,2010-09-22,2010-10-22,6443337
44365,2011-10-01,2010-09-22,2010-10-22,6443449
44366,2011-10-01,2010-09-22,2010-10-22,6443450
44367,2011-10-01,2010-09-22,2010-10-22,6443473
44368,2011-10-01,2010-09-22,2010-10-22,6443512
44369,2011-10-01,2010-09-22,2010-10-22,6443519
44370,2011-10-01,2010-09-22,2010-10-22,6443527


In [890]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Revenue_Month in wrong year'})])

Correct the cases where Revenue_Month is in the wrong year.

In [891]:
df.loc[mask, "Revenue_Month"] = datetime.strptime('10/01/2010', '%m/%d/%Y')

Remove the calculated fields.

In [892]:
df.drop(['start_date_year', 'end_date_year', 'revenue_month_year'], axis = 1, inplace = True)

### 2.6 Create an unique identifier for each account and remove unnecessary fields

According to the domain knowledge expert, each account is uniquely determined by the combination building and meter number. The combination of TDS # and Location uniquely determines a buildling and we can use EDP or RC_Code when TDS# is not available.

In [893]:
df['Building_ID'] = df['TDS #'].combine_first(df['EDP']).map(str).combine_first(df['RC_Code']) \
                    + " - " + df['Location'].map(lambda x: 'NA' if pd.isna(x) else x)

In [894]:
# Define a list of columns of interest
cols = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
        'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days', 
       'Current_Charges','Consumption_KWH', 'KWH_Charges',
       'Consumption_KW', 'KW_Charges', 'Other_Charges']
df = df[cols]

Building_ID alone is not the primary key of the data.

In [895]:
df.groupby(['Building_ID', 'Revenue_Month']).count().shape[0]/df.shape[0]

0.6327382572461407

The combination of Building_ID, meter number and revenue month is still  not a primary key.

In [896]:
df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month']).count().shape[0]/df.shape[0]

0.9988039698999092

The combination of Building_ID, meter number and revenue month is almost a primary key.

In [897]:
df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month', 'Service_Start_Date', 'Service_End_Date']).count().shape[0]/df.shape[0]

0.999528488133618

Actually adding "Revenue_Month" did not increase granularity; It can be uniquely determined by service start/end dates

In [898]:
df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count().shape[0]/df.shape[0]

0.999528488133618

### 2.7 Ensure the 4 fields (Building_ID, Meter_Number, Service_Start_Date, Service_End_Date) uniquely determines each row

In [899]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

Lots of these rows have zero values in all the numerical fields of charges and consumptions.

In [900]:
temp

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,75177,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2012-12-01,2012-11-21,2012-12-24,33.0,0.00,0.0,0.00,0.00,0.00,0.00
1,75178,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2012-12-01,2012-11-21,2012-12-24,33.0,0.00,0.0,0.00,54.43,1109.09,-1109.09
124,111642,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-01-01,2012-12-24,2013-01-24,31.0,0.00,0.0,0.00,0.00,0.00,0.00
125,111643,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-01-01,2012-12-24,2013-01-24,31.0,0.00,0.0,0.00,52.08,1105.73,-1105.73
180,111676,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-02-01,2013-01-24,2013-02-25,32.0,0.00,0.0,0.00,0.00,0.00,0.00
181,111677,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-02-01,2013-01-24,2013-02-25,32.0,0.00,0.0,0.00,52.94,1166.15,-1166.15
178,111710,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-03-01,2013-02-25,2013-03-26,29.0,0.00,0.0,0.00,0.00,0.00,0.00
179,111711,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-03-01,2013-02-25,2013-03-26,29.0,0.00,0.0,0.00,50.93,1169.81,-1169.81
176,111744,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-04-01,2013-03-26,2013-04-24,29.0,0.00,0.0,0.00,0.00,0.00,0.00
177,111745,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-04-01,2013-03-26,2013-04-24,29.0,0.00,0.0,0.00,51.46,1146.50,-1146.50


Remove those rows from the dataset.

In [901]:
mask = ((df['Current_Charges'] == 0) & (df['KWH_Charges'] == 0) & (df['KW_Charges'] == 0) \
  & (df['Other_Charges'] == 0) & (df['Consumption_KWH'] == 0) & (df['Consumption_KW'] == 0))

In [902]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'All charges being zero'})])

df = df[~mask]

In [903]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

Only 200 rows left, most of which seems to be due to rebilling and duplicated data entries.

In [904]:
temp.shape

(200, 15)

In [905]:
temp

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,133599,WILLIAMSBURG,BLD 06,2.0 - BLD 06,5648698,2013-11-01,2013-10-23,2013-11-21,29.0,2877.11,20160.0,1039.85,48.00,856.14,981.12
1,133646,WILLIAMSBURG,BLD 06,2.0 - BLD 06,5648698,2013-11-01,2013-10-23,2013-11-21,29.0,2877.11,20160.0,1039.85,48.00,856.14,981.12
26,133623,WILLIAMSBURG,BLD 07,2.0 - BLD 07,6994237,2013-11-01,2013-10-23,2013-11-21,29.0,3128.25,21920.0,1130.63,34.40,613.56,1384.06
27,133635,WILLIAMSBURG,BLD 07,2.0 - BLD 07,6994237,2013-11-01,2013-10-23,2013-11-21,29.0,3128.25,21920.0,1130.63,34.40,613.56,1384.06
146,133628,WILLIAMSBURG,BLD 07,2.0 - BLD 07,7861523,2013-11-01,2013-10-23,2013-11-21,29.0,2854.29,20000.0,1031.60,56.00,998.83,823.86
147,133640,WILLIAMSBURG,BLD 07,2.0 - BLD 07,7861523,2013-11-01,2013-10-23,2013-11-21,29.0,2854.29,20000.0,1031.60,56.00,998.83,823.86
144,133597,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5536455,2013-11-01,2013-10-23,2013-11-21,29.0,2945.59,20640.0,1064.61,52.80,941.75,939.23
145,133644,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5536455,2013-11-01,2013-10-23,2013-11-21,29.0,2945.59,20640.0,1064.61,52.80,941.75,939.23
142,133602,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5652433,2013-11-01,2013-10-23,2013-11-21,29.0,2346.23,16440.0,847.98,38.40,684.91,813.34
143,133649,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5652433,2013-11-01,2013-10-23,2013-11-21,29.0,2346.23,16440.0,847.98,38.40,684.91,813.34


Identify the index of rows and delete them from the working dataframe df.

In [906]:
mask = df['row'].isin(temp['row'].values)

df = df[~mask]

For each group of duplicated rows, add the row with the smallest index back to the working dataframe df.

In [907]:
tempB = temp.groupby(list(temp.columns[1:])).agg({'row': 'min'}).reset_index()
cols = tempB.columns
cols = cols[0:-1].insert(0, cols[-1])
tempB = tempB[cols]

df = df.append(tempB)

Add flags to the df_flag dataframe.

In [908]:
merged = temp.merge(tempB, indicator=True, how='outer')
df_flags = pd.concat([df_flags, pd.DataFrame({'row':merged.loc[merged['_merge'] == 'left_only'].row.values, \
                                              'flag':'Duplicated rows'})])

Check again which combinations of the 4 fields (Building_ID, Meter_Number, Service_Start_Date, Service_End_Date) has multiple rows.

In [909]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

temp

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,56871,THROGGS NECK,BLD 11,63.0 - BLD 11,8125318,2011-10-01,2011-09-22,2011-10-24,32.0,1306.02,12880.0,858.84,0.0,0.0,447.18
1,56872,THROGGS NECK,BLD 11,63.0 - BLD 11,8125318,2011-10-01,2011-09-22,2011-10-24,32.0,2693.18,26560.0,1771.02,0.0,0.0,922.16


Only 2 rows left, caused by rebilling (same account, same billing window).

In [910]:
mask = df['row'].isin(temp['row'].values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'rebill - same billing period'})])
df = df[~mask]

Add a column for Revenue_Year and reorder the columns.

In [911]:
df.loc[:, 'Revenue_Year'] = df['Revenue_Month'].dt.year

Remove unnecessary columns and reorder the remaining.

In [912]:
cols = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
       'Revenue_Year', 'Revenue_Month', 'Service_Start_Date', 'Service_End_Date',
       '# days', 'Consumption_KWH', 'KWH_Charges', 'Consumption_KW','KW_Charges', 
        'Other_Charges', 'Current_Charges']

df = df[cols]

More than 25% of the values for all numerical variables except "Curent Charges" are 0. We need to investigate the cause of this.

We can set up a group of metrix based on these numerical variables to indicate the cleanliness of the dataset.

In [913]:
df[["Consumption_KWH",  "Consumption_KW", "Current_Charges", "KWH_Charges", "KW_Charges", "Other_Charges"]].describe()

Unnamed: 0,Consumption_KWH,Consumption_KW,Current_Charges,KWH_Charges,KW_Charges,Other_Charges
count,259313.0,259313.0,259313.0,259313.0,259313.0,259313.0
mean,32800.33,68.732254,4543.306841,1686.00063,1092.581876,1684.240735
std,53195.3,122.580708,6643.774008,2928.751687,1812.127834,3637.524167
min,0.0,0.0,-243.15,0.0,0.0,-59396.43
25%,0.0,0.0,412.46,0.0,0.0,0.0
50%,12000.0,32.8,2574.23,586.82,468.72,916.18
75%,48480.0,99.2,6092.67,2376.26,1610.76,2654.43
max,1779600.0,16135.46,329800.37,195575.86,78782.96,134224.51


After exploring the dataset, we noticed a few issues:
1. lots of rows have zero values in all numerical fields of charges and consumptions
2. lots of accounts only have non-zero values in kw_charges or kwh_charges
3. lots of rows where current_charges != sum of kwh_charges, kw_charges and other_charges
4. charges and energy usage values are not consistent (e.g. kw_charges == 0 whereas kw > 0)

We will create a set of metrics to indicate the cleanliness of the dataset and start to solve these data quality issues in the next step.

### Calculate Data Cleanliness Metrics regarding zero-values and meter types - 1st time

In [914]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select Building_ID, Meter_Number \
        , sum(case when KWH_Charges == 0 and KW_Charges > 0 then 1 else 0 end) as count_kw_only \
        , sum(case when KW_Charges == 0 and KWH_Charges > 0 then 1 else 0 end) as count_kwh_only \
        , sum(Current_Charges) as total_current_charges \
        , count(*) as count \
        from df \
        group by df.Building_ID, df.Meter_Number"
df_meter_type = pysql(str1)


df_meter_type['kwh_only'] = ((df_meter_type['count_kwh_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kw_only'] == 0)
df_meter_type['kw_only'] = ((df_meter_type['count_kw_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kwh_only'] == 0)

#### check the meters

print("perc of kw_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kw_only'] == 1) & (df_meter_type['kwh_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 1) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_and_kw accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 0) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))


#### check the building_ids

a = df_meter_type[df_meter_type['kwh_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
b =  df_meter_type[df_meter_type['kw_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
a.columns = ['Building_ID', 'Count']
b.columns = ['Building_ID', 'Count']

print("perc of buildings with both kw_only and kwh_only accounts:", \
     "{:.2%}".format(pd.merge(a, b, on = 'Building_ID', how = 'inner').shape[0] \
/ df_meter_type.groupby(['Building_ID']).agg('count').reset_index().shape[0]))

print("\n")

#### Check the statistics of zero-value rows:

print("perc of rows - current charges of zero:", "{:.2%}".format(df[df['Current_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kw charges of zero:", "{:.2%}".format(df[df['KW_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kwh charges of zero:", "{:.2%}".format(df[df['KWH_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - usage/charge inconsistency:", \
      "{:.2%}".format(df[((df['KWH_Charges'] == 0) ^ (df['Consumption_KWH'] == 0)) \
   | ((df['KW_Charges'] == 0) ^ (df['Consumption_KW'] == 0)) ].shape[0]\
    /df.shape[0]))

print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of kw_only accounts: 28.46%
perc of kwh_only accounts: 36.40%
perc of kwh_and_kw accounts: 35.13%
perc of buildings with both kw_only and kwh_only accounts: 52.79%


perc of rows - current charges of zero: 16.61%
perc of rows - kw charges of zero: 41.13%
perc of rows - kwh charges of zero: 33.03%
perc of rows - usage/charge inconsistency: 4.45%
perc of rows - sum of charges inconsistency: 29.34%


## Step 3 - resolve the data quality issues

The high percentage of rows with zero kwh or kw charges might be caused by the kw_only and kwh_only accounts. We will check them later. First, let's look at the rows where current_charge is zero.

### 3.1 Issues regarding the current_charges field.

When current_charges == 0, all kwh_charges == 0 (as indicated by the NaN correlation coefficients with all other variables) and kw_charges seems to be negatively correlated with other_charges.

In [915]:
df[df['Current_Charges'] == 0][['KWH_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,KWH_Charges,KW_Charges,KWH_Charges.1,Other_Charges
KWH_Charges,,,,
KW_Charges,,1.0,,-0.694394
KWH_Charges,,,,
Other_Charges,,-0.694394,,1.0


When current_charges == 0, 82.3% of the time kw_charges == - other_charges and kw_charges == other_charges otherwise.

In [916]:
mask = (df['Other_Charges'] + df['KW_Charges'] == 0) & (df['Current_Charges'] == 0) & (df['KWH_Charges'] == 0)

In [917]:
print("{:.2%}".format(df[mask].shape[0]/df[df['Current_Charges'] == 0].shape[0]))

82.30%


In [918]:
df[(df['Current_Charges'] == 0) & ((df['Other_Charges'] == df['KW_Charges']) \
        | (df['Other_Charges'] + df['KW_Charges'] == 0))].shape[0] / \
df[df['Current_Charges'] == 0].shape[0]

1.0

Correct the rows where Other_Charges == KW_Charges with Other_Charges = -KW_Charges

In [919]:
mask = (df['Current_Charges'] == 0) & ((df['Other_Charges'] == df['KW_Charges']) & (df['KW_Charges'] != 0))

In [920]:
df.loc[mask, 'Other_Charges'] = df.loc[mask, 'KW_Charges'] * (-1)

Now when current_charges is zero, kwh_charge is zero, and kw_charges and other_charge either have a sum of zero or both being zero.

In [921]:
df[df['Current_Charges'] == 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
Current_Charges,,,,
KW_Charges,,1.0,,-1.0
KWH_Charges,,,,
Other_Charges,,-1.0,,1.0


Update the flag in df_flags:

In [922]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Sign of Other_Charges is incorrect'})])

This lowers the percentage of sum of charges inconsistency a little bit.

In [923]:
print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of rows - sum of charges inconsistency: 26.40%


When Current_Charges is negative. Both KWH and KW charges are non-negative, mostly be caused by negative other_charges values.

In [924]:
df[df['Current_Charges'] < 0].shape

(103, 16)

In [925]:
df[df['Current_Charges'] < 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
Current_Charges,1.0,0.231555,-0.664964,-0.078367
KW_Charges,0.231555,1.0,-0.399655,0.000617
KWH_Charges,-0.664964,-0.399655,1.0,-0.006223
Other_Charges,-0.078367,0.000617,-0.006223,1.0


In [926]:
df[df['Current_Charges'] < 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].describe()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
count,103.0,103.0,103.0,103.0
mean,-17.578932,1670.446408,8.59767,-339.93165
std,34.371892,892.339819,38.354575,1877.520472
min,-243.15,0.0,0.0,-4354.64
25%,-14.14,1097.61,0.0,-2042.915
50%,-9.95,1883.28,0.0,-626.16
75%,-5.56,2363.565,0.0,1722.43
max,-0.18,4340.94,187.81,3234.02


### How to check the other reasons for inconsistency?

#### Check Domain knowledge expert inputs
1. multiple account names for the same building_id?
2. meter number switch for the same account
3. overlapping/duplication of billing windows for the same account

### 3.2 multipe names for same building_id

In [927]:
df_id_per_name = df[['Building_ID', 'Account_Name']].groupby('Building_ID')['Account_Name'].nunique()
df_id_per_name.value_counts()

1    2036
2       4
7       1
3       1
Name: Account_Name, dtype: int64

Only 6 Building_ID's have multiple account names. It's safe to remove them all from the working dataframe.

In [928]:
mask = df['Building_ID'].isin(df_id_per_name[df_id_per_name > 1].index.values)

In [929]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'multiple account_name for same building_id'})])
df = df[~mask]

### 3.3 Meter number merging

There are Building_ID's whose meter number changed over the years, need to find the mapping and consolidate the meter numbers.

In [930]:
a = df.groupby(['Building_ID']).agg({'Meter_Number': 'nunique'}).reset_index()

a = a[a["Meter_Number"]>1]

a.columns = ['Building_ID', 'Counts']

a = pd.merge(a, df, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number', "Revenue_Month"]]\
.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month': ['max','min']}).reset_index()

a.columns = a.columns.get_level_values(0)

a.columns = ['Building_ID', 'Meter_Number', 'Max_Month', 'Min_Month']

a['Max_Month_Next'] = a['Max_Month'].map(lambda x: x + relativedelta(months=+1))
a['Min_Month_Prior'] = a['Min_Month'].map(lambda x: x - relativedelta(months=+1))
df_switch_meter = a

del(a)

In [931]:
df_switch_meter.head()

Unnamed: 0,Building_ID,Meter_Number,Max_Month,Min_Month,Max_Month_Next,Min_Month_Prior
0,1.0 - BLD 01,7836716,2018-06-01,2010-01-01,2018-07-01,2009-12-01
1,1.0 - BLD 01,7838586,2018-06-01,2010-01-01,2018-07-01,2009-12-01
2,1.0 - BLD 04,6255947,2014-04-01,2010-01-01,2014-05-01,2009-12-01
3,1.0 - BLD 04,7381828,2018-06-01,2014-05-01,2018-07-01,2014-04-01
4,1.0 - BLD 04,8638820,2017-06-01,2016-07-01,2017-07-01,2016-06-01


In [932]:
str1 = "select l.Building_ID, l.Meter_Number as Meter_Number_E, r.Meter_Number as Meter_Number_L \
        , l.Min_Month as Min_E, l.Max_Month as Max_E, r.Min_Month as Min_L, r.Max_Month as Max_L\
        from df_switch_meter l join df_switch_meter r on l.Building_ID = r.Building_ID and l.Meter_Number != r.Meter_Number \
        where l.Max_Month == r.Min_Month_Prior"
a = pysql(str1)

It seems for the same building, not only do the meter numbers change in later months, but also there can be multiple meter numbers for the same months.

In [933]:
a.head(10)

Unnamed: 0,Building_ID,Meter_Number_E,Meter_Number_L,Min_E,Max_E,Min_L,Max_L
0,1.0 - BLD 04,6255947,7381828,2010-01-01 00:00:00.000000,2014-04-01 00:00:00.000000,2014-05-01 00:00:00.000000,2018-06-01 00:00:00.000000
1,10.0 - BLD 05,1010026,1163877,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-05-01 00:00:00.000000
2,10.0 - BLD 05,8010026,8163877,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
3,10.0 - BLD 06,7864559,8163892,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
4,10.0 - BLD 09,1009984,1125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
5,10.0 - BLD 09,1009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
6,10.0 - BLD 09,8009984,1125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
7,10.0 - BLD 09,8009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
8,10.0 - BLD 11,7424010,7864545,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-03-01 00:00:00.000000
9,10.0 - BLD 13,1864535,1301063,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000


By looking at one example account, it seems there are separated meter_numbers for KWH and KW charges for the same account. This may explain the existence of accounts with only non-zero values in kw or kwh charges. We'll merge these meter_numbers first.

In [934]:
mask = (df['Building_ID'] == '10.0 - BLD 09') & (df['Meter_Number'].isin(['8125376', '1125376']))
df[mask]

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Year,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges,Current_Charges
42678,42678,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011,2011-10-01,2011-09-22,2011-10-24,32.0,0.0,0.00,67.01,581.65,316.95,898.60
42703,42703,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011,2011-10-01,2011-09-22,2011-10-24,32.0,33520.0,2235.11,0.00,0.00,1218.02,3453.13
42712,42712,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011,2011-11-01,2011-10-24,2011-11-22,29.0,0.0,0.00,51.55,447.45,245.48,692.93
42737,42737,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011,2011-11-01,2011-10-24,2011-11-22,29.0,27840.0,1856.37,0.00,0.00,1018.46,2874.83
42746,42746,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011,2011-12-01,2011-11-22,2011-12-23,31.0,0.0,0.00,50.69,439.99,196.83,636.82
42771,42771,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011,2011-12-01,2011-11-22,2011-12-23,31.0,28720.0,1915.05,0.00,0.00,856.75,2771.80
74790,74790,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2012,2012-01-01,2011-12-23,2012-01-25,33.0,0.0,0.00,49.20,1012.56,-1012.56,0.00
74815,74815,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2012,2012-01-01,2011-12-23,2012-01-25,33.0,31120.0,1707.24,0.00,0.00,2562.84,4270.08
74824,74824,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2012,2012-02-01,2012-01-25,2012-02-24,30.0,0.0,0.00,49.58,1080.87,-1080.87,0.00
74849,74849,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2012,2012-02-01,2012-01-25,2012-02-24,30.0,28320.0,1553.64,0.00,0.00,2249.54,3803.18


### 3.3.1 Identify accounts that have separated meters for KW and KWH charges and combine the meters

There are many cases where under the same Building_ID, two meter numbers differ only in the first digit and share the same service date ranges. Usually the larger meter number has zero values in all KW_Charges and the smaller one has zero values in all KWH_Charges. It seems reasonable to combined them.
- (Exceptions do exist - some larger meter number have values in both KW and KWH)

In [935]:
temp = df.groupby(['Building_ID', 'Meter_Number']).agg('count').reset_index()[['Building_ID', 'Meter_Number']]

In [936]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select distinct l.Building_ID, l.Meter_Number, r.Meter_Number\
        from temp l join temp r on l.Building_ID = r.Building_ID and l.Meter_Number > r.Meter_Number \
        where substr(l.Meter_Number, 2, length(l.Meter_number)) == substr(r.Meter_Number, 2, length(r.Meter_number))"
df_meter_mapping = pysql(str1)

df_meter_mapping.columns = ['Building_ID', 'Meter_Number_L', 'Meter_Number_S']

26.7% of the meter numbers can be mapped to another

In [937]:
str1 = "select count (distinct Meter_Number_S) as count_redudant_meters\
        from df_meter_mapping"
str2 = "select count (distinct Meter_Number) as count_meters\
        from temp"
pysql(str1)['count_redudant_meters'][0]/pysql(str2)['count_meters'][0]

0.267853488847964

In [938]:
del(temp)

In [939]:
df_meter_mapping.head()

Unnamed: 0,Building_ID,Meter_Number_L,Meter_Number_S
0,10.0 - BLD 01,7864550,1864550
1,10.0 - BLD 02,7864551,1864551
2,10.0 - BLD 03,8010023,1010023
3,10.0 - BLD 04,7864536,1864536
4,10.0 - BLD 05,8010026,1010026


Check if the two meters correspond to KWH_Charges and KW_Charges respectively, by comparing to the df_meter_type table obtained above

In [940]:
temp = pd.merge(df_meter_mapping, df_meter_type, left_on = ['Building_ID', 'Meter_Number_S']\
         , right_on = ['Building_ID', 'Meter_Number'], how = 'left')\
        [['Building_ID', 'Meter_Number_S', 'count_kwh_only', 'count_kw_only', 'count', 'kwh_only', 'kw_only', 'Meter_Number_L']]

temp.columns = ['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s',
       'Meter_Number_L']

temp = pd.merge(temp, df_meter_type, left_on = ['Building_ID', 'Meter_Number_L']\
         , right_on = ['Building_ID', 'Meter_Number'], how = 'left')\
        [['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s', 'Meter_Number_L', 'count_kwh_only', 'count_kw_only', 'count', 'kwh_only', 'kw_only']]

temp.columns = ['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s',
       'Meter_Number_L', 'count_kwh_only_l', 'count_kw_only_l', 'count_l', 'kwh_only_l', 'kw_only_l']

In [941]:
temp.head()

Unnamed: 0,Building_ID,Meter_Number_S,count_kwh_only_s,count_kw_only_s,count_s,kwh_only_s,kw_only_s,Meter_Number_L,count_kwh_only_l,count_kw_only_l,count_l,kwh_only_l,kw_only_l
0,10.0 - BLD 01,1864550,0,98,99,False,True,7864550,97,0,97,True,False
1,10.0 - BLD 02,1864551,0,98,99,False,True,7864551,95,0,95,True,False
2,10.0 - BLD 03,1010023,0,98,99,False,True,8010023,97,0,97,True,False
3,10.0 - BLD 04,1864536,0,98,99,False,True,7864536,97,0,97,True,False
4,10.0 - BLD 05,1010026,0,0,1,False,False,8010026,21,0,21,True,False


Nearly all the "small" meter_numbers are kw_only meters (they only have non-zero values in kw charges), it seems okay to map them to the "large" corresponding meter_numbers

kwh_only_l means the "larger" meter_number only has non-zero values in KWH charges; Better doc needed here

In [942]:
temp[(temp['kwh_only_l'] == False) & (temp['kw_only_l'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.060351413292589765

In [943]:
temp[(temp['kwh_only_s'] == False) & (temp['kw_only_s'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.0015278838808250573

In [944]:
temp[(temp['kwh_only_s'] == True) & (temp['kw_only_s'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.0

In [945]:
temp[(temp['kwh_only_s'] == False) & (temp['kw_only_s'] == True)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.998472116119175

#### Merge the meter numbers 

In [946]:
temp = pd.merge(df, df_meter_mapping, left_on = ['Building_ID', 'Meter_Number'], right_on = ['Building_ID','Meter_Number_S'], how = 'left')
# Meter_Number_New is the original Meter_Number if the it's not mapped to another Meter_Number, and the corresponding Meter_Number_L otherwise
temp['Meter_Number_New'] = temp['Meter_Number_L'].combine_first(temp['Meter_Number']) 

df = temp

del(temp)

In [947]:
mask = df.Meter_Number.isin(df_meter_mapping.Meter_Number_S.values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'meter_number mapped to another one with similar pattern'})])

26.8 % of meter_numbers can be mapped to another one.

In [948]:
df_meter_mapping.Meter_Number_S.nunique()/df.Meter_Number.nunique()

0.267853488847964

In [949]:
df.drop(['Meter_Number', 'Meter_Number_L', 'Meter_Number_S'], axis = 1, inplace = True)

df.columns = ['row', 'Account_Name', 'Location', 'Building_ID', 'Revenue_Year',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges', 
       'Other_Charges', 'Current_Charges', 'Meter_Number']

col_ordered = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number', 'Revenue_Year',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges', 
       'Other_Charges', 'Current_Charges']

df = df[col_ordered]

### 3.3.2 Meter number switch over the year for the same account

In [950]:
a = df.groupby(['Building_ID']).agg({'Meter_Number': 'nunique'}).reset_index()

a = a[a["Meter_Number"]>1]

a.columns = ['Building_ID', 'Counts']

a = pd.merge(a, df, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number', "Revenue_Month"]]\
.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month': ['max','min']}).reset_index()

a.columns = a.columns.get_level_values(0)

a.columns = ['Building_ID', 'Meter_Number', 'Max_Month', 'Min_Month']

a['Max_Month_Next'] = a['Max_Month'].map(lambda x: x + relativedelta(months=+1))
a['Min_Month_Prior'] = a['Min_Month'].map(lambda x: x - relativedelta(months=+1))
df_switch_meter = a

del(a)

In [951]:
str1 = "select l.Building_ID, l.Meter_Number as Meter_Number_E, r.Meter_Number as Meter_Number_L \
        , l.Min_Month as Min_E, l.Max_Month as Max_E, r.Min_Month as Min_L, r.Max_Month as Max_L\
        from df_switch_meter l join df_switch_meter r on l.Building_ID = r.Building_ID and l.Meter_Number != r.Meter_Number \
        where l.Max_Month == r.Min_Month_Prior"
a = pysql(str1)

This time it seems for each building, there is only one change of meter_number over the years.

In [952]:
a.head(10)

Unnamed: 0,Building_ID,Meter_Number_E,Meter_Number_L,Min_E,Max_E,Min_L,Max_L
0,1.0 - BLD 04,6255947,7381828,2010-01-01 00:00:00.000000,2014-04-01 00:00:00.000000,2014-05-01 00:00:00.000000,2018-06-01 00:00:00.000000
1,10.0 - BLD 09,8009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
2,10.0 - BLD 13,7864535,8301063,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
3,10.0 - BLD 14,7864555,8301067,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
4,10.0 - BLD 16,8074933,8163879,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-03-01 00:00:00.000000
5,103.0 - BLD 04,6938100,8743232,2010-01-01 00:00:00.000000,2014-11-01 00:00:00.000000,2014-12-01 00:00:00.000000,2018-06-01 00:00:00.000000
6,11.0 - CLASON POINT GARDENS BLD 03,8322351,7402447,2013-02-01 00:00:00.000000,2016-02-01 00:00:00.000000,2016-03-01 00:00:00.000000,2018-06-01 00:00:00.000000
7,11.0 - CLASON POINT GARDENS BLD 13,8322342,8096758,2013-02-01 00:00:00.000000,2016-05-01 00:00:00.000000,2016-06-01 00:00:00.000000,2018-06-01 00:00:00.000000
8,11.0 - CLASON POINT GARDENS BLD 44,6689019,8362632,2010-08-01 00:00:00.000000,2013-02-01 00:00:00.000000,2013-03-01 00:00:00.000000,2018-06-01 00:00:00.000000
9,111.0 - BLD 01,7250619,8809141,2010-04-01 00:00:00.000000,2016-09-01 00:00:00.000000,2016-10-01 00:00:00.000000,2018-06-01 00:00:00.000000


To quantify the percentage of these buildings:

In [953]:
df_meter_switch = pd.DataFrame(a['Building_ID'].value_counts() > 1).reset_index()
df_meter_switch.columns = ['Building_ID', 'Multipe_Switch']

df_single_meter_switch = df_meter_switch[df_meter_switch['Multipe_Switch'] == False]
df_multiple_meter_switch = df_meter_switch[df_meter_switch['Multipe_Switch'] == True]

In [954]:
df_meter_switch.shape

(560, 2)

In [955]:
df_multiple_meter_switch.shape

(51, 2)

In [956]:
df_meter_switch = pd.merge(a, df_single_meter_switch, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number_E', 'Meter_Number_L']]

In [957]:
del(a)

#### 14.2% of the meters can be mapped to another meter

In [958]:
df_meter_switch['Meter_Number_E'].count() / df['Meter_Number'].nunique()

0.14221849678681195

#### Merge the meter numbers 

In [959]:
temp = pd.merge(df, df_meter_switch, left_on = ['Building_ID', 'Meter_Number'], right_on = ['Building_ID', 'Meter_Number_E'], how = 'left')
temp['Meter_Number_New'] = temp['Meter_Number_L'].combine_first(temp['Meter_Number'])
df = temp

df.drop(['Meter_Number', 'Meter_Number_L', 'Meter_Number_E'], axis = 1, inplace = True)

In [960]:
df.head()

Unnamed: 0,row,Account_Name,Location,Building_ID,Revenue_Year,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges,Current_Charges,Meter_Number_New
0,0,ADAMS,BLD 05,118.0 - BLD 05,2010,2010-01-01,2009-12-24,2010-01-26,33.0,128800.0,7387.97,216.0,2808.0,5200.85,15396.82,7223256
1,1,ADAMS,BLD 05,118.0 - BLD 05,2010,2010-02-01,2010-01-26,2010-02-25,30.0,115200.0,6607.87,224.0,2912.0,5036.47,14556.34,7223256
2,2,ADAMS,BLD 05,118.0 - BLD 05,2010,2010-03-01,2010-02-25,2010-03-26,29.0,103200.0,5919.55,216.0,2808.0,5177.43,13904.98,7223256
3,3,ADAMS,BLD 05,118.0 - BLD 05,2010,2010-04-01,2010-03-26,2010-04-26,31.0,105600.0,6057.22,208.0,2704.0,6002.82,14764.04,7223256
4,4,ADAMS,BLD 05,118.0 - BLD 05,2010,2010-05-01,2010-04-26,2010-05-24,28.0,97600.0,5598.34,216.0,2808.0,5323.2,13729.54,7223256


Rename the meter_number column and reorder the columns.

In [961]:
df.columns = ['row', 'Account_Name', 'Location', 'Building_ID', 'Revenue_Year',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges', 'Current_Charges', 'Meter_Number']
col_ordered = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number', 'Revenue_Year',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges', 'Current_Charges']
df = df[col_ordered]

Log the corresponding rows in df_flags.

In [962]:
mask = df.Meter_Number.isin(df_meter_switch.Meter_Number_E.values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'meter_number switched to another for the same building_id'})])

Log the rows whose Building_ID has a many-to-many meter_number switch over the years.

In [963]:
mask = df['Building_ID'].isin(df_multiple_meter_switch.Building_ID.values)

df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'many_to_many meter_number switch over years for the building_id'})])

# 3.4 resolve other issues discovered

### 3.5 Consolidate data to Building-Meter-Service_Date_Range level
After combinging the meter numbers in the 2 steps above, there are cases where 2 rows exist for the same Meter and Service Date ranges (1 row for KW charges, 1 row for KWH charges)

In [964]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month', 'Service_Start_Date', 'Service_End_Date']).agg(['count'])['Account_Name'].reset_index()

In [965]:
idx['count'].value_counts()

1    109987
2     73572
Name: count, dtype: int64

See the example below, read starting from the 3rd row.

In [966]:
mask = (df['Building_ID'] == '70.0 - BLD 01') & (df['Revenue_Year'] == 2013) & ( (df['Meter_Number'] == '8095177') | (df['Meter_Number'] == '8095173'))
df[mask].sort_values(['Service_Start_Date', 'Meter_Number']).head(10)

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Year,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges,Current_Charges
77783,103362,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095173,2013,2013-04-01,2013-03-26,2013-04-24,29.0,45360.0,2339.67,0.0,0.0,4569.3,6908.97
77787,103366,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095177,2013,2013-04-01,2013-03-26,2013-04-24,29.0,42720.0,2203.5,0.0,0.0,4303.35,6506.85
77797,103376,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095173,2013,2013-05-01,2013-04-24,2013-05-23,29.0,0.0,0.0,90.53,2155.75,-2155.75,0.0
77811,103390,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095173,2013,2013-05-01,2013-04-24,2013-05-23,29.0,65040.0,3354.76,0.0,0.0,5421.44,8776.2
77801,103380,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095177,2013,2013-05-01,2013-04-24,2013-05-23,29.0,0.0,0.0,97.06,2311.25,-2311.25,0.0
77815,103394,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095177,2013,2013-05-01,2013-04-24,2013-05-23,29.0,75840.0,3911.83,0.0,0.0,6321.71,10233.54
77825,103404,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095173,2013,2013-06-01,2013-05-23,2013-06-24,32.0,0.0,0.0,116.16,2163.26,-2163.26,0.0
77839,103418,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095173,2013,2013-06-01,2013-05-23,2013-06-24,32.0,90480.0,5100.36,0.0,0.0,6561.4,11661.76
77829,103408,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095177,2013,2013-06-01,2013-05-23,2013-06-24,32.0,0.0,0.0,130.94,2438.51,-2438.51,0.0
77843,103422,CYPRESS HILLS,BLD 01,70.0 - BLD 01,8095177,2013,2013-06-01,2013-05-23,2013-06-24,32.0,105360.0,5939.14,0.0,0.0,7640.52,13579.66


Remove the multiple rows by aggregating at building, meter, revenue month, service_date_range level.

Since Location field has many null values, we can't aggregate by it.

In [967]:
df.isnull().sum()

row                      0
Account_Name             0
Location              5565
Building_ID              0
Meter_Number             0
Revenue_Year             0
Revenue_Month            0
Service_Start_Date       0
Service_End_Date         0
# days                   0
Consumption_KWH          0
KWH_Charges              0
Consumption_KW           0
KW_Charges               0
Other_Charges            0
Current_Charges          0
dtype: int64

In [968]:
df = df.groupby(['Account_Name', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Revenue_Year', 'Service_Start_Date',
       'Service_End_Date', '# days']).\
    agg({'Consumption_KW': 'sum', 'KW_Charges': 'sum', 'Consumption_KWH': 'sum', 'KWH_Charges': 'sum', 'Other_Charges': 'sum', 'Current_Charges': 'sum', 'row':'min'}).reset_index()

In [969]:
df.loc[:, 'Building_Meter'] = df['Building_ID'] + '_' + df['Meter_Number']

In [970]:
df.columns

Index(['Account_Name', 'Building_ID', 'Meter_Number', 'Revenue_Month',
       'Revenue_Year', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KW', 'KW_Charges', 'Consumption_KWH', 'KWH_Charges',
       'Other_Charges', 'Current_Charges', 'row', 'Building_Meter'],
      dtype='object')

Reorder the columns.

In [971]:
cols = ['row', 'Account_Name', 'Building_ID', 'Meter_Number', 'Building_Meter', 
       'Revenue_Month', 'Revenue_Year', 'Service_Start_Date',
       'Service_End_Date', '# days', 'Consumption_KW', 'KW_Charges',
       'Consumption_KWH', 'KWH_Charges', 'Other_Charges', 'Current_Charges'
       ]
df = df[cols]

# Remember to delete this dataframe in the end.

In [972]:
df_bak_prior_4 = df.copy()

## Step 4 - identify overlapping of billing periods

In [973]:
str1 = "select l.Building_ID, l.row as row_L, r.row as row_R \
        from df l join df r on l.Building_ID = r.Building_ID and l.Meter_Number = r.Meter_Number \
        and l.row != r.row and l.Service_Start_Date <= r.Service_Start_Date \
        and l.Service_End_Date > r.Service_Start_Date \
        "
a = pysql(str1)

In [974]:
temp = df[(df['row'].isin(a.row_L)) | (df['row'].isin(a.row_R))]

In [975]:
temp.Account_Name.value_counts()

RED HOOK EAST/RED HOOK WEST          54
OCEAN BAY APARTMENTS (OCEANSIDE)     12
MORRISANIA AIR RIGHTS                 4
CASSIDY-LAFAYETTE                     2
TWIN PARKS WEST (SITES 1 & 2)         2
FHA REPOSSESSED HOUSES (GROUP IX)     2
LEHMAN VILLAGE                        2
Name: Account_Name, dtype: int64

In [977]:
temp[temp['Building_Meter'] == '4.0 - RED HOOK EAST BLD 05_6505127']

Unnamed: 0,row,Account_Name,Building_ID,Meter_Number,Building_Meter,Revenue_Month,Revenue_Year,Service_Start_Date,Service_End_Date,# days,Consumption_KW,KW_Charges,Consumption_KWH,KWH_Charges,Other_Charges,Current_Charges
124490,50783,RED HOOK EAST/RED HOOK WEST,4.0 - RED HOOK EAST BLD 05,6505127,4.0 - RED HOOK EAST BLD 05_6505127,2011-01-01,2011,2010-12-23,2011-01-25,33.0,86.4,1123.2,47520.0,2725.75,2511.59,6360.54
124491,50749,RED HOOK EAST/RED HOOK WEST,4.0 - RED HOOK EAST BLD 05,6505127,4.0 - RED HOOK EAST BLD 05_6505127,2011-03-01,2011,2010-12-23,2011-03-25,92.0,81.6,1072.22,39360.0,2282.49,2279.74,5634.45


In [989]:
# mask = (df['Building_Meter'] == '4.0 - RED HOOK EAST BLD 05_6505127')
# df.loc[mask, ['Building_Meter', 'Service_Start_Date', 'Service_End_Date']]

In [978]:
temp[['Building_ID', 'Meter_Number']].drop_duplicates().shape[0]/df[['Building_ID', 'Meter_Number']].drop_duplicates().shape[0]

0.012674683132921676

Only 7 buildings (1.27% of accounts) have overlapping billing periods. We'll just exclude them from the working dataset for now.

In [983]:
mask = df.row.isin(temp.row.values)

df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'overlapping billing periods'})])

df = df[~mask]

In [984]:
df.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_cleaned")

In [171]:
# df = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_cleaned")

### Calculate Metrics regarding zero-values and meter types - 2nd time

In [1030]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select Building_ID, Meter_Number \
        , sum(case when KWH_Charges == 0 and KW_Charges > 0 then 1 else 0 end) as count_kw_only \
        , sum(case when KW_Charges == 0 and KWH_Charges > 0 then 1 else 0 end) as count_kwh_only \
        , sum(Current_Charges) as total_current_charges \
        , count(*) as count \
        from df \
        group by df.Building_ID, df.Meter_Number"
df_meter_type = pysql(str1)


df_meter_type['kwh_only'] = ((df_meter_type['count_kwh_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kw_only'] == 0)
df_meter_type['kw_only'] = ((df_meter_type['count_kw_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kwh_only'] == 0)

#### check the meters

print("perc of kw_only meters:", "{:.2%}".format(df_meter_type[(df_meter_type['kw_only'] == 1) & (df_meter_type['kwh_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_only meters:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 1) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_and_kw meters:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 0) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))


#### check the building_ids

a = df_meter_type[df_meter_type['kwh_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
b =  df_meter_type[df_meter_type['kw_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
a.columns = ['Building_ID', 'Count']
b.columns = ['Building_ID', 'Count']

print("perc of buildings with both kw_only and kwh_only meters:", \
     "{:.2%}".format(pd.merge(a, b, on = 'Building_ID', how = 'inner').shape[0] \
/ df_meter_type.groupby(['Building_ID']).agg('count').reset_index().shape[0]))


#### Check the statistics of zero-value rows:

print("perc of rows - current charges of zero:", "{:.2%}".format(df[df['Current_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kw charges of zero:", "{:.2%}".format(df[df['KW_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kwh charges of zero:", "{:.2%}".format(df[df['KWH_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - consumption/charge inconsistency:", \
      "{:.2%}".format(df[((df['KWH_Charges'] == 0) ^ (df['Consumption_KWH'] == 0)) \
   | ((df['KW_Charges'] == 0) ^ (df['Consumption_KW'] == 0)) ].shape[0]\
    /df.shape[0]))

print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of kw_only meters: 2.50%
perc of kwh_only meters: 16.44%
perc of kwh_and_kw meters: 81.05%
perc of buildings with both kw_only and kwh_only meters: 0.34%
perc of rows - current charges of zero: 2.42%
perc of rows - kw charges of zero: 17.66%
perc of rows - kwh charges of zero: 5.98%
perc of rows - consumption/charge inconsistency: 6.28%
perc of rows - sum of charges inconsistency: 34.14%


In [1031]:
print("perc of rows - consumption/charge inconsistency:", \
      "{:.2%}".format(df[((df['KWH_Charges'] == 0) & (df['Consumption_KWH'] != 0))].shape[0]\
    /df.shape[0]))

print("perc of rows - consumption/charge inconsistency:", \
      "{:.2%}".format(df[((df['KW_Charges'] == 0) & (df['Consumption_KW'] != 0))].shape[0]\
    /df.shape[0]))

print("perc of rows - KWH Charges negative:", \
     "{:.2%}".format(df[df['KWH_Charges'] < 0].shape[0]\
    /df.shape[0]))

print("perc of rows - KW Charges negative:", \
     "{:.2%}".format(df[df['KW_Charges'] < 0].shape[0]\
    /df.shape[0]))

perc of rows - consumption/charge inconsistency: 0.35%
perc of rows - consumption/charge inconsistency: 5.78%
perc of rows - KWH Charges negative: 0.00%
perc of rows - KW Charges negative: 0.00%


## Step 5 - prorate the bills with the calendar month, instead of the revenue_month field originally in the dataset

#### We first need to find the corresponding calendar months for each bill. Like the revenue_month field, each calendar month is represented by 1st day.

For each bill, add the corresponding calendar month for both the service start & end dates.

In [172]:
df['Start_Date_Month'] = df['Service_Start_Date'].apply(\
  lambda x: pd.to_datetime('-'.join([str(x.year), str(x.month)])))


df['End_Date_Month'] = df['Service_End_Date'].apply(\
  lambda x: pd.to_datetime('-'.join([str(x.year), str(x.month)])))


Create a smaller dataframe to work on the mapping.

In [173]:
cols = ['row', 'Start_Date_Month', 'End_Date_Month']
temp = df[cols]

Create a new data frame to store the maps betwwn row number to the calendar month.

In [176]:
df_month_row_mapping = temp.copy()

There are cases where the billing window is longer than one calendar month. 

For each bill, check if the billing window is longer than one month;
If so, save the first calendar month in df_month_row_mapping and move on to the next month.

Run this chunk of codes untile the temp dataframe is of zero rows.

In [181]:
temp.loc[:, 'Start_Date_Month_Next'] = \
temp.apply(lambda x: x['Start_Date_Month'] + relativedelta(months=+1), axis = 1)


temp.loc[:, 'Ind'] = \
temp.apply(lambda x: 1 if x['Start_Date_Month_Next'] < x['End_Date_Month'] else 0, axis = 1)


mask = temp['Ind'] == 1
temp = temp.loc[mask,['row', 'Start_Date_Month_Next', 'End_Date_Month']]
temp.columns = cols

df_month_row_mapping = pd.concat([df_month_row_mapping, temp])

print(temp.shape[0])

0


Collapse into one column that contains all corresponding calendar months of a given bill.

In [182]:
temp = pd.melt(df_month_row_mapping, id_vars = df_month_row_mapping.columns[0:-2].values, value_vars = df_month_row_mapping[cols].columns[-2:])

temp.drop('variable', axis = 1, inplace = True)

# Handle cases where both services start and end dates correspond to the same bill_month
temp = temp.drop_duplicates()

temp.columns  = ['row', 'Month']

Calculate the # of days in each bill_month; use this to adjust the cases where the only parts of a bill_month is covered by the billing period.

In [1002]:
temp.loc[:, 'Month_#_Days'] = \
temp['Month'].map(lambda x: (x + relativedelta(months = 1) - x).days)

Calculate how many days in the bill_month is contained in the billing period. Here we assume the Service_Start_Date is not included in the billing period whereas Service_End_Date is not.

In [1004]:
temp = pd.merge(temp, df, on = 'row', how = 'left')

In [1005]:
temp.loc[:, 'Prorated_Days'] = \
temp.apply(lambda x: \
       (min(x['Month'] + relativedelta(months = 1), x['Service_End_Date']) \
        - max(x['Service_Start_Date'], x['Month']))\
       .days, axis = 1) 

In [1006]:
mask = temp.Prorated_Days == 0
temp = temp[~mask]

Calculate the prorated kwh consumption values based on the prorated days.

In [1007]:
temp.loc[:, 'Prorated_KWH'] = \
temp.apply(lambda x: x['Consumption_KWH'] * x['Prorated_Days'] / x['# days'], axis = 1)

#### Create a dataframe that contains the data aggregated at building_meter-billing_period level and with the corresponding bill_month.

#### This dataframe will be useful to calculate the gaps days per bill month.

In [1008]:
df_with_calendar_month = temp.copy()

In [1009]:
cols = ['row', 'Account_Name', 'Building_ID',
       'Meter_Number', 'Building_Meter', 'Revenue_Month', 'Revenue_Year',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Consumption_KW',
       'KW_Charges', 'Consumption_KWH', 'KWH_Charges', 'Other_Charges',
       'Current_Charges', 'Month', 'Month_#_Days', 'Prorated_Days', 'Prorated_KWH']

df_with_calendar_month = df_with_calendar_month[cols]

In [1010]:
df_with_calendar_month.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_with_calendar_month")

In [7]:
# df_with_calendar_month = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_with_calendar_month")

Sum up the prorated kwh consupmtion values per bill_month.

Calculate weighted sum of prorated KWH consumptions per account, per calendar month.

In [1011]:
df_prorated = \
df_with_calendar_month.groupby(['Building_Meter','Month', 'Month_#_Days']).agg({'Prorated_KWH':'sum', 'Prorated_Days':'sum'}).reset_index()

df_prorated.loc[:, 'Imputed_KWH'] = \
df_prorated.apply(lambda x: x['Prorated_KWH']*x['Month_#_Days']/x['Prorated_Days'], axis = 1)

Create a dataframe that maps the account (Building_Meter) with all the calendar months that it should have bills.

In [113]:
# list of unique meters
meters = df_with_calendar_month.Building_Meter.value_counts().index.values
df_month_meter = pd.DataFrame()

end = df_with_calendar_month['Month'].max()
start = df_with_calendar_month['Month'].min()
diff = (end.year - start.year) * 12 + end.month - start.month
# list of unique months
months = [start + relativedelta(months=x) for x in range(0, diff + 1)]

Long.

In [114]:
for j in range(len(meters)):
    mask = (df_with_calendar_month['Building_Meter'] == meters[j])
    start = df_with_calendar_month[mask]['Month'].min()
    end = df_with_calendar_month[mask]['Month'].max()
    start_index = months.index(start)
    end_index = months.index(end)
    
    temp_df = pd.DataFrame({'Building_Meter':meters[j], 'Month':months[start_index:end_index + 1]})
    temp_df.loc[:, 'Month_Type'] = 'Months_In_The_Middle'
    temp_df.loc[0, 'Month_Type'] = 'First_Month'
    temp_df.loc[temp_df.shape[0]-1, 'Month_Type'] = 'Last_Month'
    df_month_meter = pd.concat([df_month_meter, temp_df])

In [149]:
df_prorated = pd.merge(df_month_meter, df_prorated, on = ['Building_Meter', 'Month'], how = 'left')

In [155]:
mask = df_prorated['Month_#_Days'].isnull()

df_prorated.loc[mask, 'Prorated_KWH'] = 0

df_prorated.loc[mask, 'Prorated_Days'] = 0

df_prorated.loc[mask, 'Imputed_KWH'] = 0

In [161]:
df_prorated.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_prorated")

In [12]:
# df_prorated = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_prorated")

In [160]:
# Create a trace
temp = df[df['Building_Meter'] == '341.0 - BLD 04_7835072']
trace1 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.Consumption_KWH,
    name = 'KWH Consumption by Revenue Month',
    yaxis = 'y'
)

temp = df_prorated[df_prorated['Building_Meter'] == '341.0 - BLD 04_7835072']
trace2 = go.Scatter(
    x = temp.Month,
    y = temp.Imputed_KWH,
    name = 'Prorated KWH Consumption by Bill Month',
    yaxis = 'y'
)

data = [trace1, trace2]

layout = go.Layout(
    title='KWH Consumptions over time',
    yaxis=dict(
        title='KWH Consumption',
        tickformat=",",
    ),
    legend=dict(x = -0.05, y=1.5)
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

## Step 6 - identify the billing gaps per calendar month

In [102]:
df_with_calendar_month.loc[:, 'Beginning_Of_Month_Bill'] = \
df_with_calendar_month.apply(lambda x: \
1 if (x['Service_End_Date'] <= x['Month'] + relativedelta(months = 1)) \
and (x['Service_Start_Date'] < x['Month'])\
else 0, axis = 1)

df_with_calendar_month.loc[:, 'End_Of_Month_Bill'] = \
df_with_calendar_month.apply(lambda x: \
1 if (x['Service_End_Date'] > x['Month'] + relativedelta(months = 1)) \
and (x['Service_Start_Date'] >= x['Month'])\
else 0, axis = 1)

In [117]:
df_month_gaps = \
df_with_calendar_month.groupby(['Building_Meter','Month', 'Month_#_Days'])\
.agg({'Prorated_Days':'sum', 'Beginning_Of_Month_Bill':'sum', 'End_Of_Month_Bill':'sum', 'row':'count'}).reset_index()

df_month_gaps.columns = ['Building_Meter', 'Month', 'Month_#_Days', 'Prorated_Days',
       'Beginning_Of_Month_Bill', 'End_Of_Month_Bill', 'Bill_Count']

In [120]:
df_month_gaps = pd.merge(df_month_meter, df_month_gaps, on = ['Building_Meter', 'Month'], how = 'left')

In [121]:
mask = df_month_gaps['Month_#_Days'].isnull()

In [124]:
df_month_gaps.loc[mask, 'Prorated_Days'] = 0
df_month_gaps.loc[mask, 'Bill_Count'] = 0
df_month_gaps.loc[mask, 'Beginning_Of_Month_Bill'] = 0
df_month_gaps.loc[mask, 'End_Of_Month_Bill'] = 0
df_month_gaps.loc[mask, 'Month_#_Days'] = df_month_gaps['Month'].map(lambda x: (x + relativedelta(months = 1) - x).days)

In [128]:
df_month_gaps.loc[:,'Gaps'] = df_month_gaps.apply(lambda x: x['Month_#_Days'] - x['Prorated_Days'], axis = 1)

In [132]:
mask = ((df_month_gaps['Bill_Count'] == 1) & (df_month_gaps['End_Of_Month_Bill'] == 1) & (df_month_gaps['Month_Type'] == 'First_Month')) \
| ((df_month_gaps['Bill_Count'] == 1) & (df_month_gaps['Beginning_Of_Month_Bill'] == 1) & (df_month_gaps['Month_Type'] == 'Last_Month'))


In [134]:
df_month_gaps.loc[mask, 'Gaps'] = np.nan

In [162]:
df_month_gaps.shape

(212336, 9)

In [163]:
df_prorated.shape

(212336, 7)

In [166]:
df_month_gaps.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_month_gaps")

In [5]:
# df_month_gaps = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_month_gaps")

# Continue to work here:

#### Trendline of % of accounts with missing data by calendar month

In [39]:
temp = df_month_gaps.groupby(['Month']).agg({'Building_Meter':'nunique'}).reset_index()
temp.columns = ['Month', 'meters_with_data_count']
temp = pd.merge(temp, df_month_meter, on = 'Month', how = 'inner')

temp['meter_with_data_perc'] = round(temp['meters_with_data_count'] / temp['meters_count'], 4)
temp = temp.sort_values('Month')
temp['meters_missing_data_count'] = temp['meters_count']  - temp['meters_with_data_count']

df_data_completeness_by_month = temp

In [40]:
# Create a trace
trace1 = go.Bar(
    x = df_data_completeness_by_month.Month,
    y = df_data_completeness_by_month.meters_count,
    name = '# of Meters that should have data in the month', 
    marker=dict(
        color='rgba(204,204,204,1)'
    ),
    yaxis= 'y'
)

trace2 = go.Scatter(
    x = df_data_completeness_by_month.Month,
    y = 1 - df_data_completeness_by_month.meter_with_data_perc,
    name = '% of Meters with missing data',
    yaxis = 'y2'
)

data = [trace1, trace2]

layout = go.Layout(
    title='Trend Line of Data Completeness',
    yaxis=dict(
        title='# of Meters that should have data in the month',
        tickformat=",",
    ),
    yaxis2=dict(
        title='% of Meters with missing data',
        tickformat=".1%",
        side='right',
        overlaying='y',
    ), 
    legend=dict(x = -0.05, y=1.5)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

#### Trendline of % of accounts with billing gaps (no data or 3+ days of gap) by revenue month

In [169]:
df_month_gaps.head()

Unnamed: 0,Building_Meter,Month,Month_Type,Month_#_Days,Prorated_Days,Beginning_Of_Month_Bill,End_Of_Month_Bill,Bill_Count,Gaps
0,128.0 - BLD 05_7754298,2009-12-01,First_Month,31.0,8.0,0.0,1.0,1.0,
1,128.0 - BLD 05_7754298,2010-01-01,Months_In_The_Middle,31.0,31.0,1.0,1.0,2.0,0.0
2,128.0 - BLD 05_7754298,2010-02-01,Months_In_The_Middle,28.0,28.0,1.0,1.0,2.0,0.0
3,128.0 - BLD 05_7754298,2010-03-01,Months_In_The_Middle,31.0,31.0,1.0,1.0,2.0,0.0
4,128.0 - BLD 05_7754298,2010-04-01,Months_In_The_Middle,30.0,30.0,1.0,1.0,2.0,0.0


In [44]:
df_data_completeness_by_month.head()

Unnamed: 0,Month,meters_with_data_count,meters_count,meter_with_data_perc,meters_missing_data_count
0,2009-12-01,1578,1578.0,1.0,0.0
1,2010-01-01,1753,1753.0,1.0,0.0
2,2010-02-01,1717,1849.0,0.9286,132.0
3,2010-03-01,1737,1881.0,0.9234,144.0
4,2010-04-01,1685,1869.0,0.9016,184.0


In [None]:
meters_missing_3_days = [df[(df['gaps'] > 3) & (df['Revenue_Month'] ==  month)].Building_Meter.nunique() for month in months]

df_gap_3days_by_month = pd.DataFrame({'Revenue_Month':months, 'meters_3days_count':meters_missing_3_days})

df_data_completeness_by_month = pd.merge(df_data_completeness_by_month, df_gap_3days_by_month)

df_data_completeness_by_month['meter_gaps_days_perc'] = (df_data_completeness_by_month['meters_3days_count'] \
                                                        + df_data_completeness_by_month['meters_missing_data_count']) \
                                                        /df_data_completeness_by_month['meters_count']

In [None]:
# Create a trace
trace1 = go.Bar(
    x = df_data_completeness_by_month.Revenue_Month,
    y = df_data_completeness_by_month.meters_count,
    name = '# of Accounts that should have data in the month', 
    marker=dict(
        color='rgba(204,204,204,1)'
    ),
    yaxis= 'y'
)

trace2 = go.Scatter(
    x = df_data_completeness_by_month.Revenue_Month,
    y = 1 - df_data_completeness_by_month.meter_with_data_perc,
    name = '% of Accounts with no data',
    yaxis = 'y2'
)

trace3 = go.Scatter(
    x = df_data_completeness_by_month.Revenue_Month,
    y = df_data_completeness_by_month.meter_gaps_days_perc,
    name = '% of Accounts with no data or 3+ days of gap', 
    yaxis= 'y2'
)

data = [trace1, trace2, trace3]

layout = go.Layout(
    title='Trend Line of Data Incompleness',
    yaxis=dict(
        title='# of Accounts that should have data in the month',
        tickformat=",",
    ),
    yaxis2=dict(
        title='% of Accounts missing data',
        tickformat=".1%",
        side='right',
        overlaying='y',
    ), 
    legend=dict(x = -0.05, y= -0.4)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

##### average % of accounts that have no data or 3+ days of gap

In [None]:
np.mean(df_data_completeness_by_month.meter_gaps_days_perc)

In [None]:
df_month_gaps

In [None]:
meters_missing_3_days = [df[(df['gaps'] > 3) & (df['Revenue_Month'] ==  month)].Building_Meter.nunique() for month in months]

df_gap_3days_by_month = pd.DataFrame({'Revenue_Month':months, 'meters_3days_count':meters_missing_3_days})

df_data_completeness_by_month = pd.merge(df_data_completeness_by_month, df_gap_3days_by_month)

df_data_completeness_by_month['meter_gaps_days_perc'] = (df_data_completeness_by_month['meters_3days_count'] \
                                                        + df_data_completeness_by_month['meters_missing_data_count']) \
                                                        /df_data_completeness_by_month['meters_count']

In [None]:
# Create a trace
trace1 = go.Bar(
    x = df_data_completeness_by_month.Revenue_Month,
    y = df_data_completeness_by_month.meters_count,
    name = '# of Accounts that should have data in the month', 
    marker=dict(
        color='rgba(204,204,204,1)'
    ),
    yaxis= 'y'
)

trace2 = go.Scatter(
    x = df_data_completeness_by_month.Revenue_Month,
    y = 1 - df_data_completeness_by_month.meter_with_data_perc,
    name = '% of Accounts with no data',
    yaxis = 'y2'
)

trace3 = go.Scatter(
    x = df_data_completeness_by_month.Revenue_Month,
    y = df_data_completeness_by_month.meter_gaps_days_perc,
    name = '% of Accounts with no data or 3+ days of gap', 
    yaxis= 'y2'
)

data = [trace1, trace2, trace3]

layout = go.Layout(
    title='Trend Line of Data Incompleness',
    yaxis=dict(
        title='# of Accounts that should have data in the month',
        tickformat=",",
    ),
    yaxis2=dict(
        title='% of Accounts missing data',
        tickformat=".1%",
        side='right',
        overlaying='y',
    ), 
    legend=dict(x = -0.05, y= -0.4)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

##### average % of accounts that have no data or 3+ days of gap

In [None]:
np.mean(df_data_completeness_by_month.meter_gaps_days_perc)

In [None]:
# temp = df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month']).agg('count').reset_index().iloc[:, 0:4]
# temp.columns = ['Building_ID', 'Meter_Number', 'Revenue_Month', 'Row_Counts']

In [None]:
# df_multiple = pd.merge(df, temp[temp['Row_Counts']  > 1], on = ['Building_ID', 'Meter_Number', 'Revenue_Month'], how = 'inner').iloc[:, 0:15]
# df_single = pd.merge(df, temp[temp['Row_Counts']  == 1], on = ['Building_ID', 'Meter_Number', 'Revenue_Month'], how = 'inner').iloc[:, 0:15]

In [None]:
# # sort by building_id, revenue month, meter number
# df_multiple = df_multiple.sort_values(by = ['Meter_Number', 'Revenue_Month', 'Service_Start_Date'], ascending=[True, True, True])

# def merge_dates(grp):
#     # Find contiguous date groups, and get the first/last start/end date for each group.
#     dt_groups = (grp['Service_Start_Date'] != grp['Service_End_Date'].shift()).cumsum()
#     return grp.groupby(dt_groups).agg({'Service_Start_Date': 'first', 'Service_End_Date': 'last',
#         '# days':'sum', 'Consumption_KW':'sum', 'KW_Charges':'sum',
#        'Consumption_KWH':'sum', 'KWH_Charges':'sum', 'Other_Charges':'sum', 'Current_Charges':'sum'})

# # Perform a groupby and apply the merge_dates function, followed by formatting.
# df_multiple_concatenate = df_multiple.groupby(['Account_Name', 'Location', 'Building_ID', 'Meter_Number', 'Revenue_Month', 'Revenue_Year']).apply(merge_dates)
# df_multiple_concatenate = df_multiple_concatenate.reset_index().drop('level_6', axis = 1)
# df_multiple_concatenate = df_multiple_concatenate.reset_index().iloc[:, 1:16]

### 16. Find the gaps between service date ranges

We'd like to know how many account have gaps (> 3 days) in their billing windows

#### concatenate service date ranges for each builing_id and  meter_number, across all years

In [None]:
# sort by building_id, meter number
df = df.sort_values(by = ['Building_ID', 'Meter_Number', 'Service_Start_Date'], ascending=[True, True, True])

def merge_dates(grp):
    # Find contiguous date groups, and get the first/last start/end date for each group.
    dt_groups = (grp['Service_Start_Date'] != grp['Service_End_Date'].shift()).cumsum()
    return grp.groupby(dt_groups).agg({'Service_Start_Date': 'first', 'Service_End_Date': 'last'})

# Perform a groupby and apply the merge_dates function, followed by formatting.
df_gap = df.groupby(['Building_ID', 'Meter_Number']).apply(merge_dates)
df_gap = df_gap.reset_index().drop('level_2', axis = 1)
df_gap = df_gap.reset_index()
df_gap.columns = ['rowNum', 'Building_ID', 'Meter_Number', 
       'Service_Start_Date', 'Service_End_Date']

df_gap['nextRowNum'] = df_gap['rowNum'].map(lambda x: x+1)

# Join the dataframe with itself to find the gap between service ranges
df_gap = pd.merge(df_gap, df_gap[['Building_ID', 'Meter_Number', 'nextRowNum', 'Service_End_Date']],\
        left_on = ['Building_ID', 'Meter_Number', 'rowNum'], right_on = ['Building_ID', 'Meter_Number', 'nextRowNum'], how = 'left')

# consecutive days of billing for the same meter number
df_gap['consecutive_days'] = \
df_gap[['Service_End_Date_x', 'Service_Start_Date']].apply(lambda x: (x[0] - x[1]).days, axis = 1)

# number of days elapsed since the previous service range
df_gap['gap_days'] = \
df_gap[['Service_Start_Date', 'Service_End_Date_y']].apply(lambda x: (x[0] - x[1]).days, axis = 1)


# Rename and reorder the columns
df_gap = df_gap[['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date_x', 'consecutive_days', 'gap_days']]
df_gap.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'consecutive_days', 'gap_days']

df_gap['Building_Meter'] = df_gap['Building_ID'] + df_gap['Meter_Number']

#### How frequent does a meter has gaps of at least 3 days through all the years ? ~83.2%

In [None]:
df_gap[df_gap['gap_days'] >= 3]['Building_Meter'].nunique() / df_gap['Building_Meter'].nunique()

#### Overlapping service date ranges - 0.71% of the meter accounts

In [None]:
mask = df_gap['gap_days'] < 0
df_gap[mask]

In [None]:
print("Perc of meters with overlapping service date ranges:", "{:.2%}".format(df_gap[mask]['Building_Meter'].agg('nunique')/df_gap['Building_Meter'].agg('nunique')))

In [None]:
df_gap[mask].gap_days.value_counts()

#### Examples

In [None]:
mask = (df['Building_ID'] == '79.0 - RED HOOK WEST BLD 03') \
& ((df['Meter_Number'] == '6477455')|(df['Meter_Number'] == '6477455') ) \
& (df['Revenue_Year'] == 2011)

df[mask].sort_values(['Revenue_Month', 'Service_Start_Date', 'Meter_Number'])

### Summarize gaps by days

In [None]:
df_gap_summary = df_gap[df_gap['gap_days'] > 0].groupby('Building_Meter').agg({'consecutive_days':'sum', 'gap_days':'sum'}).reset_index()

df_gap_summary['perc_gap'] = df_gap_summary['gap_days']/(df_gap_summary['consecutive_days'] + df_gap_summary['gap_days'])

#### Only 29.3% of the meters have % of missing days less than 10%

In [None]:
df_gap_summary[df_gap_summary['perc_gap'] < 0.1].shape[0]/ df_gap_summary.shape[0]

#### For those who doesn't have gaps longer than 5 days, most of them just have one revenue_month reported 

In [None]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select a.m1 as Building_Meter from \
        (select distinct Building_Meter as m1\
        from df_gap) a \
        left join \
        (select distinct Building_Meter as m2, 1 as ind\
        from df_gap where gap_days >= 5) b \
        on a.m1 == b.m2 where b.ind is null \
        "
a = pysql(str1)

#### Only two metes have almost no gap in all 8 years

In [None]:
pd.merge(a, df_gap, on = 'Building_Meter', how = 'inner').gap_days.value_counts()

#### Save the data for later use

In [None]:
df_gap.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_service_range_gaps")
df_gap_summary.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_service_range_gaps_summary")

### 17. Summarize gaps by revenue months (since we found that most of the cases, service date ranges either missed the entire month, or covers the whole month)

In [None]:
# sort by building_id, meter number and revenue month
df = df.sort_values(by = ['Building_ID', 'Meter_Number', 'Revenue_Month'], ascending=[True, True, True])
a = df[['Building_ID', 'Meter_Number', 'Revenue_Month']]
a.loc[:, 'Next_Revenue_Month'] = a['Revenue_Month'].map(lambda x: x + relativedelta(months=+1))

def merge_months(grp):
    # Find contiguous date groups, and get the first/last start/end date for each group.
    dt_groups = (grp['Revenue_Month'] != grp['Next_Revenue_Month'].shift()).cumsum()
    return grp.groupby(dt_groups).agg({'Revenue_Month': 'first', 'Next_Revenue_Month': 'last'})

# Perform a groupby and apply the merge_dates function, followed by formatting.
df_gap_month = a.groupby(['Building_ID', 'Meter_Number']).apply(merge_months)
df_gap_month = df_gap_month.reset_index().drop('level_2', axis =1)

df_gap_month.columns = ['Building_ID', 'Meter_Number', 
       'Revenue_Month_Start', 'Revenue_Month_End']
    
df_gap_month.loc[:, 'Consecutive_Months'] = \
(df_gap_month['Revenue_Month_End'].dt.year - df_gap_month['Revenue_Month_Start'].dt.year) * 12 + \
(df_gap_month['Revenue_Month_End'].dt.month - df_gap_month['Revenue_Month_Start'].dt.month)

df_gap_month['Building_Meter'] = df_gap_month['Building_ID'] + df_gap_month['Meter_Number']

In [None]:
a = pd.merge(df.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month':'max'}).reset_index() \
, df.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month':'min'}).reset_index() \
, on = ['Building_ID', 'Meter_Number'], how = 'inner' \
)


a.columns = ['Building_ID', 'Meter_Number', 'Revenue_Month_max', 'Revenue_Month_min']

a.loc[:, 'Span_Months'] = \
(a['Revenue_Month_max'].dt.year - a['Revenue_Month_min'].dt.year) * 12 + \
(a['Revenue_Month_max'].dt.month - a['Revenue_Month_min'].dt.month) + 1

df_gap_month_summary = \
pd.merge(df_gap_month.groupby(['Building_ID', 'Meter_Number']).agg({'Consecutive_Months':'sum'}).reset_index()\
, a, on = ['Building_ID', 'Meter_Number'], how = 'inner')

del(a)

In [None]:
cols = ['Building_ID', 'Meter_Number', 'Consecutive_Months', 'Span_Months']
df_gap_month_summary = df_gap_month_summary[cols]

df_gap_month_summary.loc[:, 'Consecutive_Months_Perc'] = \
df_gap_month_summary['Consecutive_Months'] / df_gap_month_summary['Span_Months']

#### Save the data for later use

In [None]:
df_gap_month.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_revenue_month_gaps")
df_gap_month_summary.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_revenue_month_gaps_summary")

### 17. Combine rows to the Building-Meter-Month level and Building-Month level; add new aggregation metrics

We need to analyze anamolous values of charges and consumptions at the Building-Meter-Month level and Building-Month level

In [None]:
df_combined_meter.shape

In [None]:
df.shape

In [None]:
df_combined_meter = df

df_combined_building = pd.pivot_table(df, values = ['Current_Charges','Consumption_KWH', 'KWH_Charges',\
       'Consumption_KW', 'KW_Charges', 'Other_Charges'], index=['Account_Name', 'Location', 'Building_ID',
       'Revenue_Month'], aggfunc = np.sum).reset_index()

In [None]:
df_combined_meter['Total_Charges'] = df_combined_meter['KW_Charges'] + df_combined_meter['KWH_Charges']
df_combined_meter['Total_Energy_Rate'] = df_combined_meter['Total_Charges']/df_combined_meter['Consumption_KWH']

df_combined_meter['Building_Meter'] = df_combined_meter['Building_ID'] + df_combined_meter['Meter_Number']

In [None]:
df_combined_building['Total_Charges'] = df_combined_building['KW_Charges'] + df_combined_building['KWH_Charges']
df_combined_building['Total_Energy_Rate'] = df_combined_building['Total_Charges']/df_combined_building['Consumption_KWH']

In [None]:
df_combined_building.shape

In [None]:
df_combined_building.head()

In [None]:
df_combined_meter.shape

In [None]:
df_combined_meter.head()

### 18. Save the cleaned data to the output folder

In [None]:
# data at Building_ID, Meter_Number, Revenue_Month level
df.to_pickle("../output/NYCHA_Electricity_2010_to_2018_CleanedDF")

In [None]:
# data at Building_ID, Meter_Number, Revenue_Month level
df_combined_meter.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_combined_meter")

In [None]:
# data at Building_ID, Meter_Number level
df_combined_building.to_pickle("../output/NYCHA_Electricity_2010_to_2018_df_combined_building")

## To continue the work:

In [None]:
from __future__ import division
import pandas as pd
import numpy as np
import pandasql as pdsql
from datetime import datetime
from datetime import timedelta
from dateutil.relativedelta import *

# import matplotlib as mpl
import matplotlib.pyplot as plt
# Setup matplotlib to display in notebook:
%matplotlib inline

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)         # initiate notebook for offline plot


In [None]:
df_orig = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_original_dataset")

df = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_CleanedDF")

df_combined_meter = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_combined_meter")
df_combined_building = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_combined_building")

df_gap = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_service_range_gaps")
df_gap_summary = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_service_range_gaps_summary")

df_gap_month = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_revenue_month_gaps")
df_gap_month_summary = pd.read_pickle("../output/NYCHA_Electricity_2010_to_2018_df_revenue_month_gaps_summary")

#### Use SQL to explore the data

In [None]:
# pysql = lambda q: pdsql.sqldf(q, globals())
# str1 = "select count(*) \
#         from df \
#         "
# temp = pysql(str1)

#### How many meters per building?

In [None]:
df.groupby('Building_ID').agg({'Meter_Number':'nunique'}).reset_index()['Meter_Number'].value_counts()

#### Summary Statistics 

In [None]:
df[["Consumption_KWH",  "Consumption_KW", "Current_Charges", "KWH_Charges", "KW_Charges", "Other_Charges"]].describe()

#### Perc of accounts with no missing data for all months

In [None]:
a = df_gap.groupby('Building_Meter').agg({'gap_days':'sum'}).reset_index()

In [None]:
a[a['gap_days'] == 0].shape[0]/a.shape[0]

In [None]:
del(a)

#### Trend Line of Average Energy Charges

In [None]:
temp = df_combined_meter.groupby(['Revenue_Month']).\
agg({'Total_Charges':'mean', 'Total_Energy_Rate':'mean', 'KWH_Charges':'mean', 'KW_Charges':'mean'}).reset_index()

In [None]:
# Create traces
trace1 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.Total_Charges,
    name = 'Avg. Total Charge'
)
trace2 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.Total_Energy_Rate,
    name = 'Avg. Total Charge Rate', 
    yaxis='y2'
)

data = [trace1, trace2]

layout = go.Layout(
    title='Trend Line of Average Energy Charges',
    yaxis=dict(
        title='Avg. Total Charges($)',
        tickformat=","
    ),
    yaxis2=dict(
        title='Avg. Total Charge Rates($/KWH)',
        titlefont=dict(
            color='rgb(148, 103, 189)'
        ),
        tickfont=dict(
            color='rgb(148, 103, 189)'
        ),
#         tickformat=".2%",
        overlaying='y',
        side='right'
    ),
    legend=dict(x=-.1, y=1.2)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

#### Trend Line of Average KW and KWH Charges

In [None]:
# Create traces
trace1 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.KWH_Charges,
    name = 'Avg. KWH Charges'
)
trace2 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.KW_Charges,
    name = 'Avg. KW Charges', 
    yaxis='y2'
)

data = [trace1, trace2]

layout = go.Layout(
    title='Trend Line of Average KW and KWH Charges',
    yaxis=dict(
        title='Avg. KWH Charges($)',
        tickformat=","
    ),
    yaxis2=dict(
        title='Avg. KW Charges($)',
        titlefont=dict(
            color='rgb(148, 103, 189)'
        ),
        tickfont=dict(
            color='rgb(148, 103, 189)'
        ),
        tickformat=",",
        overlaying='y',
        side='right'
    ),
    legend=dict(x=-.1, y=1.2)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

## Q&A with Linnea:

1. why would "Consumption_KW" be zero?
    - KW and KWH should be both positive, unless there are some related bills that already covers it
    - Maybe one account was separated into multiple meters?
2. What's the "Other Charges"?
    - negative values to adjust for the payments from previous month
    - taxes, fee for meter-reading, little fees charged by utilities and states (e.g. system benefit charge), credit (state got a better deal after charging the clients)

## To Do:

2. Summarize all types of entries that doesn't make sense; flag and ignore them
   - Cases where other == kw and kwh == 0, why?
   - Cases where other == current and (kw!=0 or kwh != 0)
   - Negative values in KWH, KW
   - Inconsistency between consumption & charges
   - KW charge is offset by negative "other charge" (16.7%)
   - Meter accounts that only have non-zero values in either KW (0.8%) or KWH (16.9%) charges
   - Overlapping or duplication of service_date_ranges between rows (this affects the prorated values also)
3. Calendarize the bills (calculate avg. daily cost and consumption and multiple by # of days) All analyses on missing data and gaps should be based on calendarized bills
4. Starting from 2015, does data quality get better? less meters are missing data? (Government required companies to submit utility data since that time)
5. January are more likely to miss data. Why? Check if that's true. 
6. Check the distribution of % of accounts with gaps days == 3, May 2010 and May 2010 have really high %...
6. Check the relationship between Building_ID and Account_Name. Is it a 1-on-1 mapping?
6. Check anomalies in the following order
    - KWH (consumption) .. only compare where there are months of data (ignore the gap month), or we can also use usage per day and then exclude the days with no consumption(instead of using the pro-rated value)
    - KWH_Charges
    - KW (capacity) consumption and charges (difference in daytime vs. nightime, summer vs. winter, whole summer is at capacity, we will have really high charges for summer capacity use) (Later Metrics defined below)

####  Metrics to consider later

1) total capacity (kW) for all the meters for the month (building level aggregate)

2) Max kW value for the month (both building level and account level)

3) Max kW for each meter for the previous 12 months

4) Sum of the Max kW for each individual meter

5) The variance of Total Charge (sum of KWH_charge and KW_charge) at both account level and building level

#### Edge case examples

##### 1. Check where df_combined_meter['Total_Charges'] < 0 or df_combined_meter['Consumption_KWH'] == 0

In [None]:
mask = (df_combined_meter['Consumption_KWH'] > 0) & (df_combined_meter['Total_Charges'] > 0)

mask = df_combined_meter['Consumption_KWH'] > 0
temp = df_combined_meter[mask].groupby(['Revenue_Month']).agg({\
        'Total_Charges':'mean', 'Total_Energy_Rate': 'mean', 'KWH_Charges':'mean', 'KW_Charges':'mean'}).reset_index()

temp.columns = ['Revenue_Month', 'Total_Charges', 'Total_Energy_Rate', 'KWH_Charges', 'KW_Charges']

temp = temp.sort_values('Revenue_Month')

#### 2. Check where other_charges is not zero, but all other metrics are zero

In [None]:
df[(df['Other_Charges'] != 0) & (df['Current_Charges'] == df['Other_Charges']) & (~((df['KWH_Charges'] == 0) & (df['KW_Charges'] == 0) \
  & (df['Consumption_KWH'] == 0) & (df['Consumption_KW'] == 0)))]