## Objective

The objective of this notebook is to get familiarity with the electricity usage data and conduct data cleaning and transformations in preparation for further statistical analyses and anomaly detection. More specifically, we need to:
- Resolve the data quality issues and clean the dataset, such that there is only one row per account per billing period (an account is uniquely determined by the combination of building ID and meter number)
- Prorate the KWH Consumption values use the calendar month, instead of the revenue month in the original dataset
- Identify the billing gap per account per calendar month 
- Impute the KWH consumptions per account per calendar month

## Known data quality issues according to  the domain knowledge expert
1. Multiple account names might exist for the same building_id
2. Meter number may switch over the years for the same account
3. Overlapping/duplication of bills might exist

## Data Cleaning Steps
1. Read in data, create a dataframe to log the rows with issues
2. Conduct general data cleaning - remove null rows, convert data types, set up a group of metrics to indicate data cleanliness
3. Resolve the known issues & other issues discovered along the way
4. Identify duplication and overlapping of bills
5. Prorate the bills to calendar months
6. Identify the billing gaps per calendar month
7. Impute the KWH Consumption values per calendar month

## Step 1 - Read in data, create a dataframe to log the rows with issues

Below is a list of Python packages required for data processing and analysis:

In [1]:
import pandas as pd
import numpy as np
import pandasql as pdsql
import math

from datetime import timedelta, datetime
from dateutil.relativedelta import *

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
# initiate notebook for offline plot
init_notebook_mode(connected=True)         

### 1.1  Read in the data 

In [2]:
df = pd.read_csv("../data/NYC Open Data - Electric_Consumption_And_Cost__2010_-__June_2018_.csv", low_memory=False)

In [3]:
df.shape

(313147, 27)

Check the column names.

In [4]:
df.columns

Index(['Development Name', 'Borough', 'Account Name', 'Location', 'Meter AMR',
       'Meter Scope', 'TDS #', 'EDP', 'RC Code', 'Funding Source', 'AMP #',
       'Vendor Name', 'UMIS BILL ID', 'Revenue Month', 'Service Start Date',
       'Service End Date', '# days', 'Meter Number', 'Estimated',
       'Current Charges', 'Rate Class', 'Bill Analyzed', 'Consumption (KWH)',
       'KWH Charges', 'Consumption (KW)', 'KW Charges', 'Other charges'],
      dtype='object')

Change column names for easy reference.

In [5]:
df.columns = ['Development_Name', 'Borough', 'Account_Name', 'Location', 'Meter_AMR',
       'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source', 'AMP #',
       'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month', 'Service_Start_Date',
       'Service_End_Date', '# days', 'Meter_Number', 'Estimated',
       'Current_Charges', 'Rate_Class', 'Bill_Analyzed', 'Consumption_KWH',
       'KWH_Charges', 'Consumption_KW', 'KW_Charges', 'Other_Charges']

Check the number of empty values in each column. It seems there are 146 rows that contains null values only.

In [6]:
df.isnull().sum()

Development_Name         146
Borough                  146
Account_Name             146
Location                9041
Meter_AMR                187
Meter_Scope           296588
TDS #                   1717
EDP                      146
RC_Code                  146
Funding_Source           146
AMP #                   1657
Vendor_Name              146
UMIS_BILL_ID             146
Revenue_Month            146
Service_Start_Date       146
Service_End_Date         146
# days                   146
Meter_Number             146
Estimated                146
Current_Charges          146
Rate_Class               146
Bill_Analyzed            146
Consumption_KWH          146
KWH_Charges              146
Consumption_KW           146
KW_Charges               146
Other_Charges            146
dtype: int64

### 1.2  Save a copy of the dataframe before data cleaning to flag the rows with problems

Save df as df_orig before data cleaning and use the index as a unique row identifier to connect the two dataframes.

In [7]:
df = df.reset_index()

In [8]:
df.columns

Index(['index', 'Development_Name', 'Borough', 'Account_Name', 'Location',
       'Meter_AMR', 'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source',
       'AMP #', 'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Meter_Number',
       'Estimated', 'Current_Charges', 'Rate_Class', 'Bill_Analyzed',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges'],
      dtype='object')

In [9]:
df.columns = ['row', 'Development_Name', 'Borough', 'Account_Name', 'Location',
       'Meter_AMR', 'Meter_Scope', 'TDS #', 'EDP', 'RC_Code', 'Funding_Source',
       'AMP #', 'Vendor_Name', 'UMIS_BILL_ID', 'Revenue_Month',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Meter_Number',
       'Estimated', 'Current_Charges', 'Rate_Class', 'Bill_Analyzed',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges']

In [10]:
df_orig = df.copy()

Create a data frame to log the rows with data quality issues.

In [11]:
df_flags = pd.DataFrame(columns = ['row', 'flag'])

## Step 2 - General Data Cleaning

### 2.1 Remove empty rows

Find the rows that contains null values only.

In [12]:
mask = (pd.isna(df['Account_Name']) == True)

In [13]:
mask.sum()

146

Remove the problematic rows from the working datafrome df and log them in the df_flags dataframe.

In [14]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'NULL Values in all columns'})])
df = df[~mask]

All the rows with only null values have been removed. All the columns of interest do not have null values.

In [15]:
df.isnull().sum()

row                        0
Development_Name           0
Borough                    0
Account_Name               0
Location                8895
Meter_AMR                 41
Meter_Scope           296442
TDS #                   1571
EDP                        0
RC_Code                    0
Funding_Source             0
AMP #                   1511
Vendor_Name                0
UMIS_BILL_ID               0
Revenue_Month              0
Service_Start_Date         0
Service_End_Date           0
# days                     0
Meter_Number               0
Estimated                  0
Current_Charges            0
Rate_Class                 0
Bill_Analyzed              0
Consumption_KWH            0
KWH_Charges                0
Consumption_KW             0
KW_Charges                 0
Other_Charges              0
dtype: int64

### 2.2 Remove rows for which electricity charges were estimated

In [16]:
df['Estimated'].value_counts()

N             260863
Y              51749
NA               389
Name: Estimated, dtype: int64

In [17]:
df['Estimated'].value_counts().index.values

array(['N         ', 'Y         ', 'NA        '], dtype=object)

Identify & log the problematic rows, delete them from the working dataframe df.

In [18]:
mask = (df['Estimated'] ==  'N         ')

df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inaccurate Values of Charges'})])
df = df[mask]

### 2.3 Data Type Converstion

Check data types of fields. All timestamp fields and some numerical fields are stored as objects and thus need to be converted.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260863 entries, 0 to 313146
Data columns (total 28 columns):
row                   260863 non-null int64
Development_Name      260863 non-null object
Borough               260863 non-null object
Account_Name          260863 non-null object
Location              254029 non-null object
Meter_AMR             260840 non-null object
Meter_Scope           15032 non-null object
TDS #                 259446 non-null float64
EDP                   260863 non-null float64
RC_Code               260863 non-null object
Funding_Source        260863 non-null object
AMP #                 259500 non-null object
Vendor_Name           260863 non-null object
UMIS_BILL_ID          260863 non-null float64
Revenue_Month         260863 non-null object
Service_Start_Date    260863 non-null object
Service_End_Date      260863 non-null object
# days                260863 non-null float64
Meter_Number          260863 non-null object
Estimated             260863 non

2.3.1. Change these fields from object to numerical:
"Consumption_KW", "Current_Charges", "KWH_Charges", "KW_Charges", "Other_Charges"

In [20]:
df["Consumption_KW"] = df["Consumption_KW"].apply(lambda x: x.replace(",","") if type(x) == str else str(x))
df["Consumption_KW"] = df["Consumption_KW"].astype(float)

In [21]:
df["Current_Charges"] = df["Current_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["Current_Charges"] = df["Current_Charges"].astype(float)

In [22]:
df["KWH_Charges"] = df["KWH_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["KWH_Charges"] = df["KWH_Charges"].astype(float, inplace = True)

In [23]:
df["KW_Charges"] = df["KW_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["KW_Charges"] = df["KW_Charges"].astype(float, inplace = True)

In [24]:
df["Other_Charges"] = df["Other_Charges"].apply(lambda x: x.replace("$","").replace(",","").replace("(","-").replace(")","") if type(x) == str else str(x))
df["Other_Charges"] = df["Other_Charges"].astype(float, inplace = True)

2.3.2. Convert Revenue_Month and service date fields to datetime type.

In [25]:
df["Revenue_Month"] = df["Revenue_Month"].map(lambda x: datetime.strptime(x.split(" ")[0], '%m/%d/%Y'))
df['Service_Start_Date'] = df['Service_Start_Date'].map(lambda x: datetime.strptime(x, '%m/%d/%Y'))
df['Service_End_Date'] = df['Service_End_Date'].map(lambda x: datetime.strptime(x, '%m/%d/%Y'))

### 2.4 Clean up the Meter_Number field

Remove leading zeros:

In [26]:
df['Meter_Number'] = df['Meter_Number'].apply(lambda x: x.lstrip("0").strip(" "))

Remove white spaces:

In [27]:
df['Meter_Length'] = df['Meter_Number'].apply(lambda x: len(x))

Standardize the format for meter_numbers with the similar patterns:

In [28]:
df['Meter_Length'].value_counts()

7     257568
8       1847
12       456
5        412
6        292
18       287
10         1
Name: Meter_Length, dtype: int64

Certain meter numbers are recorded in different formats that need to be standardized.

In [29]:
df[df['Meter_Length'] == 12]['Meter_Number'].value_counts()

1860113_7500    68
7860113_7500    68
1860113_1600    66
7860113_1600    66
8096662-41.5    35
1096662-58.5    35
8096662-58.5    35
1096662-41.5    35
1096662 41-5    12
8096662 58-5    12
1096662 58-5    12
8096662 41-5    12
Name: Meter_Number, dtype: int64

In [30]:
mask = df['Meter_Number'] == '1096662 41-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '1096662 41.5'

In [31]:
mask = df['Meter_Number'] == '1096662 58-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '1096662-58.5'

In [32]:
mask = df['Meter_Number'] == '8096662 41-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '8096662-41.5'

In [33]:
mask = df['Meter_Number'] == '8096662 58-5'
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Inconsistent Format of Meter_Number'})])
df.loc[mask, 'Meter_Number'] = '8096662-58.5'

### 2.5 Correct values of the Revenue_Month field

In some cases the Revenue_Month is not in the same year as the Service Start and End dates when those two are and we will manually correct them.

In [34]:
df['start_date_year'] = df['Service_Start_Date'].apply(lambda x: datetime(x.year, 1, 1))
df['end_date_year'] = df['Service_End_Date'].apply(lambda x: datetime(x.year, 1, 1))
df['revenue_month_year'] = df['Revenue_Month'].apply(lambda x: datetime(x.year, 1, 1))

mask = ((df['end_date_year'] == df['start_date_year']) & (df['revenue_month_year'] != df['end_date_year']))

In [35]:
df[mask][['Revenue_Month', 'Service_Start_Date', 'Service_End_Date', 'Meter_Number']].sort_values(['Revenue_Month', 'Service_Start_Date', 'Meter_Number'])

Unnamed: 0,Revenue_Month,Service_Start_Date,Service_End_Date,Meter_Number
44361,2011-10-01,2010-09-22,2010-10-22,5934193
44362,2011-10-01,2010-09-22,2010-10-22,6439093
44363,2011-10-01,2010-09-22,2010-10-22,6443262
44364,2011-10-01,2010-09-22,2010-10-22,6443337
44365,2011-10-01,2010-09-22,2010-10-22,6443449
44366,2011-10-01,2010-09-22,2010-10-22,6443450
44367,2011-10-01,2010-09-22,2010-10-22,6443473
44368,2011-10-01,2010-09-22,2010-10-22,6443512
44369,2011-10-01,2010-09-22,2010-10-22,6443519
44370,2011-10-01,2010-09-22,2010-10-22,6443527


Log these rows in the df_flag dataset.

In [36]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Revenue_Month in wrong year'})])

Correct the cases where Revenue_Month is in the wrong year.

In [37]:
df.loc[mask, "Revenue_Month"] = datetime.strptime('10/01/2010', '%m/%d/%Y')

Remove the calculated fields.

In [38]:
df.drop(['start_date_year', 'end_date_year', 'revenue_month_year'], axis = 1, inplace = True)

### 2.6 Create an unique identifier for each account and remove unnecessary columns

According to the domain knowledge expert, each account can be uniquely determined by the combination building id and meter number. The combination of TDS # and Location uniquely determines a buildling and we can use EDP or RC_Code when TDS# is not available.

In [39]:
df['Building_ID'] = df['TDS #'].combine_first(df['EDP']).map(str).combine_first(df['RC_Code']) \
                    + " - " + df['Location'].map(lambda x: 'NA' if pd.isna(x) else x)

Define a list of columns of interest.

In [40]:
cols = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
        'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days', 
       'Current_Charges','Consumption_KWH', 'KWH_Charges',
       'Consumption_KW', 'KW_Charges', 'Other_Charges']
df = df[cols]

Building_ID alone is not the primary key of the data.

In [41]:
df.groupby(['Building_ID', 'Revenue_Month']).count().shape[0]/df.shape[0]

0.6327382572461407

The combination of Building_ID, meter number and revenue month is almost a primary key.

In [42]:
df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month']).count().shape[0]/df.shape[0]

0.9988039698999092

Adding the Billing range (specified by Service_Start_Date and Service_End_Date fields) helped a tiny bit.

In [43]:
df.groupby(['Building_ID', 'Meter_Number', 'Revenue_Month', 'Service_Start_Date', 'Service_End_Date']).count().shape[0]/df.shape[0]

0.999528488133618

Actually adding "Revenue_Month" did not increase granularity; It can be uniquely determined by Service_Start_Date and Service_End_Date.

In [44]:
df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count().shape[0]/df.shape[0]

0.999528488133618

### 2.7 Ensure the 4 fields (Building_ID, Meter_Number, Service_Start_Date, Service_End_Date) uniquely determines each row

In [45]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

In [46]:
temp.shape

(246, 15)

Lots of these rows have zero values in all the numerical fields of charges and consumptions.

In [47]:
temp.head(10)

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,75177,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2012-12-01,2012-11-21,2012-12-24,33.0,0.0,0.0,0.0,0.0,0.0,0.0
1,75178,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2012-12-01,2012-11-21,2012-12-24,33.0,0.0,0.0,0.0,54.43,1109.09,-1109.09
124,111642,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-01-01,2012-12-24,2013-01-24,31.0,0.0,0.0,0.0,0.0,0.0,0.0
125,111643,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-01-01,2012-12-24,2013-01-24,31.0,0.0,0.0,0.0,52.08,1105.73,-1105.73
180,111676,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-02-01,2013-01-24,2013-02-25,32.0,0.0,0.0,0.0,0.0,0.0,0.0
181,111677,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-02-01,2013-01-24,2013-02-25,32.0,0.0,0.0,0.0,52.94,1166.15,-1166.15
178,111710,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-03-01,2013-02-25,2013-03-26,29.0,0.0,0.0,0.0,0.0,0.0,0.0
179,111711,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-03-01,2013-02-25,2013-03-26,29.0,0.0,0.0,0.0,50.93,1169.81,-1169.81
176,111744,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-04-01,2013-03-26,2013-04-24,29.0,0.0,0.0,0.0,0.0,0.0,0.0
177,111745,KINGSBOROUGH,BLD 06,10.0 - BLD 06,1864559,2013-04-01,2013-03-26,2013-04-24,29.0,0.0,0.0,0.0,51.46,1146.5,-1146.5


Remove those rows from the dataset.

In [48]:
mask = ((df['Current_Charges'] == 0) & (df['KWH_Charges'] == 0) & (df['KW_Charges'] == 0) \
  & (df['Other_Charges'] == 0) & (df['Consumption_KWH'] == 0) & (df['Consumption_KW'] == 0))

In [49]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'All charges being zero'})])

df = df[~mask]

In [50]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

Still 200 rows left, most of which seems to be due to rebilling and duplicated data entries.

In [51]:
temp.shape

(200, 15)

In [52]:
temp.head(20)

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,133599,WILLIAMSBURG,BLD 06,2.0 - BLD 06,5648698,2013-11-01,2013-10-23,2013-11-21,29.0,2877.11,20160.0,1039.85,48.0,856.14,981.12
1,133646,WILLIAMSBURG,BLD 06,2.0 - BLD 06,5648698,2013-11-01,2013-10-23,2013-11-21,29.0,2877.11,20160.0,1039.85,48.0,856.14,981.12
26,133623,WILLIAMSBURG,BLD 07,2.0 - BLD 07,6994237,2013-11-01,2013-10-23,2013-11-21,29.0,3128.25,21920.0,1130.63,34.4,613.56,1384.06
27,133635,WILLIAMSBURG,BLD 07,2.0 - BLD 07,6994237,2013-11-01,2013-10-23,2013-11-21,29.0,3128.25,21920.0,1130.63,34.4,613.56,1384.06
146,133628,WILLIAMSBURG,BLD 07,2.0 - BLD 07,7861523,2013-11-01,2013-10-23,2013-11-21,29.0,2854.29,20000.0,1031.6,56.0,998.83,823.86
147,133640,WILLIAMSBURG,BLD 07,2.0 - BLD 07,7861523,2013-11-01,2013-10-23,2013-11-21,29.0,2854.29,20000.0,1031.6,56.0,998.83,823.86
144,133597,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5536455,2013-11-01,2013-10-23,2013-11-21,29.0,2945.59,20640.0,1064.61,52.8,941.75,939.23
145,133644,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5536455,2013-11-01,2013-10-23,2013-11-21,29.0,2945.59,20640.0,1064.61,52.8,941.75,939.23
142,133602,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5652433,2013-11-01,2013-10-23,2013-11-21,29.0,2346.23,16440.0,847.98,38.4,684.91,813.34
143,133649,WILLIAMSBURG,BLD 08,2.0 - BLD 08,5652433,2013-11-01,2013-10-23,2013-11-21,29.0,2346.23,16440.0,847.98,38.4,684.91,813.34


Identify the index of rows and delete them from the working dataframe df.

In [53]:
mask = df['row'].isin(temp['row'].values)

df = df[~mask]

For each group of duplicated rows, add the row with the smallest index back to the working dataframe df.

In [54]:
tempB = temp.groupby(list(temp.columns[1:])).agg({'row': 'min'}).reset_index()
cols = tempB.columns
cols = cols[0:-1].insert(0, cols[-1])
tempB = tempB[cols]

df = df.append(tempB)

Add flags to the df_flag dataframe.

In [55]:
merged = temp.merge(tempB, indicator=True, how='outer')
df_flags = pd.concat([df_flags, pd.DataFrame({'row':merged.loc[merged['_merge'] == 'left_only'].row.values, \
                                              'flag':'Duplicated rows'})])

Check again which combinations of the 4 fields (Building_ID, Meter_Number, Service_Start_Date, Service_End_Date) has multiple rows.

In [56]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).count()['Account_Name'].reset_index()
idx.columns = ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date', 'Counts']
idx = idx[idx['Counts'] > 1]

dupRows = idx.sort_values('Counts', ascending = False)

temp = pd.merge(dupRows, df, on = \
         ['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'], how = 'inner')[cols]\
        .sort_values(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date'])

temp

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Current_Charges,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges
0,56871,THROGGS NECK,BLD 11,63.0 - BLD 11,8125318,2011-10-01,2011-09-22,2011-10-24,32.0,1306.02,12880.0,858.84,0.0,0.0,447.18
1,56872,THROGGS NECK,BLD 11,63.0 - BLD 11,8125318,2011-10-01,2011-09-22,2011-10-24,32.0,2693.18,26560.0,1771.02,0.0,0.0,922.16


Only 2 rows left, caused by rebilling (same account, same billing window).

In [57]:
mask = df['row'].isin(temp['row'].values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'rebill - same billing period'})])
df = df[~mask]

Remove unnecessary columns and reorder the remaining.

In [58]:
cols = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date',
       '# days', 'Consumption_KWH', 'KWH_Charges', 'Consumption_KW','KW_Charges', 
        'Other_Charges', 'Current_Charges']

df = df[cols]

More than 25% of the values for all numerical variables except "Curent Charges" are 0 (as indicated by the 25th percentile values of these variables). We need to investigate the cause of this.

In [59]:
df[["Consumption_KWH",  "Consumption_KW", "Current_Charges", "KWH_Charges", "KW_Charges", "Other_Charges"]].describe()

Unnamed: 0,Consumption_KWH,Consumption_KW,Current_Charges,KWH_Charges,KW_Charges,Other_Charges
count,259313.0,259313.0,259313.0,259313.0,259313.0,259313.0
mean,32800.33,68.732254,4543.306841,1686.00063,1092.581876,1684.240735
std,53195.3,122.580708,6643.774008,2928.751687,1812.127834,3637.524167
min,0.0,0.0,-243.15,0.0,0.0,-59396.43
25%,0.0,0.0,412.46,0.0,0.0,0.0
50%,12000.0,32.8,2574.23,586.82,468.72,916.18
75%,48480.0,99.2,6092.67,2376.26,1610.76,2654.43
max,1779600.0,16135.46,329800.37,195575.86,78782.96,134224.51


After exploring the dataset, we noticed a few issues:
1. lots of rows have zero values in all numerical fields of charges and consumptions
2. lots of accounts only have non-zero values in kw_charges or kwh_charges (it suggests for some accounts the billing of KWH_Charges and KW_Charge are separated, making the account to be KWH_only or KW_only effectively)
3. lots of rows where current_charges != sum of kwh_charges, kw_charges and other_charges
4. charges and energy usage values are not consistent (e.g. kw_charges == 0 whereas kw > 0)

We will create a set of metrics to indicate the dataset's degree of cleanliness and start to solve these data quality issues in the next step.

### 2.8 Calculate Data Cleanliness Metrics regarding zero-values and meter types - Initial Check

In [60]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select Building_ID, Meter_Number \
        , sum(case when KWH_Charges == 0 and KW_Charges > 0 then 1 else 0 end) as count_kw_only \
        , sum(case when KW_Charges == 0 and KWH_Charges > 0 then 1 else 0 end) as count_kwh_only \
        , sum(Current_Charges) as total_current_charges \
        , count(*) as count \
        from df \
        group by df.Building_ID, df.Meter_Number"
df_meter_type = pysql(str1)


df_meter_type['kwh_only'] = ((df_meter_type['count_kwh_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kw_only'] == 0)
df_meter_type['kw_only'] = ((df_meter_type['count_kw_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kwh_only'] == 0)

#### check the meters

print("perc of kw_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kw_only'] == 1) & (df_meter_type['kwh_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 1) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_and_kw accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 0) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))


#### check the building_ids

a = df_meter_type[df_meter_type['kwh_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
b =  df_meter_type[df_meter_type['kw_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
a.columns = ['Building_ID', 'Count']
b.columns = ['Building_ID', 'Count']

print("perc of buildings with both kw_only and kwh_only accounts:", \
     "{:.2%}".format(pd.merge(a, b, on = 'Building_ID', how = 'inner').shape[0] \
/ df_meter_type.groupby(['Building_ID']).agg('count').reset_index().shape[0]))

print("\n")

#### Check the statistics of zero-value rows:

print("perc of rows - current charges of zero:", "{:.2%}".format(df[df['Current_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kw charges of zero:", "{:.2%}".format(df[df['KW_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kwh charges of zero:", "{:.2%}".format(df[df['KWH_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - usage/charge inconsistency:", \
      "{:.2%}".format(df[((df['KWH_Charges'] == 0) ^ (df['Consumption_KWH'] == 0)) \
   | ((df['KW_Charges'] == 0) ^ (df['Consumption_KW'] == 0)) ].shape[0]\
    /df.shape[0]))

print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of kw_only accounts: 28.46%
perc of kwh_only accounts: 36.40%
perc of kwh_and_kw accounts: 35.13%
perc of buildings with both kw_only and kwh_only accounts: 52.79%


perc of rows - current charges of zero: 16.61%
perc of rows - kw charges of zero: 41.13%
perc of rows - kwh charges of zero: 33.03%
perc of rows - usage/charge inconsistency: 4.45%
perc of rows - sum of charges inconsistency: 29.34%


## Step 3 - Resolve the data quality issues

The high percentage of rows with zero kwh or kw charges might be caused by the kw_only and kwh_only accounts. Handling them is the main task of this section. But first let's look at a few smaller issues.

### 3.1 Handle issues regarding the current_charges field

When current_charges == 0, all kwh_charges == 0 (as indicated by the NaN correlation coefficients with all other variables) and kw_charges seems to be negatively correlated with other_charges.

In [61]:
df[df['Current_Charges'] == 0][['KWH_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,KWH_Charges,KW_Charges,KWH_Charges.1,Other_Charges
KWH_Charges,,,,
KW_Charges,,1.0,,-0.694394
KWH_Charges,,,,
Other_Charges,,-0.694394,,1.0


When current_charges is 0, 82.3% of the time kw_charges == - other_charges and kw_charges == other_charges otherwise.

In [62]:
mask = (df['Other_Charges'] + df['KW_Charges'] == 0) & (df['Current_Charges'] == 0) & (df['KWH_Charges'] == 0)

In [63]:
print("{:.2%}".format(df[mask].shape[0]/df[df['Current_Charges'] == 0].shape[0]))

82.30%


In [64]:
df[(df['Current_Charges'] == 0) & ((df['Other_Charges'] == df['KW_Charges']) \
        | (df['Other_Charges'] + df['KW_Charges'] == 0))].shape[0] / \
df[df['Current_Charges'] == 0].shape[0]

1.0

Correct the rows where Other_Charges == KW_Charges with Other_Charges = -KW_Charges

In [65]:
mask = (df['Current_Charges'] == 0) & ((df['Other_Charges'] == df['KW_Charges']) & (df['KW_Charges'] != 0))

In [66]:
df.loc[mask, 'Other_Charges'] = df.loc[mask, 'KW_Charges'] * (-1)

Now when current_charges is zero, kwh_charge is zero, and kw_charges and other_charge either have a sum of zero or both being zero.

In [67]:
df[df['Current_Charges'] == 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
Current_Charges,,,,
KW_Charges,,1.0,,-1.0
KWH_Charges,,,,
Other_Charges,,-1.0,,1.0


Update the flag in df_flags:

In [68]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'Sign of Other_Charges is incorrect'})])

This lowers the percentage of sum of charges inconsistency a little bit.

In [69]:
print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of rows - sum of charges inconsistency: 26.40%


When Current_Charges is negative. Both KWH and KW charges are non-negative, mostly be caused by negative other_charges values.

In [70]:
df[df['Current_Charges'] < 0].shape

(103, 15)

In [71]:
df[df['Current_Charges'] < 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].corr()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
Current_Charges,1.0,0.231555,-0.664964,-0.078367
KW_Charges,0.231555,1.0,-0.399655,0.000617
KWH_Charges,-0.664964,-0.399655,1.0,-0.006223
Other_Charges,-0.078367,0.000617,-0.006223,1.0


In [72]:
df[df['Current_Charges'] < 0][['Current_Charges', 'KW_Charges', 'KWH_Charges', 'Other_Charges']].describe()

Unnamed: 0,Current_Charges,KW_Charges,KWH_Charges,Other_Charges
count,103.0,103.0,103.0,103.0
mean,-17.578932,1670.446408,8.59767,-339.93165
std,34.371892,892.339819,38.354575,1877.520472
min,-243.15,0.0,0.0,-4354.64
25%,-14.14,1097.61,0.0,-2042.915
50%,-9.95,1883.28,0.0,-626.16
75%,-5.56,2363.565,0.0,1722.43
max,-0.18,4340.94,187.81,3234.02


### 3.2 Other data quality checking work on fields regarding charges.

Per discussion with the domain knowledge expert, we decided to convert the bills from revenue month to calendar month and focus on the prorated the KWH_Consumption values only (other numerical fields are harder to prorate). Therefore we'll consider the data quality issues in the charges fields in future iterations. Example data quality issues to investigate include:
   - Cases where other == kw and kwh == 0
   - Cases where other == current and (kw!=0 or kwh != 0)
   - Negative values in KWH, KW charges
   - Inconsistency between consumption & charges
   - KW charge is offset by negative "other charge"
   - Meter accounts that only have non-zero values in either KW  or KWH  charges

#### Other background info from the domain knowledge expert

1. Why would "Consumption_KW" be zero?
    - KW and KWH should be both positive, unless there are some related bills that already covers it
    - Maybe the bills of one account was separated into multiple meters
2. What's the "Other Charges" field?
    - negative values to adjust for the payments from previous months
    - taxes, fee for meter-reading, little fees charged by utilities and states (e.g. system benefit charge), credit (state got a better deal after charging the clients)

### 3.3 Clean other fields based on inputs by the domain knowledge expert 
- multiple account names for the same building_id
- meter number switch for the same account over the years

#### 3.3.1 multipe names for same building_id

In [73]:
df_id_per_name = df[['Building_ID', 'Account_Name']].groupby('Building_ID')['Account_Name'].nunique()
df_id_per_name.value_counts()

1    2036
2       4
7       1
3       1
Name: Account_Name, dtype: int64

Only 6 Building_ID's have multiple account names. It's safe to remove them all from the working dataframe.

In [74]:
mask = df['Building_ID'].isin(df_id_per_name[df_id_per_name > 1].index.values)

In [75]:
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'multiple account_name for same building_id'})])
df = df[~mask]

#### 3.3.2 Meter number merging

There are Building_ID's whose meter number changed over the years, so we need to find the mapping and merge the meter numbers.

First, we connect two meters in the same building if their span of revenue months are adjacent to each other.

In [76]:
a = df.groupby(['Building_ID']).agg({'Meter_Number': 'nunique'}).reset_index()

a = a[a["Meter_Number"]>1]

a.columns = ['Building_ID', 'Counts']

a = pd.merge(a, df, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number', "Revenue_Month"]]\
.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month': ['max','min']}).reset_index()

a.columns = a.columns.get_level_values(0)

a.columns = ['Building_ID', 'Meter_Number', 'Max_Month', 'Min_Month']

a['Max_Month_Next'] = a['Max_Month'].map(lambda x: x + relativedelta(months=+1))
a['Min_Month_Prior'] = a['Min_Month'].map(lambda x: x - relativedelta(months=+1))

In [77]:
a.head()

Unnamed: 0,Building_ID,Meter_Number,Max_Month,Min_Month,Max_Month_Next,Min_Month_Prior
0,1.0 - BLD 01,7836716,2018-06-01,2010-01-01,2018-07-01,2009-12-01
1,1.0 - BLD 01,7838586,2018-06-01,2010-01-01,2018-07-01,2009-12-01
2,1.0 - BLD 04,6255947,2014-04-01,2010-01-01,2014-05-01,2009-12-01
3,1.0 - BLD 04,7381828,2018-06-01,2014-05-01,2018-07-01,2014-04-01
4,1.0 - BLD 04,8638820,2017-06-01,2016-07-01,2017-07-01,2016-06-01


In [78]:
str1 = "select l.Building_ID, l.Meter_Number as Meter_Number_E, r.Meter_Number as Meter_Number_L \
        , l.Min_Month as Min_E, l.Max_Month as Max_E, r.Min_Month as Min_L, r.Max_Month as Max_L\
        from a l join a r on l.Building_ID = r.Building_ID and l.Meter_Number != r.Meter_Number \
        where l.Max_Month == r.Min_Month_Prior"
a = pysql(str1)

It seems for the same building, not only do the meter numbers change in later months, but also there can be multiple meter numbers for the same month. For instance, in Building '10.0 - BLD 05', both meter '1010026' and '8010026' has data in 2010-01 and they only differ in the first digit.

In [79]:
a.head(10)

Unnamed: 0,Building_ID,Meter_Number_E,Meter_Number_L,Min_E,Max_E,Min_L,Max_L
0,1.0 - BLD 04,6255947,7381828,2010-01-01 00:00:00.000000,2014-04-01 00:00:00.000000,2014-05-01 00:00:00.000000,2018-06-01 00:00:00.000000
1,10.0 - BLD 05,1010026,1163877,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-05-01 00:00:00.000000
2,10.0 - BLD 05,8010026,8163877,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
3,10.0 - BLD 06,7864559,8163892,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
4,10.0 - BLD 09,1009984,1125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
5,10.0 - BLD 09,1009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
6,10.0 - BLD 09,8009984,1125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
7,10.0 - BLD 09,8009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-03-01 00:00:00.000000
8,10.0 - BLD 11,7424010,7864545,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-03-01 00:00:00.000000
9,10.0 - BLD 13,1864535,1301063,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000


By looking at one example account, it seems there are separated meter_numbers for KWH and KW charges for the same account. This may explain the existence of accounts with only non-zero values in kw or kwh charges. We'll merge these meter_numbers first, since it will most likely eliminate the case of many-to-many mapping between meter_numbers in different years.

In [80]:
mask = (df['Building_ID'] == '10.0 - BLD 09') & (df['Meter_Number'].isin(['8125376', '1125376']))
df[mask].head(30)

Unnamed: 0,row,Account_Name,Location,Building_ID,Meter_Number,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges,Current_Charges
42678,42678,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011-10-01,2011-09-22,2011-10-24,32.0,0.0,0.0,67.01,581.65,316.95,898.6
42703,42703,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011-10-01,2011-09-22,2011-10-24,32.0,33520.0,2235.11,0.0,0.0,1218.02,3453.13
42712,42712,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011-11-01,2011-10-24,2011-11-22,29.0,0.0,0.0,51.55,447.45,245.48,692.93
42737,42737,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011-11-01,2011-10-24,2011-11-22,29.0,27840.0,1856.37,0.0,0.0,1018.46,2874.83
42746,42746,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2011-12-01,2011-11-22,2011-12-23,31.0,0.0,0.0,50.69,439.99,196.83,636.82
42771,42771,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2011-12-01,2011-11-22,2011-12-23,31.0,28720.0,1915.05,0.0,0.0,856.75,2771.8
74790,74790,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2012-01-01,2011-12-23,2012-01-25,33.0,0.0,0.0,49.2,1012.56,-1012.56,0.0
74815,74815,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2012-01-01,2011-12-23,2012-01-25,33.0,31120.0,1707.24,0.0,0.0,2562.84,4270.08
74824,74824,KINGSBOROUGH,BLD 09,10.0 - BLD 09,1125376,2012-02-01,2012-01-25,2012-02-24,30.0,0.0,0.0,49.58,1080.87,-1080.87,0.0
74849,74849,KINGSBOROUGH,BLD 09,10.0 - BLD 09,8125376,2012-02-01,2012-01-25,2012-02-24,30.0,28320.0,1553.64,0.0,0.0,2249.54,3803.18


#### 3.3.3 Identify accounts that have separated meters for KW and KWH charges and combine the meters

There are many cases where under the same Building_ID, two meter numbers differ only in the first digit and share the same service date ranges. Usually the larger meter number has zero values in all KW_Charges and the smaller one has zero values in all KWH_Charges. It seems reasonable to merge them.

In [81]:
temp = df.groupby(['Building_ID', 'Meter_Number']).agg('count').reset_index()[['Building_ID', 'Meter_Number']]

In [82]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select distinct l.Building_ID, l.Meter_Number, r.Meter_Number\
        from temp l join temp r on l.Building_ID = r.Building_ID and l.Meter_Number > r.Meter_Number \
        where substr(l.Meter_Number, 2, length(l.Meter_number)) == substr(r.Meter_Number, 2, length(r.Meter_number))"
df_meter_mapping = pysql(str1)

df_meter_mapping.columns = ['Building_ID', 'Meter_Number_L', 'Meter_Number_S']

26.7% of the meter numbers can be mapped to another.

In [83]:
str1 = "select count (distinct Meter_Number_S) as count_redudant_meters\
        from df_meter_mapping"
str2 = "select count (distinct Meter_Number) as count_meters\
        from temp"
pysql(str1)['count_redudant_meters'][0]/pysql(str2)['count_meters'][0]

0.267853488847964

In [84]:
df_meter_mapping.head()

Unnamed: 0,Building_ID,Meter_Number_L,Meter_Number_S
0,10.0 - BLD 01,7864550,1864550
1,10.0 - BLD 02,7864551,1864551
2,10.0 - BLD 03,8010023,1010023
3,10.0 - BLD 04,7864536,1864536
4,10.0 - BLD 05,8010026,1010026


Check if the two meters correspond to KWH_Charges and KW_Charges respectively, by joining with the df_meter_type table we just obtained above.

In [85]:
temp = pd.merge(df_meter_mapping, df_meter_type, left_on = ['Building_ID', 'Meter_Number_S']\
         , right_on = ['Building_ID', 'Meter_Number'], how = 'left')\
        [['Building_ID', 'Meter_Number_S', 'count_kwh_only', 'count_kw_only', 'count', 'kwh_only', 'kw_only', 'Meter_Number_L']]

temp.columns = ['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s',
       'Meter_Number_L']

temp = pd.merge(temp, df_meter_type, left_on = ['Building_ID', 'Meter_Number_L']\
         , right_on = ['Building_ID', 'Meter_Number'], how = 'left')\
        [['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s', 'Meter_Number_L', 'count_kwh_only', 'count_kw_only', 'count', 'kwh_only', 'kw_only']]

temp.columns = ['Building_ID', 'Meter_Number_S', 'count_kwh_only_s', 'count_kw_only_s', 'count_s', 'kwh_only_s', 'kw_only_s',
       'Meter_Number_L', 'count_kwh_only_l', 'count_kw_only_l', 'count_l', 'kwh_only_l', 'kw_only_l']

In [86]:
temp.head(20)

Unnamed: 0,Building_ID,Meter_Number_S,count_kwh_only_s,count_kw_only_s,count_s,kwh_only_s,kw_only_s,Meter_Number_L,count_kwh_only_l,count_kw_only_l,count_l,kwh_only_l,kw_only_l
0,10.0 - BLD 01,1864550,0,98,99,False,True,7864550,97,0,97,True,False
1,10.0 - BLD 02,1864551,0,98,99,False,True,7864551,95,0,95,True,False
2,10.0 - BLD 03,1010023,0,98,99,False,True,8010023,97,0,97,True,False
3,10.0 - BLD 04,1864536,0,98,99,False,True,7864536,97,0,97,True,False
4,10.0 - BLD 05,1010026,0,0,1,False,False,8010026,21,0,21,True,False
5,10.0 - BLD 05,1163877,0,98,98,False,True,8163877,74,0,74,True,False
6,10.0 - BLD 06,1864559,0,65,66,False,True,7864559,21,0,21,True,False
7,10.0 - BLD 06,1163892,0,19,19,False,True,8163892,77,0,77,True,False
8,10.0 - BLD 07,1010032,0,98,99,False,True,8010032,97,0,97,True,False
9,10.0 - BLD 08,1864549,0,98,99,False,True,7864549,96,0,96,True,False


Nearly all the "small" meter_numbers are kw_only meters (they only have non-zero values in kw charges), and all the "larger" meter_numbers are kwh_only mters (they only have non-zero values in kwh charges). Therefore it looks reasonable to map them to the "large" corresponding meter_numbers.

The indicator field "kwh_only_l" means the "larger" meter_number only has non-zero values in KWH charges.

In [87]:
temp[(temp['kwh_only_l'] == False) & (temp['kw_only_l'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.060351413292589765

In [88]:
temp[(temp['kwh_only_l'] == True) & (temp['kw_only_l'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.9396485867074102

In [89]:
temp[(temp['kwh_only_l'] == False) & (temp['kw_only_l'] == True)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.0

In [90]:
temp[(temp['kwh_only_s'] == False) & (temp['kw_only_s'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.0015278838808250573

In [91]:
temp[(temp['kwh_only_s'] == True) & (temp['kw_only_s'] == False)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.0

In [92]:
temp[(temp['kwh_only_s'] == False) & (temp['kw_only_s'] == True)].Meter_Number_S.nunique() / temp.Meter_Number_S.nunique()

0.998472116119175

Merge the meter numbers.

In [93]:
temp = pd.merge(df, df_meter_mapping, left_on = ['Building_ID', 'Meter_Number'], right_on = ['Building_ID','Meter_Number_S'], how = 'left')
# Meter_Number_New is the original Meter_Number if the it's not mapped to another Meter_Number, and the corresponding Meter_Number_L otherwise
temp['Meter_Number_New'] = temp['Meter_Number_L'].combine_first(temp['Meter_Number']) 

df = temp
del(temp)

In [94]:
mask = df.Meter_Number.isin(df_meter_mapping.Meter_Number_S.values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'meter_number mapped to another one with similar pattern'})])

26.8 % of meter_numbers can be mapped to another one.

In [95]:
df_meter_mapping.Meter_Number_S.nunique()/df.Meter_Number.nunique()

0.267853488847964

Reorder the columns of the working dataframe.

In [96]:
df.columns

Index(['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges', 'Current_Charges', 'Meter_Number_L', 'Meter_Number_S',
       'Meter_Number_New'],
      dtype='object')

In [97]:
df.drop(['Meter_Number', 'Meter_Number_L', 'Meter_Number_S'], axis = 1, inplace = True)

df.columns = ['row', 'Account_Name', 'Location', 'Building_ID',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges', 
       'Other_Charges', 'Current_Charges', 'Meter_Number']

col_ordered = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges', 
       'Other_Charges', 'Current_Charges']

df = df[col_ordered]

#### 3.3.4 Identify meter number switches over time for the same account

In [98]:
a = df.groupby(['Building_ID']).agg({'Meter_Number': 'nunique'}).reset_index()

a = a[a["Meter_Number"]>1]

a.columns = ['Building_ID', 'Counts']

a = pd.merge(a, df, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number', "Revenue_Month"]]\
.groupby(['Building_ID', 'Meter_Number']).agg({'Revenue_Month': ['max','min']}).reset_index()

a.columns = a.columns.get_level_values(0)

a.columns = ['Building_ID', 'Meter_Number', 'Max_Month', 'Min_Month']

a['Max_Month_Next'] = a['Max_Month'].map(lambda x: x + relativedelta(months=+1))
a['Min_Month_Prior'] = a['Min_Month'].map(lambda x: x - relativedelta(months=+1))

In [99]:
str1 = "select l.Building_ID, l.Meter_Number as Meter_Number_E, r.Meter_Number as Meter_Number_L \
        , l.Min_Month as Min_E, l.Max_Month as Max_E, r.Min_Month as Min_L, r.Max_Month as Max_L\
        from a l join a r on l.Building_ID = r.Building_ID and l.Meter_Number != r.Meter_Number \
        where l.Max_Month == r.Min_Month_Prior"
a = pysql(str1)

This time for each building, there is only one change of meter_number over the years.

In [100]:
a.head(10)

Unnamed: 0,Building_ID,Meter_Number_E,Meter_Number_L,Min_E,Max_E,Min_L,Max_L
0,1.0 - BLD 04,6255947,7381828,2010-01-01 00:00:00.000000,2014-04-01 00:00:00.000000,2014-05-01 00:00:00.000000,2018-06-01 00:00:00.000000
1,10.0 - BLD 09,8009984,8125376,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
2,10.0 - BLD 13,7864535,8301063,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
3,10.0 - BLD 14,7864555,8301067,2010-01-01 00:00:00.000000,2011-09-01 00:00:00.000000,2011-10-01 00:00:00.000000,2018-05-01 00:00:00.000000
4,10.0 - BLD 16,8074933,8163879,2010-01-01 00:00:00.000000,2010-01-01 00:00:00.000000,2010-02-01 00:00:00.000000,2018-03-01 00:00:00.000000
5,103.0 - BLD 04,6938100,8743232,2010-01-01 00:00:00.000000,2014-11-01 00:00:00.000000,2014-12-01 00:00:00.000000,2018-06-01 00:00:00.000000
6,11.0 - CLASON POINT GARDENS BLD 03,8322351,7402447,2013-02-01 00:00:00.000000,2016-02-01 00:00:00.000000,2016-03-01 00:00:00.000000,2018-06-01 00:00:00.000000
7,11.0 - CLASON POINT GARDENS BLD 13,8322342,8096758,2013-02-01 00:00:00.000000,2016-05-01 00:00:00.000000,2016-06-01 00:00:00.000000,2018-06-01 00:00:00.000000
8,11.0 - CLASON POINT GARDENS BLD 44,6689019,8362632,2010-08-01 00:00:00.000000,2013-02-01 00:00:00.000000,2013-03-01 00:00:00.000000,2018-06-01 00:00:00.000000
9,111.0 - BLD 01,7250619,8809141,2010-04-01 00:00:00.000000,2016-09-01 00:00:00.000000,2016-10-01 00:00:00.000000,2018-06-01 00:00:00.000000


Quantify the percentage of these buildings.

In [101]:
df_meter_switch = pd.DataFrame(a['Building_ID'].value_counts() > 1).reset_index()
df_meter_switch.columns = ['Building_ID', 'Multipe_Switch']

df_single_meter_switch = df_meter_switch[df_meter_switch['Multipe_Switch'] == False]
df_multiple_meter_switch = df_meter_switch[df_meter_switch['Multipe_Switch'] == True]

In [102]:
df_meter_switch.shape

(560, 2)

In [103]:
df_multiple_meter_switch.shape

(51, 2)

In [104]:
df_meter_switch = pd.merge(a, df_single_meter_switch, on = 'Building_ID', how = 'inner')[['Building_ID', 'Meter_Number_E', 'Meter_Number_L']]

In [105]:
del(a)

14.2% of the meters can be mapped to another meter in this way.

In [106]:
df_meter_switch['Meter_Number_E'].count() / df['Meter_Number'].nunique()

0.14221849678681195

Merge the meter numbers.

In [107]:
temp = pd.merge(df, df_meter_switch, left_on = ['Building_ID', 'Meter_Number'], right_on = ['Building_ID', 'Meter_Number_E'], how = 'left')
temp['Meter_Number_New'] = temp['Meter_Number_L'].combine_first(temp['Meter_Number'])
df = temp

df.drop(['Meter_Number', 'Meter_Number_L', 'Meter_Number_E'], axis = 1, inplace = True)

In [108]:
df.head()

Unnamed: 0,row,Account_Name,Location,Building_ID,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KWH,KWH_Charges,Consumption_KW,KW_Charges,Other_Charges,Current_Charges,Meter_Number_New
0,0,ADAMS,BLD 05,118.0 - BLD 05,2010-01-01,2009-12-24,2010-01-26,33.0,128800.0,7387.97,216.0,2808.0,5200.85,15396.82,7223256
1,1,ADAMS,BLD 05,118.0 - BLD 05,2010-02-01,2010-01-26,2010-02-25,30.0,115200.0,6607.87,224.0,2912.0,5036.47,14556.34,7223256
2,2,ADAMS,BLD 05,118.0 - BLD 05,2010-03-01,2010-02-25,2010-03-26,29.0,103200.0,5919.55,216.0,2808.0,5177.43,13904.98,7223256
3,3,ADAMS,BLD 05,118.0 - BLD 05,2010-04-01,2010-03-26,2010-04-26,31.0,105600.0,6057.22,208.0,2704.0,6002.82,14764.04,7223256
4,4,ADAMS,BLD 05,118.0 - BLD 05,2010-05-01,2010-04-26,2010-05-24,28.0,97600.0,5598.34,216.0,2808.0,5323.2,13729.54,7223256


Rename the meter_number column and reorder the columns.

In [109]:
df.columns = ['row', 'Account_Name', 'Location', 'Building_ID',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges', 'Current_Charges', 'Meter_Number']
col_ordered = ['row', 'Account_Name', 'Location', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days',
       'Consumption_KWH', 'KWH_Charges', 'Consumption_KW', 'KW_Charges',
       'Other_Charges', 'Current_Charges']
df = df[col_ordered]

Log the corresponding rows in df_flags.

In [110]:
mask = df.Meter_Number.isin(df_meter_switch.Meter_Number_E.values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'meter_number switched to another for the same building_id'})])

Log the rows whose Building_ID has a many-to-many meter_number switch over the years.

In [111]:
mask = df['Building_ID'].isin(df_multiple_meter_switch.Building_ID.values)

df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'many_to_many meter_number switch over years for the building_id'})])

### 3.4 Consolidate data back to Building-Meter-Service_Date_Range level
After combinging the meter numbers in the 2 steps above, there are cases where 2 rows exist for the same combination of account id and billing window (1 row for KW charge, 1 row for KWH charge).

In [112]:
idx = df.groupby(['Building_ID', 'Meter_Number', 'Service_Start_Date', 'Service_End_Date']).agg(['count'])['Account_Name'].reset_index()

In [113]:
idx['count'].value_counts()

1    109987
2     73572
Name: count, dtype: int64

Remove the multiple rows by aggregating at building, meter, revenue month, service_date_range level. And since the "Location" field has many null values, we can't aggregate by it.

In [114]:
df.isnull().sum()

row                      0
Account_Name             0
Location              5565
Building_ID              0
Meter_Number             0
Revenue_Month            0
Service_Start_Date       0
Service_End_Date         0
# days                   0
Consumption_KWH          0
KWH_Charges              0
Consumption_KW           0
KW_Charges               0
Other_Charges            0
Current_Charges          0
dtype: int64

In [115]:
df = df.groupby(['Account_Name', 'Building_ID', 'Meter_Number',
       'Revenue_Month', 'Service_Start_Date',
       'Service_End_Date', '# days']).\
    agg({'Consumption_KW': 'sum', 'KW_Charges': 'sum', 'Consumption_KWH': 'sum', 'KWH_Charges': 'sum', 'Other_Charges': 'sum', 'Current_Charges': 'sum', 'row':'min'}).reset_index()

In [116]:
df.loc[:, 'Building_Meter'] = df['Building_ID'] + '_' + df['Meter_Number']

Reorder the columns.

In [117]:
cols = ['row', 'Account_Name', 'Building_ID', 'Meter_Number', 'Building_Meter', 
       'Revenue_Month', 'Service_Start_Date', 'Service_End_Date', '# days', 
        'Consumption_KW', 'KW_Charges','Consumption_KWH', 'KWH_Charges', 
        'Other_Charges', 'Current_Charges'
       ]
df = df[cols]

## Step 4 - Identify overlapping bills

- Input: df
- Output: 
    1. df (cleaned version of the dataset, each row is a bill uniquely determined by the Building_ID, Meter_Number, Service_Start_Date and Service_End_Date and no overlapping/duplication of billing windows)
    2. df_flags (a dataframe that logs all the rows of data quality issues)
    3. updated values of the data cleanliness metrics

In [118]:
str1 = "select l.Building_ID, l.row as row_L, r.row as row_R \
        from df l join df r on l.Building_ID = r.Building_ID and l.Meter_Number = r.Meter_Number \
        and l.row != r.row and l.Service_Start_Date <= r.Service_Start_Date \
        and l.Service_End_Date > r.Service_Start_Date \
        "
a = pysql(str1)

In [119]:
temp = df[(df['row'].isin(a.row_L)) | (df['row'].isin(a.row_R))]

In [120]:
temp.Account_Name.value_counts()

RED HOOK EAST/RED HOOK WEST          54
OCEAN BAY APARTMENTS (OCEANSIDE)     12
MORRISANIA AIR RIGHTS                 4
TWIN PARKS WEST (SITES 1 & 2)         2
FHA REPOSSESSED HOUSES (GROUP IX)     2
CASSIDY-LAFAYETTE                     2
LEHMAN VILLAGE                        2
Name: Account_Name, dtype: int64

Look at one example.

In [121]:
temp[temp['Building_Meter'] == '4.0 - RED HOOK EAST BLD 05_6505127']

Unnamed: 0,row,Account_Name,Building_ID,Meter_Number,Building_Meter,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KW,KW_Charges,Consumption_KWH,KWH_Charges,Other_Charges,Current_Charges
124490,50783,RED HOOK EAST/RED HOOK WEST,4.0 - RED HOOK EAST BLD 05,6505127,4.0 - RED HOOK EAST BLD 05_6505127,2011-01-01,2010-12-23,2011-01-25,33.0,86.4,1123.2,47520.0,2725.75,2511.59,6360.54
124491,50749,RED HOOK EAST/RED HOOK WEST,4.0 - RED HOOK EAST BLD 05,6505127,4.0 - RED HOOK EAST BLD 05_6505127,2011-03-01,2010-12-23,2011-03-25,92.0,81.6,1072.22,39360.0,2282.49,2279.74,5634.45


In [122]:
temp[['Building_ID', 'Meter_Number']].drop_duplicates().shape[0]/df[['Building_ID', 'Meter_Number']].drop_duplicates().shape[0]

0.012674683132921676

Only 7 buildings (1.27% of accounts) have overlapping billing periods. We'll just exclude them from the working dataset for now.

In [123]:
mask = df.row.isin(temp.row.values)
df_flags = pd.concat([df_flags, pd.DataFrame({'row':df[mask].row.values, 'flag':'overlapping billing periods'})])
df = df[~mask]

### Calculate Metrics regarding zero-values and meter types - 2nd time

In [124]:
pysql = lambda q: pdsql.sqldf(q, globals())
str1 = "select Building_ID, Meter_Number \
        , sum(case when KWH_Charges == 0 and KW_Charges > 0 then 1 else 0 end) as count_kw_only \
        , sum(case when KW_Charges == 0 and KWH_Charges > 0 then 1 else 0 end) as count_kwh_only \
        , sum(Current_Charges) as total_current_charges \
        , count(*) as count \
        from df \
        group by df.Building_ID, df.Meter_Number"
df_meter_type = pysql(str1)


df_meter_type['kwh_only'] = ((df_meter_type['count_kwh_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kw_only'] == 0)
df_meter_type['kw_only'] = ((df_meter_type['count_kw_only']/df_meter_type['count']) > 0.9) & (df_meter_type['count_kwh_only'] == 0)

#### check the meters

print("perc of kw_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kw_only'] == 1) & (df_meter_type['kwh_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_only accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 1) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))

print("perc of kwh_and_kw accounts:", "{:.2%}".format(df_meter_type[(df_meter_type['kwh_only'] == 0) & (df_meter_type['kw_only'] == 0)].shape[0] / df_meter_type.shape[0]))


#### check the building_ids

a = df_meter_type[df_meter_type['kwh_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
b =  df_meter_type[df_meter_type['kw_only'] == 1].groupby(['Building_ID']).agg('count').reset_index().iloc[:, 0:2]
a.columns = ['Building_ID', 'Count']
b.columns = ['Building_ID', 'Count']

print("perc of buildings with both kw_only and kwh_only accounts:", \
     "{:.2%}".format(pd.merge(a, b, on = 'Building_ID', how = 'inner').shape[0] \
/ df_meter_type.groupby(['Building_ID']).agg('count').reset_index().shape[0]))

print("\n")

#### Check the statistics of zero-value rows:

print("perc of rows - current charges of zero:", "{:.2%}".format(df[df['Current_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kw charges of zero:", "{:.2%}".format(df[df['KW_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - kwh charges of zero:", "{:.2%}".format(df[df['KWH_Charges'] == 0].shape[0] / df.shape[0]))

print("perc of rows - usage/charge inconsistency:", \
      "{:.2%}".format(df[((df['KWH_Charges'] == 0) ^ (df['Consumption_KWH'] == 0)) \
   | ((df['KW_Charges'] == 0) ^ (df['Consumption_KW'] == 0)) ].shape[0]\
    /df.shape[0]))

print("perc of rows - sum of charges inconsistency:", \
     "{:.2%}".format(1 - df[df['Current_Charges'] == df['KWH_Charges'] + df['KW_Charges'] + df['Other_Charges']].shape[0]\
    /df.shape[0]))

perc of kw_only accounts: 2.50%
perc of kwh_only accounts: 16.44%
perc of kwh_and_kw accounts: 81.05%
perc of buildings with both kw_only and kwh_only accounts: 0.34%


perc of rows - current charges of zero: 2.42%
perc of rows - kw charges of zero: 17.66%
perc of rows - kwh charges of zero: 5.98%
perc of rows - usage/charge inconsistency: 6.28%
perc of rows - sum of charges inconsistency: 34.14%


Comparing to initial check, all metrics have significant improvement except for the "usage/charge inconsistency" and "sum of charges inconsistency". However the charges fields are not in our current focus areas.

Metrics from the initial check:

In [None]:
# perc of kw_only accounts: 28.46%
# perc of kwh_only accounts: 36.40%
# perc of kwh_and_kw accounts: 35.13%
# perc of buildings with both kw_only and kwh_only accounts: 52.79%


# perc of rows - current charges of zero: 16.61%
# perc of rows - kw charges of zero: 41.13%
# perc of rows - kwh charges of zero: 33.03%
# perc of rows - usage/charge inconsistency: 4.45%
# perc of rows - sum of charges inconsistency: 29.34%

## Step 5 - Prorate the bills to calendar months

- Input: df
- Outputs: 
    1. df_prorated
    2. df_with_calendar_month

Since we want to prorate the Consumption_KWH value, let remove the accounts that only has zero value in the Consumption_KWH field.

In [125]:
a = df.groupby("Building_Meter").agg({'Consumption_KWH':'sum'}).reset_index()
KW_Only_Accounts = a[a['Consumption_KWH'] == 0].Building_Meter.values

mask = df['Building_Meter'].isin(KW_Only_Accounts)
df = df[~mask]

We need to find the corresponding calendar months for each bill. Like the revenue_month field, each calendar month is represented by its 1st day.

For each bill, add the corresponding calendar month for both the start and end date of the billing window, both of which should be mapped to the bill. Here we assume the Service_Start_Date is included in the billing window whereas Service_End_Date is not.

In [126]:
df['Start_Date_Month'] = df['Service_Start_Date'].apply(\
  lambda x: pd.to_datetime('-'.join([str(x.year), str(x.month)])))

df['End_Date_Month'] = df['Service_End_Date'].apply(\
  lambda x: pd.to_datetime('-'.join([str((x + relativedelta(days=-1)).year), str((x + relativedelta(days=-1)).month)])))

Create a dataframe of the relevant columns to work on the mapping between row number to the calendar month.

In [127]:
cols = ['row', 'Start_Date_Month', 'End_Date_Month']
temp = df[cols]

Create a new data frame to store the mapping. The dataframe will have 3 columns: 'row' (identifier of the bill), 'Start_Date_Month' and 'End_Date_Month'. We'll collapse the last 2 columns into 1 in order to get the associated calendar month for each bill.

In [128]:
df_month_row_mapping = temp.copy()

There are cases where the billing window is longer than one calendar month. 

- So for each bill, check if the billing window is longer than one month;
If so, save the Start_Date_Month in df_month_row_mapping and then replace it with its subsequent month until the billing window is less than one month.

In [129]:
while (temp.shape[0] > 0):
    temp.loc[:, 'Start_Date_Month_Next'] = \
    temp['Start_Date_Month'].map(lambda x: x + relativedelta(months=+1))

    temp.loc[:, 'Ind'] = \
    temp.apply(lambda x: 1 if x['Start_Date_Month_Next'] < x['End_Date_Month'] else 0, axis = 1)


    mask = temp['Ind'] == 1
    temp = temp.loc[mask,['row', 'Start_Date_Month_Next', 'End_Date_Month']].copy()
    temp.columns = cols

    df_month_row_mapping = pd.concat([df_month_row_mapping, temp])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Collapse the  'Start_Date_Month' and 'End_Date_Month' columns into one column that contains all corresponding calendar months of a given bill.

In [130]:
temp = pd.melt(df_month_row_mapping, id_vars = df_month_row_mapping.columns[0:-2].values, value_vars = df_month_row_mapping[cols].columns[-2:])
temp.drop('variable', axis = 1, inplace = True)

Handle cases where both services start and end dates correspond to the same bill_month.

In [131]:
temp = temp.drop_duplicates()
temp.columns  = ['row', 'Month']

Calculate how many days in the bill_month is contained in the billing period. Still assuming the Service_Start_Date is included in the billing window whereas Service_End_Date is not.

In [132]:
temp = pd.merge(temp, df, on = 'row', how = 'left')

In [133]:
temp.loc[:, 'Prorated_Days'] = \
temp.apply(lambda x: \
       (min(x['Month'] + relativedelta(months = 1), x['Service_End_Date']) \
        - max(x['Service_Start_Date'], x['Month']))\
       .days, axis = 1) 

Some bills has zero prorated_days since the bill's start & end dates are the same, probably due to data entry error.

In [134]:
mask = temp['Prorated_Days'] == 0
temp[mask]

Unnamed: 0,row,Month,Account_Name,Building_ID,Meter_Number,Building_Meter,Revenue_Month,Service_Start_Date,Service_End_Date,# days,Consumption_KW,KW_Charges,Consumption_KWH,KWH_Charges,Other_Charges,Current_Charges,Start_Date_Month,End_Date_Month,Prorated_Days


In [135]:
temp = temp[~mask]

Calculate the prorated kwh consumption values based on the prorated days.

In [136]:
temp.loc[:, 'Prorated_KWH'] = \
temp.apply(lambda x: (x['Consumption_KWH'] / x['# days'] )* x['Prorated_Days'], axis = 1)

Save a dataframe that contains the full dataset as well as the corresponding calendar month for each bill. This dataframe will be useful to calculate the gaps days per account per month later. Note this dataframe has more rows than the original dataset since one bill may correspond to multiple calendar months.

In [137]:
df_with_calendar_month = temp.copy()

In [138]:
cols = ['row', 'Account_Name', 'Building_ID',
       'Meter_Number', 'Building_Meter', 'Revenue_Month',
       'Service_Start_Date', 'Service_End_Date', '# days', 'Consumption_KW',
       'KW_Charges', 'Consumption_KWH', 'KWH_Charges', 'Other_Charges',
       'Current_Charges', 'Month', 'Prorated_Days', 'Prorated_KWH']

df_with_calendar_month = df_with_calendar_month[cols]

Aggregate the data to Account-Month level by summing up the prorated kwh consupmtion values per bill_month.

In [139]:
df_prorated = \
df_with_calendar_month.groupby(['Building_Meter','Month']).agg({'Prorated_KWH':'sum', 'Prorated_Days':'sum'}).reset_index()

So far for each account we've only been working on the calendar months that the accounts has billing records. We also need to map the account id to the calendar months where it should have data but were not logged or reported.

Create a dataframe that maps the account (Building_Meter) with all the calendar months that it should have bills.

Find all unique accounts (Building_Meter) and months in the dataset.

In [140]:
meters = df_with_calendar_month.Building_Meter.value_counts().index.values
df_meter_month = pd.DataFrame()

end = df_with_calendar_month['Month'].max()
start = df_with_calendar_month['Month'].min()
diff = (end.year - start.year) * 12 + end.month - start.month
# list of unique months
months = [start + relativedelta(months=x) for x in range(0, diff + 1)]

Create a reference table with all the calendar months and the corresponding # of days in the month. 

In [141]:
month_days = [(x + relativedelta(months = 1) - x).days for x in months]
df_month_days = pd.DataFrame({'Month':months,  'Month_#_Days':month_days})

Now we can map the account (Building_Meter) to all the calendar months that it should have billing data (Here we assumed the account should have data in all months between the first and last calendar month that it has billing data of).

In [142]:
df_meter_month = pd.DataFrame()

for j in range(len(meters)):
    mask = (df_with_calendar_month['Building_Meter'] == meters[j])
    start = df_with_calendar_month[mask]['Month'].min()
    end = df_with_calendar_month[mask]['Month'].max()
    start_index = months.index(start)
    end_index = months.index(end)
    
    temp_df = pd.DataFrame({'Building_Meter':meters[j], 'Month':months[start_index:end_index + 1]})
    temp_df.loc[:, 'Month_Type'] = 'Month_In_The_Middle'
    temp_df.loc[0, 'Month_Type'] = 'First_Month'
    temp_df.loc[temp_df.shape[0]-1, 'Month_Type'] = 'Last_Month'
    df_meter_month = pd.concat([df_meter_month, temp_df])

In [143]:
df_meter_month = pd.merge(df_meter_month, df_month_days, on = ['Month'], how = 'left')

Left join account_meter mapping table to get all months for each account.

In [144]:
df_prorated = pd.merge(df_meter_month, df_prorated, on = ['Building_Meter', 'Month'], how = 'left')

For months that the account didn't have data, fill in with zeros.

In [145]:
mask = df_prorated['Prorated_Days'].isnull()
df_prorated.loc[mask, 'Prorated_KWH'] = 0
df_prorated.loc[mask, 'Prorated_Days'] = 0

## Step 6 - Identify the billing gaps per calendar month

- Input: df_with_calendar_month
- Output: df_gaps

#### Goals:
1. For each account, identify the months with no data
2. For each account, identify months with >3 days of billing gap

Create a dataframe to sum all prorated days from different bills per account per month.

In [146]:
df_gaps = df_with_calendar_month.groupby(['Building_Meter','Month'])\
            .agg({'Prorated_Days':'sum'}).reset_index()

df_gaps.columns = ['Building_Meter', 'Month', 'Prorated_Days']

Left join account_meter mapping table to get all months for each account.

In [147]:
df_gaps = pd.merge(df_meter_month, df_gaps, on = ['Building_Meter', 'Month'], how = 'left')

Calculate the gap days per month for each account.

In [148]:
mask = df_gaps['Prorated_Days'].isnull()
df_gaps.loc[mask, 'Prorated_Days'] = 0

df_gaps.loc[:,'Gap_Days'] = df_gaps.apply(lambda x: x['Month_#_Days'] - x['Prorated_Days'], axis = 1)

#### Edge case consideration
- If the month is the first month of an account, it's legitmate to have a few days with no data at the beginning of the month. Similarly, the last month may have a few days with no data at the end of it.

In [149]:
df_meter_min_max_service_date = df_with_calendar_month.groupby(['Building_Meter']).agg({'Service_Start_Date':'min', 'Service_End_Date':'max'}).reset_index()
df_meter_min_max_service_date.columns = ['Building_Meter', 'First_Service_Date', 'Last_Service_Date']

In [150]:
temp = pd.merge(df_meter_min_max_service_date[['Building_Meter', 'First_Service_Date']], \
        df_meter_month[df_meter_month['Month_Type'] == 'First_Month'], on = 'Building_Meter', how = 'inner')

temp['Gap_Days_To_Exclude'] = temp.apply(lambda x: (x['First_Service_Date'] - x['Month']).days, axis = 1)

df_Meter_First_Month_Gaps_To_Exclude = temp[['Building_Meter', 'Month', 'Gap_Days_To_Exclude']]

In [151]:
temp = pd.merge(df_meter_min_max_service_date[['Building_Meter', 'Last_Service_Date']], \
        df_meter_month[df_meter_month['Month_Type'] == 'Last_Month'], on = 'Building_Meter', how = 'inner')

temp['Gap_Days_To_Exclude'] = temp.apply(lambda x: (x['Month'] + relativedelta(months=+1) - x['Last_Service_Date']).days, axis = 1)
df_Meter_Last_Month_Gaps_To_Exclude = temp[['Building_Meter', 'Month', 'Gap_Days_To_Exclude']]

In [152]:
df_gaps_first_month = \
pd.merge(df_gaps[df_gaps['Month_Type'] == 'First_Month']\
         , df_Meter_First_Month_Gaps_To_Exclude, on = ['Building_Meter', 'Month'], how = 'inner')

In [153]:
df_gaps_last_month = \
pd.merge(df_gaps[df_gaps['Month_Type'] == 'Last_Month']\
         , df_Meter_Last_Month_Gaps_To_Exclude, on = ['Building_Meter', 'Month'], how = 'inner')

In [154]:
df_gaps_first_month['Gap_Days_New'] = \
df_gaps_first_month.apply(lambda x: x['Gap_Days'] - x['Gap_Days_To_Exclude'], axis = 1)

df_gaps_last_month['Gap_Days_New'] = \
df_gaps_last_month.apply(lambda x: x['Gap_Days'] - x['Gap_Days_To_Exclude'], axis = 1)

Update the # of gap days of the first and last month for each account.

In [155]:
df_gaps = \
pd.merge(df_gaps, df_gaps_first_month[['Building_Meter', 'Month', 'Gap_Days_New']], \
         on = ['Building_Meter', 'Month'], how = 'left')

mask = df_gaps['Gap_Days_New'].isnull()
df_gaps.loc[~mask, 'Gap_Days'] = df_gaps.loc[~mask, 'Gap_Days_New']

df_gaps.drop('Gap_Days_New', axis = 1, inplace = True)

In [156]:
df_gaps = \
pd.merge(df_gaps, df_gaps_last_month[['Building_Meter', 'Month', 'Gap_Days_New']], \
         on = ['Building_Meter', 'Month'], how = 'left')

mask = df_gaps['Gap_Days_New'].isnull()
df_gaps.loc[~mask, 'Gap_Days'] = df_gaps.loc[~mask, 'Gap_Days_New']

df_gaps.drop('Gap_Days_New', axis = 1, inplace = True)

#### Flag the accounts that have more than 3 days of gap in a month.

In [157]:
df_gaps['Gap_Type'] = 'No Gap'

mask = (df_gaps['Gap_Days'] > 3)
df_gaps.loc[mask, 'Gap_Type'] = 'Gap more than 3 days'

mask = (df_gaps['Gap_Days'] == df_gaps['Month_#_Days'])
df_gaps.loc[mask, 'Gap_Type'] = 'Full Month Gap'

## Step 7 - Impute the KWH Consumption values

- Inputs: df_gaps, df_prorated
- Output: df_prorated_kwh_imputed

#### Impute the KWH Consumption values per account per month. 

It's essentially a weighted sum of prorated KWH consumptions from different bills per account, per calendar month.

In [158]:
df_prorated_kwh_imputed = pd.merge(df_gaps[['Building_Meter', 'Month', 'Month_Type', 'Month_#_Days', \
       'Prorated_Days', 'Gap_Days', 'Gap_Type']], df_prorated[['Building_Meter', 'Month', 'Prorated_KWH']], \
        on = ['Building_Meter', 'Month'], how = 'inner')

In [161]:
mask = df_prorated_kwh_imputed['Prorated_Days'] > 0

df_prorated_kwh_imputed.loc[mask, 'Imputed_KWH'] = \
df_prorated_kwh_imputed.loc[mask].apply(lambda x: x['Prorated_KWH']*x['Month_#_Days']/x['Prorated_Days'], axis = 1)

df_prorated_kwh_imputed.loc[~mask, 'Imputed_KWH'] = 0

#### Show some samples of the final dataset.

In [162]:
df_prorated_kwh_imputed.head()

Unnamed: 0,Building_Meter,Month,Month_Type,Month_#_Days,Prorated_Days,Gap_Days,Gap_Type,Prorated_KWH,Imputed_KWH
0,165.0 - BLD 04_99273488,2009-12-01,First_Month,31,9.0,0.0,No Gap,19694.117647,67835.294118
1,165.0 - BLD 04_99273488,2010-01-01,Month_In_The_Middle,31,31.0,0.0,No Gap,68283.02521,68283.02521
2,165.0 - BLD 04_99273488,2010-02-01,Month_In_The_Middle,28,28.0,0.0,No Gap,61071.133005,61071.133005
3,165.0 - BLD 04_99273488,2010-03-01,Month_In_The_Middle,31,31.0,0.0,No Gap,58011.118077,58011.118077
4,165.0 - BLD 04_99273488,2010-04-01,Month_In_The_Middle,30,30.0,0.0,No Gap,55164.054336,55164.054336


#### Example plot of the difference between KWH consumptions per revenue month and imputed KWH consumptions per calendar month. 

Here we can see the imputed KWH consumptions time series is a bit smoother than the original one of which it's a weighted sum. Also in the imputed KWH version we were able to capture the month where the account has no billing data at all (Jun 2012 in this case).

In [163]:
temp = df[df['Building_Meter'] == '341.0 - BLD 04_7835072']
trace1 = go.Scatter(
    x = temp.Revenue_Month,
    y = temp.Consumption_KWH,
    name = 'KWH Consumption by Revenue Month',
    yaxis = 'y'
)

temp = df_prorated_kwh_imputed[df_prorated_kwh_imputed['Building_Meter'] == '341.0 - BLD 04_7835072']
trace2 = go.Scatter(
    x = temp.Month,
    y = temp.Imputed_KWH,
    name = 'Prorated & Imputed KWH Consumption by Calendar Month',
    yaxis = 'y'
)

data = [trace1, trace2]

layout = go.Layout(
    title='KWH Consumptions over time',
    margin=go.layout.Margin(
        l=80,
        r=50,
        b=100,
        t=200,
        pad=4
    ),
    yaxis=dict(
#         title='KWH Consumption',
        tickformat=",",
    ),
    legend=dict(x = -0.05, y=1.4)
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

#### Trendline of % of accounts with billing gaps (entire month or 3+ days of gap) by revenue month

In [164]:
temp = df_gaps.groupby(['Month']).agg({'Building_Meter':'nunique'}).reset_index()


mask = df_gaps['Gap_Type'] == 'Gap more than 3 days'
a = df_gaps[mask].groupby(['Month']).agg({'Building_Meter':'nunique'}).reset_index()

mask = df_gaps['Gap_Type'] == 'Full Month Gap'
b = df_gaps[mask].groupby(['Month']).agg({'Building_Meter':'nunique'}).reset_index()

temp = pd.merge(temp, a, on = 'Month', how = 'left')
temp = pd.merge(temp, b, on = 'Month', how = 'left')

temp = temp.fillna(0)

temp.columns = ['Month', 'meters_count', 'meters_3DayGap_count', 'meters_1MonthGap_count']

temp['meters_no_data_perc'] = \
temp.apply(lambda x: round(x['meters_1MonthGap_count']/x['meters_count'], 4), axis = 1)

temp['meters_no_data_or_long_gap_perc'] = \
temp.apply(lambda x: round((x['meters_1MonthGap_count'] + x['meters_3DayGap_count'])/x['meters_count'], 4), axis = 1)

In [167]:
df_data_completeness_by_month = temp.copy()

In [168]:
# Create a trace
trace1 = go.Bar(
    x = df_data_completeness_by_month.Month,
    y = df_data_completeness_by_month.meters_count,
    name = '# of Accounts that should have data in the month', 
    marker=dict(
        color='rgba(204,204,204,1)'
    ),
    yaxis= 'y'
)

trace2 = go.Scatter(
    x = df_data_completeness_by_month.Month,
    y = df_data_completeness_by_month.meters_no_data_perc,
    name = '% of Accounts with no data',
    yaxis = 'y2'
)

trace3 = go.Scatter(
    x = df_data_completeness_by_month.Month,
    y = df_data_completeness_by_month.meters_no_data_or_long_gap_perc,
    name = '% of Accounts with no data or 3+ days of gap', 
    yaxis= 'y2'
)

data = [trace1, trace2, trace3]

layout = go.Layout(
    title='Trend Line of Data Incompleteness',
    yaxis=dict(
        title='# of Accounts that should have data in the month',
        tickformat=",",
    ),
    yaxis2=dict(
        title='% of Accounts missing data',
        tickformat=".1%",
        side='right',
        overlaying='y',
    ), 
    legend=dict(x = -0.05, y= -0.4)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

#### Hypotheses on data incompleteness over time:

1. Starting from 2015, does data quality get better? less meters are missing data? (Government required companies to submit utility data since that time)
    - No.
2. January are more likely to miss data. Why? Check if that's true. 
    - No.

### Save the output files

In [207]:
mask = df_prorated_kwh_imputed['Imputed_KWH'] == 0

df_prorated_kwh_imputed.loc[mask, 'Imputed_KWH'] = np.nan

df_prorated_kwh_imputed.to_pickle("../output/NYCHA_Prorated_KWH")

df_prorated_kwh_imputed.to_csv('../output/NYCHA_Prorated_KWH.csv')

### Save the data of accounts with more than 50 valid data points for anomaly detection analysis

In [209]:
df_prorated_kwh_imputed.loc[:, 'NA_KWH'] = df_prorated_kwh_imputed.apply(lambda x: True if np.isnan(x['Imputed_KWH']) else False, axis = 1)

tmp1 = df_prorated_kwh_imputed.groupby('Building_Meter').agg('count').reset_index().iloc[:, 0:2]

tmp1.columns = ['Building_Meter', 'month_count']

tmp2 = df_prorated_kwh_imputed.groupby('Building_Meter').agg({'NA_KWH':'sum'}).reset_index().iloc[:, 0:2]

tmp2.columns = ['Building_Meter', 'month_na_count']

tmp = tmp1.merge(tmp2, on = 'Building_Meter', how = 'inner')

mask = tmp['month_count'] - tmp['month_na_count'] >= 50

df_prorated_kwh_imputed_valid_50plus = df_prorated_kwh_imputed.merge(tmp[mask][['Building_Meter']], on = 'Building_Meter', how = 'inner')[['Building_Meter', 'Month', 'Imputed_KWH']]
df_prorated_kwh_imputed_valid_50plus.columns = ['Account', 'Month', 'Value']

In [217]:
df_prorated_kwh_imputed_valid_50plus.to_csv('../output/NYCHA_TS.csv')

## Next Steps:

Check data anomalies in the following order:
- KWH (consumption) .. only compare where there are months of data (ignore the gap month), or we can also use usage per day and then exclude the days with no consumption(instead of using the pro-rated value)
- KWH_Charges
- KW (capacity) consumption and charges (difference in daytime vs. nightime, summer vs. winter, whole summer is at capacity, we will have really high charges for summer capacity use) (Later Metrics defined below)

Other metrics to consider later:

1. Total capacity (kW) for all the meters for the month (building level aggregate)
2. Max kW value for the month (both building level and account level)
3. Max kW for each meter for the previous 12 months
4. Sum of the Max kW for each individual meter
5. The variance of Total Charge (sum of KWH_charge and KW_charge) at both account level and building level
6. Average total charges (sum of kwh_charges and kw_charges per account per month)