## Import Statements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats

pd.options.display.max_columns=25

## Data FY 2012

In [2]:
data_2012 = pd.read_excel('houston-houston-electricity-bills/coh-fy2012-ee-bills-july2011-june2012.xls')
orig_shape_2012 = data_2012.shape[0]

data_2012.shape

(57430, 24)

In [3]:
data_2012.head(5)

Unnamed: 0,Reliant Contract No,Service Address,Meter No,ESID,Business Area,Cost Center,Fund,Bill Type,Bill Date,Read Date,Due Date,Meter Read,Base Cost ($),T&D Discretionary ($),T&D Charges ($),Current Due ($),Adjustment ($),Total Due ($),Franchise Fee ($),Voucher Date,Billed Demand,kWh Usage,Nodal Cu Charge ($),Reliability Unit Charge ($)
0,2059605,10518 BELLAIRE,303261,1008901000140050014100,2000,2000040005,8300,T,2012-06-26,2012-06-21,2012-07-26,47940.0,61070.65,1638.01,10440.86,73232.11,,73232.11,-1047.28,2012-06-27,1507.291667,905421,82.59,0.0
1,2059605,10518 BELLAIRE,303261,1008901000140050014100,2000,2000040005,8300,T,2012-05-25,2012-05-21,2012-06-24,47186.0,56319.47,1631.0,10364.63,68463.46,,68463.46,-1045.21,2012-05-30,1496.907217,824107,148.36,0.0
2,2059605,10518 BELLAIRE,303261,1008901000140050014100,2000,2000040005,8300,T,2012-04-27,2012-04-23,2012-05-27,46499.0,68461.63,1674.67,10676.79,80847.87,,80847.87,-1081.11,2012-04-30,1562.5,977744,34.78,0.0
3,2059605,10518 BELLAIRE,303261,1008901000140050014100,2000,2000040005,8300,T,2012-03-27,2012-03-21,2012-04-26,45684.0,62036.29,1696.66,10681.48,74373.93,,74373.93,-1087.32,2012-03-28,1567.708333,876838,-40.5,0.0
4,2059605,10518 BELLAIRE,303261,1008901000140050014100,2000,2000040005,8300,T,2012-02-27,2012-02-21,2012-03-28,44954.0,61670.24,1703.8,10707.94,74080.27,,74080.27,-1090.08,2012-02-28,1577.083333,872898,-1.71,0.0


### Checking Nulls

In [4]:
data_2012.isna().sum()

Reliant Contract No                0
Service Address                    0
Meter No                        7809
ESID                               0
Business Area                      0
Cost Center                        0
Fund                               0
Bill Type                          0
Bill Date                          0
Read Date                          0
Due Date                           0
Meter Read                         2
Base Cost ($)                      0
T&D Discretionary ($)              0
T&D Charges ($)                    0
Current Due ($)                    0
Adjustment ($)                 56259
Total Due ($)                      0
Franchise Fee ($)                  0
Voucher Date                       0
Billed Demand                      3
kWh Usage                          0
Nodal Cu Charge ($)                1
Reliability Unit Charge ($)        4
dtype: int64

### Checking Adjustment ($) column

In [5]:
data_2012['Adjustment ($)'].value_counts(dropna=False)

NaN       56259
0.0        1170
9425.9        1
Name: Adjustment ($), dtype: int64

The column does not have any relevant information based on the above reported values. Electing to drop the column.

In [6]:
data_2012.drop(columns=['Adjustment ($)'], inplace=True)

### Checking Unique Number of Customers

There are quite a few columns in the dataset that signify relating to a unique person/house/business. Checking the unique counts of such columns.

In [7]:
check_unique_columns = ['Reliant Contract No', 'Service Address ', 'Meter No', 
                        'ESID', 'Business Area', 'Cost Center',]

for col in check_unique_columns:
    print(f'Number of Unique Values in {col}: {data_2012[col].nunique()}')

Number of Unique Values in Reliant Contract No: 5241
Number of Unique Values in Service Address : 5183
Number of Unique Values in Meter No: 4021
Number of Unique Values in ESID: 5241
Number of Unique Values in Business Area: 9
Number of Unique Values in Cost Center: 38


Based on the above reported values and further research online:

ESID signifies a unique ID provided to each customer subscribed to the electricity board. It would be best to choose ESID and Service Address columns going forward as these would provide number of unique customers and the areas (streets) where higher usage of electricity occurs.

Business Area signifies a grouping a number of buildings which covers a certain area. This would be useful usage patterns grouped by certain zones in the city.

### Checking Bill Type

In [8]:
data_2012['Bill Type'].value_counts(dropna=False)

T    56859
P      552
C       19
Name: Bill Type, dtype: int64

Bill Type could signify the type of the connection given. Since commercial, residential and government spaces would have different type of pricing and needs this column could be capturing that information.

In [9]:
data_2012['Service Address '].nunique(), data_2012['Meter No'].nunique(), data_2012['ESID'].nunique()

(5183, 4021, 5241)

The next 3 columns are: Bill Date, Read Date and Due Date. Of these it would be best to choose the Bill date across all the data files to keep the data consistent. 

### Electricity Usage Statistics

In [10]:
data_2012[['Meter Read', 'Billed Demand ', 'kWh Usage']].describe()

Unnamed: 0,Meter Read,Billed Demand,kWh Usage
count,57428.0,57427.0,57430.0
mean,10008.024135,52.581303,22497.32
std,19208.052944,432.027165,221634.9
min,0.0,0.0,0.0
25%,118.75,0.0,100.0
50%,2583.0,0.0,298.0
75%,7879.0,11.0,2240.0
max,342348.0,18495.555556,10693440.0


There are 3 columns that denote the amount of electricity: Meter Read, Billed Demand, kWh Usage.

Using kWh Usage as a standard unit of measurement.

In [11]:
data_2012[[
    'Base Cost ($)', 'T&D Discretionary ($)', 'T&D Charges ($)', 
    'Current Due ($)', 'Total Due ($)', 'Franchise Fee ($)', 
    'Nodal Cu Charge ($)', 'Reliability Unit Charge ($)'
     ]].describe()

Unnamed: 0,Base Cost ($),T&D Discretionary ($),T&D Charges ($),Current Due ($),Total Due ($),Franchise Fee ($),Nodal Cu Charge ($),Reliability Unit Charge ($)
count,57430.0,57430.0,57430.0,57430.0,57430.0,57430.0,57429.0,57426.0
mean,1557.590034,404.377159,322.32478,2292.520167,2326.005266,-36.249975,8.067123,0.0
std,15332.140262,12617.605024,2103.325682,23457.157709,23484.415824,255.356787,136.268511,0.0
min,0.0,-44.99,-680.34,-64.21,0.0,-9352.01,-367.21,0.0
25%,6.87,3.24,7.38,18.65,18.43,-5.74,0.0,0.0
50%,20.59,3.91,12.44,38.24,38.49,-0.5,0.01,0.0
75%,155.2525,17.07,98.8475,312.61,317.2125,0.0,0.28,0.0
max,740473.96,754326.01,64282.33,907483.66,907483.66,0.0,18019.45,0.0


Reliability Unit Charge does not contain any useful information. Electing to drop that column.

The columns other than Current Due or Total Due are adding up the value present in these two columns. Going forward choosing the column Total Due ($). 
Based on the above statistics the columns Current Due and Total Due represent the same value. 

### Selecting and Filtering Columns

In [12]:
data_2012.columns

Index(['Reliant Contract No', 'Service Address ', 'Meter No', 'ESID',
       'Business Area', 'Cost Center', 'Fund', 'Bill Type', 'Bill Date',
       'Read Date', 'Due Date', 'Meter Read', 'Base Cost ($)',
       'T&D Discretionary ($)', 'T&D Charges ($)', 'Current Due ($)',
       'Total Due ($)', 'Franchise Fee ($)', 'Voucher Date', 'Billed Demand ',
       'kWh Usage', 'Nodal Cu Charge ($)', 'Reliability Unit Charge ($)'],
      dtype='object')

Based on the above analysis of the dataset choosing the following columns:

1. ESID
2. Business Area
3. Service Address 
3. Bill Type
4. Bill Date
5. Total Due ($)
6. kWh Usage

In [13]:
data_2012 = data_2012[[
    'ESID', 'Business Area', 'Service Address ', 'Bill Type',
    'Bill Date', 'Total Due ($)', 'kWh Usage'
]]

In [14]:
rename_cols = {
    'ESID': 'esid',
    'Business Area': 'business_area',
    'Service Address ': 'service_address',
    'Bill Type': 'bill_type',
    'Bill Date': 'bill_date',
    'Total Due ($)': 'total_due',
    'kWh Usage': 'kwh_usage'
}

data_2012_main = data_2012.rename(columns=rename_cols)

Checking for Nulls again and dtypes

In [15]:
data_2012_main.isna().sum()

esid               0
business_area      0
service_address    0
bill_type          0
bill_date          0
total_due          0
kwh_usage          0
dtype: int64

In [16]:
data_2012_main.dtypes

esid                       object
business_area               int64
service_address            object
bill_type                  object
bill_date          datetime64[ns]
total_due                 float64
kwh_usage                   int64
dtype: object

In [17]:
data_2012_main.shape

(57430, 7)

In [18]:
zscore_2012 = stats.zscore(data_2012_main[['total_due', 'kwh_usage']])

zscore_2012

Unnamed: 0,total_due,kwh_usage
0,3.019310,3.983720
1,2.816252,3.616835
2,3.343602,4.310039
3,3.067930,3.854755
4,3.055426,3.836978
...,...,...
57425,-0.070053,-0.090029
57426,-0.070059,-0.090029
57427,-0.070064,-0.090029
57428,-0.070255,-0.090029


Each zscore value signifies how many standard deviations away an individual value is from the mean. This is a good indicator to finding outliers in the dataframe.

Usually z-score=3 is considered as a cut-off value to set the limit. Therefore, any z-score greater than +3 or less than -3 is considered as outlier which is pretty much similar to standard deviation method

In [19]:
# data_2012_main = data_2012_main[(np.abs(zscore_2012) < 3).all(axis=1)]

data_2012_main.shape

(57430, 7)

The number of rows has decreased from 57,430 to 57,025. So 405 rows were outliers based on the data.

In [20]:
data_2012_main.head(5)

Unnamed: 0,esid,business_area,service_address,bill_type,bill_date,total_due,kwh_usage
0,1008901000140050014100,2000,10518 BELLAIRE,T,2012-06-26,73232.11,905421
1,1008901000140050014100,2000,10518 BELLAIRE,T,2012-05-25,68463.46,824107
2,1008901000140050014100,2000,10518 BELLAIRE,T,2012-04-27,80847.87,977744
3,1008901000140050014100,2000,10518 BELLAIRE,T,2012-03-27,74373.93,876838
4,1008901000140050014100,2000,10518 BELLAIRE,T,2012-02-27,74080.27,872898


In [21]:
orig_shape_2012 - data_2012_main.shape[0]

0

In [22]:
data_2012.to_csv('electricity_usage_data_2012.csv', index=False)

The trend graph of both the cost and energy usage is the same as the value of cost = energy usage times the cost per unit.