## Import Statements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats

pd.options.display.max_columns=25

## Reading and Merging Data

In [2]:
data_2012_main = pd.read_csv('electricity_usage_data_2012.csv')
data_2013_main = pd.read_csv('electricity_usage_data_2013.csv')
data_2013_2_main = pd.read_csv('electricity_usage_data_2013_2.csv')
data_2014_main = pd.read_csv('electricity_usage_data_2014.csv')

In [3]:
df_list = [data_2012_main, data_2013_main, data_2013_2_main, data_2014_main]

data = pd.concat(df_list)
data.shape

(257848, 14)

Since we cleaned each df separetely there should not be any NANs, but checking for it nonetheless.

In [4]:
data.isna().sum()

ESID                200418
Business Area       200418
Service Address     200418
Bill Type           200418
Bill Date           200418
Total Due ($)       200418
kWh Usage           200418
esid                 57430
business_area        57430
service_address      57430
bill_type            57430
bill_date            57430
total_due            57430
kwh_usage            57430
dtype: int64

### Checking for Duplicate Rows

Since there is an overlap in time period for the CSV files it is important not to have repeating rows.

We can check for this in the following way: A particular ESID should be billed only once. Therefore by taking a subset of ESID, Business Area and Bill Date we can know if a particular customer's billing info has been repeated in the df or not.

In [5]:
dup_rows_index = data.duplicated(subset=['esid', 'business_area', 'service_address', 'bill_date'])
(dup_rows_index).sum()

113345

This confirms the doubt that the overlap with the FY 2012 and FY 2013 with the CSV file has generated duplicate rows in the data.
We need to remove these columns.

In [6]:
data_main = data[~(dup_rows_index)]

data_main.shape

(144503, 14)

In [7]:
data_main.to_csv('Electricity_Usage_Data.csv', index=False)