# Proprocessing Smart Grid Smart City (SGSC) dataset

This data used comes from Smart Grid Smart City (SGSC) project (2010-2014) and can be downloaded from the Austalian government website: https://data.gov.au/dataset/ds-dga-4e21dea3-9b87-4610-94c7-15a8a77907ef/details. We need to download the zipped readings data (*CD_INTERVAL_READING_ALL_NO_QUOTES.csv*), csv with description of each customer (*sgsc-customers.csv*) and data dictionary xlsx explaining fields in the csv (*sgsc-data-dictionary.xlsx*). From the list of all customers we choose the ones which are in the control group and don't have any automatically controlled load or solar panels energy as we don't want to take into account the effects of different tariffs or any additional factors. 

In [1]:
import pandas as pd

In [2]:
path_to_data = r'path_to_your_folder_with_data'

In [3]:
customers = pd.read_csv(path_to_data + '\sgsc-customers.csv')
customers = customers[customers['CONTROL_GROUP_FLAG']=='Y']
customers = customers[customers['CONTROLLED_LOAD_CNT']==0]
customers = customers[customers['NET_SOLAR_CNT']==0]
customers = customers[customers['GROSS_SOLAR_CNT']==0]
customers = set(customers['CUSTOMER_KEY'])

print(len(customers))

1198


This leaves us with 1198 customers in the dataset for whom we extract the data. This process will take a few minutes as we are concatenating all the data into one dataframe, it could be sped up by processing data in chunks, but we keep the data in one file for convenience. 

In [None]:
chunksize = 10**7
df = pd.DataFrame(columns=['id', 'datetime', 'consumption'])

with pd.read_csv(path_to_data + '\CD_INTERVAL_READING_ALL_NO_QUOTES.csv', chunksize=chunksize, parse_dates=['READING_DATETIME']) as reader:
    for chunk in reader:
        chunk = chunk[['CUSTOMER_ID', 'READING_DATETIME', ' GENERAL_SUPPLY_KWH']]
        chunk = chunk.rename(columns={' GENERAL_SUPPLY_KWH': 'consumption', 'READING_DATETIME': 'datetime', 'CUSTOMER_ID': 'id'})
        df = pd.concat([df, chunk[chunk['id'].isin(customers)]], ignore_index=True)
        
df = df.groupby(['id', pd.Grouper(key='datetime', freq='H')]).sum().reset_index()

In [7]:
def count_missing_datetimes(group):
    min_datetime = group['datetime'].min()
    max_datetime = group['datetime'].max()
    all_datetimes = pd.date_range(min_datetime, max_datetime, freq='1H')
    missing = all_datetimes[~all_datetimes.isin(group['datetime'])]
    return len(missing)
    
missing_counts = df.groupby('id').apply(count_missing_datetimes)
print(missing_counts.sum())

49494


There is close to 50k missing datetimes so we add them and remove duplicate hours caused by the timezone change due to daylight saving.

In [8]:
df = df.drop_duplicates(subset=['id', 'datetime'])

def generate_missing_datetimes(group):
    min_datetime = group['datetime'].min()
    max_datetime = group['datetime'].max()
    all_datetimes = pd.date_range(min_datetime, max_datetime, freq='1H')
    return pd.DataFrame({'id': group['id'].iloc[0], 'datetime': all_datetimes})

df_all_datetimes = df.groupby('id').apply(generate_missing_datetimes).reset_index(drop=True)  
df_all_datetimes = df_all_datetimes.merge(df, how='left', on=['id', 'datetime'])     

In [10]:
df_all_datetimes.to_csv('preprocessed_sgsc.csv', index=False)