# Preprocessing Low Carbon London (LCL) dataset 

This dataset can be downloaded form the https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households website. It contains the energy consumption of 5566 households in London from November 2011 to February 2014. The data is collected in half-hourly intervals. Aftern unzipping the data you can read the csv file, however since the data is over 8GB it is easier to read it in chunks. 


In [1]:
import pandas as pd
import numpy as np

In [None]:
path_to_data = r"path_to_your_folder_with_data"

In [2]:
pd.read_csv(path_to_data + '\CC_LCL-FullData.csv', nrows=100)

Unnamed: 0,LCLid,stdorToU,DateTime,KWH/hh (per half hour)
0,MAC000002,Std,2012-10-12 00:30:00.0000000,0.000
1,MAC000002,Std,2012-10-12 01:00:00.0000000,0.000
2,MAC000002,Std,2012-10-12 01:30:00.0000000,0.000
3,MAC000002,Std,2012-10-12 02:00:00.0000000,0.000
4,MAC000002,Std,2012-10-12 02:30:00.0000000,0.000
...,...,...,...,...
95,MAC000002,Std,2012-10-14 00:30:00.0000000,0.166
96,MAC000002,Std,2012-10-14 01:00:00.0000000,0.226
97,MAC000002,Std,2012-10-14 01:30:00.0000000,0.088
98,MAC000002,Std,2012-10-14 02:00:00.0000000,0.126


We want to extract only data with 'std' value and remove all the 'ToU' users. Also there are some missing values in the data which are marked as 'Null', we will replace them with nan. This process will take a few minutes as we are concatenating all the data into one dataframe, it could be sped up by processing data in chunks, but we keep the data in one file for convenience. 

In [None]:
df = pd.DataFrame(columns=['id', 'datetime', 'consumption'])
chunksize = 10**7
dtypes = {'KWH/hh (per half hour) ': 'float64', 'LCLid': 'str', 'DateTime': 'datetime64'}
with pd.read_csv(path_to_data + '\CC_LCL-FullData.csv', 
                 chunksize=chunksize, dtype={'KWH/hh (per half hour) ': 'float64'},
                 parse_dates=['DateTime'], na_values=['Null']) as reader:
    for chunk in reader:
        chunk = chunk[chunk['stdorToU']=='Std'].drop(columns=['stdorToU'])
        chunk = chunk[['KWH/hh (per half hour) ', 'LCLid', 'DateTime']]
        chunk = chunk.rename(columns={'KWH/hh (per half hour) ': 'consumption', 'LCLid': 'id', 'DateTime': 'datetime'})
        chunk = chunk.groupby(['id', pd.Grouper(key='datetime', freq='H')]).sum().reset_index()
        df = pd.concat([df, chunk], ignore_index=True)

In [4]:
def count_missing_datetimes(group):
    min_datetime = group['datetime'].min()
    max_datetime = group['datetime'].max()
    all_datetimes = pd.date_range(min_datetime, max_datetime, freq='1H')
    missing = all_datetimes[~all_datetimes.isin(group['datetime'])]
    return len(missing)
    
missing_counts = df.groupby('id').apply(count_missing_datetimes)
print(missing_counts.sum())
print(len(missing_counts))

190967
4443


We have 4443 customers in this dataset and close to 200k missing datetimes which we will add.

In [6]:
def generate_missing_datetimes(group):
    min_datetime = group['datetime'].min()
    max_datetime = group['datetime'].max()
    all_datetimes = pd.date_range(min_datetime, max_datetime, freq='1H')
    return pd.DataFrame({'id': group['id'].iloc[0], 'datetime': all_datetimes})

df_all_datetimes = df.groupby('id').apply(generate_missing_datetimes).reset_index(drop=True)              

In [7]:
df['datetime'] = pd.to_datetime(df['datetime'])
df_all_datetimes = df_all_datetimes.merge(df, how='left', on=['id', 'datetime'])

In [None]:
df_all_datetimes.to_csv('preprocessed_lcl.csv', index=False)