### Challenge Description

Energy consumption in buildings and industry is often wasted due to user behaviour, human error, and poorly performing equipment. In this context, identifying abnormal consumption power behavior can be an important part of reducing peak energy consumption and changing undesirable user behavior. With the widespread rollouts of smart meters, normal operating consumption can be learned over time and used to identify or flag abnormal consumption. Such information can help indicate to users when their equipment is not operating as normal and can help to change user behavior or to even indicate what the problem appliances may be to implement lasting changes.

This challenge is looking for data scientists to apply their skills to an anomaly detection problem using smart meter data. Ideally, such an algorithm should begin to operate after as little as 3 months and should improve over time. A platform to visualise the anomalies would also be useful. Users can select any type of machine learning algorithms that they wish to in order to detect the anomalies from the data.

### Data
A sample including smart meter data can be found on [kaggle](https://www.kaggle.com/portiamurray/anomaly-detection-smart-meter-data-sample). Participants are encouraged to find other smart meter data to work with in order to test their algorithms.

### Imports

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import random
from datetime import datetime, timedelta
import collections as cl

### Load file to 'data' dataframe

In [2]:
data = pd.read_excel('SmartMeterSample.xlsx')
data.columns = ['timestamp', 'reading']
data = data.set_index('timestamp')
data.head

<bound method NDFrame.head of                      reading
timestamp                   
2016-01-01 00:15:00     2.85
2016-01-01 00:30:00     2.85
2016-01-01 00:45:00     3.00
2016-01-01 01:00:00     2.94
2016-01-01 01:15:00     2.79
...                      ...
2017-05-10 23:00:00     3.60
2017-05-10 23:15:00     3.51
2017-05-10 23:30:00     3.60
2017-05-10 23:45:00     3.51
2017-05-11 00:00:00     3.57

[47581 rows x 1 columns]>

In [3]:
data = data[~data.index.duplicated(keep='first')] # there are 4 entries with the same timestamp

In [4]:
data.iloc[-1]

reading    3.57
Name: 2017-05-11 00:00:00, dtype: float64

### How to with timestamps

In [5]:
# how to get a timestamp from string
datetime.fromisoformat('2016-01-01 00:15:00')
# how to add 90 days to a specific time
datetime.fromisoformat('2016-01-01 00:15:00') + timedelta(days=90)

datetime.datetime(2016, 3, 31, 0, 15)

In [6]:
from datetime import datetime, timedelta

def datetime_range(start, end, delta):
    current = start
    while current < end:
        yield current
        current += delta
# generate timestamps for the whole period: starting on 01.01.2016, ending on 10.05.2017 
dt = [dt for dt in 
       datetime_range(datetime(2016, 1, 1, 0, 15), datetime(2017, 5, 10, 23, 55), 
       timedelta(minutes=15))]

print(len(dt))

47615


In [7]:
df = pd.DataFrame(dt)
df.columns = ['timestamp']
df['reading'] = np.nan
df = df.set_index('timestamp')
df.head()

Unnamed: 0_level_0,reading
timestamp,Unnamed: 1_level_1
2016-01-01 00:15:00,
2016-01-01 00:30:00,
2016-01-01 00:45:00,
2016-01-01 01:00:00,
2016-01-01 01:15:00,


In [8]:
for index,row in df.iterrows():
    if index in data.index:
        df.loc[index]['reading'] = data.loc[index]['reading']
    else:
        print('Missing value at ', index)

Missing value at  2016-03-27 02:00:00
Missing value at  2016-03-27 02:15:00
Missing value at  2016-03-27 02:30:00
Missing value at  2016-03-27 02:45:00
Missing value at  2016-11-10 01:45:00
Missing value at  2017-02-24 00:00:00
Missing value at  2017-02-24 00:15:00
Missing value at  2017-02-24 00:30:00
Missing value at  2017-02-24 00:45:00
Missing value at  2017-02-24 01:00:00
Missing value at  2017-02-24 01:15:00
Missing value at  2017-02-24 01:30:00
Missing value at  2017-02-24 01:45:00
Missing value at  2017-02-24 02:00:00
Missing value at  2017-02-24 02:15:00
Missing value at  2017-02-24 02:30:00
Missing value at  2017-02-24 02:45:00
Missing value at  2017-02-24 03:00:00
Missing value at  2017-02-24 03:15:00
Missing value at  2017-02-24 03:30:00
Missing value at  2017-02-24 03:45:00
Missing value at  2017-02-24 04:00:00
Missing value at  2017-02-24 04:15:00
Missing value at  2017-02-24 04:30:00
Missing value at  2017-02-24 04:45:00
Missing value at  2017-02-24 05:00:00
Missing valu

In [9]:
for index, row in df.iterrows():
    val = row['reading']
    if val == np.nan:
        print('Missing value at ', index)

In [10]:
df['freq'] = 0
appr = cl.defaultdict(float)
count = 0
alert_threshold = 0.0002

for index, row in df.iterrows():
    val = row['reading']
    if pd.isna(val):
        print('Missing value at ', index)
    else:
        count += 1
        if val in appr:
            appr[val] += 1
        else:
            appr[val] = 1
        row['freq'] = appr[val]/count
        if row['freq'] < alert_threshold:
            print('Anomolous value at ', index)
            

Anomolous value at  2016-03-14 14:15:00
Missing value at  2016-03-27 02:00:00
Missing value at  2016-03-27 02:15:00
Missing value at  2016-03-27 02:30:00
Missing value at  2016-03-27 02:45:00
Anomolous value at  2016-09-25 13:45:00
Anomolous value at  2016-11-01 10:00:00
Anomolous value at  2016-11-09 09:00:00
Anomolous value at  2016-11-09 09:15:00
Anomolous value at  2016-11-09 09:45:00
Anomolous value at  2016-11-09 10:00:00
Missing value at  2016-11-10 01:45:00
Anomolous value at  2016-11-23 09:30:00
Anomolous value at  2016-12-05 10:45:00
Anomolous value at  2016-12-05 11:00:00
Anomolous value at  2016-12-05 11:15:00
Anomolous value at  2016-12-06 09:45:00
Anomolous value at  2016-12-06 10:00:00
Anomolous value at  2016-12-07 10:00:00
Anomolous value at  2016-12-08 10:45:00
Anomolous value at  2016-12-13 15:00:00
Anomolous value at  2016-12-14 09:30:00
Anomolous value at  2016-12-14 09:45:00
Anomolous value at  2016-12-14 10:00:00
Anomolous value at  2016-12-14 10:15:00
Anomolous 