### Objective and Input

This notebook takes in a csv file (e.g. NYCHA_TS.csv) as input with **Building_Meter**, **Month**, **Value** as necessary column names and output a dataframe with a column (**Anomaly**) identifying whether the individual records is an anomaly point or not, a column (**Reconstructed_Value**) showing the reconstructed value from cluster centroilds ,and finally, a column (**Reconstruction_Error**) denoting the reconstruction error calculated by the absolute different between Original Value and Reconstructed Value.

### Required Packages

We require pandas and numpy for dataframe, row, column, and cell manipulation; and we require KMeans from sklearn cluster package to calculated centroids for each chunk of waveforms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

### User Defined Functions

There are two main user defined functions, one for slicing the account level time series data into waveforms of 8 data point, and another one for calculating clusters of waveforms and the centroilds for each cluster and stitching all centroids together to reconstruct the time series trend without anomalies.

In [2]:
# Step 1: slice account level trend into 8 data point waveforms with sliding window of 1 step
def sliding_chunker(data, window_len, slide_len):
    """
    Split an account level trend data into waveforms,
    each waveform is window_len long,
    sliding along by slide_len each time.
    If the list doesn't have enough elements for the final sub-list 
    to be window_len long, the remaining data will be dropped.
    e.g. sliding_chunker(range(6), window_len=3, slide_len=2)
    gives [ [0, 1, 2], [2, 3, 4] ]
    """
    chunks = []
    for pos in range(0, len(data), slide_len):
        chunk = np.copy(data[pos:pos+window_len])
        if len(chunk) != window_len:
            continue
        chunks.append(chunk)

    return chunks

In [3]:
# Step 2: cluster the chucks into 12 clusters using KMeans clsutering and reconstruct by taking mean of centroids
def clustering_reconstruction(df_one_building, segment_len = 8, slide_len = 1):
    """
    This functions consists of two main parts: Clustering and Reconstruction.
    The Clustering part clusters segments from slicer into 12 clusters.
    The Reconstruction part stitches all centroids mean by original data index.
    """
    segments = []
    for start_pos in range(0, len(df_one_building['Value']), slide_len):
        end_pos = start_pos + segment_len
        # make a copy so changes to 'segments' doesn't modify the original data
        segment = np.copy(df_one_building['Value'][start_pos:end_pos])
        # if we're at the end and we've got a truncated segment, drop it
        if len(segment) != segment_len:
            continue
        segments.append(segment)
        
    # use KMeans function from sklearn to cluster segments into 12 clusters representing each month
    clusterer = KMeans(n_clusters=12)
    clusterer.fit(segments)
    
    # define data for reconstruction 
    data = df_one_building['Value']
    reconstruction = np.zeros(len(data))

    # define test segments for calculating clusters
    test_segments = sliding_chunker(
        df_one_building['Value'],
        window_len=segment_len,
        slide_len=slide_len
    )

    # loop through each test segments to find the nearest centroids
    for segment_n, segment in enumerate(test_segments):
        segment = np.copy(segment)
        nearest_centroid_idx = clusterer.predict(segment.reshape(1,-1))[0]
        centroids = clusterer.cluster_centers_
        nearest_centroid = np.copy(centroids[nearest_centroid_idx])

        # overlay our reconstructed segments with an overlap of half a segment
        pos = int(segment_n * slide_len)
        reconstruction[pos:pos+segment_len] += nearest_centroid/(segment_len/slide_len)

    # fix first segment_len and last segment_len data points since they are not modeled segment_len/slide_len times
    for i in np.linspace(0,segment_len-1,segment_len).astype(int):
        reconstruction[i] = reconstruction[i]/(i+1)*(segment_len/slide_len)
        reconstruction[-i -1 ] = reconstruction[-i - 1]/(i+1)*(segment_len/slide_len)

    # calculate the reconstruction errors by taking the absolute difference between reconstruct data and original data
    error = reconstruction[0:len(data)] - data[0:len(data)]
    error_99th_percentile = np.percentile(error, 99)
    
    # assign three new columns for output
    df_one_building['Anomaly'] = np.where(np.abs(error[0:len(data)])>error_99th_percentile, 'True', 'False')
    df_one_building['Reconstruction_Error'] = error
    df_one_building['Reconstructed_Value'] = reconstruction
    
    return df_one_building

### Pipeline for All Accounts

In [4]:
%%time
# Step 3: loop over all accounts

# Input and change all NA values to 0 for processing
all_valid_account_data = pd.read_csv("../output/NYCHA_TS.csv")
all_valid_account_data = all_valid_account_data[['Account','Month','Value']]
all_valid_account_data = all_valid_account_data.fillna(0)

# define an empty dataframe result to store data on individual account level
result = []
for account in pd.unique(all_valid_account_data['Account']): 

    df_one_building = all_valid_account_data[all_valid_account_data['Account']==account]

    df_one_building_result = clustering_reconstruction(df_one_building)
    
    result.append(df_one_building_result)
    
result = pd.concat(result, axis=0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


CPU times: user 5min 30s, sys: 682 ms, total: 5min 31s
Wall time: 5min 32s


## Output

Users can now save the result dataframe to desired directory in desired format