## Objectives

This notebook contains codes to detect outliers in a given dataset of energy consumption values, using the time series decomposition method. It loops through all the accounts and labels whether the value associated to each month is an anomaly or not.
- Input: a csv file that contains the three columns needed: Account (account id), Month (calendar month), Value (numerical column to detect outliers from)
- Output: an R tibble (data frame) of the original dataset together with an outlier indicator column and outlier rank column
- Note: most of the codes in this notebook is also contained in the Decomposition_Demo.ipynb notebook. The purpose of this notebook is present all the codes needed for automatic anomaly detection in one place.

## System Setup

This notebook includes R codes and we will need the following R packages:
- "tidyverse" package for data manipulation
- "Anomalize" package for STL time series decomposition
- "ggQC" package for drawing of XmR charts

## Load packages

In [130]:
library(tidyverse)
library(anomalize)
library(ggQC)

## Read in dataset

In [131]:
# read in the csv file which contains the prorated consumption (and/or charge) values for the energy accounts
# tb_all = read_csv("../output/client2/natural_gas/natural_gas_prorated_ts.csv")
tb_all = read_csv("../output/client1/electricity_prorated_ts.csv")


# filter out unnecessary columns
tb_all = select(tb_all, c('Account', 'Month', 'Prorated_Consumption', 'Prorated_Charge'))

“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_double(),
  Account = col_character(),
  Month = col_date(format = ""),
  Prorated_Consumption = col_double(),
  Prorated_Charge = col_double()
)


In [132]:
head(tb_all)

Account,Month,Prorated_Consumption,Prorated_Charge
Building_Code-B15-Account_ID-1038,2014-12-01,13410.57,116.4843
Building_Code-B15-Account_ID-1038,2015-01-01,208371.43,1808.8974
Building_Code-B15-Account_ID-1038,2015-02-01,193331.29,1676.3237
Building_Code-B15-Account_ID-1038,2015-03-01,240831.13,2116.4588
Building_Code-B15-Account_ID-1038,2015-04-01,253343.09,2239.0405
Building_Code-B15-Account_ID-1038,2015-05-01,325959.7,2898.0553


### Specify metric type (consumption or charge)

In [133]:
tb_all = select(tb_all, c('Account', 'Month', 'Prorated_Charge'))
tb_all = rename(tb_all, Value = Prorated_Charge)

### Get the list of account_id's in the input file

In [134]:
accounts = tb_all %>% group_by(Account) 
accounts <- accounts %>% summarise(counts = n(), na_counts = sum(is.na(Value)))

# calculate the percentage of months of missing value
accounts <- mutate(accounts, na_perc = na_counts/counts)

### Fill the missing values in the "Value" column with 0

In [135]:
tb_all <- mutate(tb_all, Value = ifelse(is.na(Value), 0, Value))

## Loop through all accounts to apply STL decomposition and ourlier detection methods

### Define a function to calculate the deviation of a residual from the limit values - This will be used to rank the outliers detected based on its deviation from the boundary values

In [136]:
cal_dev <- function(residual, upper, lower) {
  if (residual < lower) {
    return(lower - residual)
  } else if (residual > upper) {
    return(residual - upper)
  } else {
    return (NA)
  }
}

### Loop through all the accounts
- apply STL decomposition to an account's time series values
- detect outliers by appling three methods applied to the residual component from the STL decomposition
- assign anomaly label and anomaly rank to each of the outlier detected from a "voted" result from the step above
    - If IQR_6X methods outputs it as an anomaly, then label is as an outlier
    - If IQR_6X methods outputs it as a normal value, but both other two methods output it as an outlier, label it as an outlier

In [137]:
results_stl <- vector("list", length(accounts$Account)) 

start.time <- Sys.time()

for (i in 1:length(accounts$Account)) {
    # select the data for the input account
    ts = select(filter(tb_all, Account == accounts$Account[[i]]), 'Month', 'Value')
    
    # apply STL decomposition
    ts_anomalized <- ts %>%
        time_decompose(Value, merge = TRUE, method = 'stl', message = FALSE)
    
    # rename and reorder the columns of the resulting dataframe
    ts_anomalized$Account = accounts$Account[[i]]
    ts_anomalized$Missing_Value = ts_anomalized$Value == 0
    ts_anomalized <- rename(ts_anomalized, Calendar_Month = Month, Total = Value, Trend = trend, Seasonal = season
       , Residual = remainder)
    ts_anomalized <- ts_anomalized[, c('Account', 'Calendar_Month', 'Missing_Value', 'Total', 'Trend', 'Seasonal', 'Residual')]


    # Calculate residuals
    iqr_3X <- ts_anomalized %>%
            anomalize(Residual, method = 'iqr', alpha = 0.05)

    iqr_3X <- iqr_3X[, c('Account', 'Calendar_Month', 'Residual', 'Residual_l1', 'Residual_l2', 'anomaly')]
    iqr_3X <- rename(iqr_3X, Lower = Residual_l1, Upper = Residual_l2, Anomaly = anomaly)
    iqr_3X <- mutate(iqr_3X, Anomaly = ifelse(Anomaly == "Yes", TRUE, FALSE))

    iqr_6X <- ts_anomalized %>%
            anomalize(Residual, method = 'iqr', alpha = 0.025)

    iqr_6X <- iqr_6X[, c('Account', 'Calendar_Month', 'Residual', 'Residual_l1', 'Residual_l2', 'anomaly')]
    iqr_6X <- rename(iqr_6X, Lower = Residual_l1, Upper = Residual_l2, Anomaly = anomaly)
    iqr_6X <- mutate(iqr_6X, Anomaly = ifelse(Anomaly == "Yes", TRUE, FALSE))

    ctrl_limits <- QC_Lines(data = ts_anomalized$Residual, method = "XmR")  
    ctrl_limits <- ctrl_limits[, c('xBar_one_LCL', 'xBar_one_UCL')]
    ctrl_limits <- rename(ctrl_limits, Lower= xBar_one_LCL, Upper = xBar_one_UCL)
    xmr_mean <- cbind(ts_anomalized[, c('Account', 'Calendar_Month', 'Residual')], ctrl_limits)
    
    
    # Add deviation from limit, rank of outlier and outlier indicator
    iqr_3X$Dev <- mapply(cal_dev, iqr_3X$Residual, iqr_3X$Upper, iqr_3X$Lower)
    iqr_3X <- arrange(iqr_3X, desc(Dev))
    iqr_3X$Rank = seq(1:nrow(iqr_3X))
    if (sum(is.na(iqr_3X$Dev)) > 0) {
       iqr_3X[is.na(iqr_3X$Dev), ]$Rank <- NA
    }

    iqr_6X$Dev <- mapply(cal_dev, iqr_6X$Residual, iqr_6X$Upper, iqr_6X$Lower)
    iqr_6X <- arrange(iqr_6X, desc(Dev))
    iqr_6X$Rank = seq(1:nrow(iqr_6X))
    if (sum(is.na(iqr_6X$Dev)) > 0) {
        iqr_6X[is.na(iqr_6X$Dev), ]$Rank <- NA
    }

    xmr_mean$Dev <- mapply(cal_dev, xmr_mean$Residual, xmr_mean$Upper, xmr_mean$Lower)
    xmr_mean <- arrange(xmr_mean, desc(Dev))
    xmr_mean$Rank = seq(1:nrow(xmr_mean))
    if (sum(is.na(xmr_mean$Dev)) > 0) {
        xmr_mean[is.na(xmr_mean$Dev), ]$Rank <- NA
    }
    xmr_mean <- mutate(xmr_mean, Anomaly = ifelse(is.na(Dev), FALSE, TRUE))
    
    
    # rename the columns
    iqr_3X <- rename(iqr_3X, Lower_3X = Lower, Upper_3X = Upper, Anomaly_3X = Anomaly, Dev_3X = Dev, Rank_3X = Rank)
    iqr_6X <- rename(iqr_6X, Lower_6X = Lower, Upper_6X = Upper, Anomaly_6X = Anomaly, Dev_6X = Dev, Rank_6X = Rank)
    xmr_mean <- rename(xmr_mean, Lower_xmr = Lower, Upper_xmr = Upper, Anomaly_xmr = Anomaly, Dev_xmr = Dev, Rank_xmr = Rank)

    
    # Combine the results of 3 methods
    result <- 
    ts_anomalized[c('Account', 'Calendar_Month', 'Missing_Value', 'Total', 'Trend', 'Seasonal', 'Residual')] %>% 
        inner_join(iqr_3X[, -3], by = c('Account', 'Calendar_Month')) %>%
        inner_join(iqr_6X[, -3], by = c('Account', 'Calendar_Month')) %>%
        inner_join(xmr_mean[, -3], by = c('Account', 'Calendar_Month')) 

    # considered as an outlier if 
    #     1) STL_6X outputs it as an outliers OR
    #     2) STL_6X outputs it as a normal point but both other two methods output it as an outlier
    result <- mutate(result
        , Anomaly_Voted = ifelse(((Anomaly_6X == TRUE) | ((Anomaly_6X == FALSE) & (Anomaly_xmr == TRUE & Anomaly_3X == TRUE))), TRUE, FALSE)
    )

    # weighted rank of the outlier
    results_stl[[i]] <- 
        mutate(result, Rank_Weighted = ifelse(is.na(Rank_6X), (Rank_3X + Rank_xmr)/2, Rank_6X))

}

end.time <- Sys.time()
time.taken.stl <- end.time - start.time

## Calculate average runtime per account

In [138]:
time.taken.stl/length(accounts$Account)

Time difference of 0.09235124 secs

## Get a combined data frame of the original data with anomaly indicator and anomaly rank columns

In [139]:
result = tibble()

for (i in 1:length(accounts$Account)) {
    tmp = select(results_stl[[i]] , c('Account', 'Calendar_Month', 'Total', 'Anomaly_Voted', 'Rank_Weighted'))
    tmp = rename(tmp, Value = Total, Month = Calendar_Month, Anomaly = Anomaly_Voted, Anomaly_Rank = Rank_Weighted)
    result = rbind(result, tmp)
    }

## Save the output

The user can then output the result to any desired directory

In [140]:
write.table(result , file = "../output/client1/anomaly_detection_decomposition_client1_electricity_charge.csv")