## Code to execute Prophet and generate an output file highlighting anomalies

** Importing all the required libraries **

In [1]:
import pandas as pd
import numpy as np
from fbprophet import Prophet
import warnings
warnings.filterwarnings('ignore')

** The purpose of this section is to read a csv file that can be used as an input for the anomaly detection function.
We start with the cleaned NYCHA dataset, delete all rows where consumption values are missing (0), and filter for those accounts which have atleast 50 rows of data. Finally all the unwanted columns are dropped leaving us with just 3 columns, Building_id/Account no, Month and Consumption. **

In [2]:
nycha = pd.read_csv("../output/nycha/NYCHA_TS.csv", parse_dates=['Month'])
nycha = nycha.fillna(0)
nycha_f = nycha[nycha['Value'] != 0]
nycha_f = nycha_f.drop('Unnamed: 0', axis=1)
nycha_f = nycha_f.groupby('Account').filter(lambda x: len(x) > 50)

** The function to automate the generation of a dataframe indicating anomalies using Prophet. <br>
The input to this function is the list of building account ids from which anomalies need to be detected **


In [3]:
def automate_prophet(b_id):
    # filter the data for this account
    prop_df = nycha_f[nycha_f['Account'] == b_id]
    # drop and rename the columns as required by Prophet
    prop_df = prop_df.drop('Account', axis = 1)
    prop_df = prop_df.rename(columns={'Month':'ds', 'Value':'y'})
    # create a copy of the original dataframe
    prop_df_o = prop_df.copy()
    # set the month as index
    prop_df_o = prop_df_o.set_index('ds')
    # run the prophet model with yearly seasonality, interval width or confidence interval is set to 95% to 
    # increase the sampling threshold, mcmc sample size of 50 performs full Bayesian sampling to include uncertainty
    # in seasonality
    model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False, interval_width=0.95, mcmc_samples=50)
    # fit the model using original dataset
    model.fit(prop_df)
    # get predicted values for the entire dataset
    predicted = model.predict()
    # the resulting dataframe has many columns, here we filter the important ones from it.
    actvpred = predicted[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
    # get the original values for consumption 
    actvpred.loc[:,'y_orig'] = prop_df_o.values
    # identify anomalies based on whether the original value lies in the given threshold (between yhat_lower and yhat_upper) 
    actvpred.loc[:,'Anomaly'] = np.where((actvpred['y_orig'] >= actvpred['yhat_lower']) & (actvpred['y_orig'] <= actvpred['yhat_upper']), 'No', 'Yes')
    # add a column to indicate the building id in the resulting dataframe
    actvpred['b_id'] = b_id
    return actvpred

** This section of code is to call the function created above, by first aggregating all the account numbers in a list and using that as an input parameter for the function.
Once the resulting dataframe is returned from the function, we export this data to a csv file. **

In [17]:
# get all the unique account ids from the data
b_m = np.unique(nycha_f['Account'])
# convert this to a list and then Series
m_list = b_m.tolist()
build_meter = pd.Series(m_list)
# initialize the columns to be used in the final dataframe
col = ['Build_id','Month','Predicted','Original','Upper_b','Lower_b','Anomaly']
# initialize a new dataframe with the column names from above
df_new = pd.DataFrame(columns = col)
# run a loop for all the accounts
for i in build_meter:
    # call the function
    res = automate_prophet(i)
    # rename the columns in the dataframe returned to match the column names above
    res = res.rename(columns={"ds":"Month","yhat":"Predicted","yhat_lower":"Lower_b","yhat_upper":"Upper_b","y_orig":"Original","b_id":"Build_id"})
    # reorder the columns as specified
    res = res[['Build_id','Month','Predicted','Original','Upper_b','Lower_b','Anomaly']]
    # append this dataframe to the result
    df_new = pd.concat([df_new,res],ignore_index=True)
# write the output to a csv file
df_new.to_csv('../output/nycha/final_output_prophet.csv', header=True, index= False)