Here we do time series anomaly detection using Facebook (Meta) Prophet (https://facebook.github.io/prophet/) model in Python. Anomalies are also called outliers, and we will use these two terms interchangeably. 

# Step 0: Algorithm for Time Series Anomaly Detection

In step 0, let's talk about the algorithm for time series anomaly detection. At a high level, the outliers are detected based on the prediction interval of the time series. The implementation includes the following steps:

1. Build a time series forecasting model.
2. Make predictions on historical data using the time series forecasting model.
3. Compare the actual values with the prediction intervals. Outliers are defined as the data points with actual values outside of the prediction intervals.

# Step 1: Install and Import Libraries

we will install and import libraries.

`prophet` is the package for the time series model. Afterwards `prophet` is imported into the notebook. 

We also import `pandas` and `numpy` for data processing, `seaborn` and `matplotlib` for visualization, and `mean_absolute_error` and `mean_absolute_percentage_error` for the model performance evaluation.

In [None]:
# Prophet model for time series forecast
from prophet import Prophet

# Data processing
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Model performance evaluation
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

# Step 2: Get Data and processes it

In [None]:
data_path = "../../../data/raw/Time_Series_Merchants_Transactions_Anonymized.csv"
df_merchant_transactions = pd.read_csv(data_path)

In [None]:
df_merchant_transactions = df_merchant_transactions.drop(columns='Merchant Name')

In [None]:
start_date = '2020-08'
end_date = '2022-10'
# replacing columns names with standard date format
stddates = pd.date_range(start=start_date, end=end_date, freq="M")
df_merchant_transactions.columns = stddates
df_merchant_transactions.head()
#stddates

The goal of the time series model is to predict the six group merchant transactions.

Prophet requires at least two columns as inputs: a `ds` column and a `y` column.
* The `ds` column has the time information. Currently we have the date as the index, so we name this index as `ds`.
* The y column has the time series transaction values. In this example, because we are predicting six group merchant transactions, the column name for the transactions is named `y`.

In [None]:
df= {'ds':stddates,
    'y' :df_merchant_transactions.iloc[0,:].values
    }
data = pd.DataFrame(df)
data.head()

Using `.info`, we can see that the dataset has 26 records and there are no missing values.

In [None]:
# Information on the dataframe
data.info()

Next, let's visualize the merchant transactions of the two tickers using `seaborn`, and add the legend to the plot using `matplotlib`. we see the transactions fluctuated over the given period of time.

In [None]:
# Visualize data using seaborn
sns.set(rc={'figure.figsize':(12,8)})
sns.lineplot(x=data['ds'], y=data['y'])
plt.legend(['merchant transactions'])

# Step 3: Build Time Series Model Using Prophet in Python

In step 3, we will build a time series model using Prophet in Python. 

Notice that we did not do train test split for the modeling dataset. This first goal is to fit a model that predicts well on the past prices. Therefore, we will use the whole dataset for both training and forecasting.

* When initiating the prophet model, the seasonality_mode='multiplicative' is explicitly set, and then fit on the training data.
* The `interval_width` is set to 0.99, which means that the uncertainty interval is 99%.

We keep the model simple in this example to focus on the process of anomaly detection. 

In [None]:
# Add seasonality
model = Prophet(interval_width=0.99, seasonality_mode='multiplicative')

# Fit the model on the training dataset
model.fit(data)

# Step 4: Make Predictions Using Prophet in Python

After building the model, in step 4, we use the model to make predictions on the dataset. The forecast plot shows that the predictions are in general aligned with the actual values.

In [None]:
# Make prediction
forecast = model.predict(data)

# Visualize the forecast
model.plot(forecast); # Add semi-colon to remove the duplicated chart

We can also check the components plot for the trend, weekly seasonality, and yearly seasonality.

In [None]:
# Visualize the forecast components
model.plot_components(forecast);

# Step 5: Check Time Series Model Performace

In step 5, we will check the time series model performance. The forecast dataframe does not include the actual values, so we need to merge the forecast dataframe with the actual dataframe to compare the actual values with the predicted values. Two performance metrics are included:

* MAE (Mean Absolute Error) sums up the absolute difference between actual and prediction and is divided by the number of predictions.
* MAPE (Mean Absolute Percentage Error) sums up the absolute percentage difference between actual and prediction and is divided by the number of predictions. MAPE is independent of the magnitude of data, so it can be used to compare different forecasts. But it’s undefined when the actual value is zero.

In [None]:
# Merge actual and predicted values
performance = pd.merge(data, forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']], on='ds')

# Check MAE value
performance_MAE = mean_absolute_error(performance['y'], performance['yhat'])
print(f'The MAE for the model is {performance_MAE}')

# Check MAPE value
performance_MAPE = mean_absolute_percentage_error(performance['y'], performance['yhat'])
print(f'The MAPE for the model is {performance_MAPE}')

The mean absolute error (MAE) for the model is 373, meaning that on average, the forecast is off by 373. Given that transactions are in thousands, the prediction is not bad and better than Auto Arima and ETS.

The mean absolute percent error (MAPE) for the baseline model is 7%, meaning that on average, the forecast is off by 7% of the transactions.

# Step 6: Identify Anomalies

In step 6, we will identify the time series anomalies by checking if the actual value is outside of the uncertainty interval. If the actual value is smaller than the lower bound or larger than the upper bound of the uncertainty interval, the anomaly indicator is set to 1, otherwise, it's set to 0.

Using `value_counts()`, we can see that there are no outliers out of 26 data points.

In [None]:
# Create an anomaly indicator
performance['anomaly'] = performance.apply(lambda rows: 1 if ((rows.y<rows.yhat_lower)|(rows.y>rows.yhat_upper)) else 0, axis = 1)

# Check the number of anomalies
performance['anomaly'].value_counts()

In [None]:
# Take a look at the anomalies
anomalies = performance[performance['anomaly']==1].sort_values(by='ds')
anomalies

In the visualization, all the dots are actual values and the black line represents the predicted values. The orange dots are the outliers. we see we have no outliers for this merchant transactions.

In [None]:
# Visualize the anomalies
sns.scatterplot(x='ds', y='y', data=performance, hue='anomaly')
sns.lineplot(x='ds', y='yhat', data=performance, color='black')

# Summary

we made time series anomaly detection using Prophet in Python. The results look good for the one choosen merchant case.

# References

[1] [Prophet Documentation](https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html)