<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />


# Worksheet 7.1: Anomaly Detection - Answers

This worksheet covers concepts relating to Anomaly Detection.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple. 

## Import the Libraries
For this exercise, we will be using:
* Pandas (https://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (https://matplotlib.org/stable/)
* Prophet (https://github.com/facebook/prophet)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

style.use("ggplot")
DATA_HOME = '../data'

# Finding Anomalies in CPU Usage Data
The first part of this lab, you will be examining CPU usage data to find anomalies. 

## Step One:  Get the Data
For this example, we will be looking at CPU Utilization Data to see if we can identify periods of unusual activity.  The data can be found in several files:

* `cpu-full-a.csv`:  A full set of CPU usage data without anomalies
* `cpu-train-a.csv`:  The training set from data set A
* `cpu-test-b.csv`:  The test set from data set A
* `cpu-full-b.csv`:  A full set of CPU usage data with an anomaly
* `cpu-train-b.csv`:  The training set from data set A
* `cpu-test-b.csv`:  The test set from data set A


This dataset is from examples in *Machine Learning & Security*  by Clarence Chio and David Freeman.  https://github.com/oreilly-mlsec/book-resources/tree/master/chapter3/datasets/cpu-utilization.

First let's take a look at the data set A.  For the first part of this lab, load the training dataset into a dataframe.  DataFrames have an option `infer_datatime_format` which, when set to `True`, will automatically infer dates. Setting this will save time and steps in data preparation. 

Once the data is loaded, call the usual series of exploratory functions and most importantly, plot the data.

# Prophet
A really useful library for time series analysis is called `Prophet` and is published by Meta.  The documentation is available here: [https://github.com/facebook/prophet](https://github.com/facebook/prophet).  

Prophet uses a unique algorithm for time series analysis, specifically it is based on a decomposable additive model where non-linear trends fit with seasonality.  Prophet can take seasons and holidays into account in its predictions.  Additionally, the algorithm seems to be fairly computationally efficient.

The prophet equation is:
``` python
forecast = trend + seasonality + holidays + error term
```

- Trend models non periodic changes in the time series data.

- Seasonality is caused due to the periodic changes like daily, weekly, or yearly seasonality.

- Holiday effect which occur on irregular schedules over a day or a period of days.

- Error terms is what is not explained by the model.


## Using Prophet
Prophet's usage is pretty straightforward, however it requires a dataframe with two columns: a timestamp and a data column.  These must be named `ds`  and `y` respectively. The y column must be numeric, and represents the measurement we wish to forecast. 

Pull in our csv that contains anomalies. 

In [None]:
df_b = pd.read_csv(f'{DATA_HOME}/cpu-full-b.csv', parse_dates=[0])

In [None]:
df_b.head()

We can see that there's a datetime column and our cpu column. 

In [None]:
df_b.dtypes

In [None]:
df_b.rename(columns={'datetime': 'ds',
                        'cpu': 'y'}, inplace=True)
df_b.head()


First up, we create an instance of the Prophet class and then call its fit and predict methods.

In [None]:
# set the uncertainty interval to 90% (the Prophet default is 80%)
cpu_model = Prophet(interval_width=0.90)

In [None]:
cpu_model.fit(df_b)

In order to obtain forecasts of our time series, we must provide Prophet with a new DataFrame containing a ds column that holds the dates for which we want predictions.

In [None]:
predicted_cpu = cpu_model.make_future_dataframe(periods=500, freq='H') # S , min, H
predicted_cpu.head()

In [None]:
cpu_model.plot(forecast, uncertainty=True)

In [None]:
forecast = cpu_model.predict(predicted_cpu)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()

- ds: the datestamp of the forecasted value
- yhat: the forecasted value of our metric 
- yhat_lower: the lower bound of our forecasts
- yhat_upper: the upper bound of our forecasts

Prophet relies on Markov chain Monte Carlo (MCMC) methods to generate its forecasts, which is a stochastic process, so values will be slightly different each time.

In [None]:
cpu_model.plot(forecast, uncertainty=True)

In [None]:
cpu_model.plot_components(forecast)
