<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />


# Worksheet 2.0 Anomaly Detection

This worksheet covers concepts relating to Anomaly Detection.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple. 

## Import the Libraries
For this exercise, we will be using:
* Pandas (http://pandas.pydata.org/pandas-docs/stable/)
* Numpy (https://docs.scipy.org/doc/numpy/reference/)
* Matplotlib (http://matplotlib.org/api/pyplot_api.html)
* StatsModels (https://www.statsmodels.org/stable/index.html)
* Pmdarima (http://alkaline-ml.com/pmdarima/)


In [None]:
import pandas as pd
import numpy as np
from pmdarima.arima import auto_arima
from pmdarima.arima import ADFTest
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_predict
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import style
from pandas.plotting import autocorrelation_plot
style.use("ggplot")
%matplotlib inline

# Part One:  Finding Anomalies in CPU Usage Data
The first part of this lab, you will be examining CPU usage data to find anomalies. 

## Step One:  Get the Data
For this example, we will be looking at CPU Utilization Data to see if we can identify periods of unusual activity.  The data can be found in several files:

* `cpu-full-a.csv`:  A full set of CPU usage data without anomalies
* `cpu-train-a.csv`:  The training set from data set A
* `cpu-test-b.csv`:  The test set from data set A
* `cpu-full-b.csv`:  A full set of CPU usage data with an anomaly
* `cpu-train-b.csv`:  The training set from data set A
* `cpu-test-b.csv`:  The test set from data set A


This dataset is from examples in *Machine Learning & Security*  by Clarence Chio and David Freeman.  https://github.com/oreilly-mlsec/book-resources/tree/master/chapter3/datasets/cpu-utilization.

First let's take a look at the data set A.  For the first part of this lab, load the training dataset into a dataframe.  DataFrames have an option `infer_datatime_format` which, when set to `True`, will automatically infer dates. Setting this will save time and steps in data preparation. 

Once the data is loaded, call the usual series of exploratory functions and most importantly, plot the data.

In [None]:
df = # Your code here...

## Step 2:  Is the Data Stationary?

Now, we are going to check to the stationarity of our data set.  Stationarity is a measurement of whether the data has seasonal trends or not.[1]

First compute the rolling mean and standard deviation for the CPU column in the data set.  This can be accomplished with the `rolling` function.[2]  Try different window sizes. 

[1]: https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322
[2]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Once you have computed the rolling mean and std, plot them on a graph with the original data.  If the lines are generally flat, we know that the data is stationary. 


In [None]:
rolling_mean = # Your code here...
rolling_std = # Your code here...

In [None]:
# Your code here...


Next we are going to run a test called the Dickey-Fuller [1] test on this data to prove whether the data is stationary or not.  Use the `adfuller` method in statsmodels to perform this computation. (https://www.statsmodels.org/devel/generated/statsmodels.tsa.stattools.adfuller.html)  For this example, use `AIC` as the autolag parameter which means that the lag is chosen to minimize the information criterion.

[1]: https://en.wikipedia.org/wiki/Dickey–Fuller_test

In [None]:
adft = # Your code here...

The code below will present the results in a more understandable manner.

In [None]:
output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'], adft[4]['5%'], adft[4]['10%']]  , "Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", 
                                                        "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
output_df

### So is the Data Stationary? 
If the `p-value` is greater than 5 and the test statistics are greater than the critical values, then we know that the data is not stationary.  What do you think?


## Step Three:  Check for Autocorrelation
The next step we want to determine is how correlated the time series is with past values. This will help us tune our model and also decide whether the data can be used at all.

For this exercise, we will use the pandas `autocorrelation` methods.  

#### References
https://pandas.pydata.org/docs/reference/api/pandas.Series.autocorr.html
https://pandas.pydata.org/docs/reference/api/pandas.plotting.autocorrelation_plot.html

First, calculate the autocorrelation at various lag intervals. This is calculating the Pearson correlation, so 1 indiciates perfect correlation.

At what point does the correlation go below 75%?  50%?

In [None]:
# Your code here... 

Now create an autocorrelation plot using the pandas autocorrelation plot method. This plot will help us visualize whether the data is correlated with itself and what the lag periods are.

(https://pandas.pydata.org/docs/reference/api/pandas.plotting.autocorrelation_plot.html) 

The method is: `pd.plotting.autocorrelation_plot(<data>)`.

The horizontal lines in the plot correspond to 95% and 99% confidence bands.  The dashed line is 99% confidence band.

In [None]:
# Your code here...

## Step Four:  Seasonal Decomposition
The last analytic technique we're going to use here is seasonal decomposition. Using statsmodels `seasonal_decompose` create a decompose plot and let's take a look at the data.

Use `additive` as the model type and try different values for the period. 

https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.seasonal_decompose.html

In [None]:
# Your code here...

#### Automation

While is good to understand how this works, the module `pmdarima` actually has an automated test that can do this automatically. Try running the code below to determine whether the data is stationary or not.

```python
adf_test = ADFTest(alpha=0.05)
adf_test.should_diff(df['cpu'])
```

In [None]:
# Your code here...

## Step Five:  Fit an ARIMA Model
Since we are dealing with time series data, let's train an ARIMA model and see how well this technique fits the actual data. 

ARIMA has three parameters:

* `p`:  The number of lag observations included in the model
* `d`: The number of times the raw observations are differenced
* `q`:  The size of the moving average window

We are going to use the auto_arima method in pmdarima to do our forecasting.  Let's see how it works.  First build and fit an ARIMA model setting seasonal to `True`.  


Docs:
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html

In [None]:
# Your code here...

Next, run the `summary()` method to view some summary statistics for this model.  

In [None]:
# Your code here...

Using the `predict_in_sample()` method, create a plot of the original data and the predictions to see how well the model did at forecasting with known data.

In [None]:
# Your code here...

## Step Three:  Find Anomalies in the CPU data
Using data set `B` train a new model. Once you have a trained model, the next step is to call the `.predict()` method to generate 60 predictions.  

Next, compare the predictions with the actual values in the test set, similar to how we assess the accuracy of a classifier.  We will call the difference between the actual and predicted value the anomaly score.  Calculate the anomaly score for the test data.  Finally, plot the anomaly scores, and see if you can find the time intervals with the highest anomaly score. 

In [None]:
df2_train = pd.read_csv('../data/cpu-train-b.csv', parse_dates=[0], infer_datetime_format=True)
df2_test = pd.read_csv('../data/cpu-test-b.csv', parse_dates=[0], infer_datetime_format=True)

In [None]:
# Your code here...

# Train a new model using the training dataset


# Call predict and output 60 predictions and create a series with that


# Create a series with the delta of the predictions and the test values.


# Plot the resu


## Conclusion:
If all went well, you should see anomalous behavior at 10 seconds into the test data.  Remembering that the forecasting's confidence goes down over time, the first anomaly should be enough to throw an alert for investigation. 