This notebook demonstrates in detail how dows the anomaly detection in request durations work.  
What is the process?  
 1) Data in RQA have generally form of normal, F or triangular distribution -> we generate some mock data and explain why  
    These data mocks the raw reference dataset which must first be cleaned for it to be used as detection arbiter.
 2) Outliers are detected using combination of DBSCAN algorithm and Modified Z-Score method.
 3) Anomaly is found among the detected outliers using HDBSCAN or DBSCAN and Modified Z-Score.
 4) Valid data are kept as reference data for further detections.  
    (step 3 can be skipped in praxis as it has no effect in cleaning the data, here it's performed for demonstration purposes)
 5) We generate the same data again but in smaller number and use the reference data to perform the anomaly detection.

In [None]:
#r "nuget:YSoft.Rqa.AnomalyDetection.Application"

In [None]:
using YSoft.Rqa.AnomalyDetection.Application;
using YSoft.Rqa.AnomalyDetection.Application.Model;
using YSoft.Rqa.AnomalyDetection.Application.Services;
using YSoft.Rqa.AnomalyDetection.Data.Model.Csv;
using YSoft.Rqa.AnomalyDetection.Data.Model.Graylog;
using MoreLinq;

In [None]:
var generator = new TrafficGenerator();
var detector = new DurationAnomalyDetector(new Clusterer());
var plotter = new Plotter();

Let's choose a distribution for the rest of the demo. (uncomment the desired one)

In [None]:
//var distributionType = "normal";
var distributionType = "F";
//var distributionType = "triangular";

Let's see how the typical RQA data look like.

In [None]:
var count = 5000;
var timeSpan = new DateTimeInterval(DateTime.Now.AddDays(-7), DateTime.Now);
var timestamps = generator.GenerateRequestTimestamps(count, timeSpan);
List<double> durations;
if (distributionType == "normal")
    durations = ProbabilityDistribution.Normal(count, mean: 260, sigma: 30).Data;
else if(distributionType == "F")
    durations = ProbabilityDistribution.FDistribution(count, center: 150, dfNum: 30, dfDen: 60).Data;
else
    durations = ProbabilityDistribution.Triangular(count, start: 80, end: 420, peak: 150).Data;

var requests = Enumerable.Range(0, count).Select(i => new RequestDataPoint { Timestamp = timestamps[i], Duration = durations[i]});
var rg = new RequestGroup("MockService", "MockRequestType", requests.ToList());

In [None]:
display(plotter.Histogram(rg.ValidData, title: "Basic distribution"))

Let's add some random outliers to the data. 2% of the valid data in wider range around the distribution seems about right.

In [None]:
var interval = distributionType == "normal" ? new Interval(100, 420) : distributionType == "F" ? new Interval(0, 500) : new Interval(0, 600);
durations.AddRange(ProbabilityDistribution.Uniform((int)(durations.Count*0.02), interval));
timestamps.AddRange(generator.GenerateRequestTimestamps((int)(durations.Count*0.02), timeSpan));

requests = Enumerable.Range(0, durations.Count).Select(i => new RequestDataPoint { Timestamp = timestamps[i], Duration = durations[i]});
rg = new RequestGroup("MockService", "MockRequestType", requests.ToList());

In [None]:
display(plotter.Histogram(rg.ValidData, title: "Mock data with random outliers"))

Finally add an anomaly. Anomaly is an repetitive occurence of certain outliers (= cluster of outliers) outside the valid data.

In [None]:
interval = distributionType == "normal" ? new Interval(380, 400) : distributionType == "F" ? new Interval(350, 370) : new Interval(450, 470);
durations.AddRange(ProbabilityDistribution.Uniform((int)(durations.Count*0.06), interval));
timestamps.AddRange(generator.GenerateRequestTimestamps((int)(durations.Count*0.06), timeSpan));

requests = Enumerable.Range(0, durations.Count).Select(i => new RequestDataPoint { Timestamp = timestamps[i], Duration = durations[i]});
rg = new RequestGroup("MockService", "MockRequestType", requests.ToList());

In [None]:
display(plotter.Histogram(rg.ValidData, title: "Raw reference data with an anomaly"))

Outlier detection - DBSCAN + Modified Z-Score  
1) DBSCAN: to find the center of the dataset (= the biggest cluster) from which we obtain the parameters for Modified Z-Score.  
2) Modified Z-Score: used for the detection itself.  

Why in this way?  
 * DBSCAN alone might:
   * Not find anything in adverse dataset.
   * Not cover the whole cluster (instead of one big it might find several smaller ones / just the very dense center of the cluster == too strict on outliers).
   * Cover more than it should (include some noise datapoints around itself).
 * Modified Z-Score alone might:
   * Have median and standard deviation impacted by big anomaly or multiple ones. Simply put, it would shift the valid area towards the anomaly -> part of the anomaly would have been proclaimed as valid and some lower valid data would have become outliers.  
   
The reason for this is simple. The detection is made for general purpose. The input data are unknown (each request of every service looks a bit different) so it cannot be optimized or even trained on a specific dataset. That's why the DBSCAN returns only "rough" result which is than corrected with the Modified Z-Score method.

In [None]:
detector.FindOutliers(rg);

In [None]:
display(plotter.Scatter(rg.OutlierDetectionClusters, "Clusters found by DBSCAN"));
display(plotter.DetectionScatter(rg, "Final outlier detection after applying Modified Z-Score"));
display(plotter.DetectionHistogram(rg));

Anomaly detection - (H)DBSCAN + Modified Z-Score  
1) Find clusters in outliers using HDBSCAN for large datasets, DBSCAN for small ones.
2) Close clusters are merged for the reasons of "rough" clustering due to general purpose explained above.
3) Modified Z-Score smooths the clusters.
4) If a cluster is big enough (default is >= 5% of the whole dataset), it is an anomaly.

In [None]:
detector.FindAnomalies(rg);

In [None]:
display(plotter.Scatter(rg.AnomalyDetectionClusters, "Clusters found by (H)DBSCAN"));
display(plotter.Scatter(rg.AnomalyDetectionMergedClusters, "Clusters after merging the close ones"));
display(plotter.DetectionScatter(rg, "Final anomaly detection"));
display(plotter.DetectionHistogram(rg));

Now that we identified valid data, let's use it for further detection.

In [None]:
var referenceData = rg.ValidData.Clone();

Let's generate new data. Should we consider a 1 week reference window and performing detection 2x a day, it makes approximately 7.15% of the reference dataset per detection.

In [None]:
var newDataCount = (int)(count * 0.0715);
List<double> newData;
Interval outlierInterval;
Interval anomalyInterval;
// valid data
if (distributionType == "normal"){
    newData = ProbabilityDistribution.Normal(newDataCount, mean: 260, sigma: 30).Data;
    outlierInterval = new Interval(100, 420);
    anomalyInterval = new Interval(380, 400);
}
else if(distributionType == "F"){
    newData = ProbabilityDistribution.FDistribution(newDataCount, center: 150, dfNum: 75, dfDen: 40).Data;
    outlierInterval = new Interval(0, 500);
    anomalyInterval = new Interval(350, 370);
}
else{
    newData = ProbabilityDistribution.Triangular(newDataCount, start: 80, end: 420, peak: 150).Data;
    outlierInterval = new Interval(0, 600);
    anomalyInterval = new Interval(460, 480);
}
// outliers and an anomaly
newData.AddRange(ProbabilityDistribution.Uniform((int)(Math.Max(5, newData.Count*0.02)), outlierInterval));
newData.AddRange(ProbabilityDistribution.Uniform((int)(Math.Max(5, newData.Count*0.06)), anomalyInterval));
newData = newData.Shuffle().ToList();

var newDataTimestamps = generator.GenerateRequestTimestamps((int)(newData.Count), new DateTimeInterval(DateTime.Now.AddDays(-0.5), DateTime.Now));
requests = Enumerable.Range(0, newData.Count).Select(i => new RequestDataPoint { Timestamp = newDataTimestamps[i], Duration = newData[i]});
rg = new RequestGroup("MockService", "MockRequestType", requests.ToList());

In [None]:
display(plotter.Scatter(rg.ValidData, title: "New data in which to find an anomaly"));

Owing to the reference data, there is no longer need to use the DBSCAN. We already know how the data should look like.  
The Modified Z-Score parameters (median and MAD) are obtained from the reference dataset and applied on the new data.

In [None]:
detector.FindOutliers(rg, referenceData);

In [None]:
display(plotter.DetectionScatter(rg, "Outlier detection with reference dataset"));
display(plotter.DetectionHistogram(rg));

Anomaly detection (=clustering of outliers) isn't directly incluenced by the reference dataset. However, it still refers to the valid data in cases of determining neighbors or extreme values of distances. These computations are therefore made more precise.

In [None]:
detector.FindAnomalies(rg, referenceData);

In [None]:
display(plotter.Scatter(rg.AnomalyDetectionClusters, "Clusters found by (H)DBSCAN"));
display(plotter.Scatter(rg.AnomalyDetectionMergedClusters, "Clusters after merging the close ones"));

display(plotter.DetectionScatter(rg, "Final anomaly detection"));
display(plotter.DetectionHistogram(rg));

Sometimes, what might look like an anomaly is not desired to be. For instance, valid data are in range of 40-60 and than "anomalous" in range 90-100. There's a clear gap and sufficient amount of outliers for an anomaly. Still, one can argue that 30ms is next to nothing. For this purpose, FindAnomalies methods provides a "tolerance" paramater specifying an "anomaly-free zone" around the valid data in which nothing is an anomaly.  

Let's demonstrate it on our example. With a sufficient tolerance there should be no anomalies detected.

In [None]:
rg = new RequestGroup("MockService", "MockRequestType", requests.ToList());
detector.FindOutliers(rg, referenceData);
detector.FindAnomalies(rg, referenceData, tolerance: 150);
rg.Anomalies.Length()