Anomaly-Detection-System

Overview

This is a Anamoly Detection and Recommender system using MATLAB.

The anamoly detection project uses the Multivariate Gaussian Distribution to fit the training data. We collected data, m=307, when the detection algorithm detected anomalous behavior in the computer servers, thus the unlabeled dataset { x^1, ..., x^m}. The features measure the throughput (mb/s) and latency (ms) of response of each server and we suspect that the vast majority of these examples are "normal" or non-anomalous examples of the servers operating normally but there is a possibility that some of the examples from the servers is acting anomalously from the collected data.

The Dataset

The graph below is the first dataset:

Gaussian Distribution

To perform this anomaly detection, we need to fit the model to the data's distribution.

Given that our training set {x^1,...,x^m}, where x^i ∈ R^n, we want to estimate the Gaussian distribution for each of the features x^i. For each i=1...n, we need to find parameters μ_i and σ_i² that fit the data in the i^{th^{dimension {x_i ^1 ,....,x_i^m}}}

The Gaussian distribution is given by:

Where miu is the mean and the sigma squared controlss the variance.

Project Schema 1: Estimating the parameters for a Gaussian distribution

To estimate the parameters (miu, sigma squared) of the ith feature by using the following equations.

To estimate the mean, miu, we will use:

and for the variance, sigma, we will use:

The file estimateGaussian.m is a function that takes data matrix X as input and output an n-dimension vector mu that hols the mean of all the n features and another n-dimension vector mu that holds the mean of all the n features and another n-dimension vector sigma2 that holds the variances of all the features.

We will also be visulaizing the contours of the fitted Gaussian distribution and we should have a plot similar to below:

From the plot visuals, we can see that most of non-anomalous examples are in the region with the highest probility, while the anomalous training examples are in the region with lower probabilities.

Project Schema 2: Selecting the threshold, ε

After estimating or Gaussian distribution parameters, we can now investigate which training examples have a very high probability given this distribution and which training example has a low probability.

The low probability training examples are more likely to be the anomalies in our traning examples, and a way to determine which examples are anomalies is to select athreshole based on a cross validation set.

In this part of the project we will implement as algorithm to select the threshold ε using F₁ score on the cross validation set. For this we use a cross validation trainging set

{(x_cv⁽¹⁾,y_cv⁽¹⁾),...,(x_cv^(m),y_cv^(m)), where the label y=1 corresponds to an anomalous training example and y=0 corresponds to a normal training example. For each cross validation training example, we will compute p(x_cv⁽ⁱ⁾). The vector of all these probabilities p(x_cv⁽¹⁾),...,p(x_cv^(m)) is passed to the file function selectThreshold.m in the vector pval, and y_cv^(m)) is passed through it as well.

selectThreshold.m will then return (1) the selected threshold, ε and (2) the F₁ score. When it returns (1) the selected threshold, ε , and an example x has a low probability of p(x)<ε , then it is considered to be an anamoly. And when it return (2) the F₁ score, that should tells you how good we're doing on finding the ground truth anomalies given a certain threshold.

For the various value of ε, we will also compute the F₁ score by computing how many examples the current threshold classifies correctly.

F₁ is compute by using prec and rec:

Then we compute precision and recall by computing:

prec = a / (a + b);

rec = a / (a + c);

where,

a is the number of true positives: the ground truth lavel says it's an anomaly and our algorithm correctly classified it as an anomaly

b is the number of false positives: the ground truth label says it's not an anomaly, but our algorithm incorrectly classified it as an anomaly.

c is the number of false negatives: the ground truth lable says it's an anomaly, but our algorithm incorrectly classified it as not being anomalous.

selectThreshold.m has a loop that will try different values of ε, and will select the best ε based on the F₁ score. To compute a, b and c, we are using a vectors. We can implement this by using MATLAB equality test between a vector and a single number. If there is a several binary values in a n-dimensional binary vector

we can find out how many values in this vector are 0 by using: sum(v == 0). We can also apply a logical and operator to binary vectors.

We should see the value for epsilon approximately at 8.99e-05 when we run

Project Schema 3: Dimensional dataset

The last part of the script of Anomaly.m is it will run the anomaly detection algotithm using a more dimensional dataset. This dataset has 11 features that captures more properties of the computer servers. The script will also use the code to estimate Gaussian parameters (μi and σi2) and evaluate the probabilities for both the training data X from were we estimated the Gaussian parameters and also fo the cross validation set Xval. Finally, it will use the selectThreshold to find the best threshold ε. We would see the value epsilon of about 1.38e-18 and 117 anomalies detected.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
Anomaly.m		Anomaly.m
README.md		README.md
cofiCostFunc.m		cofiCostFunc.m
estimateGaussian.m		estimateGaussian.m
ex8data1.mat		ex8data1.mat
ex8data2.mat		ex8data2.mat
fmincg.m		fmincg.m
multivariateGaussian.m		multivariateGaussian.m
selectThreshold.m		selectThreshold.m
visualizeFit.m		visualizeFit.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Anomaly-Detection-System

Overview

The Dataset

Gaussian Distribution

Project Schema 1: Estimating the parameters for a Gaussian distribution

Project Schema 2: Selecting the threshold, ε

Project Schema 3: Dimensional dataset

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

ntalib/Machine-Learning-Anomaly-Detection-System

Folders and files

Latest commit

History

Repository files navigation

Anomaly-Detection-System

Overview

The Dataset

Gaussian Distribution

Project Schema 1: Estimating the parameters for a Gaussian distribution

Project Schema 2: Selecting the threshold, ε

Project Schema 3: Dimensional dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages