# Outlier Detection

This notebook will walk you through a workflow for outlier detection.

## Case Study I

In this example we will use synthetic data. The data was generated using:
```
sample/outlier/workload.sh
```

The script basically compiles the linux kernel a few times. For certain compiles
it adds a background workload, so the compile takes longer. The slower job is marked
with the an 'outlier' suffix. However, we will pretend we don't know the outlier
and figure it out.

Along the way, we will also learn how to create a trained model, and use it 
for future outlier detection.


### Requirements

You will need to import data lying in `sample/outlier/*.tgz`:
```
$ ./epmt -v submit sample/outlier/*.tgz


In [1]:
import epmt_query as eq
import epmt_outliers as eod

INFO:epmt_job:Binding to DB: {'provider': 'postgres', 'host': 'localhost', 'password': 'example', 'dbname': 'EPMT', 'user': 'postgres'}
INFO:epmt_job:Generating mapping from schema...


{'provider': 'postgres', 'host': 'localhost', 'password': 'example', 'dbname': 'EPMT', 'user': 'postgres'}


In [2]:
jobs = eq.get_jobs(tag='exp_name:linux_kernel', fmt='terse')
jobs

['kern-6656-20190614-185359',
 'kern-6656-20190614-190245',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-192044-outlier',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-200819',
 'kern-6656-20190614-201744-outlier']

In [3]:
# As a first pass let's see whether the outliers can be auto-detected
(df, fdict) = eod.detect_outlier_jobs(jobs)
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,1,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


As you can see, while we did catch both the outliers, there
is also the "false positive" on one "non-outlier" process
The reason the 1 is marked for `duration` and `cpu_time` but `not num_procs`
is because the background compute process increased the job duration
but not the number of sub-processes of our workload.

In [4]:
fdict

{'cpu_time': ({'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-185359',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-201744-outlier'}),
 'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'}),
 'num_procs': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819',
   'kern-6656-20190614-201744-outlier'},
  set())}

`fdict` the other return value is a dictionary keyed by `feature`. The value is a tuple of two partitions based on the `feature`. The first partition being the reference set, and the second partition is the outlier set.
This partitioning can be more simply obtained as follows:

In [5]:
parts = eod.partition_jobs(jobs, features=['duration'])
parts

{'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'})}

Above, we just got the partitioning of the jobs on a single `feature` -- `duration`.

Now would be a good time to create a trained model based on the
set of jobs in the reference partition:

In [6]:
ref_jobs = parts['duration'][0]
ref_jobs

{'kern-6656-20190614-185359',
 'kern-6656-20190614-190245',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-200819'}

In [8]:
r = eq.create_refmodel(ref_jobs, tag='exp_name:linux_kernel;type:ref')

In [10]:
r['id'], r['tags']

(2, {'exp_name': 'linux_kernel', 'type': 'ref'})

We added a tag to help search for this trained/ref model later.

In [11]:
# using the trained model is as simple as:
(df, _) = eod.detect_outlier_jobs(jobs, trained_model = r['id'])
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,0,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


Obviously the jobs that were used to create the reference model will not be 
classifed as outliers for any feature.

This marks the end of this case study. In a following study we will explore how
to detect outliers in individual operations and create a trained model for ops.