# Outlier Detection

This notebook will walk you through a workflow for outlier detection.

## TOC

 * [Detecting outlier jobs](#detect-outlier-jobs-study-1)
  * [partition jobs](#partition-jobs)
  * [create a trained model](#create-ref-model)
 * [Root Cause Analysis (RCA)](#rca-job)

## <a name="detect-outlier-jobs-study-1">Case Study 1 - Detecting outlier jobs</a>

In this example we will use synthetic data. The data was generated using:
```
sample/outlier/workload.sh
```

The script basically compiles the linux kernel a few times. For certain compiles
it adds a background workload, so the compile takes longer. The slower job is marked
with the an 'outlier' suffix. However, we will pretend we don't know the outlier
and figure it out.

Along the way, we will also learn how to create a trained model, and use it 
for future outlier detection.


### Requirements

You will need to import data lying in `sample/outlier/*.tgz`:
```
$ ./epmt -v submit sample/outlier/*.tgz


In [1]:
import epmt_query as eq
import epmt_outliers as eod

INFO:epmt_job:Binding to DB: {'host': 'localhost', 'user': 'postgres', 'password': 'example', 'dbname': 'EPMT', 'provider': 'postgres'}
INFO:epmt_job:Generating mapping from schema...


{'host': 'localhost', 'user': 'postgres', 'password': 'example', 'dbname': 'EPMT', 'provider': 'postgres'}


In [2]:
jobs = eq.get_jobs(tag='exp_name:linux_kernel', fmt='terse')
jobs

['kern-6656-20190614-185359',
 'kern-6656-20190614-190245',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-192044-outlier',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-200819',
 'kern-6656-20190614-201744-outlier']

In [3]:
# As a first pass let's see whether the outliers can be auto-detected
(df, fdict) = eod.detect_outlier_jobs(jobs)
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,1,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


As you can see, while we did catch both the outliers, there
is also the "false positive" on one "non-outlier" process
The reason the 1 is marked for `duration` and `cpu_time` but `not num_procs`
is because the background compute process increased the job duration
but not the number of sub-processes of our workload.

In [4]:
fdict

{'cpu_time': ({'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-185359',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-201744-outlier'}),
 'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'}),
 'num_procs': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819',
   'kern-6656-20190614-201744-outlier'},
  set())}

`fdict` the other return value is a dictionary keyed by `feature`. The value is a tuple of two partitions based on the `feature`. The first partition being the reference set, and the second partition is the outlier set.
This partitioning <a name="partition-jobs"></a> can be more simply obtained as follows:

In [5]:
parts = eod.partition_jobs(jobs, features=['duration'])
parts

{'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'})}

Above, we just got the partitioning of the jobs on a single `feature` -- `duration`.

Now would be a good time to <a name="create-ref-model"></a>create a trained model based on the
set of jobs in the reference partition:

In [6]:
ref_jobs = parts['duration'][0]
ref_jobs

{'kern-6656-20190614-185359',
 'kern-6656-20190614-190245',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-200819'}

In [7]:
r = eq.create_refmodel(ref_jobs, tag='exp_name:linux_kernel;type:ref')

In [8]:
r['id'], r['tags']

(4, {'exp_name': 'linux_kernel', 'type': 'ref'})

We added a tag to help search for this trained/ref model later.

In [9]:
# using the trained model is as simple as:
(df, _) = eod.detect_outlier_jobs(jobs, trained_model = r['id'])
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,0,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


Obviously the jobs that were used to create the reference model will not be 
classifed as outliers for any feature.

This marks the end of this case study. In a following study we will explore how
to detect outliers in individual operations and create a trained model for ops.

## <a name="rca-job"></a> Case Study II - Root Cause Analysis

In this study we will do an RCA with real data generated from GFDL PP runs.

```
$ ./epmt -v submit $(cat <<EOT
> sample/ppr-batch/1854/625151.tgz
> sample/ppr-batch/1859/627907.tgz
> sample/ppr-batch/1869/633114.tgz
> sample/ppr-batch/1864/629322.tgz
> sample/ppr-batch/1884/685001.tgz
> sample/ppr-batch/1874/675992.tgz
> sample/ppr-batch/1879/680163.tgz
> sample/ppr-batch/1889/691209.tgz
> sample/ppr-batch/1894/693129.tgz
> EOT
> )
```

All these jobs share the following tags: `{u'ocn_res': u'0.5l75', u'atm_res': u'c96l49', u'exp_component': u'ocean_annual_z_1x1deg', u'exp_name': u'ESM4_historical_D151'`. The difference is only that they have different values for `('exp_time', 'script_name')`.

If you are curious how we found these comparable jobs, here is the query:
```
>>> x = Job.select(lambda j: j.tags['exp_component'] == 'ocean_annual_z_1x1deg').filter(lambda j: j.tags['exp_name'] == 'ESM4_historical_D151')
>>> eq.get_jobs(x, fmt="terse")
```

In [10]:
jobs = eq.get_jobs(tag="exp_name:ESM4_historical_D151;exp_component:ocean_annual_z_1x1deg", fmt='terse')
jobs

['625151',
 '627907',
 '633114',
 '629322',
 '685001',
 '675992',
 '680163',
 '802938',
 '691209',
 '693129',
 '696110',
 '804266']

Now, let's partition the jobs by `cpu_time`: 

In [11]:
parts = eod.partition_jobs(jobs, features=['cpu_time'])
parts

{'cpu_time': ({'627907',
   '629322',
   '633114',
   '675992',
   '680163',
   '685001',
   '691209',
   '693129',
   '696110',
   '802938',
   '804266'},
  {'625151'})}

As you can see, the first partition contains 9 conformant jobs, and the outlier
partition contains a single job `625151`. Let's see how different is the `cpu_time`
of this job v. the rest:

In [12]:
jobs_df = eq.get_jobs(jobs, fmt='pandas', order='desc(j.cpu_time)')
display(jobs_df.columns.values)
jobs_df[['jobid', 'cpu_time', 'duration', 'num_procs']]

array(['PERF_COUNT_SW_CPU_CLOCK', 'account', 'all_proc_tags',
       'cancelled_write_bytes', 'cpu_time', 'created_at',
       'delayacct_blkio_time', 'duration', 'end', 'env_changes_dict',
       'env_dict', 'exitcode', 'guest_time', 'inblock', 'info_dict',
       'invol_ctxsw', 'jobid', 'jobname', 'jobscriptname', 'majflt',
       'minflt', 'num_procs', 'num_threads', 'outblock', 'processor',
       'queue', 'rchar', 'rdtsc_duration', 'read_bytes', 'rssmax',
       'sessionid', 'start', 'submit', 'syscr', 'syscw', 'systemtime',
       'tags', 'time_oncpu', 'time_waiting', 'timeslices', 'updated_at',
       'user', 'user+system', 'usertime', 'vol_ctxsw', 'wchar',
       'write_bytes'], dtype=object)

Unnamed: 0,jobid,cpu_time,duration,num_procs
0,625151,1224444629,10425623185,13530
1,627907,694594906,6589174875,4411
2,629322,622137956,7286331754,4411
3,802938,558475582,3986871458,4411
4,691209,536679993,860163243,4411
5,804266,535237219,3285839213,4411
6,693129,533686454,3619324767,4411
7,675992,520837207,9114150525,4411
8,685001,480310481,6815710476,4428
9,680163,462936089,6156192011,4411


It's really interesting that the outlier has thrice the number of processes
as the others. Let's see if we can uncover more..

In [13]:
(refs, outl) = parts['cpu_time']
(refs, outl)

({'627907',
  '629322',
  '633114',
  '675992',
  '680163',
  '685001',
  '691209',
  '693129',
  '696110',
  '802938',
  '804266'},
 {'625151'})

In [14]:
(_, df, flist) = eod.detect_rootcause(refs, '625151')
df

Unnamed: 0,num_procs,cpu_time,duration
count,11.0,11.0,11.0
mean,4415.636,525258000.0,5205965000.0
std,10.99339,85072300.0,2348081000.0
min,4411.0,375760800.0,860163200.0
25%,4411.0,471623300.0,3567232000.0
50%,4411.0,533686500.0,6036720000.0
75%,4411.0,547577800.0,6702443000.0
max,4445.0,694594900.0,9114151000.0
input,13530.0,1224445000.0,10425620000.0
ref_max_modified_z_score,inf,2.0334,1.7033


The features are ranked from the one with the highest score (w.r.to ref) from left to right, 
in decreasing importance.

In [15]:
flist

[('num_procs', nan),
 ('cpu_time', 4.2928100718009246),
 ('duration', 0.8478835202254447)]

If you would like to expand the RCA to other `features`, do as below:

In [17]:
(_, df, flist) = eod.detect_rootcause(refs, '625151', features=[])
df

Unnamed: 0,num_procs,wchar,syscw,rchar,rssmax,syscr,vol_ctxsw,num_threads,timeslices,minflt,...,invol_ctxsw,read_bytes,inblock,cancelled_write_bytes,majflt,rdtsc_duration,delayacct_blkio_time,guest_time,exitcode,processor
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,...,11.0,11.0,11.0,11.0,11.0,11.0,11,11,11,11
mean,4415.636,72728630000.0,4052543.727273,95688640000.0,69662600.0,12201015.636364,821240.090909,4680.909,846691.636364,7501751.636364,...,20757.454545,9201505000.0,17971684.363636,3074751000.0,458.909091,-7355076000000000.0,0,0,0,0
std,10.99339,84399.2,274.828343,2366415.0,238978.4,4829.967956,17789.60775,11.64006,25269.44119,190515.54583,...,18702.991634,12464160000.0,24344060.897759,3940853000.0,1106.925332,1.739965e+16,0,0,0,0
min,4411.0,72728510000.0,4052375.0,95687200000.0,69274940.0,12198338.0,802024.0,4676.0,820315.0,7328295.0,...,12130.0,3448832.0,6736.0,364437500.0,2.0,-5.328561e+16,0,0,0,0
25%,4411.0,72728560000.0,4052427.0,95687360000.0,69570080.0,12198804.5,812348.5,4676.0,829796.0,7367003.5,...,13272.5,98834430.0,193036.0,2222774000.0,10.5,19640030000000.0,0,0,0,0
50%,4411.0,72728630000.0,4052464.0,95687720000.0,69630370.0,12198986.0,813963.0,4676.0,836282.0,7514178.0,...,14236.0,3769131000.0,7361584.0,2236223000.0,62.0,31099370000000.0,0,0,0,0
75%,4411.0,72728710000.0,4052499.0,95688200000.0,69674250.0,12200396.5,822491.0,4676.0,856153.0,7545145.5,...,18125.5,12680420000.0,24766436.0,2236244000.0,302.0,50864330000000.0,0,0,0,0
max,4445.0,72728730000.0,4053353.0,95694910000.0,70088920.0,12214610.0,862453.0,4712.0,895097.0,8003186.0,...,76421.0,40003150000.0,78131160.0,14785980000.0,3750.0,71810590000000.0,0,0,0,0
input,13530.0,138800700000.0,4436801.0,98540650000.0,133660800.0,13084676.0,3150991.0,14551.0,3198404.0,20550722.0,...,32844.0,12811070000.0,25021616.0,2331615000.0,80.0,100956100000000.0,0,0,0,0
ref_max_modified_z_score,inf,0.8902,16.2062,12.9519,3.7394,28.0276,9.516,inf,5.3036,2.3773,...,32.3142,6.4902,6.4902,344434.0,46.0659,1897.052,0,0,0,0


As you can see, the `modified_z_score_ratio` for `rssmax` is 139. This shows that even the memory
footprint of the outlier is much more than those of the reference jobs.

In our next study we will attempt to determine which `operations` from within the jobs were outliers.