# Outlier Detection

This notebook will walk you through a workflow for outlier detection.

## TOC

 * [Detecting outlier jobs](#detect-outlier-jobs-study-1)
  * [partition jobs](#partition-jobs)
  * [create a trained model](#create-ref-model)
 * [Root Cause Analysis (RCA)](#rca-job)
 * [Detecting outlier operations](#detect-outlier-ops)
  * [RCA for operations](#rca-ops)
 * [Notes](#notes)

## <a name="detect-outlier-jobs-study-1">Detecting outlier jobs</a>

In this example we will use synthetic data. The data was generated using:
```
sample/outlier/workload.sh
```

The script basically compiles the linux kernel a few times. For certain compiles
it adds a background workload, so the compile takes longer. The slower job is marked
with the an 'outlier' suffix. However, we will pretend we don't know the outlier
and figure it out.

Along the way, we will also learn how to create a trained model, and use it 
for future outlier detection.


### Requirements

You will need to import data lying in `sample/outlier/*.tgz`:
```
$ ./epmt -v submit sample/outlier/*.tgz
```

In [1]:
import epmt_query as eq
import epmt_outliers as eod

INFO:epmt_job:Binding to DB: {'create_db': True, 'filename': 'database.sqlite', 'provider': 'sqlite'}
INFO:epmt_job:Generating mapping from schema...


In [2]:
jobs = eq.get_jobs(tag='exp_name:linux_kernel', fmt='terse')
jobs

['kern-6656-20190614-201744-outlier',
 'kern-6656-20190614-200819',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-185359',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-192044-outlier',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-190245']

In [3]:
# As a first pass let's see whether the outliers can be auto-detected
(df, fdict) = eod.detect_outlier_jobs(jobs)
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,1,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


As you can see, while we did catch both the outliers, there
is also the "false positive" on one "non-outlier" process
The reason the 1 is marked for `duration` and `cpu_time` but `not num_procs`
is because the background compute process increased the job duration
but not the number of sub-processes of our workload.

### <a name="partition-jobs">Partitioning the jobs set into two sets</a>

`fdict`, obtained above, is a dictionary that contains for each feature 
(`cpu_time`, `duration` or `num_procs`), two sets, the first is a set of 
"reference" jobs and the second is a set of "outlier" jobs as detected by 
that particular feature.

In [4]:
fdict

{'cpu_time': ({'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-185359',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-201744-outlier'}),
 'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'}),
 'num_procs': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-192044-outlier',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819',
   'kern-6656-20190614-201744-outlier'},
  set())}

Partitioning can be more simply obtained using the following call:

In [5]:
parts = eod.partition_jobs(jobs, features=['duration'])
parts

{'duration': ({'kern-6656-20190614-185359',
   'kern-6656-20190614-190245',
   'kern-6656-20190614-191138',
   'kern-6656-20190614-194024',
   'kern-6656-20190614-194953',
   'kern-6656-20190614-195909',
   'kern-6656-20190614-200819'},
  {'kern-6656-20190614-192044-outlier', 'kern-6656-20190614-201744-outlier'})}

Above, we just got the partitioning of the jobs on a single `feature` -- `duration`.
You could also specify multiple features, such as `['duration', 'cpu_time']`.

### <a name="create-ref-model">Creating a trained (reference) model</a>

If we have a set of reference jobs, we can create a "trained" model. Subsequently we can do
outlier detection using the trained model. Let's use the reference set we obtained above:

In [6]:
ref_jobs = parts['duration'][0]
ref_jobs

{'kern-6656-20190614-185359',
 'kern-6656-20190614-190245',
 'kern-6656-20190614-191138',
 'kern-6656-20190614-194024',
 'kern-6656-20190614-194953',
 'kern-6656-20190614-195909',
 'kern-6656-20190614-200819'}

When creating the trained model, we can specify a `tag` of our choosing. This `tag` can
be used later to retrieve the trained model from the database.

In [7]:
r = eq.create_refmodel(ref_jobs, tag='exp_name:linux_kernel;type:ref')

In [8]:
r['id'], r['tags']

(4, {'exp_name': 'linux_kernel', 'type': 'ref'})

In [9]:
# using the trained model is as simple as:
(df, _) = eod.detect_outlier_jobs(jobs, trained_model = r['id'])
df

Unnamed: 0,jobid,duration,cpu_time,num_procs
0,kern-6656-20190614-185359,0,0,0
1,kern-6656-20190614-190245,0,0,0
2,kern-6656-20190614-191138,0,0,0
3,kern-6656-20190614-192044-outlier,1,1,0
4,kern-6656-20190614-194024,0,0,0
5,kern-6656-20190614-194953,0,0,0
6,kern-6656-20190614-195909,0,0,0
7,kern-6656-20190614-200819,0,0,0
8,kern-6656-20190614-201744-outlier,1,1,0


Obviously the jobs that were used to create the reference model will not be 
classifed as outliers for any feature.

## <a name="rca-job">Root Cause Analysis (RCA)</a>

In this study we will do an RCA with real data generated from GFDL PP runs.


### Requirements
Please import the following data:

```
$ ./epmt -v submit $(cat <<EOT
sample/ppr-batch/1854/625151.tgz
sample/ppr-batch/1859/627907.tgz
sample/ppr-batch/1869/633114.tgz
sample/ppr-batch/1864/629322.tgz
sample/ppr-batch/1884/685001.tgz
sample/ppr-batch/1874/675992.tgz
sample/ppr-batch/1879/680163.tgz
sample/ppr-batch/1889/691209.tgz
sample/ppr-batch/1894/693129.tgz
EOT
)
```

All these jobs share the following tags: `{u'ocn_res': u'0.5l75', u'atm_res': u'c96l49', u'exp_component': u'ocean_annual_z_1x1deg', u'exp_name': u'ESM4_historical_D151'}`. The difference is only that they have different values for `('exp_time', 'script_name')`.
<div></div>

<!--
<details>
  <summary>Advanced query</summary>
  If you are curious how we found these comparable jobs, here is the query (ADVANCED TOPIC):
  ```
>>> x = eq.Job.select(lambda j: j.tags['exp_component'] == 'ocean_annual_z_1x1deg').filter(lambda j: j.tags['exp_name'] == 'ESM4_historical_D151')
>>> eq.get_jobs(x, fmt="terse")
  ```
</details>
-->

In [10]:
jobs = eq.get_jobs(tag="exp_name:ESM4_historical_D151;exp_component:ocean_annual_z_1x1deg", fmt='terse')
jobs

['693129',
 '691209',
 '680163',
 '675992',
 '685001',
 '629322',
 '633114',
 '627907',
 '625151']

Now, let's partition the jobs by `cpu_time`: 

In [11]:
parts = eod.partition_jobs(jobs, features=['cpu_time'])
parts

{'cpu_time': ({'627907',
   '629322',
   '633114',
   '675992',
   '680163',
   '685001',
   '691209',
   '693129'},
  {'625151'})}

As you can see, the first partition contains conformant jobs, and the outlier
partition contains a single job `625151`. Let's see how different is the `cpu_time`
of this job v. the rest:

In [12]:
jobs_df = eq.get_jobs(jobs, fmt='pandas', order='desc(j.cpu_time)')
display(jobs_df.columns.values)
jobs_df[['jobid', 'cpu_time', 'duration', 'num_procs']]

array(['PERF_COUNT_SW_CPU_CLOCK', 'account', 'all_proc_tags',
       'cancelled_write_bytes', 'cpu_time', 'created_at',
       'delayacct_blkio_time', 'duration', 'end', 'env_changes_dict',
       'env_dict', 'exitcode', 'guest_time', 'inblock', 'info_dict',
       'invol_ctxsw', 'jobid', 'jobname', 'jobscriptname', 'majflt',
       'minflt', 'num_procs', 'num_threads', 'outblock', 'processor',
       'queue', 'rchar', 'rdtsc_duration', 'read_bytes', 'rssmax',
       'sessionid', 'start', 'submit', 'syscr', 'syscw', 'systemtime',
       'tags', 'time_oncpu', 'time_waiting', 'timeslices', 'updated_at',
       'user', 'user+system', 'usertime', 'vol_ctxsw', 'wchar',
       'write_bytes'], dtype=object)

Unnamed: 0,jobid,cpu_time,duration,num_procs
0,625151,1224444629,10425623185,13530
1,627907,694594906,6589174875,4411
2,629322,622137956,7286331754,4411
3,691209,536679993,860163243,4411
4,693129,533686454,3619324767,4411
5,675992,520837207,9114150525,4411
6,685001,480310481,6815710476,4428
7,680163,462936089,6156192011,4411
8,633114,457180929,6036720046,4445


It's really interesting that the outlier has thrice the number of processes
as the others. Let's see if we can uncover more..

In [13]:
(refs, outl) = parts['cpu_time']
(refs, outl)

({'627907',
  '629322',
  '633114',
  '675992',
  '680163',
  '685001',
  '691209',
  '693129'},
 {'625151'})

In [14]:
(_, df, flist) = eod.detect_rootcause(refs, '625151')
df

Unnamed: 0,num_procs,cpu_time,duration
count,8.0,8.0,8.0
mean,4417.375,538545500.0,5809721000.0
std,12.648405,82295710.0,2512447000.0
min,4411.0,457180900.0,860163200.0
25%,4411.0,475966900.0,5432371000.0
50%,4411.0,527261800.0,6372683000.0
75%,4415.25,558044500.0,6933366000.0
max,4445.0,694594900.0,9114151000.0
input,13530.0,1224445000.0,10425620000.0
ref_max_modified_z_score,3.5973,2.0286,5.4813


The features are ranked from the one with the highest `modified_z_score_ratio` from left to right, 
in decreasing importance.

In [15]:
flist

[('num_procs', 268.2083785061018),
 ('cpu_time', 4.166370896184561),
 ('duration', 0.735227044679182)]

If you would like to expand the RCA to other `features`, do as below:

In [16]:
(_, df, flist) = eod.detect_rootcause(refs, '625151', features=[])
df

Unnamed: 0,wchar,syscw,rchar,num_threads,num_procs,rssmax,syscr,vol_ctxsw,timeslices,minflt,...,invol_ctxsw,read_bytes,inblock,cancelled_write_bytes,majflt,rdtsc_duration,exitcode,delayacct_blkio_time,guest_time,processor
count,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,...,8.0,8.0,8.0,8.0,8.0,8.0,8,8,8,8
mean,72728610000.0,4052585.5,95688900000.0,4682.75,4417.375,69628990.0,12201808.0,823488.875,850609.75,7527056.375,...,22424.625,9678984000.0,18904257.0,2056820000.0,558.375,-1.012455e+16,0,0,0,0
std,83074.45,317.118725,2773749.0,13.392429,12.648405,227732.4,5538.602738,20708.215233,28892.369866,212811.545768,...,21975.570631,14398320000.0,28121705.822046,504245700.0,1298.907226,2.00089e+16,0,0,0,0
min,72728510000.0,4052375.0,95687200000.0,4676.0,4411.0,69274940.0,12198338.0,802024.0,820315.0,7328295.0,...,12130.0,3448832.0,6736.0,808894500.0,2.0,-5.328561e+16,0,0,0,0
25%,72728540000.0,4052453.0,95687350000.0,4676.0,4411.0,69570190.0,12198807.75,811829.75,829806.25,7372435.0,...,13439.75,144743400.0,282702.0,2234230000.0,11.75,-6987199000000000.0,0,0,0,0
50%,72728600000.0,4052482.5,95687510000.0,4676.0,4411.0,69629960.0,12199156.0,813721.0,838391.5,7527901.5,...,14215.5,1980019000.0,3867224.0,2236223000.0,45.0,39846340000000.0,0,0,0,0
75%,72728670000.0,4052528.75,95688910000.0,4680.5,4415.25,69654830.0,12202047.25,832198.25,871352.0,7543403.25,...,16518.5,14610610000.0,28536334.0,2236244000.0,220.5,54265780000000.0,0,0,0,0
max,72728730000.0,4053353.0,95694910000.0,4712.0,4445.0,70079200.0,12214610.0,862453.0,895097.0,8003186.0,...,76421.0,40003150000.0,78131160.0,2236248000.0,3750.0,71810590000000.0,0,0,0,0
input,138800700000.0,4436801.0,98540650000.0,14551.0,13530.0,133660800.0,13084676.0,3150991.0,3198404.0,20550722.0,...,32844.0,12811070000.0,25021616.0,2331615000.0,80.0,100956100000000.0,0,0,0,0
ref_max_modified_z_score,1.223,13.9798,21.2727,3.5973,3.5973,5.644,17.7274,4.2322,2.765,3.7019,...,38.7778,12.987,12.987,47008.47,62.4756,1580.507,0,0,0,0


Note that the metrics are aggregates across the underlying threads/processes across a job.

In our next study we will attempt to determine which `operations` from within the jobs were outliers.

## <a name="detect-outlier-ops">Detect Outlier Operations</a>

This study shows how we can find outlier operations across a set of jobs.

Please [review the previous study](#rca-job) as this has the same requirements.

In [17]:
jobs = eq.get_jobs(tag="exp_name:ESM4_historical_D151;exp_component:ocean_annual_z_1x1deg", fmt='terse')

In [18]:
len(jobs)

9

In [19]:
# widen width of column display width to show full tag
import pandas as pd
pd.set_option('display.max_colwidth', 200)
(df, parts, scores_df, sorted_tags, sorted_features) = eod.detect_outlier_ops(jobs)
df.head(20)

Unnamed: 0,jobid,tags,duration,cpu_time,num_procs
0,625151,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,0,0
1,627907,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,0,0
2,629322,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",1,0,0
3,633114,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",1,0,0
4,675992,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",1,0,0
5,680163,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,0,0
6,685001,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,1,0
7,691209,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,0,0
8,693129,"{'op': 'hsmget', 'op_sequence': '21', 'op_instance': '6'}",0,0,0
9,625151,"{'op': 'hsmget', 'op_sequence': '19', 'op_instance': '7'}",0,0,0


In [20]:
len(sorted_tags)

416

In [21]:
sorted_tags[:5]

[{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '21'},
 {'op': 'hsmget', 'op_instance': '7', 'op_sequence': '19'},
 {'op': 'hsmget', 'op_instance': '6', 'op_sequence': '18'},
 {'op': 'hsmget', 'op_instance': '7', 'op_sequence': '13'},
 {'op': 'mv', 'op_instance': '10', 'op_sequence': '60'}]

As you can see there are 416 unique operations. Since we want to focus our attention
on the operations, that have the highest deviation across jobs, `detect_outlier_ops`
helps us by ordering the dataframe, `df` in the order of decreasing operation (tag)
importance. To figure out the importance of a tag, it uses the maximum of the scores
across all jobs, across all features for that particular tag. `scores_df` and `sorted_tags`
are similarly ordered by decreasing tag importance.

In [22]:
scores_df.head()

Unnamed: 0,tags,duration,cpu_time,num_procs
0,"{""op"": ""hsmget"", ""op_instance"": ""6"", ""op_sequence"": ""21""}",646.428,0.026,0
1,"{""op"": ""hsmget"", ""op_instance"": ""7"", ""op_sequence"": ""19""}",463.38,3.166,0
2,"{""op"": ""hsmget"", ""op_instance"": ""6"", ""op_sequence"": ""18""}",362.595,0.0,0
3,"{""op"": ""hsmget"", ""op_instance"": ""7"", ""op_sequence"": ""13""}",229.944,0.0,0
4,"{""op"": ""mv"", ""op_instance"": ""10"", ""op_sequence"": ""60""}",160.865,0.0,0


`detect_outlier_ops` thus already helps with RCA by ordering the output
in decreasing tag importance. It goes even further, by presenting a ordered
`features` list, in descreasing order of feature importance. Here the
importance of a `feature` is determined by summing the scores of the
feature across all tags. 

In [23]:
sorted_features

['duration', 'cpu_time', 'num_procs']

<a name="rca-ops"></a>Once you have the ordered list of tags (operations), you may want to do
further RCA analysis for a specific operation. For RCA analysis at the operation-level, 
we specify a *reference set* of jobs and an outlier job (similar to how we did for RCA at the job-level). 
In addition, we specify the operation of interest. The goal of the op-RCA is to rank the 
features in order of importance.

Let's suppose we care about the top operation: `{"op": "ncatted", "op_instance": "3", "op_sequence": "32"}`.
We see the jobid `629322` was one of the outliers for this operation in `df` for both the `duration` and
`cpu_time` features.

In [24]:
# first derive the list of jobs other than the outlier
refjobs = eq.get_jobs(tag="exp_name:ESM4_historical_D151;exp_component:ocean_annual_z_1x1deg", fltr='j.jobid != "629322"', fmt='terse')
refjobs

['693129',
 '691209',
 '680163',
 '675992',
 '685001',
 '633114',
 '627907',
 '625151']

In [25]:
(ret, df_rca, feature_scores) = eod.detect_rootcause_op(refjobs, '629322', {"op": "ncatted", "op_instance": "3", "op_sequence": "32"})
ret

True

In [26]:
df_rca

Unnamed: 0,duration,cpu_time,num_procs
count,7.0,7.0,7
mean,1188.428571,24566.571429,1
std,114.402589,4197.045503,0
min,1109.0,21995.0,1
25%,1120.0,22995.0,1
50%,1149.0,22995.0,1
75%,1196.5,23496.0,1
max,1428.0,33994.0,1
input,1735.0,29994.0,1
ref_max_modified_z_score,4.9522,7418.8255,0


In [27]:
# to get the full feature list, just pass features=[] to the same function
(ret, df_rca, feature_scores) = eod.detect_rootcause_op(refjobs, '629322', {"op": "ncatted", "op_instance": "3", "op_sequence": "32"}, features=[])
ret

True

In [28]:
df_rca

Unnamed: 0,read_bytes,majflt,inblock,timeslices,vol_ctxsw,rdtsc_duration,minflt,duration,systemtime,cpu_time,...,syscr,delayacct_blkio_time,write_bytes,syscw,numtids,outblock,guest_time,cancelled_write_bytes,processor,rssmax
count,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,...,7.0,7,7,7,7,7,7,7,7,7.0
mean,0.0,0.0,0.0,22.142857,17.714286,4033804.142857,2524.857143,1188.428571,7141.142857,24566.571429,...,281.571429,0,4096,3,1,8,0,0,0,7984.571429
std,0.0,0.0,0.0,1.345185,1.889822,273511.549812,27.431299,114.402589,1463.476048,4197.045503,...,1.511858,0,0,0,0,0,0,0,0,36.836835
min,0.0,0.0,0.0,21.0,17.0,3813848.0,2508.0,1109.0,4999.0,21995.0,...,281.0,0,4096,3,1,8,0,0,0,7968.0
25%,0.0,0.0,0.0,21.5,17.0,3854039.0,2509.0,1120.0,6498.5,22995.0,...,281.0,0,4096,3,1,8,0,0,0,7970.0
50%,0.0,0.0,0.0,22.0,17.0,3938599.0,2509.0,1149.0,6998.0,22995.0,...,281.0,0,4096,3,1,8,0,0,0,7972.0
75%,0.0,0.0,0.0,22.0,17.0,4113005.0,2533.5,1196.5,7998.0,23496.0,...,281.0,0,4096,3,1,8,0,0,0,7972.0
max,0.0,0.0,0.0,25.0,22.0,4550094.0,2572.0,1428.0,8998.0,33994.0,...,285.0,0,4096,3,1,8,0,0,0,8068.0
input,49152.0,1.0,96.0,140.0,136.0,5978859.0,2644.0,1735.0,9998.0,29994.0,...,281.0,0,4096,3,1,8,0,0,0,7972.0
ref_max_modified_z_score,0.0,0.0,0.0,2.8329,4.7215,3.5837,42.4935,4.9522,1.3504,7418.8255,...,4.7215,0,0,0,0,0,0,0,0,4.3583


## <a name="notes">Notes and Suggestions</a>

 * When we do `detect_outlier_ops` we might also want to know the top jobs (ranked) by tag.
   This would help for subsequent RCA analysis.
 * The columns of `df` and `scores_df` should be ordered in the same order as `sorted_features`.
