# EPMT Query API

This workbook will illustrate the usage of the EPMT Query API.


## Requirements

This workbook relies on importing data as follows:

```
./epmt -v submit $(cat <<EOF
sample/ppr-batch/1864/629337.tgz
sample/ppr-batch/1854/625172.tgz
sample/ppr-batch/1879/680181.tgz
sample/ppr-batch/1859/627922.tgz
sample/ppr-batch/1899/696127.tgz
sample/ppr-batch/1869/633144.tgz
sample/ppr-batch/1874/676007.tgz
sample/ppr-batch/1884/685016.tgz
sample/ppr-batch/1889/692544.tgz
sample/ppr-batch/1904/802954.tgz
sample/ppr-batch/1894/693147.tgz
sample/ppr-batch/1909/804285.tgz
EOF
)
```

## Table of Contents

 * [Job Query](#job-query)
   * [output formatting](#output-formats)
   * [working with ORM objects](#orm-objects)
   * [job tags](#job-tags)
   * [failed jobs](#failed-jobs)
   * [process sums](#proc-sums-field)
 * [Process Query](#process-query)
   * [process tags](#process-tags)
   * [filter and ordering](#filter-processes)
   * [thread metrics aggregation](#thread-metrics-aggregation)
 * [Operations](#ops)
   * [select op processes](#select-op-procs)
   * [aggregate cpu time for operation](#cumulative-cpu-time-for-op)
   * [select top executables by cpu time](#op-top-exe)
   * [op_metrics to aggregate operation counts](#op-metrics)
   * [data-movement v. useful work](#dm-ops)
   * [op_metrics grouped by tag](#group-by-tag)
   * [cpu-time v. duration](#cpu-time-v-duration)
 * [Thread Query](#thread-query)
 * [Logging and SQL Debug](#logging-debug)
 * [Useful Queries](#useful-queries)
   * [process tree walk](#process-tree-walk)
   * [failed processes](#failed-procs)
   * [all process tags for job](#job-proc-tags)
   * [root process](#root-process)
   * [timeline](#timeline)
 * [Useful Attributes of Job/Process/Threads](#useful-attributes)
 * [Case Study - Linux Kernel](#linux-kernel)


In [1]:
# import the query api module
import epmt_query as eq

INFO:epmt_job:Binding to DB: {'host': 'localhost', 'provider': 'postgres', 'password': 'example', 'user': 'postgres', 'dbname': 'EPMT'}
INFO:epmt_job:Generating mapping from schema...


The API has a few queries -- `get_jobs`, `get_procs` and `get_thread_metrics` -- that you will be using frequently.

Each of these operate at distinct levels: job, process and threads.

### <a name="job-query">Job Query</a>

The job query usually takes a `tag` and returns a collection of jobs in the format specified by `fmt`.
The returned list can be pruned and/or ordered using `fltr`, `limit` and `order`.

You can also pass in one or more jobs as a `jobs` parameter, most often for format conversion.

Let's get started!

In [2]:
# let's get jobs, we use the job tag to select the jobs
jobs = eq.get_jobs(tag='exp_name:ESM4_historical_D151;exp_component:ocean_month_rho2_1x1deg',fmt='terse')
jobs

['804285',
 '802954',
 '696127',
 '693147',
 '692544',
 '685016',
 '680181',
 '676007',
 '633144',
 '629337',
 '627922',
 '625172']

<a name="output-formats"></a>`fmt` can take one of the following values:
 * `terse` -- this returns a list of job ids
 * `pandas` -- this returns a pandas dataframe
 * `dict` -- for a list of python dictionaries
 * `orm` -- ORM object for maximum flexibility and speediest queries.

In [3]:
# above we got a list of job ids. sometimes we want to see more details
# than just the job id. We can use `conv_jobs` to convert between formats
jobs_df = eq.conv_jobs(jobs, fmt='pandas')
display(jobs_df.columns.values)
jobs_df

array(['PERF_COUNT_SW_CPU_CLOCK', 'account', 'all_proc_tags',
       'cancelled_write_bytes', 'cpu_time', 'created_at',
       'delayacct_blkio_time', 'duration', 'end', 'env_changes_dict',
       'env_dict', 'exitcode', 'guest_time', 'inblock', 'info_dict',
       'invol_ctxsw', 'jobid', 'jobname', 'jobscriptname', 'majflt',
       'minflt', 'num_procs', 'num_threads', 'outblock', 'processor',
       'queue', 'rchar', 'rdtsc_duration', 'read_bytes', 'rssmax',
       'sessionid', 'start', 'submit', 'syscr', 'syscw', 'systemtime',
       'tags', 'time_oncpu', 'time_waiting', 'timeslices', 'updated_at',
       'user', 'user+system', 'usertime', 'vol_ctxsw', 'wchar',
       'write_bytes'], dtype=object)

Unnamed: 0,PERF_COUNT_SW_CPU_CLOCK,account,all_proc_tags,cancelled_write_bytes,cpu_time,created_at,delayacct_blkio_time,duration,end,env_changes_dict,...,time_oncpu,time_waiting,timeslices,updated_at,user,user+system,usertime,vol_ctxsw,wchar,write_bytes
0,866184046169,,"[{'op': 'untar', 'op_instance': '3', 'op_seque...",2448543744,959968389,2019-06-26 16:01:00.691579,0,12630660818,2019-06-09 22:23:53.234877,{},...,967130737639,48105200287,3175850,2019-06-26 16:01:00.691582,Jeffrey.Durachta,959968389,769390290,3132299,138638409056,134643470336
1,649725614194,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",3044032512,679701225,2019-06-26 16:36:44.851065,0,6532173945,2019-06-10 08:12:06.562689,{},...,681730630413,13802065001,824685,2019-06-26 16:36:44.851070,Jeffrey.Durachta,679701225,428874880,809138,74236894004,72541306880
2,576733745205,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",1998835712,623964730,2019-06-26 17:11:27.526872,0,6696039124,2019-06-10 11:50:58.082917,{},...,625961524504,19476877777,805888,2019-06-26 17:11:27.526876,Jeffrey.Durachta,623964730,478902294,792701,74236893737,70252703744
3,582075035986,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",1385582592,621768978,2019-06-26 17:47:26.478671,0,6625637678,2019-06-10 18:39:32.439890,{},...,623723072793,24198886180,826722,2019-06-26 17:47:26.478675,Jeffrey.Durachta,621768978,467152088,793749,74236867345,70780125184
4,605349533801,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",1465577472,640906121,2019-06-26 18:22:14.726809,0,10080732883,2019-06-14 11:18:38.154111,{},...,642678934515,23712986693,849400,2019-06-26 18:22:14.726813,Jeffrey.Durachta,640906121,450158690,832964,74236894547,73864601600
5,522980958427,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",1998835712,571561896,2019-06-26 18:53:58.558702,0,6009933600,2019-06-14 18:14:24.986076,{},...,573565404518,32443024755,818718,2019-06-26 18:53:58.558707,Jeffrey.Durachta,571561896,434850361,793333,74236809886,70259195904
6,403054040862,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",3392450560,427082965,2019-06-27 04:39:55.070758,0,7005618511,2019-06-15 09:49:24.210549,{},...,429280375226,11304090572,812300,2019-06-27 04:39:55.070762,Jeffrey.Durachta,427082965,332821891,799028,74236867987,70246367232
7,557685977561,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",3392417792,593701277,2019-06-27 05:12:35.879256,0,709300857,2019-06-16 14:06:18.129747,{},...,595771270622,18759054582,797027,2019-06-27 05:12:35.879259,Jeffrey.Durachta,593701277,457078582,783079,74236883941,70606073856
8,553117186630,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",1998848000,594222175,2019-06-27 05:46:19.237698,0,3340305357,2019-06-16 17:16:11.907347,{},...,596382718874,21984544439,801396,2019-06-27 05:46:19.237701,Jeffrey.Durachta,594222175,452663282,783373,74236938446,70251761664
9,574686010894,,"[{'op': 'untar', 'op_instance': '5', 'op_seque...",2347225088,607235263,2019-06-27 06:18:25.742892,0,3676905115,2019-06-17 07:22:16.747572,{},...,609161751217,17253128153,813225,2019-06-27 06:18:25.742895,Jeffrey.Durachta,607235263,468679853,797766,74236837177,70476115968


In [4]:
# if you prefer dealing with python lists and dictionaries,
# you can set fmt='dict'. Here we get a list of dictionaries
eq.get_jobs(jobs = jobs, fmt='dict')

[{'PERF_COUNT_SW_CPU_CLOCK': 866184046169,
  'account': None,
  'all_proc_tags': [{'op': 'untar', 'op_instance': '3', 'op_sequence': '139'},
   {'op': 'hsmput', 'op_instance': '1', 'op_sequence': '118'},
   {'op': 'untar', 'op_instance': '2', 'op_sequence': '130'},
   {'op': 'dmput', 'op_instance': '2', 'op_sequence': '190'},
   {'op': 'ncatted', 'op_instance': '1', 'op_sequence': '116'},
   {'op': 'untar', 'op_instance': '6', 'op_sequence': '166'},
   {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '109'},
   {'op': 'fregrid', 'op_instance': '4', 'op_sequence': '150'},
   {'op': 'ncrcat', 'op_instance': '4', 'op_sequence': '136'},
   {'op': 'mv', 'op_instance': '13', 'op_sequence': '170'},
   {'op': 'cp', 'op_instance': '9', 'op_sequence': '158'},
   {'op': 'hsmget', 'op_instance': '7', 'op_sequence': '25'},
   {'op': 'ncatted', 'op_instance': '12', 'op_sequence': '174'},
   {'op': 'hsmget', 'op_instance': '10', 'op_sequence': '30'},
   {'op': 'hsmget', 'op_instance': '13', 'op_s

<a name="orm-objects"></a>
There is a very useful format called ORM, this optimizes queries
and it lets you get the underlying Job (or Process) object directly

In [5]:
jobs_orm = eq.get_jobs(jobs, fmt='orm')
jobs_orm.count(), type(jobs_orm)

(12, pony.orm.core.Query)

The ORM format is powerful as it minimizes the number of SQL queries.
Let's see this in action. To do so, we need to enable SQL debug. This
first requires setting logging to INFO level or higher. <a name="logging-debug"></a>

Now, we will run a query first using a format other than ORM, say 'terse', 
and then using the 'orm' format. You will notice that in ORM format SQL queries are
"lazy-evaluated", leading to fewer queries. It's only for the ORM type of result that 
you can use functions like: `sum`, `count`, `avg`, etc. For other objects such 
as a list or pandas dataframe, you would use functions like `len`.

In [6]:
eq.set_logging(1)
eq.set_sql_debug(True)

In [7]:
jobs = eq.get_jobs(tag='exp_component:ocean_month_rho2_1x1deg',fmt='terse')

INFO:pony.orm.sql:SELECT "j"."start", "j"."end", "j"."duration", "j"."proc_sums", "j"."created_at", "j"."updated_at", "j"."info_dict", "j"."env_dict", "j"."env_changes_dict", "j"."submit", "j"."jobid", "j"."jobname", "j"."jobscriptname", "j"."sessionid", "j"."exitcode", "j"."user", "j"."tags", "j"."account", "j"."queue", "j"."cpu_time"
FROM "job" "j"
WHERE ("j"."tags" #>> %(p1)s) = %(p2)s
ORDER BY "j"."created_at" DESC
LIMIT 20
INFO:pony.orm:RELEASE CONNECTION


In [8]:
jobs_orm =  eq.get_jobs(tag='exp_component:ocean_month_rho2_1x1deg',fmt='orm')

In [9]:
# Notice for the ORM, the query hasn't been run yet. Now, let's do a count
# of the jobs. You will see that rather than loading the jobs from the DB,
# only a COUNT sql query is run
jobs_orm.count()

INFO:pony.orm:GET CONNECTION FROM THE LOCAL POOL
INFO:pony.orm:SWITCH TO AUTOCOMMIT MODE
INFO:pony.orm.sql:SELECT COUNT(*)
FROM "job" "j"
WHERE ("j"."tags" #>> %(p1)s) = %(p2)s


12

In [10]:
# now let's remove the logging and sql debug to avoid cluttering the output
eq.set_sql_debug(False)
eq.set_logging(0)

<a name="job-tags"></a>
Each job has a `tags` field that is set during import time. The job tag is a stored
as dictionary of key/value pairs. Let's see the job tags for our list of jobs.
This is an advanced query, and we are doing it to show some of the power of the
ORM query syntax.

In [11]:
# now let's see the job tags for each of the jobs in the ORM `Query` object
eq.select((j.jobid, j.tags) for j in jobs_orm)[:]

[('804285', {'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_19090101', 'exp_name': 'ESM4_historical_D151', 'ocn_res': '0.5l75', 'exp_component': 'ocean_month_rho2_1x1deg', 'exp_time': '19090101', 'atm_res': 'c96l49'}), ('802954', {'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_19040101', 'exp_name': 'ESM4_historical_D151', 'ocn_res': '0.5l75', 'exp_component': 'ocean_month_rho2_1x1deg', 'exp_time': '19040101', 'atm_res': 'c96l49'}), ('696127', {'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18990101', 'exp_name': 'ESM4_historical_D151', 'ocn_res': '0.5l75', 'exp_component': 'ocean_month_rho2_1x1deg', 'exp_time': '18990101', 'atm_res': 'c96l49'}), ('693147', {'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18940101', 'exp_name': 'ESM4_historical_D151', 'ocn_res': '0.5l75', 'exp_component': 'ocean_month_rho2_1x1deg', 'exp_time': '18940101', 'atm_res': 'c96l49'}), ('692544', {'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1de

In [12]:
# some other useful queries might be for instance to order the jobs
# by duration
eq.get_jobs(jobs_orm, order='desc(j.duration)',fmt="pandas")[['jobid', 'tags', 'duration', 'exitcode']]

Unnamed: 0,jobid,tags,duration,exitcode
0,625172,{'script_name': 'ESM4_historical_D151_ocean_mo...,12630660818,0
1,676007,{'script_name': 'ESM4_historical_D151_ocean_mo...,10080732883,0
2,685016,{'script_name': 'ESM4_historical_D151_ocean_mo...,7005618511,0
3,629337,{'script_name': 'ESM4_historical_D151_ocean_mo...,6696039124,0
4,633144,{'script_name': 'ESM4_historical_D151_ocean_mo...,6625637678,0
5,627922,{'script_name': 'ESM4_historical_D151_ocean_mo...,6532173945,0
6,680181,{'script_name': 'ESM4_historical_D151_ocean_mo...,6009933600,0
7,802954,{'script_name': 'ESM4_historical_D151_ocean_mo...,3879024457,0
8,696127,{'script_name': 'ESM4_historical_D151_ocean_mo...,3676905115,0
9,693147,{'script_name': 'ESM4_historical_D151_ocean_mo...,3340305357,0


<a name="failed-jobs"></a>Let's figure out which if any jobs failed.

In [13]:
eq.get_jobs(jobs_orm, fltr='j.exitcode != 0', fmt='terse')

[]

#### <a name="proc-sums-field">Aggregation across job processes</a>
Each job object has a `proc_sums` field that aggregates data across the 
processes of the job. The field itself is a dictionary of key/value pairs.
This field is an attribute in the Job object, and when converting from `orm` 
to the other formats, the underlying key/value pairs of the dictionary are made available 
as top-level fields of the `dict` or `pandas` dataframe. `proc_sums` represents aggregates across
the processes of a job:

In [14]:
j = jobs_orm.first()
j.proc_sums.keys()

dict_keys(['all_proc_tags', 'rdtsc_duration', 'minflt', 'wchar', 'majflt', 'guest_time', 'cancelled_write_bytes', 'inblock', 'time_oncpu', 'user+system', 'systemtime', 'write_bytes', 'num_procs', 'time_waiting', 'rchar', 'invol_ctxsw', 'PERF_COUNT_SW_CPU_CLOCK', 'read_bytes', 'usertime', 'delayacct_blkio_time', 'outblock', 'syscw', 'num_threads', 'timeslices', 'vol_ctxsw', 'processor', 'rssmax', 'syscr'])

Now, the fields shown above become available in other formats (`dict` and `pandas`) as top-level fields, while the `proc_sums`
field itself is masked.

In [15]:
j_df = eq.get_jobs(j, fmt='pandas')
j_df.columns.values

array(['PERF_COUNT_SW_CPU_CLOCK', 'account', 'all_proc_tags',
       'cancelled_write_bytes', 'cpu_time', 'created_at',
       'delayacct_blkio_time', 'duration', 'end', 'env_changes_dict',
       'env_dict', 'exitcode', 'guest_time', 'inblock', 'info_dict',
       'invol_ctxsw', 'jobid', 'jobname', 'jobscriptname', 'majflt',
       'minflt', 'num_procs', 'num_threads', 'outblock', 'processor',
       'queue', 'rchar', 'rdtsc_duration', 'read_bytes', 'rssmax',
       'sessionid', 'start', 'submit', 'syscr', 'syscw', 'systemtime',
       'tags', 'time_oncpu', 'time_waiting', 'timeslices', 'updated_at',
       'user', 'user+system', 'usertime', 'vol_ctxsw', 'wchar',
       'write_bytes'], dtype=object)

### <a name="process-query">Process Query</a>

A process query returns a collection of one or more processes. Usually the query is
passed a `jobs` parameter to restrict the process set to those contained under the
specified `jobs`. Like the job query, the process query can take `tag`, `fmt`, 
`fltr`, `order` and `limit` to filter and format the output.

In [16]:
# If you want to get the processes belonging to a job
# here each row in the pandas dataframe contains one job process
# again, you can use the 'terse' fmt option to get just the list of database ids of the processes
eq.get_procs(['629337'], fmt='pandas')

Unnamed: 0,PERF_COUNT_SW_CPU_CLOCK,args,cancelled_write_bytes,created_at,delayacct_blkio_time,duration,end,exclusive_cpu_time,exename,exitcode,...,time_oncpu,time_waiting,timeslices,updated_at,user,user+system,usertime,vol_ctxsw,wchar,write_bytes
0,288875,^fre/.+,0,2019-06-26 17:11:35.152791,0,5014,2019-06-10 15:50:57.953311,4998,grep,0,...,5971828,4517378,6,2019-06-26 17:11:35.152794,Jeffrey.Durachta,4998,2999,5,0,0
1,77395,: n,0,2019-06-26 17:11:35.150577,0,82,2019-06-10 15:50:57.952353,3998,tr,0,...,4687213,8550701,6,2019-06-26 17:11:35.150580,Jeffrey.Durachta,3998,1999,4,0,0
2,105049,-c echo torque/6.0.2:moab/9.0.2:slurm/18.08:gl...,0,2019-06-26 17:11:35.148391,0,109,2019-06-10 15:50:57.943236,999,bash,0,...,1110693,4971699,1,2019-06-26 17:11:35.148394,Jeffrey.Durachta,999,999,0,203,0
3,1270150,-c echo torque/6.0.2:moab/9.0.2:slurm/18.08:gl...,0,2019-06-26 17:11:35.155110,0,18487,2019-06-10 15:50:57.954098,6998,bash,0,...,7958715,56548,10,2019-06-26 17:11:35.155112,Jeffrey.Durachta,6998,3999,8,0,0
4,102192,fredb,0,2019-06-26 17:11:35.146150,0,321,2019-06-10 15:50:57.923063,6998,which,0,...,7385108,88913,9,2019-06-26 17:11:35.146153,Jeffrey.Durachta,6998,3999,6,0,0
5,299989,-Gn,0,2019-06-26 17:11:35.143958,0,489,2019-06-10 15:50:57.903606,6998,id,0,...,7249342,101952,9,2019-06-26 17:11:35.143961,Jeffrey.Durachta,6998,3999,6,0,0
6,393361,-Gn,0,2019-06-26 17:11:35.141777,0,698,2019-06-10 15:50:57.892236,6998,id,0,...,7739709,152639,9,2019-06-26 17:11:35.141779,Jeffrey.Durachta,6998,2999,6,0,0
7,319789,-Gn,0,2019-06-26 17:11:35.139584,0,618,2019-06-10 15:50:57.868106,6998,id,0,...,7273913,96956,9,2019-06-26 17:11:35.139587,Jeffrey.Durachta,6998,2999,6,0,0
8,357973,-Gn,0,2019-06-26 17:11:35.137405,0,546,2019-06-10 15:50:57.856721,6998,id,0,...,7666008,165383,9,2019-06-26 17:11:35.137408,Jeffrey.Durachta,6998,3999,6,0,0
9,321774,-Gn,0,2019-06-26 17:11:35.135218,0,581,2019-06-10 15:50:57.832829,6998,id,0,...,7165056,107410,9,2019-06-26 17:11:35.135221,Jeffrey.Durachta,6998,3999,6,0,0


You could also pass in more than one job, in which case the returned processes
would be a superset of those across the jobs list. Here we use the `orm` format
to speed the query since we just want a `count` of processes.

In [17]:
procs = eq.get_procs(['629337', '625172'], fmt='orm')
procs.count()

15943

#### <a name="process-tags">Process Tags</a>

Each process saves a dictionary of key/value pairs, such as:
`{'op': "ncatted", 'op_instance': 12, 'op_sequence': 159}`

For a job we can determine the <a name="job-proc-tags">unique
set of process tags</a> across all its processes using the
`job_proc_tags` API call.

In [18]:
# suppose you want to figure out the unique set of operations
# across all the jobs of interest. We would pass in our list of
# jobs
eq.job_proc_tags(jobs_orm)

[{'op': 'hsmget', 'op_instance': '10', 'op_sequence': '32'},
 {'op': 'untar', 'op_instance': '6', 'op_sequence': '166'},
 {'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'},
 {'op': 'hsmget', 'op_instance': '1', 'op_sequence': '9'},
 {'op': 'fregrid', 'op_instance': '6', 'op_sequence': '67'},
 {'op': 'cp', 'op_instance': '9', 'op_sequence': '57'},
 {'op': 'mv', 'op_instance': '4', 'op_sequence': '143'},
 {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '93'},
 {'op': 'mv', 'op_instance': '18', 'op_sequence': '187'},
 {'op': 'rm', 'op_instance': '13', 'op_sequence': '164'},
 {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '95'},
 {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '75'},
 {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '83'},
 {'op': 'rm', 'op_instance': '18', 'op_sequence': '178'},
 {'op': 'hsmget', 'op_instance': '13', 'op_sequence': '74'},
 {'op': 'splitvars', 'op_instance': '1', 'op_sequence': '124'},
 {'op': 'dmput', 'op_instance': '2', 'op_se

#### <a name="filter-processes">Filtering and Ordering Processes</a>

In [19]:
# now let's say we care about a particular operation. 
# Let's find the processes in the operation, and
# sort them by the cpu_time, and then see the top offenders
ncatted_procs = eq.get_procs(jobs_orm, \
                             tag = {'op': 'ncatted'}, \
                             order='desc(p.exclusive_cpu_time)', \
                             limit=10, \
                             fmt='pandas')
ncatted_procs[['jobid', 'tags', 'exename', 'duration', 'exclusive_cpu_time']]

Unnamed: 0,jobid,tags,exename,duration,exclusive_cpu_time
0,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1256,58990
1,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncdump,1112,53991
2,629337,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1143,48992
3,693147,"{'op': 'ncatted', 'op_instance': '5', 'op_sequ...",ncatted,1118,48992
4,629337,"{'op': 'ncatted', 'op_instance': '3', 'op_sequ...",ncatted,1119,48991
5,627922,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1037,47992
6,696127,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1082,47992
7,633144,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1085,47991
8,692544,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1053,47991
9,693147,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1042,46991


We could have used a more precise tag, such as `{'op': 'ncatted', 'op_sequence': 85}`,
for more granular selection. And, maybe an exename, such as `ncatted`.

In [20]:
procs = eq.get_procs(jobs_orm, tag='op:ncatted;op_sequence:85', \
                     fltr='p.exename == "ncatted"', \
                     order='desc(p.duration)', \
                     fmt='pandas')
procs[['jobid', 'tags', 'exename', 'duration', 'exclusive_cpu_time', 'exitcode']]

Unnamed: 0,jobid,tags,exename,duration,exclusive_cpu_time,exitcode
0,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1256,58990,0
1,629337,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1143,48992,0
2,633144,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1085,47991,0
3,696127,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1082,47992,0
4,692544,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1053,47991,0
5,693147,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1042,46991,0
6,627922,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1037,47992,0
7,804285,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,588,22995,0
8,676007,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,569,23995,0
9,802954,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,536,21996,0


#### <a name="thread-metrics-aggregation">Process contains aggregated thread metrics</a>

The `pandas` (and the `dict`) formats, in addition to having process-level data in each row, also have fields that represent metrics aggregated across the underlying threads of the process, such, as
`rssmax`, `exclusive_cpu_time`, and `rchar`. The ORM `Process` object instead has a `threads_sums` attribute, 
which is a dictionary containing the above fields.

In [21]:
procs.columns.values

array(['PERF_COUNT_SW_CPU_CLOCK', 'args', 'cancelled_write_bytes',
       'created_at', 'delayacct_blkio_time', 'duration', 'end',
       'exclusive_cpu_time', 'exename', 'exitcode', 'gen', 'group',
       'guest_time', 'host', 'id', 'inblock', 'inclusive_cpu_time',
       'invol_ctxsw', 'job', 'jobid', 'majflt', 'minflt', 'numtids',
       'outblock', 'parent', 'path', 'pgid', 'pid', 'ppid', 'processor',
       'rchar', 'rdtsc_duration', 'read_bytes', 'rssmax', 'sid', 'start',
       'syscr', 'syscw', 'systemtime', 'tags', 'time_oncpu',
       'time_waiting', 'timeslices', 'updated_at', 'user', 'user+system',
       'usertime', 'vol_ctxsw', 'wchar', 'write_bytes'], dtype=object)

## <a name="ops">Operations</a>

An operation is simply a collection of processes that share a tag.
The collection of processes form a graph or a tree (if there is a
unique root process of the operation).

### <a name="select-op-procs">Selecting processes in an operation</a>

We select the processes in an operation by passing a tag to `get_procs`.
You may limit the selection to a single job or multiple jobs using the
`jobs` parameter to `get_procs`.

In [22]:
hsmget_op_procs = eq.get_procs(jobs, tag='op:hsmget', fmt='orm')
hsmget_op_procs.count()

27720

<a name="cumulative-cpu-time-for-op"></a>
Let's get the cumulative cpu time across all the processes for the operation.

In [23]:
eq.select(p.exclusive_cpu_time for p in hsmget_op_procs).sum()

2540237921.0

<a name="op-top-exe"></a>
Now let us see the top executables by cpu time for the operation. This requires doing a `select` with
a group-by executable name.

In [24]:
eq.select((p.exename, sum(p.exclusive_cpu_time)) for p in hsmget_op_procs).order_by('eq.desc(sum(p.exclusive_cpu_time))')[:][:10]

[('globus-url-copy', 2005629923.0),
 ('perl', 252404075.0),
 ('tcsh', 78575804.0),
 ('cut', 27399825.0),
 ('host', 18195405.0),
 ('grid-proxy-info', 15352101.0),
 ('make', 15155680.0),
 ('which', 13790410.0),
 ('ncdump', 12461684.0),
 ('uberftp', 11301956.0)]

As you can see the `globus-url-copy` takes `2005` seconds, and the next program `perl` takes an order of magnitude less time.

Writing ORM queries for aggregating operation is cumbersome, so we have an API call <a name="op-metrics">
`op_metrics`</a> to aggregate fields across processes in a given operation. In its simplest invocation 
we pass it a list of one or more jobs:

In [25]:
# widen width of column display width to show full tag
import pandas as pd
pd.set_option('display.max_colwidth', 200)

ops_df = eq.op_metrics(jobs, fmt='pandas')
ops_df[['jobid', 'tags', 'duration', 'cpu_time']]

Unnamed: 0,jobid,tags,duration,cpu_time
0,625172,"{'op': 'hsmget', 'op_instance': '10', 'op_sequence': '32'}",74926114,802817
1,625172,"{'op': 'untar', 'op_instance': '6', 'op_sequence': '166'}",341940,315940
2,625172,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",56171297,381911
3,627922,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",2120153,380909
4,629337,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",22305539,1229774
5,633144,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",3224796,808841
6,676007,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",3201669,781847
7,680181,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",3351135,821832
8,685016,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",2005635,367907
9,692544,"{'op': 'hsmget', 'op_instance': '6', 'op_sequence': '12'}",2216228,451897


#### <a name="dm-ops">Data movement operations</a>
That's a lot of operations. The call can take an optional list of tags
if one knows the operations one cares about. Let's figure out the time spent
in data movement operations</a> v. useful work.
In the call to `op_metrics` below, we pass in the *list of tags* that
represent the data-movement operations. As it's a list of tags, it's like
an OR-operation with the tags.

In [26]:
dm_tags = ['op:hsmget', 'op:cp', 'op:dmget', 'op:gcp', 'op:mv', 'op:untar', 'op:tar', 'op:rm']
dm_ops_df = eq.op_metrics(jobs, tags = dm_tags)
dm_ops_df[['jobid', 'tags', 'cpu_time', 'duration', 'num_procs']]

Unnamed: 0,jobid,tags,cpu_time,duration,num_procs
0,625172,{'op': 'hsmget'},525588783,16145107597,8860
1,627922,{'op': 'hsmget'},151229296,7160025791,1713
2,629337,{'op': 'hsmget'},221437661,7716458753,1713
3,633144,{'op': 'hsmget'},207422750,7509555121,1713
4,676007,{'op': 'hsmget'},187083822,14508170608,1713
5,680181,{'op': 'hsmget'},230162385,7836626559,1713
6,685016,{'op': 'hsmget'},123305670,7585973173,1713
7,692544,{'op': 'hsmget'},190831238,1526509158,1713
8,693147,{'op': 'hsmget'},199442956,5148360707,1730
9,696127,{'op': 'hsmget'},201617665,4595265912,1713


While the query above helps, we would like it to aggregate across jobs by tag. This
is easily accomplished by passing the <a name="group-by-tag">`group_by_tag`</a> 
argument to `op_metrics`:

In [27]:
dm_ops_df_grouped = eq.op_metrics(jobs, tags = dm_tags, group_by_tag = True)
dm_ops_df_grouped[['tags', 'cpu_time', 'duration', 'num_procs']]

Unnamed: 0,tags,cpu_time,duration,num_procs
0,{'op': 'cp'},125672164,553131029,12827
1,{'op': 'mv'},142701408,992812444,900
2,{'op': 'hsmget'},2540237921,87996944703,27720
3,{'op': 'rm'},26662950,47021607,2940
4,{'op': 'untar'},45750643,99932219,2513


So, the total time spent in all data-movement operations is:

In [47]:
dm_ops_df_grouped['cpu_time'].sum()/1e6

2881.025086

Contrast this with the time spent in the jobs as a whole:

In [48]:
eq.select(j.cpu_time for j in jobs_orm).sum()/1e6

7351.686315

#### <a name="cpu-time-v-duration">cpu time v. duration</a>
So, the data-movement operations take about `39%` of the total cpu time across our jobs.
There is a reason we did not use `duration` for our calculation, and instead we used
`exclusive_cpu_time` a.k.a `cpu_time`. The reason is that `duration` can get counted multiple
times if a process forks and waits. The `duration` or `wall-clock` time will end up getting
calculated both for the parent process and the child process. `cpu_time` on the other hand
is the actual time spent on the cpu, and cannot be counted twice in such a scenario.

## <a name="thread-query">Thread Query</a>

The thread query requires passing one or more *process primary keys* or `Process`
objects to `get_thread_metrics`. Let's illustrate this with an example, where
we first obtain the <a name="root-process">root process</a> of a job:

In [28]:
# let's find the root process for a particular job
root = eq.root('629337', fmt='terse')
root

2148207

In [29]:
root_threads_df = eq.get_thread_metrics(root)
display(root_threads_df.columns.values)
root_threads_df[['process_pk', 'tid', 'usertime', 'systemtime', 'rssmax']]

array(['tid', 'start', 'end', 'usertime', 'systemtime', 'rssmax',
       'minflt', 'majflt', 'inblock', 'outblock', 'vol_ctxsw',
       'invol_ctxsw', 'num_threads', 'starttime', 'processor',
       'delayacct_blkio_time', 'guest_time', 'rchar', 'wchar', 'syscr',
       'syscw', 'read_bytes', 'write_bytes', 'cancelled_write_bytes',
       'time_oncpu', 'time_waiting', 'timeslices', 'rdtsc_duration',
       'PERF_COUNT_SW_CPU_CLOCK', 'process_pk'], dtype=object)

Unnamed: 0,process_pk,tid,usertime,systemtime,rssmax
0,2148207,16269,454930,352946,5516


## <a name="useful-attributes">Useful attributes in Job, Process and Thread objects</a>

The following are some useful attributes of the job, process and thread objects.
They are accessible when using the `orm` format. They are available in the 
`pandas` and `dict` formats. There is one important thing to note:

`proc_sums` field of the Job object is masked for `pandas` and `dict` formats
and the underlying keys of the dictionary are exposed at the top-level.

`threads_sums` field of the Process object is masked for `pandas` and `dict` format
and the underlying keys of the dictionary are exposed at the top-level.

### Job Attributes
 - duration: this is the wallclock time in microseconds
 - cpu_time: user+system time aggregated across all processes of the job
 - start:    start time in microseconds since epoch
 - end:      end time in microseconds since epoch
 - jobid:    database id for job (unique)
 - exitcode: return code from job
 - tags:     dict of key/value pairs
 - processes:list of processes belonging to job
 - proc_sums: aggregates across processes of a job
 

### Process Attributes
 - duration: this is the wallclock time in microseconds
 - exclusive_cpu_time: user+system time for process (aggregated across it's threads)
 - inclusive_cpu_time: user+system time for the process and *all its descendants*
 - start:    start time in microseconds since epoch
 - end:      end time in microseconds since epoch
 - tags:     dict of key/value pairs
 - threads_df: json serialized dataframe of process threads (ADVANCED)
 - threads_sums: key/value pairs consisting of sums of thread metrics (ADVANCED)
 - numtids:  number of threads
 - exename
 - args
 - pid
 - ppid
 - id:       database ID for process
 - exitcode
 - parent
 - children
 - ancestors
 - descendants
 
 
### Thread Attributes
 - usertime
 - systemtime
 - user+system
 - rssmax
 - majflt
 - read_bytes
 - write_bytes

### <a name="useful-queries">Useful queries</a>

Below we have some more queries to give you a flavor of how to use the API

In [30]:
# ordinarily we would first find the job and then probe downwards
# You can use tags or fltr arguments to find the job
# As we did not include job tags in this script, let's just find the job using
# its job id
job = eq.get_jobs('676007', fmt='dict')[0]
job

{'PERF_COUNT_SW_CPU_CLOCK': 605349533801,
 'account': None,
 'all_proc_tags': [{'op': 'untar', 'op_instance': '5', 'op_sequence': '56'},
  {'op': 'cp', 'op_instance': '3', 'op_sequence': '30'},
  {'op': 'ncatted', 'op_instance': '12', 'op_sequence': '73'},
  {'op': 'rm', 'op_instance': '2', 'op_sequence': '34'},
  {'op': 'mv', 'op_instance': '18', 'op_sequence': '83'},
  {'op': 'cp', 'op_instance': '15', 'op_sequence': '79'},
  {'op': 'cp', 'op_instance': '11', 'op_sequence': '66'},
  {'op': 'timavg', 'op_instance': '11', 'op_sequence': '72'},
  {'op': 'hsmget', 'op_instance': '4', 'op_sequence': '20'},
  {'op': 'untar', 'op_instance': '7', 'op_sequence': '78'},
  {'op': 'fregrid', 'op_instance': '7', 'op_sequence': '80'},
  {'op': 'untar', 'op_instance': '6', 'op_sequence': '65'},
  {'op': 'timavg', 'op_instance': '9', 'op_sequence': '64'},
  {'op': 'fregrid', 'op_instance': '4', 'op_sequence': '49'},
  {'op': 'fregrid', 'op_instance': '2', 'op_sequence': '31'},
  {'op': 'ncrcat', 'op

In [31]:
# now get the processes that are part of this job, let's sort them by the inclusive time
# we need to pass in the job id to restrict the query to a particular job
# the inclusive_cpu_time sums all the cpu times of the process and its dependents
# in this case you can see that after the top-level 'bash', the 'find' with the
# -exec stat shows up
procs = eq.get_procs('676007', order = 'desc(p.inclusive_cpu_time)', fmt='pandas', limit=10)
procs[['exename', 'duration', 'inclusive_cpu_time', 'exitcode']]

Unnamed: 0,exename,duration,inclusive_cpu_time,exitcode
0,tcsh,10080580982,607624279,0
1,fregrid,72611586,68253623,0
2,ncra,88403258,55002636,0
3,tcsh,40762418,38443149,0
4,TAVG.exe,40354062,38386164,0
5,tcsh,34855334,34631728,0
6,TAVG.exe,34593673,34583741,0
7,perl,38512955,32827920,0
8,perl,37960575,32017044,0
9,make,33665029,31420174,0


<a name="process-tree-walk"></a>Let's do a walk through the process tree.

In [32]:
# now let's walk through the process tree. To make this easy, we use the 'orm' format
# let's sort the processes by exclusive cpu time
# You will get a sorted list of ORM objects, let's see the top 10
procs = eq.get_procs('676007', order = 'desc(p.exclusive_cpu_time)', fmt='orm')[:10]
procs

[Process[3290949], Process[3290754], Process[3289464], Process[3289961], Process[3290210], Process[3287818], Process[3290459], Process[3288579], Process[3288394], Process[3289713]]

In [33]:
# lets pick up the first
p = procs[0]
p

Process[3290949]

In [34]:
p.exename

'fregrid'

In [35]:
p.exename, p.args, p.duration, len(p.children), p.numtids

('fregrid',
 '--standard_dimension --input_mosaic ocean_mosaic.nc --input_file all --interp_method conserve_order1 --remap_file .fregrid_remap_file_360_by_180.nc --nlon 360 --nlat 180 --scalar_field volcello,thkcello,vo,vmo,vhGM,vhml --output_file out.nc',
 72611586.0,
 0,
 1)

In [36]:
parent = p.parent
parent

Process[3287663]

In [37]:
parent.exename, parent.args, parent.pid, len(parent.children), len(parent.descendants)

('tcsh',
 '-f /home/Jeffrey.Durachta/ESM4/DECK/ESM4_historical_D151/gfdl.ncrc4-intel16-prod-openmp/scripts/postProcess/ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101.tags',
 27339,
 729,
 3309)

In [38]:
# let's see p's thread sums
p.threads_sums

{'PERF_COUNT_SW_CPU_CLOCK': 68221670024,
 'cancelled_write_bytes': 0,
 'delayacct_blkio_time': 0,
 'guest_time': 0,
 'inblock': 12968064,
 'invol_ctxsw': 180,
 'majflt': 3,
 'minflt': 9946,
 'outblock': 2141984,
 'processor': 0,
 'rchar': 15140041925,
 'rdtsc_duration': 251077780608,
 'read_bytes': 6639648768,
 'rssmax': 58112,
 'syscr': 1741380,
 'syscw': 33346,
 'systemtime': 5995088,
 'time_oncpu': 68265620909,
 'time_waiting': 15819937,
 'timeslices': 536,
 'user+system': 68253623,
 'usertime': 62258535,
 'vol_ctxsw': 355,
 'wchar': 2185067697,
 'write_bytes': 1096695808}

In [39]:
# let's get the thread dataframes for p
eq.get_thread_metrics(p)

Unnamed: 0,tid,start,end,usertime,systemtime,rssmax,minflt,majflt,inblock,outblock,...,syscw,read_bytes,write_bytes,cancelled_write_bytes,time_oncpu,time_waiting,timeslices,rdtsc_duration,PERF_COUNT_SW_CPU_CLOCK,process_pk
0,5488,1560525416253004,1560525488864590,62258535,5995088,58112,9946,3,12968064,2141984,...,33346,6639648768,1096695808,0,68265620909,15819937,536,251077780608,68221670024,3290949


In [40]:
# Let's explore a particular operation in a job, and see which processes took the 
# top *inclusive* cpu time.
# Let's limit the output to the top 5 results
# and let's get a pandas dataframe
procs = eq.get_procs(j, tag = 'op_sequence:159', order='desc(p.inclusive_cpu_time)', limit=5, fmt='pandas')
procs[['exename', 'args', 'exclusive_cpu_time', 'inclusive_cpu_time', 'duration']]

KeyError: "['exename' 'args' 'exclusive_cpu_time' 'inclusive_cpu_time' 'duration'] not in index"

<a name="failed-procs"></a>Let's see if there are any failed processes in our job selection.

In [None]:
# Let's find the failed processes across our jobs subset
failed_procs = eq.get_procs(jobs_orm, fltr='p.exitcode > 0', fmt='pandas')
failed_procs[['jobid', 'exename', 'args', 'exitcode', 'tags']]

Let's focus only on a particular operation, and prune the list a bit

In [None]:
failed_procs = eq.get_procs(jobs, tag='op_sequence:79', fltr='p.exitcode > 0', fmt='pandas')
failed_procs[['jobid', 'id', 'exename', 'args', 'exitcode']]

In [None]:
# let's explore one of the failed processes
p = eq.Process[1581673]
p.exename, p.exitcode, p.args

## <a name="linux-kernel">Case Study - Linux Kernel Compile</a>

Start by importing the data for this experiment (import takes around half an hour on my laptop):
```
$ ./epmt -v submit sample/outlier/*.tgz
```

Let's review the script:
```
$ cat sample/kernel/build-linux-kernel.sh 
#!/bin/bash -e

# you will need the following deps installed:
#  sudo apt-get install build-essential libncurses-dev bison flex libssl-dev libelf-dev coreutils

# EPMT_JOB_TAGS='model:linux-kernel;compiler:gcc' ./epmt -a -j kernel-build-$(date +%Y%m%d-%H%M%S) run sample/kernel/build-linux-kernel.sh
#

build_dir=$(tempfile -p epmt -s build)
echo "creating build directory: $build_dir"
rm -rf $build_dir; mkdir -p $build_dir && cd $build_dir

# download
PAPIEX_TAGS="operation:download;operation_count:1;instance:1" wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.1.7.tar.xz
PAPIEX_TAGS="operation:extract;operation_count:2;instance:1" tar -xf linux-5.1.7.tar.xz
cd linux-5.1.7

# configure
cp -v /boot/config-$(uname -r) .config
PAPIEX_TAGS="operation:configure;operation_count:3;instance:1" make olddefconfig

# build
PAPIEX_TAGS="operation:build;operation_count:4;instance:1" make -j $(nproc)
```

The job has a tag set: `model:linux-kernel;compiler:gcc`

Each of the download, extract, configure and build operations are marked using `PAPIEX_TAGS`.

In [None]:
# start by locating the job in the database using tags
# you can specify tags as a dict or a string
# use fmt='terse' as we just want to know the job id
kern_jobs = j = eq.get_jobs(tag = 'exp_name:linux_kernel', fmt='terse')
j

In [None]:
# let's get all the tags associated with the processes of the jobs
# This is a *very slow query* as all the processes for the job are loaded
# and the tags are filtered to get the unique tags
# If you already know the tags of the operations you care about,
# then this step is not needed
eq.job_proc_tags('kern-6656-20190614-185359')

In [None]:
# let's see the processes in the download phase
download = eq.get_procs('kern-6656-20190614-185359', tag = 'op:download', fmt='pandas')
download[['exename', 'args', 'exitcode', 'duration']]

In [None]:
# So, there was only the single program wget and the duration shows the wallclock time
# Now let's sudy the configre phase. We expect it to have many processes. Whenever the 
# number of processes is large, it is a good idea to use order_by and limit, particularly
# if the format is dict or pandas. The 'orm' and 'terse' formats are usually fast already.
configure_procs = eq.get_procs('kern-6656-20190614-185359', tag = 'op:configure', fmt='orm')
configure_procs.count()

In [None]:
# As you can see, that a lot of processes. Let's use order and limit to get a better understanding
# So, we will re-run the query but this time, we will sort by inclusive_cpu_time and get the top 10 processes
configure = eq.get_procs('kern-6656-20190614-185359', tag = 'op:configure', order = 'desc(p.inclusive_cpu_time)', limit = 10, fmt='pandas')
configure[['exename', 'args', 'pid', 'duration', 'inclusive_cpu_time', 'exclusive_cpu_time']]

In [None]:
# ADVANCED TOPIC:
# The idea below is to get the user familiar with the power of ORM
# operations, so we can get feedback and ideas for new API calls
# 

# If you just want to know the total time of an operation, and you could
# use database queries on the ORM directly. 
# The big advantage is the speedup in the query whenever you use the ORM
# as there is lazy loading and optimized queries using db primitives
c = eq.get_procs('kern-6656-20190614-185359', tag = 'op:configure', fmt='orm')
eq.select(p.exclusive_cpu_time for p in c).sum()

In [None]:
# another trick that works to get the max time for an operation is
# to find the process with the max value for duration. This works if
# you have a top-level process that spawned the rest
# Notice we use order and limit
root_build_proc = eq.get_procs('kern-6656-20190614-185359', tag = 'op:build', order='desc(p.duration)', limit=5, fmt='pandas')
root_build_proc[['exename', 'args', 'duration', 'inclusive_cpu_time', 'exitcode']]

In [None]:
# Above, you notice the build operation's root process 'make' took
# a long time 

# Now let's see if any process failed in the build phase
# If you use 'orm' you get access to 'count', which is superfast as it
# uses sql to a count directly rather than load all the fields of the matching processes
eq.get_procs('kern-6656-20190614-185359', tag = 'op:build', fltr='p.exitcode != 0', fmt='orm').count()

#### <a name="timeline">Timeline</a>
Sometimes you want to get a timeline of the processes in the order they were spawned

In [None]:
procs = eq.timeline('kern-6656-20190614-185359', fmt='pandas', limit=25)
procs[['exename', 'tags', 'start', 'duration']]

In [None]:
# Advanced topic:
# The orm also gives an easy way to navigate the process hierarchy
# Let's use the ORM directly to walk through the job
j = eq.get_jobs('kern-6656-20190614-185359', fmt='orm').first()
j

In [None]:
# Notice we have a Job object. The processes in the job
# are available as j.processes
j.duration, j.cpu_time, j.exitcode, j.tags

In [None]:
# let's see the process that took the max cpu time
max_cpu_proc = j.processes.order_by('desc(p.exclusive_cpu_time)').limit(1)[0]
max_cpu_proc.exename, max_cpu_proc.pid, max_cpu_proc.exclusive_cpu_time, max_cpu_proc.duration

In [None]:
# let's get details on the build operation
b = eq.get_procs(j, tag = 'op:build', order='desc(p.inclusive_cpu_time)', fmt='orm')
b

In [None]:
# Above we get a Query object, we can iterate over it, convert
# it to a list or get a slice of it
b_limit = b.order_by('eq.desc(p.inclusive_cpu_time)').limit(5)
b_limit

In [None]:
# observe that we don't actually do any queries until we start using
# the result
top_cpu = b_limit[0]
top_cpu

In [None]:
top_cpu.exename, top_cpu.args, top_cpu.duration, top_cpu.exclusive_cpu_time

In [None]:
# now we get access to the parent/children/ancestors/descendats of this process
max(top_cpu.descendants.exitcode)

In [None]:
# so one or more descendant processes failed, let's find which ones
failed = top_cpu.descendants.filter('p.exitcode != 0')
failed.count()

In [None]:
# Advanced topic: 
# we can convert a Query object to a pandas dataframe anytime
# Note, in future you will be able to pass a Query object to
# eq.get_procs and achive format conversion. Below is a quick-and-dirty
# workaround.
import pandas as pd
df = pd.DataFrame([p.to_dict() for p in failed])
df[['exename', 'args', 'start', 'end', 'pid', 'ppid', 'exitcode']]

In [None]:
# first we ask for the aggregate metrics for single job
# Here, we don't specify any tags. For single jobs, when
# we don't specify the operation/tags, they are queried from the job
op_sums = eq.op_metrics(jobs='kern-6656-20190614-185359', fmt='pandas')
display(op_sums.columns.values)
op_sums[['jobid', 'tags', 'duration', 'cpu_time']]

In [None]:
# Now let's run the same query against all the kernel build jobs. In this case, we need
# to provide a list of tags (or a single tag) for the operation
eq.op_metrics(kern_jobs, tags=['op:build', 'op:configure'])[['job','tags', 'cpu_time','num_procs', 'rssmax']]

In [None]:
# let's look at a particular job and see the processes with largest page faults
# across all threads for only the build operation
df = eq.get_procs('kern-6656-20190614-185359', tag='op:build', order='desc(p.threads_sums["majflt"])', limit=5, fmt='pandas')
df[['exename', 'args', 'majflt', 'exclusive_cpu_time']]

## Misc examples

Below is a collection of some queries that we found useful.

In [None]:
# As you may know for outlier detection we can only compare jobs with the
# same exp_name and exp_component. Let's do a query to count the number of jobs
# for each exp_component:
# For this we will use advanced ORM methods
q = eq.select((eq.count(j), j.tags['exp_component']) for j in eq.Job)
list(q[:])

In [None]:
# below we filter those processes of the job that exceed a certain
# wallclock time, and then sort them by the exclusive cpu time (user+system)
# fltr can be a lamdba function or a string
# limit can be useful to restrict the number of elements in the output
eq.get_procs(jobs_orm, fltr = lambda p: p.duration > 100000, order = 'desc(p.exclusive_cpu_time)', limit=5, fmt='pandas')