# EPMT Query API

This workbook will illustrate the usage of the EPMT Query API. It assumes you have `EPMT`
installed.


## Table of Contents

 * [Import data for the study](#import-data)
 * [Import module](#import-module)
 * [Getting documentation](#getting-docs)
 * [Job Query](#job-query)
   * [output format and converting between formats](#output-formats)
   * [working with ORM objects (ADVANCED TOPIC)](#orm-objects)
   * [job tags](#job-tags)
   * [ordering and filtering jobs](#jobs-order-filter)
   * [failed jobs](#failed-jobs)
   * [process sums (ADVANCED TOPIC)](#proc-sums-field)
 * [Process Query](#process-query)
   * [process tags](#process-tags)
     * [unique process tags in job (ADVANCED TOPIC)](#job-proc-tags)
   * [filter and ordering](#filter-processes)
   * [thread metrics aggregation (ADVANCED TOPIC)](#thread-metrics-aggregation)
 * [Operations](#ops)
   * [select op processes](#select-op-procs)
   * [the Operation primitive](#operation-primitive)
   * [aggregating operation metrics](#op-metrics)
   * [data-movement v. useful work](#dm-ops)
   * [op_metrics grouped by tag](#group-by-tag)
   * [cpu-time v. duration](#cpu-time-v-duration)
 * [Thread Query](#thread-query)
 * [Useful Queries](#useful-queries)
   * [process tree walk](#process-tree-walk)
   * [failed processes](#failed-procs)
   * [all process tags for job](#job-proc-tags)
   * [root process](#root-process)
   * [timeline](#timeline)
 * [Useful Attributes of Job/Process/Threads](#useful-attributes)


## <a name="import-data">Import the data for this study</a>

This workbook relies on importing the following data. We use an sqlite database 
in this study, but you can use another database such as `postgresql`.
See the `preset_settings` folder to pick up a template of your choice and edit it
if needed. Save the template in the `epmt` folder as `settings.py`

While not required to do so, it's recommended that you start in a fresh database
so as not to affect your existing data. The sqlite database path is controlled
in `settings.py`, and is typically a file in the user `HOME`.

```
# pick the database settings file of your choice
$ cp ../preset_settings/settings_sqlite_localfile.py settings.py

# backup your existing database
# The path might vary depending on your settings.py
# mv ~/EPMT_DB.sqlite ~/EPMT_DB.sqlite.backup

# now import the data
$ ./epmt -v submit test/data/query_notebook/*.tgz

# check the list of imported jobs
$ ./epmt list
['625172', '627922', '629337', '633144', '676007', '680181', '685000', '685016', '692544', '693147', '696127', '802954', '804285']
```

<a name="import-module"></a>

In [1]:
# import the query api module
import epmt_query as eq

The API has a few queries -- `get_jobs`, `get_procs` and `get_thread_metrics` -- that you will be using frequently.

Each of these operate at distinct levels: job, process and threads.

### <a name="getting-docs">Getting to the docs</a>

The module functions have embedded documentation in the form of docstrings. You can access it, 
as you would do for any Python module/function:

To get help for all functions in the module, do `help(<module-name)`:
```
help(eq)
```

To get documentation for a specific function, do something like:
```
help(eq.get_jobs)
```

### <a name="job-query">Job Query</a>

The job query usually takes a `tag` and returns a collection of jobs in the format specified by `fmt`.
The returned list can be pruned and/or ordered using `fltr`, `limit` and `order`.

You can also pass in one or more jobs as a `jobs` parameter, most often for format conversion.

Let's get started!

In [2]:
# let's get jobs, we use the job tag to select the jobs
jobs = eq.get_jobs(tags='exp_name:ESM4_historical_D151;exp_component:ocean_month_rho2_1x1deg',fmt='terse')
jobs

['625172',
 '627922',
 '629337',
 '633144',
 '676007',
 '680181',
 '685016',
 '692544',
 '693147',
 '696127',
 '802954',
 '804285']

<a name="output-formats"></a>`fmt` can take one of the following values:
 * `terse` -- this returns a list of job ids
 * `pandas` -- this returns a pandas dataframe
 * `dict` -- for a list of python dictionaries
 * `orm` -- ORM object for maximum flexibility and speediest queries.

In [3]:
# above we got a list of job ids. sometimes we want to see more details
# than just the job id. We can use `conv_jobs` to convert between formats
jobs_df = eq.conv_jobs(jobs, fmt='pandas')
display(jobs_df.columns.values)
jobs_df

array(['start', 'jobname', 'created_at', 'end', 'exitcode', 'duration',
       'updated_at', 'tags', 'info_dict', 'env_dict', 'cpu_time',
       'annotations', 'env_changes_dict', 'analyses', 'submit', 'jobid',
       'user', 'all_proc_tags', 'num_procs', 'num_threads', 'systemtime',
       'rchar', 'minflt', 'time_oncpu', 'inblock', 'guest_time',
       'read_bytes', 'syscw', 'timeslices', 'PERF_COUNT_SW_CPU_CLOCK',
       'user+system', 'rdtsc_duration', 'cancelled_write_bytes',
       'invol_ctxsw', 'syscr', 'delayacct_blkio_time', 'vol_ctxsw',
       'usertime', 'majflt', 'outblock', 'wchar', 'write_bytes', 'rssmax',
       'processor', 'time_waiting'], dtype=object)

Unnamed: 0,start,jobname,created_at,end,exitcode,duration,updated_at,tags,info_dict,env_dict,...,delayacct_blkio_time,vol_ctxsw,usertime,majflt,outblock,wchar,write_bytes,rssmax,processor,time_waiting
0,2019-06-09 18:53:22.574059,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:46:42.022221,2019-06-09 22:23:53.234877,0,12630660000.0,2019-12-02 15:47:10.660314,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job625172', '...",...,0,3132299,769390290,27,262975528,138638409056,134643470336,98645272,0,48105200287
1,2019-06-10 06:23:14.388744,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:12.908891,2019-06-10 08:12:06.562689,0,6532174000.0,2019-12-02 15:47:20.581507,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job627922', '...",...,0,809138,428874880,300,141682240,74236894004,72541306880,33288800,0,13802065001
2,2019-06-10 09:59:22.043793,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:21.210285,2019-06-10 11:50:58.082917,0,6696039000.0,2019-12-02 15:47:28.920924,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job629337', '...",...,0,792701,478902294,26,137212312,74236893737,70252703744,33310584,0,19476877777
3,2019-06-10 16:49:06.802212,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:31.650940,2019-06-10 18:39:32.439890,0,6625638000.0,2019-12-02 15:47:39.285445,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job633144', '...",...,0,793749,467152088,200,138242432,74236867345,70780125184,33303320,0,24198886180
4,2019-06-14 08:30:37.421228,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:39.925640,2019-06-14 11:18:38.154111,0,10080730000.0,2019-12-02 15:47:47.674135,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job676007', '...",...,0,832964,450158690,788,144266800,74236894547,73864601600,33071160,0,23712986693
5,2019-06-14 16:34:15.052476,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:48.246445,2019-06-14 18:14:24.986076,0,6009934000.0,2019-12-02 15:47:56.014609,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job680181', '...",...,0,793333,434850361,8,137224992,74236809886,70259195904,33538160,0,32443024755
6,2019-06-15 07:52:38.592038,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:47:56.611887,2019-06-15 09:49:24.210549,0,7005619000.0,2019-12-02 15:48:04.382412,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job685016', '...",...,0,799028,332821891,8,137199936,74236867987,70246367232,33653664,0,11304090572
7,2019-06-16 13:54:28.828890,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:48:05.024637,2019-06-16 14:06:18.129747,0,709300900.0,2019-12-02 15:48:12.756760,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job692544', '...",...,0,783079,457078582,424,137902488,74236883941,70606073856,33029440,0,18759054582
8,2019-06-16 16:20:31.601990,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:48:13.339553,2019-06-16 17:16:11.907347,0,3340305000.0,2019-12-02 15:48:21.178988,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job693147', '...",...,0,783373,452663282,11,137210472,74236938446,70251761664,33379836,0,21984544439
9,2019-06-17 06:20:59.842457,ESM4_historical_D151_ocean_month_rho2_1x1deg_1...,2019-12-02 15:48:21.769608,2019-06-17 07:22:16.747572,0,3676905000.0,2019-12-02 15:48:29.529477,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...","{'tz': 'US/Eastern', 'status': {'exit_code': 0...","{'TMP': '/vftmp/Jeffrey.Durachta/job696127', '...",...,0,797766,468679853,289,137648664,74236837177,70476115968,33264792,0,17253128153


In [4]:
# if you prefer dealing with python lists and dictionaries,
# you can set fmt='dict'. Here we get a list of dictionaries
eq.get_jobs(jobs = jobs, fmt='dict')

[{'start': datetime.datetime(2019, 6, 9, 18, 53, 22, 574059),
  'jobname': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18540101',
  'created_at': datetime.datetime(2019, 12, 2, 15, 46, 42, 22221),
  'end': datetime.datetime(2019, 6, 9, 22, 23, 53, 234877),
  'exitcode': 0,
  'duration': 12630660818.0,
  'updated_at': datetime.datetime(2019, 12, 2, 15, 47, 10, 660314),
  'tags': {'exp_name': 'ESM4_historical_D151',
   'exp_component': 'ocean_month_rho2_1x1deg',
   'exp_time': '18540101',
   'atm_res': 'c96l49',
   'ocn_res': '0.5l75',
   'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18540101'},
  'info_dict': {'tz': 'US/Eastern',
   'status': {'exit_code': 0,
    'exit_reason': 'none',
    'script_path': '/home/Jeffrey.Durachta/ESM4/DECK/ESM4_historical_D151/gfdl.ncrc4-intel16-prod-openmp/scripts/postProcess/ESM4_historical_D151_ocean_month_rho2_1x1deg_18540101.tags',
    'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18540101'}},
  'env_dict': {'TMP': '/v

<a name="orm-objects"></a>
There is a very useful format called ORM, this optimizes queries
and it lets you get the underlying Job (or Process) object directly

In [5]:
jobs_orm = eq.get_jobs(jobs, fmt='orm')
jobs_orm.count(), type(jobs_orm)

(12, sqlalchemy.orm.query.Query)

`jobs_orm` above is a `Query` object. The `Query` object can be iterated
over (like a Python list). You can convert it to a list by using the slice
operator -- `[:]`.

The ORM format is powerful as it minimizes the number of SQL queries and
lazy-evaluates queries where possible.

#### <a name="job-tags">Job Tags</a>

Each job has a `tags` field that is set during import time. The job tag is a stored
as dictionary of key/value pairs. The most common use of the job tag is for selecting
jobs. You can specify the tag either as a dictionary or as a string, with each key/value
pair separated by semicolons. All the key/value pairs must match for a job to be considered
a match.

In [6]:
jobs_190900101 = eq.get_jobs(tags='exp_name:ESM4_historical_D151;exp_component:ocean_month_rho2_1x1deg;exp_time:19090101', fmt='orm')

In [7]:
for j in jobs_190900101:
    print(j.jobid, j.tags)

804285 {'exp_name': 'ESM4_historical_D151', 'exp_component': 'ocean_month_rho2_1x1deg', 'exp_time': '19090101', 'atm_res': 'c96l49', 'ocn_res': '0.5l75', 'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_19090101'}


#### <a name="jobs-order-filter">Ordering and Filtering Jobs</a>

You can use the `order`, `limit`, and `fltr` option with `get_jobs` to sort and filter the job list.
It is advisable to use `limit` when possible, as it sends a `LIMIT` option to the SQL query
and saves database load time.

In [8]:
# some other useful queries might be for instance to order the jobs
# by duration, and getting the top 5
df = eq.get_jobs(jobs, order=eq.desc(eq.Job.duration), fmt="pandas")
df[['jobid', 'tags', 'duration', 'exitcode']]

Unnamed: 0,jobid,tags,duration,exitcode
0,625172,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",12630660000.0,0
1,676007,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",10080730000.0,0
2,685016,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",7005619000.0,0
3,629337,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",6696039000.0,0
4,633144,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",6625638000.0,0
5,627922,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",6532174000.0,0
6,680181,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",6009934000.0,0
7,802954,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",3879024000.0,0
8,696127,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",3676905000.0,0
9,693147,"{'exp_name': 'ESM4_historical_D151', 'exp_comp...",3340305000.0,0


<a name="failed-jobs"></a>Let's figure out which if any jobs failed.

In [9]:
eq.get_jobs(jobs_orm, fltr=(eq.Job.exitcode != 0), fmt='terse')

[]

#### <a name="proc-sums-field">Aggregation across job processes (ADVANCED TOPIC)</a>

Each job object has a `proc_sums` field that aggregates data across the 
processes of the job. The field itself is a dictionary of key/value pairs.
This field is an attribute in the Job object, and when converting from `orm` 
to the other formats, the underlying key/value pairs of the dictionary are made available 
as top-level fields of the `dict` or `pandas` dataframe. `proc_sums` represents aggregates across
the processes of a job:

In [10]:
j = jobs_orm.first()
j.proc_sums.keys()

dict_keys(['all_proc_tags', 'num_procs', 'num_threads', 'systemtime', 'rchar', 'minflt', 'time_oncpu', 'inblock', 'guest_time', 'read_bytes', 'syscw', 'timeslices', 'PERF_COUNT_SW_CPU_CLOCK', 'user+system', 'rdtsc_duration', 'cancelled_write_bytes', 'invol_ctxsw', 'syscr', 'delayacct_blkio_time', 'vol_ctxsw', 'usertime', 'majflt', 'outblock', 'wchar', 'write_bytes', 'rssmax', 'processor', 'time_waiting'])

Now, the fields shown above become available in other formats (`dict` and `pandas`) as top-level fields, while the `proc_sums`
field itself is masked.

In [11]:
j_df = eq.get_jobs(j, fmt='pandas')
j_df.columns.values

array(['start', 'jobname', 'created_at', 'end', 'exitcode', 'duration',
       'updated_at', 'tags', 'info_dict', 'env_dict', 'cpu_time',
       'annotations', 'env_changes_dict', 'analyses', 'submit', 'jobid',
       'user', 'all_proc_tags', 'num_procs', 'num_threads', 'systemtime',
       'rchar', 'minflt', 'time_oncpu', 'inblock', 'guest_time',
       'read_bytes', 'syscw', 'timeslices', 'PERF_COUNT_SW_CPU_CLOCK',
       'user+system', 'rdtsc_duration', 'cancelled_write_bytes',
       'invol_ctxsw', 'syscr', 'delayacct_blkio_time', 'vol_ctxsw',
       'usertime', 'majflt', 'outblock', 'wchar', 'write_bytes', 'rssmax',
       'processor', 'time_waiting'], dtype=object)

### <a name="process-query">Process Query</a>

A process query returns a collection of one or more processes. The query is
passed a `jobs` parameter to restrict the process set to those belong to a
specified set of `jobs`. 

Like the job query, the process query can take `tag`, `fmt`, 
`fltr`, `order` and `limit` to filter and format the output. `order` and `limit` become
particularly important in process queries as each job can have thousands of processes,
and that takes time to load from the database. In the same vein, using `fmt=orm` is a good
idea, in process queries.

In [12]:
# If you want to get the processes belonging to a job
# here each row in the pandas dataframe contains one job process
# again, you can use the 'terse' fmt option to get just the list of database ids of the processes
eq.get_procs(['629337'], fmt='pandas')[:10]

Unnamed: 0,end,cpu_time,sid,created_at,duration,inclusive_cpu_time,gen,updated_at,tags,exename,...,syscw,read_bytes,write_bytes,cancelled_write_bytes,time_oncpu,time_waiting,timeslices,rdtsc_duration,PERF_COUNT_SW_CPU_CLOCK,user+system
0,2019-06-10 09:59:22.064416,0.0,16255,2019-12-02 15:47:27.897601,120.0,0.0,,2019-12-02 15:47:28.951113,{},tcsh,...,0,0,0,0,1966005,24572,1,399175,116082,0
1,2019-06-10 09:59:22.074184,3998.0,16255,2019-12-02 15:47:27.897614,182.0,3998.0,,2019-12-02 15:47:28.951124,{},mkdir,...,0,0,0,0,4907282,42356,5,613580,177971,3998
2,2019-06-10 09:59:22.137614,16996.0,16255,2019-12-02 15:47:27.897617,5215.0,16996.0,,2019-12-02 15:47:28.951128,{},modulecmd,...,1,0,4096,0,17479352,101721,10,17999644,5032465,16996
3,2019-06-10 09:59:22.155449,10997.0,16255,2019-12-02 15:47:27.897621,106.0,10997.0,,2019-12-02 15:47:28.951132,{},test,...,1,0,4096,0,11612639,67333,6,334404,97976,10997
4,2019-06-10 09:59:22.178281,14997.0,16255,2019-12-02 15:47:27.897624,4037.0,14997.0,,2019-12-02 15:47:28.951135,{},modulecmd,...,1,0,4096,0,15719947,97660,12,13927480,3840149,14997
5,2019-06-10 09:59:22.199641,10997.0,16255,2019-12-02 15:47:27.897627,96.0,10997.0,,2019-12-02 15:47:28.951139,{},test,...,1,0,4096,0,11355247,95352,7,299926,88208,10997
6,2019-06-10 09:59:22.298420,59990.0,16255,2019-12-02 15:47:27.897630,48198.0,59990.0,,2019-12-02 15:47:29.033068,{},perl,...,1,0,0,0,60730333,184677,9,166626070,47913625,59990
7,2019-06-10 09:59:22.328123,24995.0,16255,2019-12-02 15:47:27.897633,12747.0,24995.0,,2019-12-02 15:47:29.033080,{},perl,...,1,0,0,0,25407862,119673,7,44043982,12715756,24995
8,2019-06-10 09:59:22.344394,10998.0,16255,2019-12-02 15:47:27.897636,80.0,10998.0,,2019-12-02 15:47:29.033084,{},python,...,1,0,0,0,12275457,243829,7,242876,72110,10998
9,2019-06-10 09:59:22.358313,10997.0,16255,2019-12-02 15:47:27.897639,154.0,10997.0,,2019-12-02 15:47:29.033087,{},cat,...,1,0,0,0,11585273,9207210,12,499088,131647,10997


You could also pass in more than one job, in which case the returned processes
would be a superset of those across the jobs list. Here we use the `orm` format
to speed the query since we just want a `count` of processes.

In [13]:
procs = eq.get_procs(['629337', '625172'], fmt='orm')
procs.count()

15943

#### <a name="process-tags">Process Tags</a>

Each process saves a dictionary of key/value pairs, such as:
`{'op': "ncatted", 'op_instance': 12, 'op_sequence': 159}`

The process tag is commonly used to filter processes that constitute an **operation** using the `tag` option.

In [14]:
# below we get the processes in an operation that is identified by 'op_sequence=66'
op = eq.get_procs(jobs, tags='op:cp;op_instance:11;op_sequence:66', fmt='pandas')
len(op)

1914

##### <a name="job-proc-tags">Unique process tags in a job (ADVANCED TOPIC)</a>

For a job we can determine the unique set of process tags</a> across all its processes using the
`job_proc_tags` API call.

In [15]:
# suppose you want to figure out the unique set of operations
# across all the jobs of interest. We would pass in our list of
# jobs
eq.job_proc_tags(jobs_orm)

[{'op': 'cp', 'op_instance': '1', 'op_sequence': '119'},
 {'op': 'cp', 'op_instance': '1', 'op_sequence': '122'},
 {'op': 'cp', 'op_instance': '1', 'op_sequence': '123'},
 {'op': 'cp', 'op_instance': '11', 'op_sequence': '167'},
 {'op': 'cp', 'op_instance': '15', 'op_sequence': '180'},
 {'op': 'cp', 'op_instance': '3', 'op_sequence': '131'},
 {'op': 'cp', 'op_instance': '5', 'op_sequence': '140'},
 {'op': 'cp', 'op_instance': '7', 'op_sequence': '149'},
 {'op': 'cp', 'op_instance': '9', 'op_sequence': '158'},
 {'op': 'dmput', 'op_instance': '1', 'op_sequence': '126'},
 {'op': 'dmput', 'op_instance': '2', 'op_sequence': '190'},
 {'op': 'fregrid', 'op_instance': '1', 'op_sequence': '117'},
 {'op': 'fregrid', 'op_instance': '1', 'op_sequence': '121'},
 {'op': 'fregrid', 'op_instance': '2', 'op_sequence': '132'},
 {'op': 'fregrid', 'op_instance': '3', 'op_sequence': '141'},
 {'op': 'fregrid', 'op_instance': '4', 'op_sequence': '150'},
 {'op': 'fregrid', 'op_instance': '5', 'op_sequence': '

#### <a name="filter-processes">Filtering and Ordering Processes</a>

`order`, `limit` and `fltr` should be used where possible to reduce query time.

In [16]:
# now let's say we care about a particular operation. 
# Let's find the processes in the operation, and
# sort them by the cpu_time, and then see the top offenders
ncatted_procs = eq.get_procs(jobs_orm, \
                             tags = {'op': 'ncatted'}, \
                             order=eq.desc(eq.Process.cpu_time), \
                             limit=10, \
                             fmt='pandas')
ncatted_procs[['jobid', 'tags', 'exename', 'duration', 'cpu_time']]

Unnamed: 0,jobid,tags,exename,duration,cpu_time
0,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1256.0,58990.0
1,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncdump,1112.0,53991.0
2,629337,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1143.0,48992.0
3,693147,"{'op': 'ncatted', 'op_instance': '5', 'op_sequ...",ncatted,1118.0,48992.0
4,629337,"{'op': 'ncatted', 'op_instance': '3', 'op_sequ...",ncatted,1119.0,48991.0
5,627922,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1037.0,47992.0
6,696127,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1082.0,47992.0
7,633144,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1085.0,47991.0
8,692544,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1053.0,47991.0
9,693147,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1042.0,46991.0


We could have used a more precise tag, such as `{'op': 'ncatted', 'op_sequence': 85}`,
for more granular selection. And, maybe an exename, such as `ncatted`.

In [17]:
procs = eq.get_procs(jobs_orm, tags='op:ncatted;op_sequence:85', \
                     fltr=(eq.Process.exename == "ncatted"), \
                     order=(eq.desc(eq.Process.duration)), \
                     fmt='pandas')
procs[['jobid', 'tags', 'exename', 'duration', 'cpu_time', 'exitcode']]

Unnamed: 0,jobid,tags,exename,duration,cpu_time,exitcode
0,680181,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1256.0,58990.0,0
1,629337,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1143.0,48992.0,0
2,633144,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1085.0,47991.0,0
3,696127,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1082.0,47992.0,0
4,692544,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1053.0,47991.0,0
5,693147,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1042.0,46991.0,0
6,627922,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,1037.0,47992.0,0
7,804285,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,588.0,22995.0,0
8,676007,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,569.0,23995.0,0
9,802954,"{'op': 'ncatted', 'op_instance': '15', 'op_seq...",ncatted,536.0,21996.0,0


#### <a name="thread-metrics-aggregation">Process contains aggregated thread metrics (ADVANCED TOPIC)</a>

The `pandas` (and the `dict`) formats, in addition to having process-level data in each row, also have fields that represent metrics aggregated across the underlying threads of the process, such, as
`rssmax`, `cpu_time`, and `rchar`. The ORM `Process` object instead has a `threads_sums` attribute, 
which is a dictionary containing the above fields.

In [18]:
procs.columns.values

array(['end', 'cpu_time', 'sid', 'created_at', 'duration',
       'inclusive_cpu_time', 'gen', 'updated_at', 'tags', 'exename',
       'exitcode', 'info_dict', 'path', 'id', 'args', 'pid', 'jobid',
       'numtids', 'ppid', 'start', 'pgid', 'job', 'host', 'parent',
       'user', 'usertime', 'systemtime', 'rssmax', 'minflt', 'majflt',
       'inblock', 'outblock', 'vol_ctxsw', 'invol_ctxsw', 'processor',
       'delayacct_blkio_time', 'guest_time', 'rchar', 'wchar', 'syscr',
       'syscw', 'read_bytes', 'write_bytes', 'cancelled_write_bytes',
       'time_oncpu', 'time_waiting', 'timeslices', 'rdtsc_duration',
       'PERF_COUNT_SW_CPU_CLOCK', 'user+system'], dtype=object)

## <a name="ops">Operations</a>

An operation is simply a collection of processes that share a tag.
The collection of processes form a **forest**. In the trivial case the forest will be a single
tree if there is only one root process. 

### <a name="select-op-procs">Selecting processes in an operation</a>

We select the processes in an operation by passing a tag to `get_procs`.
You may limit the selection to a single job or multiple jobs using the
`jobs` parameter to `get_procs`.

In [19]:
# below we use the ORM format as we just want a count on the number of processes in the operation
hsmget_op_procs = eq.get_procs(jobs, tags='op:hsmget', fmt='orm')
hsmget_op_procs.count()

27720

### <a name="operation-primitive">The Operation primitive</a>

Using `get_procs` with a tag to select processes in a operation is somewhat
clumsy. The EPMT Query API defines an **Operation** primitive. The `Operation`
API call is passed one or more jobs, and a `tag`. Internally, it calls `get_procs`.
By using the `Operation` primitive, you get aggregated metrics across the
processes constituting the operation in the `proc_sums` attribute. You can specify a granular
tag such as `{'op': 'timavg', 'op_instance': 100, 'op_sequence': 5 }`, or a more
coarse tag, such as `{'op': 'timavg'}`. The important thing to understand is that
all the processes that constitute the operation will share *ALL* the keys of the tag.

In [20]:
op = eq.Operation(jobs, {'op': 'hsmget'})
(op.tags, op.processes.count(), op.proc_sums)

({'op': 'hsmget'},
 27720,
 {'num_procs': 27720,
  'PERF_COUNT_SW_CPU_CLOCK': 2242055598562,
  'usertime': 2008400580,
  'outblock': 620041288,
  'write_bytes': 317461139456,
  'rssmax': 198999756,
  'processor': 0,
  'inblock': 197032,
  'systemtime': 531837341,
  'majflt': 245,
  'cpu_time': 2540237921.0,
  'minflt': 39641382,
  'syscw': 1674748,
  'time_oncpu': 2556135725081,
  'read_bytes': 100880384,
  'timeslices': 11577749,
  'numtids': 30212,
  'duration': 64869196565.0,
  'rdtsc_duration': -28107424259990747,
  'vol_ctxsw': 11465035,
  'rchar': 6068618970,
  'guest_time': 0,
  'syscr': 2446648,
  'user+system': 2540237921,
  'time_waiting': 172721616270,
  'cancelled_write_bytes': 1290240,
  'delayacct_blkio_time': 0,
  'wchar': 317449816715,
  'invol_ctxsw': 82453})

### <a name="op-metrics">Aggregating operation metrics</a>

The `Operation` primitive provides an easy way to obtain aggregates on metrics across
processes in an operation. Before `Operation`, the way to obtain metrics was to
use the `op_metrics` API call:

In [21]:
# widen width of column display width to show full tag
import pandas as pd
pd.set_option('display.max_colwidth', 200)

# get the operations with the top cpu_time summed across all processes. 
# Note, cpu_time is better measure of time spent in an operation than 
# 'duration', which might end up double-counting as in a 
# parent-child process scenario, where the parent waits on the time child.
ops_df = eq.op_metrics(['629337', '680181'], fmt='pandas').sort_values(by='cpu_time', ascending=False)
ops_df[['jobid', 'tags', 'duration', 'cpu_time']][:10]

Unnamed: 0,jobid,tags,duration,cpu_time
24,629337,"{'op': 'fregrid', 'op_instance': '7', 'op_sequence': '80'}",69375220.0,68409594.0
25,680181,"{'op': 'fregrid', 'op_instance': '7', 'op_sequence': '80'}",55814600.0,55867500.0
29,680181,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '3'}",499497900.0,53789735.0
31,680181,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '5'}",440434400.0,49622360.0
116,629337,"{'op': 'ncrcat', 'op_instance': '13', 'op_sequence': '76'}",48194410.0,48163676.0
117,680181,"{'op': 'ncrcat', 'op_instance': '13', 'op_sequence': '76'}",46517210.0,44601217.0
26,629337,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}",358012700.0,39960829.0
34,629337,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '9'}",110582500.0,36986292.0
28,629337,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '3'}",2586972000.0,35084576.0
32,629337,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '7'}",39927360.0,34164711.0


#### <a name="dm-ops">Data movement operations</a>
The above call was slow to execute and resulted in a lot of operations. The `op_metrics` call can take a 
list of tags if one knows the operations one cares about. The pruning using the `tags` argument speeds up
the operation significantly. Let's figure out the time spent
in data movement operations</a> v. useful work.
In the call to `op_metrics` below, we pass in the *list of tags* that
represent the data-movement operations. As it's a list of tags, it's like
an OR-operation with the tags.

In [22]:
dm_tags = ['op:hsmget', 'op:cp', 'op:dmget', 'op:gcp', 'op:mv', 'op:untar', 'op:tar', 'op:rm']
dm_ops_df = eq.op_metrics(jobs, tags = dm_tags)
dm_ops_df[['jobid', 'tags', 'cpu_time', 'duration', 'num_procs']]

Unnamed: 0,jobid,tags,cpu_time,duration,num_procs
0,625172,{'op': 'hsmget'},525588783.0,12125210000.0,8860
1,627922,{'op': 'hsmget'},151229296.0,5960385000.0,1713
2,629337,{'op': 'hsmget'},221437661.0,6253407000.0,1713
3,633144,{'op': 'hsmget'},207422750.0,6161460000.0,1713
4,676007,{'op': 'hsmget'},187083822.0,9474841000.0,1713
5,680181,{'op': 'hsmget'},230162385.0,5629497000.0,1713
6,685016,{'op': 'hsmget'},123305670.0,6672575000.0,1713
7,692544,{'op': 'hsmget'},190831238.0,267920100.0,1713
8,693147,{'op': 'hsmget'},199442956.0,2911845000.0,1730
9,696127,{'op': 'hsmget'},201617665.0,3222884000.0,1713


While the query above helps, we would like it to aggregate across jobs by tag. This
is easily accomplished by passing the <a name="group-by-tag">`group_by_tag`</a> 
argument to `op_metrics`:

In [23]:
dm_ops_df_grouped = eq.op_metrics(jobs, tags = dm_tags, group_by_tag = True)
dm_ops_df_grouped[['tags', 'cpu_time', 'duration', 'num_procs']]

Unnamed: 0,tags,cpu_time,duration,num_procs
0,{'op': 'cp'},125672200.0,208446700.0,12827
1,{'op': 'hsmget'},2540238000.0,64869200000.0,27720
2,{'op': 'mv'},142701400.0,274484600.0,900
3,{'op': 'rm'},26662950.0,36441140.0,2940
4,{'op': 'untar'},45750640.0,66632440.0,2513


So, the total time spent in all data-movement operations can be calculated easily.

In [24]:
dm_ops_df_grouped['cpu_time'].sum()/1e6

2881.025086

In [25]:
# total time spent in the jobs
s = 0
for j in jobs_orm: s += j.cpu_time
s/1e6

7351.686315

In [26]:
# data-movement as a percentage of total time
round((100*__/_), 2)

39.19

#### <a name="cpu-time-v-duration">cpu time v. duration</a>
So, the data-movement operations take about `39%` of the total cpu time across our jobs.
There is a reason we did not use `duration` for our calculation, and instead we used
`cpu_time` (a.k.a exclusive cpu time). The reason is that `duration` can get counted multiple
times if a process forks and waits for a child to terminate. The `duration` or `wall-clock` 
time will end up getting calculated both for the parent process and the child process. 
`cpu_time` on the other hand is the actual time spent on the cpu, and cannot be counted twice 
in such a scenario.

## <a name="thread-query">Thread Query</a>

The thread query requires passing one or more *process primary keys* or `Process`
objects to `get_thread_metrics`. Let's illustrate this with an example, where
we first obtain the <a name="root-process">root process</a> of a job:

In [27]:
# let's find the root process for a particular job
root = eq.root('629337', fmt='orm')
root.pid

16269

In [28]:
root_threads_df = eq.get_thread_metrics(root)
display(root_threads_df.columns.values)
root_threads_df[['process_pk', 'tid', 'usertime', 'systemtime', 'rssmax']]

array(['tags', 'hostname', 'exename', 'path', 'args', 'exitcode', 'pid',
       'generation', 'ppid', 'pgid', 'sid', 'numtids', 'tid', 'start',
       'end', 'usertime', 'systemtime', 'rssmax', 'minflt', 'majflt',
       'inblock', 'outblock', 'vol_ctxsw', 'invol_ctxsw', 'num_threads',
       'starttime', 'processor', 'delayacct_blkio_time', 'guest_time',
       'rchar', 'wchar', 'syscr', 'syscw', 'read_bytes', 'write_bytes',
       'cancelled_write_bytes', 'time_oncpu', 'time_waiting',
       'timeslices', 'rdtsc_duration', 'PERF_COUNT_SW_CPU_CLOCK',
       'process_pk'], dtype=object)

Unnamed: 0,process_pk,tid,usertime,systemtime,rssmax
0,19355,16269,454930,352946,5516


## <a name="useful-attributes">Useful attributes in Job, Process and Thread objects</a>

The following are some useful attributes of the job, process and thread objects.
They are accessible when using the `orm` format. They are available in the 
`pandas` and `dict` formats. There is one important thing to note:

`proc_sums` field of the Job object is masked for `pandas` and `dict` formats
and the underlying keys of the dictionary are exposed at the top-level.

`threads_sums` field of the Process object is masked for `pandas` and `dict` format
and the underlying keys of the dictionary are exposed at the top-level.

### Job Attributes
 - duration: this is the wallclock time in microseconds
 - cpu_time: user+system time aggregated across all processes of the job
 - start:    start time in microseconds since epoch
 - end:      end time in microseconds since epoch
 - jobid:    database id for job (unique)
 - exitcode: return code from job
 - tags:     dict of key/value pairs
 - processes:list of processes belonging to job
 - proc_sums: aggregates across processes of a job
 

### Process Attributes
 - duration: this is the wallclock time in microseconds
 - cpu_time: exclusive user+system time for process (aggregated across it's threads)
 - inclusive_cpu_time: user+system time for the process and *all its descendants*
 - start:    start time in microseconds since epoch
 - end:      end time in microseconds since epoch
 - tags:     dict of key/value pairs
 - threads_df: json serialized dataframe of process threads (ADVANCED)
 - threads_sums: key/value pairs consisting of sums of thread metrics (ADVANCED)
 - numtids:  number of threads
 - exename
 - args
 - pid
 - ppid
 - id:       database ID for process
 - exitcode
 - parent
 - children
 - ancestors
 - descendants
 
 
### Thread Attributes
 - usertime
 - systemtime
 - user+system
 - rssmax
 - majflt
 - read_bytes
 - write_bytes

## <a name="useful-queries">Misc. queries</a>

Below we have some more queries to give you a flavor of how to use the API

In [29]:
# ordinarily we would first find the job and then probe downwards
# You can use tags or fltr arguments to find the job
# As we did not include job tags in this script, let's just find the job using
# its job id
job = eq.get_jobs('676007', fmt='dict')[0]
job

{'start': datetime.datetime(2019, 6, 14, 8, 30, 37, 421228),
 'jobname': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101',
 'created_at': datetime.datetime(2019, 12, 2, 15, 47, 39, 925640),
 'end': datetime.datetime(2019, 6, 14, 11, 18, 38, 154111),
 'exitcode': 0,
 'duration': 10080732883.0,
 'updated_at': datetime.datetime(2019, 12, 2, 15, 47, 47, 674135),
 'tags': {'exp_name': 'ESM4_historical_D151',
  'exp_component': 'ocean_month_rho2_1x1deg',
  'exp_time': '18740101',
  'atm_res': 'c96l49',
  'ocn_res': '0.5l75',
  'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101'},
 'info_dict': {'tz': 'US/Eastern',
  'status': {'exit_code': 0,
   'exit_reason': 'none',
   'script_path': '/home/Jeffrey.Durachta/ESM4/DECK/ESM4_historical_D151/gfdl.ncrc4-intel16-prod-openmp/scripts/postProcess/ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101.tags',
   'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101'}},
 'env_dict': {'TMP': '/vftmp/Jeffrey.Dura

In [30]:
# now get the processes that are part of this job, let's sort them by the inclusive time
# we need to pass in the job id to restrict the query to a particular job
# the inclusive_cpu_time sums all the cpu times of the process and its dependents
# in this case you can see that after the top-level 'bash', the 'find' with the
# -exec stat shows up
procs = eq.get_procs('676007', order = (eq.desc(eq.Process.inclusive_cpu_time)), fmt='pandas', limit=10)
procs[['exename', 'duration', 'inclusive_cpu_time', 'exitcode']]

Unnamed: 0,exename,duration,inclusive_cpu_time,exitcode
0,tcsh,10080580000.0,607624279.0,0
1,fregrid,72611590.0,68253623.0,0
2,ncra,88403260.0,55002636.0,0
3,tcsh,40762420.0,38443149.0,0
4,TAVG.exe,40354060.0,38386164.0,0
5,tcsh,34855330.0,34631728.0,0
6,TAVG.exe,34593670.0,34583741.0,0
7,perl,38512960.0,32827920.0,0
8,perl,37960580.0,32017044.0,0
9,make,33665030.0,31420174.0,0


<a name="process-tree-walk"></a>Let's do a walk through the process tree.

In [31]:
# now let's walk through the process tree. To make this easy, we use the 'orm' format
# let's sort the processes by exclusive cpu time
# You will get a sorted list of ORM objects, let's see the top 10
procs = eq.get_procs('676007', order = (eq.desc(eq.Process.cpu_time)), fmt='orm')[:10]
[p.pid for p in procs]

[5488, 5218, 3238, 4196, 4560, 4036, 4837, 13027, 29936, 3809]

In [32]:
# lets pick up the first
p = procs[0]

In [33]:
p.exename

'fregrid'

In [34]:
p.exename, p.args, p.duration, len(p.children), p.numtids

('fregrid',
 '--standard_dimension --input_mosaic ocean_mosaic.nc --input_file all --interp_method conserve_order1 --remap_file .fregrid_remap_file_360_by_180.nc --nlon 360 --nlat 180 --scalar_field volcello,thkcello,vo,vmo,vhGM,vhml --output_file out.nc',
 72611586.0,
 0,
 1)

In [35]:
parent = p.parent

In [36]:
parent.exename, parent.args, parent.pid, len(parent.children), len(parent.descendants)

('tcsh',
 '-f /home/Jeffrey.Durachta/ESM4/DECK/ESM4_historical_D151/gfdl.ncrc4-intel16-prod-openmp/scripts/postProcess/ESM4_historical_D151_ocean_month_rho2_1x1deg_18740101.tags',
 27339,
 729,
 3309)

In [37]:
# let's see the aggregate thread metrics for this process
p.threads_sums

{'usertime': 62258535,
 'systemtime': 5995088,
 'rssmax': 58112,
 'minflt': 9946,
 'majflt': 3,
 'inblock': 12968064,
 'outblock': 2141984,
 'vol_ctxsw': 355,
 'invol_ctxsw': 180,
 'processor': 0,
 'delayacct_blkio_time': 0,
 'guest_time': 0,
 'rchar': 15140041925,
 'wchar': 2185067697,
 'syscr': 1741380,
 'syscw': 33346,
 'read_bytes': 6639648768,
 'write_bytes': 1096695808,
 'cancelled_write_bytes': 0,
 'time_oncpu': 68265620909,
 'time_waiting': 15819937,
 'timeslices': 536,
 'rdtsc_duration': 251077780608,
 'PERF_COUNT_SW_CPU_CLOCK': 68221670024,
 'user+system': 68253623}

In [38]:
# let's get the thread dataframes for p
eq.get_thread_metrics(p)

Unnamed: 0,tags,hostname,exename,path,args,exitcode,pid,generation,ppid,pgid,...,syscw,read_bytes,write_bytes,cancelled_write_bytes,time_oncpu,time_waiting,timeslices,rdtsc_duration,PERF_COUNT_SW_CPU_CLOCK,process_pk
0,op:fregrid;op_instance:7;op_sequence:80,pp015,fregrid,/home/fms/local/opt/fre-nctools/bronx-14/gfdl/bin/fregrid,"--standard_dimension --input_mosaic ocean_mosaic.nc --input_file all --interp_method conserve_order1 --remap_file .fregrid_remap_file_360_by_180.nc --nlon 360 --nlat 180 --scalar_field volcello,th...",0,5488,0,27339,27303,...,33346,6639648768,1096695808,0,68265620909,15819937,536,251077780608,68221670024,26052


In [39]:
# Let's explore a particular operation in a job, and see which processes took the 
# top *inclusive* cpu time.
# Let's limit the output to the top 5 results
# and let's get a pandas dataframe
procs = eq.get_procs(jobs, tags = 'op_sequence:159', order=eq.desc(eq.Process.inclusive_cpu_time), limit=5, fmt='pandas')
procs[['exename', 'args', 'cpu_time', 'inclusive_cpu_time', 'duration']]

Unnamed: 0,exename,args,cpu_time,inclusive_cpu_time,duration
0,fregrid,--standard_dimension --input_mosaic ocean_mosaic.nc --input_file annual --interp_method conserve_order1 --remap_file .fregrid_remap_file_360_by_180.nc --nlon 360 --nlat 180 --scalar_field volcello...,10237442.0,10237442.0,10219947.0
1,mv,out.nc annual.nc,462929.0,462929.0,456714.0
2,mv,annual.nc ocean_month_rho2_1x1deg.1851.ann.nc,43992.0,43992.0,36877.0


<a name="failed-procs"></a>Let's see if there are any failed processes in our job selection.

In [40]:
# Let's find the failed processes across our jobs subset
failed_procs = eq.get_procs(jobs_orm, fltr=(eq.Process.exitcode > 0), fmt='pandas')
failed_procs[['jobid', 'exename', 'args', 'exitcode', 'tags']]

Unnamed: 0,jobid,exename,args,exitcode,tags
0,625172,ln,-f /ptmp/Jeffrey.Durachta/archive/Jeffrey.Durachta/ESM4/DECK/ESM4_historical_D151/gfdl.ncrc4-intel16-prod-openmp/history/18500101.nc/18500101.ocean_month_rho2.nc /vftmp/Jeffrey.Durachta/job625172/...,1,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}"
1,625172,which,globus-ftp-client-cksm-test,1,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}"
2,625172,which,globus-ftp-client-mlst-test,1,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}"
3,625172,which,globus-ftp-client-ascii-verbose-list-test,1,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}"
4,625172,which,globus-ftp-client-delete-test,1,"{'op': 'hsmget', 'op_instance': '1', 'op_sequence': '1'}"
...,...,...,...,...,...
1537,804285,which,globus-ftp-client-delete-test,1,"{'op': 'mv', 'op_instance': '18', 'op_sequence': '83'}"
1538,804285,which,globus-ftp-client-cksm-test,1,"{'op': 'mv', 'op_instance': '18', 'op_sequence': '86'}"
1539,804285,which,globus-ftp-client-mlst-test,1,"{'op': 'mv', 'op_instance': '18', 'op_sequence': '86'}"
1540,804285,which,globus-ftp-client-ascii-verbose-list-test,1,"{'op': 'mv', 'op_instance': '18', 'op_sequence': '86'}"


Let's focus only on a particular operation, and prune the list a bit

In [41]:
failed_procs = eq.get_procs(jobs, tags='op_sequence:79', fltr=(eq.Process.exitcode > 0), fmt='pandas')
failed_procs[['jobid', 'id', 'exename', 'args', 'exitcode']]

Unnamed: 0,jobid,id,exename,args,exitcode
0,627922,15654,which,globus-ftp-client-cksm-test,1
1,627922,15655,which,globus-ftp-client-mlst-test,1
2,627922,15656,which,globus-ftp-client-ascii-verbose-list-test,1
3,627922,15657,which,globus-ftp-client-delete-test,1
4,629337,19066,which,globus-ftp-client-cksm-test,1
5,629337,19067,which,globus-ftp-client-mlst-test,1
6,629337,19068,which,globus-ftp-client-ascii-verbose-list-test,1
7,629337,19069,which,globus-ftp-client-delete-test,1
8,633144,22478,which,globus-ftp-client-cksm-test,1
9,633144,22479,which,globus-ftp-client-mlst-test,1


In [42]:
# let's explore one of the failed processes
p = eq.Process[int(failed_procs.loc[0,'id'])]
p.exename, p.exitcode, p.args, p.duration, p.parent.pid

('which', 1, 'globus-ftp-client-cksm-test', 2963.0, 10291)

### <a name="timeline">Timeline</a>
Sometimes you want to get a timeline of the processes in the order they were spawned

In [43]:
procs = eq.timeline(jobs, fmt='pandas', limit=25)
procs[['exename', 'tags', 'start', 'duration']]

Unnamed: 0,exename,tags,start,duration
0,tcsh,"{'op': 'dmput', 'op_instance': '2', 'op_sequence': '190'}",2019-06-09 18:53:22.610123,12630590000.0
1,tcsh,{},2019-06-09 18:53:22.614091,113.0
2,mkdir,{},2019-06-09 18:53:22.623899,131.0
3,modulecmd,{},2019-06-09 18:53:22.664680,3656.0
4,test,{},2019-06-09 18:53:22.678745,54.0
5,modulecmd,{},2019-06-09 18:53:22.689498,1551.0
6,test,{},2019-06-09 18:53:22.701312,41.0
7,modulecmd,{},2019-06-09 18:53:22.711901,358694.0
8,perl,{},2019-06-09 18:53:22.745150,15821.0
9,perl,{},2019-06-09 18:53:22.770251,4346.0


In [44]:
# The orm also gives an easy way to navigate the process hierarchy
# Let's use the ORM directly to walk through the job
j = eq.get_jobs('629337', fmt='orm').first()
j

Job['629337']

In [45]:
# Notice we have a Job object. The processes in the job
# are available as j.processes
j.duration, j.cpu_time, j.exitcode, j.tags

(6696039124.0,
 623964730.0,
 0,
 {'exp_name': 'ESM4_historical_D151',
  'exp_component': 'ocean_month_rho2_1x1deg',
  'exp_time': '18640101',
  'atm_res': 'c96l49',
  'ocn_res': '0.5l75',
  'script_name': 'ESM4_historical_D151_ocean_month_rho2_1x1deg_18640101'})

In [46]:
# first we ask for the aggregate metrics for single job
# Here, we don't specify any tags. For single jobs, when
# we don't specify the operation/tags, they are queried from the job
op_sums = eq.op_metrics(jobs='629337', fmt='pandas')
display(op_sums.columns.values)
op_sums[['jobid', 'tags', 'duration', 'cpu_time']]

array(['PERF_COUNT_SW_CPU_CLOCK', 'usertime', 'outblock', 'write_bytes',
       'systemtime', 'syscw', 'time_oncpu', 'read_bytes',
       'rdtsc_duration', 'delayacct_blkio_time', 'invol_ctxsw', 'rssmax',
       'processor', 'majflt', 'minflt', 'vol_ctxsw', 'rchar',
       'guest_time', 'syscr', 'user+system', 'time_waiting',
       'cancelled_write_bytes', 'inblock', 'wchar', 'timeslices', 'job',
       'jobid', 'tags', 'num_procs', 'numtids', 'cpu_time', 'duration'],
      dtype=object)

Unnamed: 0,jobid,tags,duration,cpu_time
0,629337,"{'op': 'cp', 'op_instance': '11', 'op_sequence': '66'}",3151497.0,2078506.0
1,629337,"{'op': 'cp', 'op_instance': '15', 'op_sequence': '79'}",2698541.0,1699557.0
2,629337,"{'op': 'cp', 'op_instance': '3', 'op_sequence': '30'}",3238803.0,2191485.0
3,629337,"{'op': 'cp', 'op_instance': '5', 'op_sequence': '39'}",2973848.0,2229484.0
4,629337,"{'op': 'cp', 'op_instance': '7', 'op_sequence': '48'}",3036848.0,2439456.0
...,...,...,...,...
84,629337,"{'op': 'untar', 'op_instance': '3', 'op_sequence': '38'}",565170.0,623891.0
85,629337,"{'op': 'untar', 'op_instance': '4', 'op_sequence': '47'}",553606.0,620889.0
86,629337,"{'op': 'untar', 'op_instance': '5', 'op_sequence': '56'}",566693.0,629884.0
87,629337,"{'op': 'untar', 'op_instance': '6', 'op_sequence': '65'}",690359.0,629890.0
