## Data Introduction

In [1]:
import pandas as pd

In [2]:
jobs = pd.read_csv("../data/fullsample.csv", nrows = 1000)
jobs.head(5)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0


The fullsample dataset contains job records, with one row per job.

Each job gets a unique ID, contained in the **JOBID** column.

Some jobs can be submitted as arrays of similar jobs. These are listed with an underscore in the JOBID, where the number after the underscore indicates the tasknumber. For example. JOBID 31781951 was an array job with 10 parts. 

In [3]:
jobs[jobs['JOBID'].str.contains('31781951')]

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
533,31781951_1,COMPLETED,2021-08-30T12:51:30,2021-09-08T02:17:41,16384Mn,10234.37M,12-00:00:00,8-13:26:11,1,12,production,0:0
534,31781951_2,COMPLETED,2021-08-30T12:51:30,2021-09-07T18:04:48,16384Mn,10247.40M,12-00:00:00,8-05:13:18,1,12,production,0:0
535,31781951_3,COMPLETED,2021-08-31T09:14:29,2021-09-08T16:36:06,16384Mn,10064.47M,12-00:00:00,8-07:21:37,1,12,production,0:0
536,31781951_4,COMPLETED,2021-09-01T01:59:50,2021-09-08T08:48:28,16384Mn,10004.80M,12-00:00:00,7-06:48:38,1,12,production,0:0
537,31781951_5,COMPLETED,2021-09-02T00:09:27,2021-09-08T23:58:57,16384Mn,9858.72M,12-00:00:00,6-23:49:30,1,12,production,0:0
538,31781951_6,COMPLETED,2021-09-02T16:19:55,2021-09-10T11:16:57,16384Mn,10065.06M,12-00:00:00,7-18:57:02,1,12,production,0:0
539,31781951_7,COMPLETED,2021-09-02T22:26:08,2021-09-10T18:48:31,16384Mn,10092.55M,12-00:00:00,7-20:22:23,1,12,production,0:0
540,31781951_8,COMPLETED,2021-09-03T10:54:14,2021-09-11T09:32:28,16384Mn,10146.98M,12-00:00:00,7-22:38:14,1,12,production,0:0
541,31781951_9,COMPLETED,2021-09-04T22:54:03,2021-09-12T16:16:04,16384Mn,10050.81M,12-00:00:00,7-17:22:01,1,12,production,0:0
542,31781951_10,COMPLETED,2021-09-06T06:54:35,2021-09-14T13:02:37,16384Mn,10042.53M,12-00:00:00,8-06:08:02,1,12,production,0:0


Jobs can have a few differents states, with the most common one being 'COMPLETED'. 

In [4]:
jobs['STATE'].value_counts()

STATE
COMPLETED    997
RUNNING        1
NODE_FAIL      1
CANCELLED      1
Name: count, dtype: int64

The **BEGIN** field indicates when the job was started (initiated on a computer node).

The **END** field indicates when the job ended (completed, failed, or was cancelled while running).

The **REQMEM** field is the amount of memory requested in megabytes. It can be per-core/CPU (Mc) or per-node (Mn).



In [5]:
# Jobs where memory was requested per core.
jobs[jobs['REQMEM'].str[-2:] == 'Mc'].head()

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
501,31776583_1,COMPLETED,2021-08-30T10:16:59,2021-09-01T02:04:11,4096Mc,1792.43M,14-00:00:00,1-15:47:12,1,1,production,0:0
502,31776584_12,COMPLETED,2021-08-30T10:17:00,2021-09-01T00:20:15,4096Mc,1792.43M,14-00:00:00,1-14:03:15,1,1,production,0:0
915,31793401_958,COMPLETED,2021-08-31T19:36:46,2021-09-01T00:37:11,4096Mc,2788.05M,05:00:00,05:00:25,1,1,production,0:0
916,31793401_987,COMPLETED,2021-08-31T20:33:46,2021-09-01T00:02:57,4096Mc,2779.27M,05:00:00,03:29:11,1,1,production,0:0


In [6]:
# Jobs where memory was requested per node.
jobs[jobs['REQMEM'].str[-2:] == 'Mn'].head()

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0


The USEDMEM column is the amount of memory used in MB per node.

The requested time (REQTIME) and used time (USEDTIME) columns are in d-hh:mm:ss or hh:mm:ss for jobs less than one day in duration.

**NODES** is the number of servers used for the job. Most jobs are single node. For multiple node jobs, memory usage is the maximum over all nodes.

**CPUS** is the total number of CPU cores allocated to the job, and for multi-node jobs, this includes all nodes.

Most jobs are run in the "production" or "nogpfs" partition. The "debug" and "sam" partitions are test jobs that are expected to be short, and the "maxwell", "pascal", and "turing" partitions are for GPU resources.

In [7]:
jobs['PARTITION'].value_counts()

PARTITION
production        791
nogpfs            163
pascal             32
cgw-platypus        5
cgw-capra1          4
cgw-dougherty1      3
cgw-tbi01           1
turing              1
Name: count, dtype: int64

The **EXITCODE** gives the [exit code](https://www.agileconnection.com/article/overview-linux-exit-codes) for the job, with "0:0" indicating a successful job. Exit codes have two numbers, where if the first number is non-zero, it indicates a problem on the server side and if the second is nonzero, it indicates a problem on the user side.

In [8]:
jobs['EXITCODE'].value_counts()

EXITCODE
0:0     998
1:0       1
0:15      1
Name: count, dtype: int64

In [9]:
jobs[jobs['EXITCODE'] == '1:0']

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
18,31418105,NODE_FAIL,2021-08-19T10:09:50,2021-09-17T08:45:10,92160Mn,0,41-16:00:00,28-22:35:20,1,8,cgw-dougherty1,1:0


In [10]:
jobs[jobs['EXITCODE'] == '0:15']

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
42,31669402,CANCELLED,2021-08-28T10:53:59,2021-09-05T10:53:57,65536Mn,5229.75M,8-00:00:00,7-23:59:58,9,10,production,0:15


The slurm_wrapper_ce5.log and slurm_wrapper_ce6.log files contain logs of jobs submitted from the Open Science Grid.

In [11]:
ce5 = pd.read_csv('../data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                  nrows=100)

ce5.head()

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."


For this project, we are interested in jobs from user 9204 (the test user) where the command starts with '/usr/bin/squeue', the returncode is non-zero and the time is greater than 15. These conditions indicate that the scheduler becaem unresponsive at that point in time.