In [1]:
import pandas as pd

__Introduction & dataset overview__ :


ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

The goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.

The Project makes use of three datasets:  

* fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".  
* slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).  
Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:  
    * return code = 1
    * execution time > 15 seconds

**Phase 1: Explore the Data**  
Objectives:  
* Understand the purpose of each dataset.  
* Inspect column types, sizes, and example rows.  

In [2]:
# Read the fullsample dataset into jobs dataframe
jobs = pd.read_csv("../data/fullsample.csv")
jobs.head()

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0


In [3]:
jobs.shape

(7395885, 12)

* jobs dataframe contains 7395885 rows and 12 columns

* The fullsample dataset contains job records, with one row per job.

* Each job gets a unique ID, contained in the **JOBID** column.

* Some jobs can be submitted as arrays of similar jobs. These are listed with an underscore in the JOBID, where the number after the   underscore indicates the tasknumber. For example. JOBID 31781951 was an array job with 10 parts.

In [4]:
jobs.columns

Index(['JOBID', 'STATE', 'BEGIN', 'END', 'REQMEM', 'USEDMEM', 'REQTIME',
       'USEDTIME', 'NODES', 'CPUS', 'PARTITION', 'EXITCODE'],
      dtype='object')

column | description
-------|---------
JOBID | The identification number of the job or job step. Array jobs are in the form ArrayJobID_ArrayTaskID
STATE | Job state or status (COMPLETED, CANCELLED, FAILED, TIMEOUT, PREEMPTED, etc.)
BEGIN | Beginning time for the job.
END | Ending time for the job.
REQMEM | Requested memory in megabytes. May be per-core (Mc) or per-node (Mn)
USEDMEM | Used memory in megabytes per-node
REQTIME | Requested time in d-hh:mm:ss or hh:mm:ss
USEDTIME | Used time in d-hh:mm:ss or hh:mm:ss
NODES | Number of servers used for this job
CPUS | Total number of CPU-cores allocated to the job
PARTITION | Identifies the partition on which the job ran.
EXITCODE | The exit code returned by the job script or salloc, typically as set by the exit() function. Following the colon is the signal that caused the process to terminate if it was terminated by a signal.

In [5]:
jobs.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   JOBID      7395885 non-null  object
 1   STATE      7395885 non-null  object
 2   BEGIN      7395885 non-null  object
 3   END        7395885 non-null  object
 4   REQMEM     7395885 non-null  object
 5   USEDMEM    7395885 non-null  object
 6   REQTIME    7395885 non-null  object
 7   USEDTIME   7395885 non-null  object
 8   NODES      7395885 non-null  int64 
 9   CPUS       7395885 non-null  int64 
 10  PARTITION  7395885 non-null  object
 11  EXITCODE   7395885 non-null  object
dtypes: int64(2), object(10)
memory usage: 677.1+ MB


* All the columns in the jobs dataframe contain non null values

In [6]:
(jobs['JOBID'].value_counts()>1).sum()

np.int64(0)

* Each row contains each JOBID

In [7]:
jobs['NODES'].value_counts().sort_values(ascending =False).reset_index()

Unnamed: 0,NODES,count
0,1,7385963
1,2,3525
2,3,2716
3,8,1447
4,4,860
...,...,...
70,133,1
71,28,1
72,26,1
73,65,1


Most jobs ran on 1,2,3 and 8 nodes.

In [8]:
jobs['PARTITION'].value_counts().sort_values(ascending =False)

PARTITION
production              7019578
nogpfs                   147229
pascal                   124453
sam                       64967
turing                    21424
maxwell                   11278
cgw-maizie                 4309
debug                      1616
cgw-platypus                379
cgw-dsi-gw                  228
cgw-capra1                  157
cgw-dougherty1              125
cgw-horus                    61
cgw-cqs1                     28
cgw-hanuman                  21
cgw-sideshowbob              14
cgw-vm-qa-flatearth1          9
cgw-tbi01                     8
cgw-rocksteady                1
Name: count, dtype: int64

Most jobs ran on the "production","nogpfs" and "pascal" partitions.

In [9]:
jobs.groupby(['STATE','EXITCODE'])['JOBID'].count().sort_values(ascending =False)

STATE                EXITCODE
COMPLETED            0:0         7375084
CANCELLED            0:0            5378
FAILED               1:0            2780
CANCELLED            0:15           1886
OUT_OF_MEMORY        0:125          1739
                                  ...   
CANCELLED by 649311  0:0               1
CANCELLED by 649321  0:15              1
CANCELLED by 651701  0:9               1
CANCELLED by 879160  1:0               1
CANCELLED by 161909  0:0               1
Name: JOBID, Length: 227, dtype: int64

We can see that most of the jobs are in "COMPLETED" with EXITCODE "0:0" then followed by CANCELLED and FAILED.

In [10]:
jobs[jobs['STATE']=='RUNNING']['END'].unique()

array(['Unknown'], dtype=object)

As we can see, if the job is in 'RUNNING' state the 'END' will be always in 'unknown'(date time)

In [11]:
#convert the BEGIN and END columns to date time format
jobs['BEGIN'] = pd.to_datetime(jobs['BEGIN'],
                                           format = "%Y-%m-%dT%H:%M:%S", errors='coerce')

jobs['END'] = pd.to_datetime(jobs['END'],
                                           format = "%Y-%m-%dT%H:%M:%S", errors='coerce')



In [12]:
#convert the USEDTIME and REQTIME columns to timedelta format

jobs['USEDTIME'] = pd.to_timedelta(jobs['USEDTIME'].str.replace("-", " days ", 1))
jobs['REQTIME'] = pd.to_timedelta(jobs['REQTIME'].str.replace("-", " days ", 1))

In [13]:
 jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype          
---  ------     -----          
 0   JOBID      object         
 1   STATE      object         
 2   BEGIN      datetime64[ns] 
 3   END        datetime64[ns] 
 4   REQMEM     object         
 5   USEDMEM    object         
 6   REQTIME    timedelta64[ns]
 7   USEDTIME   timedelta64[ns]
 8   NODES      int64          
 9   CPUS       int64          
 10  PARTITION  object         
 11  EXITCODE   object         
dtypes: datetime64[ns](2), int64(2), object(6), timedelta64[ns](2)
memory usage: 677.1+ MB


In [14]:
jobs.sample(5)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
1490091,31874232_484,COMPLETED,2021-08-31 23:48:03,2021-08-31 23:51:01,2048Mn,145.05M,0 days 00:15:00,0 days 00:02:58,1,1,production,0:0
5573816,26869887_2,COMPLETED,2021-01-20 10:46:27,2021-01-20 10:46:53,5120Mn,0,5 days 00:00:00,0 days 00:00:26,1,1,production,0:0
2926770,29621505_5594,COMPLETED,2021-06-15 06:47:02,2021-06-15 09:13:45,16384Mn,4320.47M,0 days 10:00:00,0 days 02:26:43,1,1,production,0:0
4647805,28088347_245,COMPLETED,2021-03-21 05:55:22,2021-03-21 17:31:07,500Mn,9.05M,5 days 00:00:00,0 days 11:35:45,1,1,production,0:0
461511,32364713_9047,COMPLETED,2021-09-13 22:59:29,2021-09-13 23:03:58,2048Mn,176.31M,0 days 00:15:00,0 days 00:04:29,1,1,production,0:0


The __slurm_wrapper_ce5.log__ and __slurm_wrapper_ce6.log__ files contain logs of jobs submitted from the Open Science Grid.

In [15]:
# Read the slurm_wrapper_ce5.log dataset into jobs ce5

ce5 = pd.read_csv('../data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python')

ce5.head()

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."


In [16]:
ce5.shape

(4770893, 6)

* ce5 dataframe contains 4770893 rows and 6 columns

In [17]:
ce5.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4770893 entries, 0 to 4770892
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       4770893 non-null  object
 1   1       4770893 non-null  object
 2   2       4770893 non-null  object
 3   3       4770893 non-null  object
 4   4       4770893 non-null  object
 5   5       4770893 non-null  object
dtypes: object(6)
memory usage: 218.4+ MB


In [18]:
# Rename the column names of ce5 dataframe

new_column_names = ['date', 'user', 'retry' , 'time', 'return_code', 'command']
ce5.columns = new_column_names
ce5.head()

Unnamed: 0,date,user,retry,time,return_code,command
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."


In [19]:
# Read the slurm_wrapper_ce6.log dataset into jobs ce6

ce6 = pd.read_csv('../data/slurm_wrapper_ce6.log',
                  header=None,
                  delimiter=' - ',
                  engine='python')

ce6.head()

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
2,2020-10-16 10:37:44.218760,user 9202,retry 0,time 0.05928945541381836,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
3,2020-10-16 10:37:44.256403,user 9202,retry 0,time 0.038695573806762695,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
4,2020-10-16 10:37:44.611603,user 9202,retry 0,time 0.03343677520751953,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."


In [20]:
ce6.shape

(4776520, 6)

ce6 dataframe contains 4776520 rows and 6 columns

In [21]:
ce6.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4776520 entries, 0 to 4776519
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       4776520 non-null  object
 1   1       4776520 non-null  object
 2   2       4776520 non-null  object
 3   3       4776520 non-null  object
 4   4       4776520 non-null  object
 5   5       4776520 non-null  object
dtypes: object(6)
memory usage: 218.7+ MB


In [22]:
#Rename the column names

new_column_names = ['date', 'user', 'retry' , 'time', 'return_code', 'command']
ce6.columns = new_column_names
ce6.head()

Unnamed: 0,date,user,retry,time,return_code,command
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
2,2020-10-16 10:37:44.218760,user 9202,retry 0,time 0.05928945541381836,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
3,2020-10-16 10:37:44.256403,user 9202,retry 0,time 0.038695573806762695,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
4,2020-10-16 10:37:44.611603,user 9202,retry 0,time 0.03343677520751953,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."


**Phase 2: Clean and Transform the Data**  
Objectives:  
* Extract job completions from fullsample.csv.  
* Parse CE5 and CE6 logs to identify unresponsive events.  
* Create analysis-ready features (time windows, completion counts, unresponsiveness indicators).  
* Optionally include other features (currently running jobs or resource usage, time-of-day).  


In [69]:
#Extract job completions from fullsample.csv

completed_jobs = jobs[(jobs['STATE'] == 'COMPLETED') &  (jobs['EXITCODE'] == '0:0')]
print(f' There are {jobs_completed.shape[0]} COMPLETED jobs in the dataset')

 There are 7375084 COMPLETED jobs in the dataset


In [60]:
# Parse CE5 and CE6 logs to identify unresponsive events.

#concat two dfs to stack the rows vertically , when both the data frames have same columns

concat_logs = pd.concat([ce5,ce6], ignore_index = True)

concat_logs.shape

(9547413, 6)

In [61]:
concat_logs.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9547413 entries, 0 to 9547412
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   date         9547413 non-null  object
 1   user         9547413 non-null  object
 2   retry        9547413 non-null  object
 3   time         9547413 non-null  object
 4   return_code  9547413 non-null  object
 5   command      9547413 non-null  object
dtypes: object(6)
memory usage: 437.0+ MB


In [62]:
concat_logs['date'].nunique()

9546796

In [63]:
#convert the date column datatype  to date time format and truncate the milliseconds

concat_logs['date'] = pd.to_datetime(concat_logs['date'],errors='coerce').dt.floor('s')


In [64]:
concat_logs['date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 9547413 entries, 0 to 9547412
Series name: date
Non-Null Count    Dtype         
--------------    -----         
9547398 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 72.8 MB


In [65]:
concat_logs['time'] = concat_logs['time'].str.replace('time', "", regex = False).astype(float)


In [66]:
concat_logs.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9547413 entries, 0 to 9547412
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   date         9547398 non-null  datetime64[ns]
 1   user         9547413 non-null  object        
 2   retry        9547413 non-null  object        
 3   time         9547413 non-null  float64       
 4   return_code  9547413 non-null  object        
 5   command      9547413 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 437.0+ MB


In [67]:
pd.set_option('display.max_colwidth', None) #To display the full column(here looking for command column) values with out cut off

unresponsive_jobs = concat_logs[(concat_logs['user'] == 'user 9204') & (concat_logs['return_code'] == 'returncode 1') & (concat_logs['time'] > 15) & (concat_logs['command'].str.contains(r"'/usr/bin/(sbatch)'"))]

  unresponsive_jobs = concat_logs[(concat_logs['user'] == 'user 9204') & (concat_logs['return_code'] == 'returncode 1') & (concat_logs['time'] > 15) & (concat_logs['command'].str.contains(r"'/usr/bin/(sbatch)'"))]


In [68]:
unresponsive_jobs.head()

Unnamed: 0,date,user,retry,time,return_code,command
49958,2020-10-18 06:53:44,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']"
49972,2020-10-18 06:54:04,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']"
50467,2020-10-18 07:47:25,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1d9e86b90.1559883/bl_fa5Tsv']"
50473,2020-10-18 07:47:45,user 9204,retry 1,20.045221,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1d9e86b90.1559883/bl_fa5Tsv']"
50582,2020-10-18 07:53:33,user 9204,retry 0,20.041486,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dcd1d3c0.1559883/bl_x3mVd1']"


In [70]:
completed_jobs.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
1,30853133,COMPLETED,2021-08-06 11:36:09,2021-09-05 11:36:32,262144Mn,20604.62M,30 days,30 days 00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06 19:04:39,2021-09-05 19:04:53,204800Mn,57553.77M,30 days,30 days 00:00:14,1,32,cgw-tbi01,0:0


In [73]:
# Import unresponsive_jobs files to csv 
unresponsive_jobs.to_csv('unresponsive_jobs.csv')
unresponsive_jobs.head()

Unnamed: 0,date,user,retry,time,return_code,command
49958,2020-10-18 06:53:44,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']"
49972,2020-10-18 06:54:04,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']"
50467,2020-10-18 07:47:25,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1d9e86b90.1559883/bl_fa5Tsv']"
50473,2020-10-18 07:47:45,user 9204,retry 1,20.045221,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1d9e86b90.1559883/bl_fa5Tsv']"
50582,2020-10-18 07:53:33,user 9204,retry 0,20.041486,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dcd1d3c0.1559883/bl_x3mVd1']"


__Merging unresponsive jobs logs with completed jobs using merge_asof()__

In [96]:
unresponsive_jobs = unresponsive_jobs.sort_values(by ='date')
completed_jobs = completed_jobs.sort_values(by = 'END')
merge_unresponsive_jobs = pd.merge_asof(left = unresponsive_jobs,right = completed_jobs,left_on= 'date', right_on = 'END')
print(f' After merging unresponsive jobs with completed jobs got {merge_unresponsive_jobs.shape[0]} rows and {merge_unresponsive_jobs.shape[1]} columns')

 After merging unresponsive jobs with completed jobs got 3296 rows and 18 columns


In [97]:
merge_unresponsive_jobs.head()

Unnamed: 0,date,user,retry,time,return_code,command,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,2020-10-18 06:16:25,user 9204,retry 0,20.037672,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x5572a7c77310.3390891/bl_23341e2dd5ae']",25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,0 days 12:00:00,0 days 00:05:12,1,1,production,0:0
1,2020-10-18 06:38:44,user 9204,retry 0,20.038736,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x5572a841eb00.3390891/bl_9f06196a57ac']",25032600_8,COMPLETED,2020-10-17 16:47:50,2020-10-18 06:38:03,4096Mn,335.27M,0 days 23:00:00,0 days 13:50:13,1,1,production,0:0
2,2020-10-18 06:53:44,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']",25041598,COMPLETED,2020-10-18 06:31:59,2020-10-18 06:53:24,21878Mn,1.63M,2 days 00:00:00,0 days 00:21:25,1,8,nogpfs,0:0
3,2020-10-18 06:54:04,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1dd302460.1559883/bl_B3aCvf']",25041598,COMPLETED,2020-10-18 06:31:59,2020-10-18 06:53:24,21878Mn,1.63M,2 days 00:00:00,0 days 00:21:25,1,8,nogpfs,0:0
4,2020-10-18 07:47:25,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x55c1d9e86b90.1559883/bl_fa5Tsv']",25042059,COMPLETED,2020-10-18 07:25:37,2020-10-18 07:47:02,21875Mn,1.83M,2 days 00:00:00,0 days 00:21:25,1,8,nogpfs,0:0


In [98]:
merge_unresponsive_jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3296 entries, 0 to 3295
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype          
---  ------       --------------  -----          
 0   date         3296 non-null   datetime64[ns] 
 1   user         3296 non-null   object         
 2   retry        3296 non-null   object         
 3   time         3296 non-null   float64        
 4   return_code  3296 non-null   object         
 5   command      3296 non-null   object         
 6   JOBID        3296 non-null   object         
 7   STATE        3296 non-null   object         
 8   BEGIN        3296 non-null   datetime64[ns] 
 9   END          3296 non-null   datetime64[ns] 
 10  REQMEM       3296 non-null   object         
 11  USEDMEM      3296 non-null   object         
 12  REQTIME      3296 non-null   timedelta64[ns]
 13  USEDTIME     3296 non-null   timedelta64[ns]
 14  NODES        3296 non-null   int64          
 15  CPUS         3296 non-null   int64    

__TIME WINDOWS__

In [108]:
merge_unresponsive_jobs['rolling_10min']= merge_unresponsive_jobs.rolling(window ='10Min', on ='date')['time'].sum()
#ax = merge_unresponsive_jobs.plot(x='date', y='JOBID', label= 'Raw Counts')
#merge_unresponsive_jobs.plot(x='date', y='rolling_10min', label= 'Rolling Average', ax=ax);
merge_unresponsive_jobs['rolling_10min']

0       20.037672
1       20.038736
2       20.038464
3       40.087371
4       20.082628
          ...    
3291    20.041436
3292    20.051321
3293    19.083227
3294    20.043146
3295    19.047097
Name: rolling_10min, Length: 3296, dtype: float64