## The Advanced Computing Center for Research and Education

The Advanced Computing Center for Research and Education (ACCRE) is a computer cluster serving the high-performance computing needs of research for Vanderbilt University. In this data question, you will be analyzing data on jobs run on ACCRE's hardware.

When a job is submitted to ACCRE, it goes through the slurm scheduler, which tracks and manages compute and memory resources. It is hypothesized that the slurm scheduler is processing so many job completions so frequently that it sometimes becomes unresponsive to commands from users trying to schedule new jobs or check job status. This is a particularly bad problem for clients who use automated submission systems, such as members of the Open Science Grid. The goal of this project is to investigate and potentially confirm that hypothesis that lots of job completions in a short time period are causing the scheduler to be unresponsive, and determine the rough threshold at which it becomes an issue.

You have been provided three datasets for this task:
* **fullsample.csv**: This file contains output for jobs run through the slurm scheduler.
* **slurm_wrapper_ce5.log** and **slurm_wrapper_ce6.log**: Logs of every slurm command that a pair of servers, ce5 and ce6, executed, how long it took, and if it succeeded. These servers connect ACCRE's local cluster to the Open Science Grid and submit jobs to slurm on behalf of the grid.

To get started, answer the following questions using just the fullsample.csv jobs dataset:

1. Calculate some descriptive statistics for how many jobs per hour are being completed. What does the completions per hour look like over the time span of the dataset? Are there weekly trends, and has it been increasing over the last year?

2. Does the job state affect completions per hour? i.e. if I only look at jobs with exit code 0:0 in the "COMPLETED" state, is that a similar number of completions per hour as with all jobs, failed or cancelled? This will indicate if the load on the scheduler is by user design or is a result of users not sufficiently testing their jobs before submitting very large arrays. We also expect that most job completions will be in the "production" partition, but is this actually true?

Next, use the two log files to determine time intervals (hours) when the scheduler was unresponsive. Do this by looking for records that are the "sbatch" command from user 9204 that have return code 1 and an execution time of ~20 seconds (at least more than 15). These are commands where the scheduler timed out in responding.

3. Calculate some descriptive statistics about how often the scheduler was unresponsive, how long these periods of time were, and create a time series plot of when the scheduler was having difficulties.

4. Finally, combine the time series information from the two datasets together to see how well correlated heavy job-completion load is with the unresponsiveness, and to see if there is some threshold of job completions per hour that generally results in unresponsiveness.

In [1]:
import pandas as pd
import numpy as np

In [2]:
scheduler = pd.read_csv('../data/accre-scheduler-data-2021/fullsample.csv')

In [12]:
# replace Unknown string with NaN in BEGIN and END columns, convert strings to datetime dtype in BEGIN and END columns

scheduler['END'] = scheduler['END'].replace('Unknown', np.nan)
scheduler['END']=pd.to_datetime(scheduler['END'])
scheduler['BEGIN'] = scheduler['BEGIN'].replace('Unknown', np.nan)
scheduler['BEGIN']=pd.to_datetime(scheduler['BEGIN'])

In [37]:
scheduler['STATE'].value_counts(normalize = True)

COMPLETED              9.971875e-01
CANCELLED              1.224329e-03
FAILED                 5.092021e-04
CANCELLED by 9201      2.401335e-04
OUT_OF_MEMORY          2.351308e-04
                           ...     
CANCELLED by 895563    1.352103e-07
CANCELLED by 494371    1.352103e-07
CANCELLED by 889749    1.352103e-07
CANCELLED by 896711    1.352103e-07
CANCELLED by 790983    1.352103e-07
Name: STATE, Length: 145, dtype: float64

In [5]:
scheduler[scheduler['EXITCODE']=='0:0']['STATE'].value_counts()

COMPLETED              7375084
CANCELLED                 5378
CANCELLED by 9201         1538
PENDING                    443
CANCELLED by 169069        294
                        ...   
CANCELLED by 858449          1
CANCELLED by 546080          1
CANCELLED by 777651          1
CANCELLED by 621980          1
CANCELLED by 795186          1
Name: STATE, Length: 104, dtype: int64

In [6]:
scheduler[scheduler['STATE']=='COMPLETED']['EXITCODE'].value_counts()

0:0    7375084
Name: EXITCODE, dtype: int64

In [36]:
scheduler[(scheduler['STATE']=='CANCELLED')& (scheduler['EXITCODE']=='0:0')].sort_values('END')

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
6707753,24539954,CANCELLED,2020-10-02 20:58:35,2020-10-02 20:58:49,28000Mn,0,2-00:00:00,00:00:14,1,4,production,0:0
6709203,24542352,CANCELLED,2020-10-02 23:17:08,2020-10-02 23:17:08,21875Mn,0,2-00:00:00,00:00:00,1,8,nogpfs,0:0
6711274,24548362,CANCELLED,2020-10-03 04:27:21,2020-10-03 04:27:21,186648Mn,0,2-00:00:00,00:00:00,1,32,nogpfs,0:0
6711273,24548360,CANCELLED,2020-10-03 04:27:21,2020-10-03 04:27:21,21875Mn,0,2-00:00:00,00:00:00,1,8,nogpfs,0:0
6712448,24552300,CANCELLED,2020-10-03 08:34:30,2020-10-03 08:34:30,186648Mn,0,2-00:00:00,00:00:00,1,32,nogpfs,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...
926104,32913381_0,CANCELLED,2021-10-03 17:52:36,2021-10-03 17:54:28,16384Mn,0,00:30:00,00:01:52,1,1,production,0:0
926105,32913381_1,CANCELLED,2021-10-03 17:52:36,2021-10-03 17:56:18,16384Mn,0,00:30:00,00:03:42,1,1,production,0:0
926106,32913381_2,CANCELLED,2021-10-03 17:52:36,2021-10-03 17:58:09,16384Mn,0,00:30:00,00:05:33,1,1,production,0:0
926182,32913795,CANCELLED,2021-10-05 15:09:35,2021-10-05 15:11:23,92160Mn,0,3-18:00:00,00:01:48,2,10,maxwell,0:0
