# Project Overview
The Advanced Computing Center for Research and Education (ACCRE) operates Vanderbilt University's high-performance computing cluster. Jobs submitted to ACCRE are managed by the [slurm scheduler](https://slurm.schedmd.com/documentation.html), which tracks compute and memory resources.

ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

The goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.

**Datasets:**
* fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".  
* slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).

Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:  
* return code = 1
* execution time > 15 seconds

## Phase 1: Explore the Data
**Objectives:**
* Understand the purpose of each dataset.  
* Inspect column types, sizes, and example rows.  

**Notebook Sections:**
* Code: Load each dataset, preview rows, summarize columns.  
* Markdown: Notes on data quality and initial observations.  

In [1]:
# IMPORT PYTHON LIBRARIES
import pandas as pd

### Explore Job Data

In [2]:
# READ fullsample.csv
jobs_df = pd.read_csv("../data/fullsample.csv")

In [3]:
# Display dataframe information
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   JOBID      object
 1   STATE      object
 2   BEGIN      object
 3   END        object
 4   REQMEM     object
 5   USEDMEM    object
 6   REQTIME    object
 7   USEDTIME   object
 8   NODES      int64 
 9   CPUS       int64 
 10  PARTITION  object
 11  EXITCODE   object
dtypes: int64(2), object(10)
memory usage: 677.1+ MB


In [4]:
# Display head and tail data
jobs_df

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...
7395880,25493434,COMPLETED,2020-10-31T23:39:00,2020-10-31T23:40:46,2000Mn,0.09M,2-00:00:00,00:01:46,1,1,sam,0:0
7395881,25493435,COMPLETED,2020-10-31T23:39:13,2020-10-31T23:40:38,2000Mn,187.92M,2-00:00:00,00:01:25,1,1,sam,0:0
7395882,25493476,COMPLETED,2020-10-31T23:46:29,2020-10-31T23:49:43,4096Mc,803.97M,12:00:00,00:03:14,1,1,production,0:0
7395883,25493515,COMPLETED,2020-10-31T23:49:44,2020-10-31T23:51:40,2000Mn,0.09M,2-00:00:00,00:01:56,1,1,sam,0:0


#### JOBID
Each row is a job with a unique ID. Jobs that are submitted as arrays of similar jobs have an ID with an underscore where the number after the underscore indicates the tasknumber. For example: JOBID 31781951 was an array job with 10 parts.

In [5]:
# Check the JOBID column for NaN values
jobs_df['JOBID'].isna().sum()

np.int64(0)

#### STATE
Jobs can have a few differents states, with the most common one being 'COMPLETED'.

In [6]:
# Display unique STATE values
jobs_df['STATE'].value_counts().head()

STATE
COMPLETED            7375084
CANCELLED               9055
FAILED                  3766
CANCELLED by 9201       1776
OUT_OF_MEMORY           1739
Name: count, dtype: int64

In [7]:
# Check the STATE column for NaN values
jobs_df['STATE'].isna().sum()

np.int64(0)

#### BEGIN
Indicates when the job was started (initiated on a computer node).

In [8]:
# Check the BEGIN column for NaN values
jobs_df['BEGIN'].isna().sum()

np.int64(0)

#### END
Indicates when the job ended (completed, failed, or was cancelled while running).

In [9]:
# Check the END column for NaN values
jobs_df['END'].isna().sum()

np.int64(0)

**REQMEM**
<br>The amount of memory requested in megabytes. It can be per-core/CPU (Mc) or per-node (Mn).

In [10]:
# Display unique REQMEM values
jobs_df['REQMEM'].value_counts().head()

REQMEM
2048Mn     1180872
4096Mn      906234
2048Mc      545202
16384Mn     429209
1024Mn      365092
Name: count, dtype: int64

In [11]:
# Check the REQMEM column for NaN values
jobs_df['REQMEM'].isna().sum()

np.int64(0)

#### USEDMEM
The amount of memory used in MB per node.

In [12]:
# Display unique USEDMEM values
jobs_df['USEDMEM'].value_counts().head()

USEDMEM
0           1099732
0.09M         65651
6.23M         26712
6.24M         19920
1637.41M       8863
Name: count, dtype: int64

In [13]:
# Check the USEDMEM column for NaN values
jobs_df['USEDMEM'].isna().sum()

np.int64(0)

**REQTIME**
<br>The requested time is in d-hh:mm:ss or hh:mm:ss for jobs less than one day in duration.

In [14]:
# Display unique REQTIME values
jobs_df['REQTIME'].value_counts().head()

REQTIME
2-00:00:00    987509
08:00:00      625325
00:15:00      581773
01:00:00      574436
04:00:00      408647
Name: count, dtype: int64

In [15]:
# Check the REQTIME column for NaN values
jobs_df['REQTIME'].isna().sum()

np.int64(0)

#### USEDTIME
The used time is in d-hh:mm:ss or hh:mm:ss for jobs less than one day in duration.

In [16]:
# Display unique USEDTIME values
jobs_df['USEDTIME'].value_counts().head()

USEDTIME
00:00:07    41436
00:00:08    39442
00:00:10    39327
00:00:06    38977
00:00:09    38476
Name: count, dtype: int64

In [17]:
# Check the USEDTIME column for NaN values
jobs_df['USEDTIME'].isna().sum()

np.int64(0)

**NODES**
<br>The number of servers used for the job. Most jobs are single node. For multiple node jobs, memory usage is the maximum over all nodes.

In [18]:
# Display unique NODES values
jobs_df['NODES'].value_counts().head()

NODES
1    7385963
2       3525
3       2716
8       1447
4        860
Name: count, dtype: int64

In [19]:
# Check the NODES column for NaN values
jobs_df['NODES'].isna().sum()

np.int64(0)

**CPUS**
<br>The total number of CPU cores allocated to the job, and for multi-node jobs, this includes all nodes.

In [20]:
# Display unique CPUS values
jobs_df['CPUS'].value_counts().head()

CPUS
1    5997522
4     489619
2     432155
8     170996
3      88834
Name: count, dtype: int64

In [21]:
# Check the CPUS column for NaN values
jobs_df['CPUS'].isna().sum()

np.int64(0)

**PARTITIAN**
<br>Most jobs are run in the "production" or "nogpfs" partition. The "debug" and "sam" partitions are test jobs that are expected to be short, and the "maxwell", "pascal", and "turing" partitions are for GPU resources.

In [22]:
# Display unique PARTITION values
jobs_df['PARTITION'].value_counts().head()

PARTITION
production    7019578
nogpfs         147229
pascal         124453
sam             64967
turing          21424
Name: count, dtype: int64

In [23]:
# Check the PARTITION column for NaN values
jobs_df['PARTITION'].isna().sum()

np.int64(0)

#### EXITCODE
The [exit code](https://www.agileconnection.com/article/overview-linux-exit-codes) for the job, with "0:0" indicating a successful job. Exit codes have two numbers, where if the first number is non-zero, it indicates a problem on the server side and if the second is nonzero, it indicates a problem on the user side.

In [24]:
# Display unique EXITCODE values
jobs_df['EXITCODE'].value_counts().head()

EXITCODE
0:0      7384480
1:0         4958
0:15        1887
0:125       1739
0:9         1361
Name: count, dtype: int64

In [25]:
# Check the EXITCODE column for NaN values
jobs_df['EXITCODE'].isna().sum()

np.int64(0)

### Explore CE5 logs

In [26]:
# READ slurm_wrapper_ce5.log
ce5_df = pd.read_csv('../data/slurm_wrapper_ce5.log', header=None, delimiter=' - ', engine='python')

In [27]:
# Display dataframe information
ce5_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4770893 entries, 0 to 4770892
Data columns (total 6 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   0       object
 1   1       object
 2   2       object
 3   3       object
 4   4       object
 5   5       object
dtypes: object(6)
memory usage: 218.4+ MB


In [28]:
# Display head and tail data
ce5_df

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
...,...,...,...,...,...,...
4770888,2021-10-07 21:58:06.738329,user 9203,retry 0,time 0.02677178382873535,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770889,2021-10-07 21:58:15.931559,user 9201,retry 0,time 0.04166150093078613,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770890,2021-10-07 21:58:48.900136,user 9221,retry 0,time 0.14348959922790527,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770891,2021-10-07 21:59:11.314056,user 9203,retry 0,time 0.026599407196044922,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."


In [29]:
# Check column 0 for NaN values
ce5_df[0].isna().sum()

np.int64(0)

In [30]:
# Display unique column 1 values
ce5_df[1].value_counts()

1
user 9201      3093747
user 9202       639795
user 9203       386689
user 9221       312727
user 9219       178075
user 9204       159847
user 0               8
user 112870          5
Name: count, dtype: int64

In [31]:
# Check column 1 for NaN values
ce5_df[1].isna().sum()

np.int64(0)

In [32]:
# Display unique column 2 values
ce5_df[2].value_counts()

2
retry 0    4345805
retry 1     369962
retry 2      55126
Name: count, dtype: int64

In [33]:
# Check column 2 for NaN values
ce5_df[2].isna().sum()

np.int64(0)

In [34]:
# Check column 3 for NaN values
ce5_df[3].isna().sum()

np.int64(0)

In [35]:
# Display unique column 4 values
ce5_df[4].value_counts()

4
returncode 0      4053244
returncode 1       697666
returncode 140      13735
returncode 255       6242
returncode 8            6
Name: count, dtype: int64

In [36]:
# Check column 4 for NaN values
ce5_df[4].isna().sum()

np.int64(0)

In [37]:
# Display unique column 5 values
ce5_df[5].value_counts()

5
command ['/usr/bin/scontrol', 'show', 'job']                                                                          551116
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'cmspilot']                                                           59796
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'lscpilot']                                                           56818
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'uscmslocal']                                                         24164
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'cmslocal']                                                           23133
                                                                                                                       ...  
command ['/usr/bin/sacct', '-j', '26836699', '--noconvert', '-P', '--format', 'UserCPU,SystemCPU,MaxRSS,ExitCode']         1
command ['/usr/bin/sacct', '-j', '26836681', '--noconvert', '-P', '--format', 'UserCPU,SystemCPU,MaxRSS,ExitCode']         

In [38]:
# Check column 5 for NaN values
ce5_df[5].isna().sum()

np.int64(0)

### Explore CE6 logs

In [39]:
# READ slurm_wrapper_ce6.log
ce6_df = pd.read_csv('../data/slurm_wrapper_ce6.log', header=None, delimiter=' - ', engine='python')

In [40]:
# Display dataframe information
ce6_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4776520 entries, 0 to 4776519
Data columns (total 6 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   0       object
 1   1       object
 2   2       object
 3   3       object
 4   4       object
 5   5       object
dtypes: object(6)
memory usage: 218.7+ MB


In [41]:
# Display head and tail data
ce6_df

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
2,2020-10-16 10:37:44.218760,user 9202,retry 0,time 0.05928945541381836,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
3,2020-10-16 10:37:44.256403,user 9202,retry 0,time 0.038695573806762695,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
4,2020-10-16 10:37:44.611603,user 9202,retry 0,time 0.03343677520751953,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
...,...,...,...,...,...,...
4776515,2021-10-07 21:59:35.014602,user 9221,retry 0,time 0.060086965560913086,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4776516,2021-10-07 21:59:35.238970,user 9202,retry 0,time 0.09804415702819824,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4776517,2021-10-07 21:59:57.265189,user 9203,retry 0,time 0.02454972267150879,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4776518,2021-10-07 22:00:04.024360,user 9201,retry 0,time 0.03941917419433594,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."


In [42]:
# Check column 0 for NaN values
ce6_df[0].isna().sum()

np.int64(0)

In [43]:
# Display unique column 1 values
ce6_df[1].value_counts()

1
user 9201    2710665
user 9202     653123
user 9203     448291
user 9219     440225
user 9221     369354
user 9204     154862
Name: count, dtype: int64

In [44]:
# Check column 1 for NaN values
ce6_df[1].isna().sum()

np.int64(0)

In [45]:
# Display unique column 2 values
ce6_df[2].value_counts()

2
retry 0    4299816
retry 1     425174
retry 2      51530
Name: count, dtype: int64

In [46]:
# Check column 2 for NaN values
ce6_df[2].isna().sum()

np.int64(0)

In [47]:
# Check column 3 for NaN values
ce6_df[3].isna().sum()

np.int64(0)

In [48]:
# Display unique column 4 values
ce6_df[4].value_counts()

4
returncode 0      4165185
returncode 1       598974
returncode 140      11252
returncode 255       1105
returncode 8            4
Name: count, dtype: int64

In [49]:
# Check column 4 for NaN values
ce6_df[4].isna().sum()

np.int64(0)

In [50]:
# Display unique column 5 values
ce6_df[5].value_counts()

5
command ['/usr/bin/scontrol', 'show', 'job']                                                   987351
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'cmspilot']                                    57020
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'lscpilot']                                    54705
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'uscmslocal']                                  24166
command ['/usr/bin/squeue', '-o', '%i %T', '-u', 'cmslocal']                                    23972
                                                                                                ...  
command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x561ffcdc33d0.1243375/bl_2bb25adc3f5a']         1
command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x561ffcdc33d0.1243375/bl_1f3e6e26b959']         1
command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x561ffcdc33d0.1243375/bl_0929a758e6c6']         1
command ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x561ffcdc33d0.1243375/bl_8ec

In [51]:
# Check column 5 for NaN values
ce6_df[5].isna().sum()

np.int64(0)

##  Phase 2: Clean and Transform the Data
**Objectives:**
* Extract job completions from fullsample.csv.  
* Parse CE5 and CE6 logs to identify unresponsive events.  
* Create analysis-ready features (time windows, completion counts, unresponsiveness indicators).  
* Optionally include other features (currently running jobs or resource usage, time-of-day).  

**Notebook Sections:**
* Code: Filtering and transforming datasets.  
* Markdown: Document preprocessing steps and reasoning.  
* Code: Combine datasets into a single dataset suitable for analysis.

### Filter unresponsive logs

In [52]:
# Rename log columns
ce5_df = ce5_df.rename(columns={0: "TIMESTAMP", 1: "USER", 2: "RETRY", 3: "RUNTIME", 4: "RETURNCODE", 5: "COMMAND"})
ce6_df = ce6_df.rename(columns={0: "TIMESTAMP", 1: "USER", 2: "RETRY", 3: "RUNTIME", 4: "RETURNCODE", 5: "COMMAND"})

In [60]:
# Clean RUNTIME values
ce5_df['RUNTIME'] = ce5_df['RUNTIME'].str.replace("time ", "").astype(float)
ce6_df['RUNTIME'] = ce6_df['RUNTIME'].str.replace("time ", "").astype(float)

In [66]:
# Filter the scheduler logs by the tests that were unresponsive
ce5_unresponsive_df = ce5_df[
    (ce5_df['USER'] == "user 9204") &
    (ce5_df['RUNTIME'] > 15) &
    (ce5_df['RETURNCODE'] != "returncode 0") &
    (ce5_df['COMMAND'].str.contains("/usr/bin/squeue"))
]

ce6_unresponsive_df = ce6_df[
    (ce6_df['USER'] == "user 9204") &
    (ce6_df['RUNTIME'] > 15) &
    (ce6_df['RETURNCODE'] != "returncode 0") &
    (ce6_df['COMMAND'].str.contains("/usr/bin/squeue"))
]

In [68]:
# Concatenate unresponsive logs
unresponseive_logs_df = pd.concat([ce5_unresponsive_df, ce6_unresponsive_df])

### Filter relevant jobs

In [54]:
# Set the relevant job statuses
relevant_state_list = ["COMPLETED", "CANCELLED", "FAILED", "CANCELLED by 9204", "OUT_OF_MEMORY", "PENDING"]

# Filter the jobs by the relevant statuses
relevant_jobs_df = jobs_df[jobs_df['STATE'].isin(relevant_state_list)]

In [70]:
relevant_jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7390088 entries, 1 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   JOBID      object
 1   STATE      object
 2   BEGIN      object
 3   END        object
 4   REQMEM     object
 5   USEDMEM    object
 6   REQTIME    object
 7   USEDTIME   object
 8   NODES      int64 
 9   CPUS       int64 
 10  PARTITION  object
 11  EXITCODE   object
dtypes: int64(2), object(10)
memory usage: 733.0+ MB
