In [1]:
import pandas as pd
import numpy as np
# not yet used
import datetime

# The Advanced Computing Center for Research and Education


Project Overview The Advanced Computing Center for Research and Education (ACCRE) operates Vanderbilt University's high-performance computing cluster. Jobs submitted to ACCRE are managed by the slurm scheduler, which tracks compute and memory resources.

ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

Your goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.

You are provided three datasets:

fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".
slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).
Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:
return code = 1
execution time > 15 seconds

### Phase 1: Explore the Data<br>

#### Objectives:

Understand the purpose of each dataset.<br>
Inspect column types, sizes, and example rows. 
<br>
<br>
#### Notebook Sections:

Code: Load each dataset, preview rows, summarize columns.
<br>
Markdown: Notes on data quality and initial observations.

In [5]:
# read the csv file into the notebook
jobs = pd.read_csv("../data/fullsample.csv")#, nrows = 1000)
jobs.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0


In [6]:
jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   JOBID      object
 1   STATE      object
 2   BEGIN      object
 3   END        object
 4   REQMEM     object
 5   USEDMEM    object
 6   REQTIME    object
 7   USEDTIME   object
 8   NODES      int64 
 9   CPUS       int64 
 10  PARTITION  object
 11  EXITCODE   object
dtypes: int64(2), object(10)
memory usage: 677.1+ MB


In [7]:
# assign the END column rows which have a value of unknown to a new df
jobs_unknown = jobs[jobs['END']=="Unknown"]

In [8]:
# check the number of unknown in the enb column of new df
jobs[jobs['END']=="Unknown"].count()

JOBID        651
STATE        651
BEGIN        651
END          651
REQMEM       651
USEDMEM      651
REQTIME      651
USEDTIME     651
NODES        651
CPUS         651
PARTITION    651
EXITCODE     651
dtype: int64

In [10]:
# replace the unknowns in jobs df with nan values
# assign it to a new df, jobs_replace
jobs_replace = jobs.replace("Unknown", np.nan)
jobs_replace

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...
7395880,25493434,COMPLETED,2020-10-31T23:39:00,2020-10-31T23:40:46,2000Mn,0.09M,2-00:00:00,00:01:46,1,1,sam,0:0
7395881,25493435,COMPLETED,2020-10-31T23:39:13,2020-10-31T23:40:38,2000Mn,187.92M,2-00:00:00,00:01:25,1,1,sam,0:0
7395882,25493476,COMPLETED,2020-10-31T23:46:29,2020-10-31T23:49:43,4096Mc,803.97M,12:00:00,00:03:14,1,1,production,0:0
7395883,25493515,COMPLETED,2020-10-31T23:49:44,2020-10-31T23:51:40,2000Mn,0.09M,2-00:00:00,00:01:56,1,1,sam,0:0


In [11]:
# check the number of unknown in the begin column of new df
jobs_replace[jobs_replace["BEGIN"] == "Unknown"].count()

JOBID        0
STATE        0
BEGIN        0
END          0
REQMEM       0
USEDMEM      0
REQTIME      0
USEDTIME     0
NODES        0
CPUS         0
PARTITION    0
EXITCODE     0
dtype: int64

In [14]:
# check the number of pending in the state column of new df
jobs[jobs['STATE']=="PENDING"].count()

JOBID        443
STATE        443
BEGIN        443
END          443
REQMEM       443
USEDMEM      443
REQTIME      443
USEDTIME     443
NODES        443
CPUS         443
PARTITION    443
EXITCODE     443
dtype: int64

In [16]:
# check to see how many unique values are in the partition column
# and their counts
jobs_replace["PARTITION"].value_counts()

PARTITION
production              7019578
nogpfs                   147229
pascal                   124453
sam                       64967
turing                    21424
maxwell                   11278
cgw-maizie                 4309
debug                      1616
cgw-platypus                379
cgw-dsi-gw                  228
cgw-capra1                  157
cgw-dougherty1              125
cgw-horus                    61
cgw-cqs1                     28
cgw-hanuman                  21
cgw-sideshowbob              14
cgw-vm-qa-flatearth1          9
cgw-tbi01                     8
cgw-rocksteady                1
Name: count, dtype: int64

In [18]:
# convert the BEGIN column to datetime from the format it was originally presented in
# assign it back to that column of the jobs_replace df
jobs_replace['BEGIN'] = pd.to_datetime(jobs_replace['BEGIN'], format = "%Y-%m-%dT%H:%M:%S")

In [20]:
# convert the END column to datetime from the format it was originally presented in
# assign it back to that column of the jobs_replace df
jobs_replace['END'] = pd.to_datetime(jobs_replace['END'], format = "%Y-%m-%dT%H:%M:%S")

In [22]:
jobs_replace.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype         
---  ------     -----         
 0   JOBID      object        
 1   STATE      object        
 2   BEGIN      datetime64[ns]
 3   END        datetime64[ns]
 4   REQMEM     object        
 5   USEDMEM    object        
 6   REQTIME    object        
 7   USEDTIME   object        
 8   NODES      int64         
 9   CPUS       int64         
 10  PARTITION  object        
 11  EXITCODE   object        
dtypes: datetime64[ns](2), int64(2), object(8)
memory usage: 677.1+ MB


In [24]:
# inspect head(2) of the jobsjobs_replace df
jobs_replace.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31 22:15:00,NaT,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06 11:36:09,2021-09-05 11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0


In [30]:
# read the first of two log files, ce5, into the notebook
ce5 = pd.read_csv('../data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                  nrows=100)

ce5.head()

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."


In [32]:
# inspect the types of data in each column on the ce5 df
ce5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       100 non-null    object
 1   1       100 non-null    object
 2   2       100 non-null    object
 3   3       100 non-null    object
 4   4       100 non-null    object
 5   5       100 non-null    object
dtypes: object(6)
memory usage: 4.8+ KB


In [34]:
# read the second of two log files, ce6, into the notebook
ce6 = pd.read_csv('../data/slurm_wrapper_ce6.log',                  
                  header=None,
                  delimiter=' - ',
                  engine='python'),
                  nrows=100)

ce6

IndentationError: unexpected indent (3679859852.py, line 6)

In [36]:

ce6.info()

NameError: name 'ce6' is not defined

In [38]:
jobs[0:3]

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0


In [None]:
jobs[jobs['JOBID'].str.contains('_')]

### Phase 2: Clean and Transform the Data

#### Objectives:
Extract job completions from fullsample.csv. <br>
Parse CE5 and CE6 logs to identify unresponsive events. <br>
Create analysis-ready features (time windows, completion counts, unresponsiveness indicators). <br>
Optionally include other features (currently running jobs or resource usage, time-of-day).

#### Notebook Sections:
Code: Filtering and transforming datasets <br>
Markdown: Document preprocessing steps and reasoning <br>
Code: Combine datasets into a single dataset suitable for analysis

### Phase 3: Analyze and Visualize

#### Objectives:
Explore the relationship between job completions and unresponsiveness. <br>
Create visualizations and basic summary statistics

#### Notebook Sections:
Code: Time-series plots, scatterplots, boxplots, summary statistics. <br>
Markdown: Interpret the visualizations and describe patters. <br>
Code: Fit a simple logistic regression to ttest the hypothesis. <br>
Markdown: Summarize the results and draw conclusions from the model. <br>
Optional: Explore additional factors (eg. day of week).

### Phase 4: Interpret and Conclude

#### Objectives:
Answer the main question: Does the data support the hypothesis that the slurm scheduler is more likely to be unresponsive during bursts of job completions? <br>
Summarize findings and limitations.

#### Notebook Sections:
Markdown: Summarize evidence for or against the hypothesis. <br>
Markdown: Provide a clear conclusions.

### Final Deliverable: A single Jupyter notebook that includes:

1. Introduction & dataset overview
2. Data exploration & cleaning
3. Feature engineering
4. Analysis & visualizations
5. Interpretation & Conclusion