## The Advanced Computing Center for Research and Education

**Project Overview**
The Advanced Computing Center for Research and Education (ACCRE) operates Vanderbilt University's high-performance computing cluster. Jobs submitted to ACCRE are managed by the [slurm scheduler](https://slurm.schedmd.com/documentation.html), which tracks compute and memory resources.

ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

Your goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.

You are provided three datasets:  
* fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".  
* slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).  
Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:  
    * return code = 1
    * execution time > 15 seconds

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt


**Phase 1: Explore the Data**  
Objectives:  
* Understand the purpose of each dataset.  
* Inspect column types, sizes, and example rows.  

Notebook Sections:  
* Code: Load each dataset, preview rows, summarize columns.  
* Markdown: Notes on data quality and initial observations.  

In [24]:
FS_jobs_DF = pd.read_csv('../Data/fullsample.csv', nrows = 1000)
FS_jobs_DF.head()

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0


In [27]:
FS_jobs_DF.shape

(1000, 12)

In [28]:
pd.set_option('display.max_rows', 1000)

In [61]:
ce5 = pd.read_csv('../Data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                  nrows=1000)
ce5.head(10)

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
5,2020-10-16 08:31:57.896479,user 9201,retry 0,time 0.12703871726989746,returncode 0,"command ['/usr/bin/scancel', '24994284']"
6,2020-10-16 08:31:58.103189,user 9201,retry 0,time 0.11046957969665527,returncode 0,"command ['/usr/bin/scancel', '24994300']"
7,2020-10-16 08:31:58.103525,user 9201,retry 0,time 0.12061500549316406,returncode 0,"command ['/usr/bin/scancel', '24994286']"
8,2020-10-16 08:31:58.114098,user 9201,retry 0,time 0.24277329444885254,returncode 0,"command ['/usr/bin/scancel', '24994333']"
9,2020-10-16 08:31:58.125105,user 9201,retry 0,time 0.1543562412261963,returncode 0,"command ['/usr/bin/scancel', '24994285']"


In [83]:
ce5['User'] = ce5[1].str[4:9]
ce5['Returncode'] = ce5[4].str[10:]
ce5

Unnamed: 0,0,1,2,3,4,5,User,Returncode
0,2020-10-16 08:15:39.278699,user 0,retry 0,time 0.07347559928894043,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '...",0,0
1,2020-10-16 08:18:08.313309,user 0,retry 0,time 0.18363237380981445,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '...",0,0
2,2020-10-16 08:22:48.128689,user 0,retry 0,time 0.07547116279602051,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '...",0,0
3,2020-10-16 08:25:13.257408,user 0,retry 0,time 0.09484362602233887,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '...",0,0
4,2020-10-16 08:31:01.460723,user 0,retry 0,time 0.07498788833618164,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '...",0,0
5,2020-10-16 08:31:57.896479,user 9201,retry 0,time 0.12703871726989746,returncode 0,"command ['/usr/bin/scancel', '24994284']",9201,0
6,2020-10-16 08:31:58.103189,user 9201,retry 0,time 0.11046957969665527,returncode 0,"command ['/usr/bin/scancel', '24994300']",9201,0
7,2020-10-16 08:31:58.103525,user 9201,retry 0,time 0.12061500549316406,returncode 0,"command ['/usr/bin/scancel', '24994286']",9201,0
8,2020-10-16 08:31:58.114098,user 9201,retry 0,time 0.24277329444885254,returncode 0,"command ['/usr/bin/scancel', '24994333']",9201,0
9,2020-10-16 08:31:58.125105,user 9201,retry 0,time 0.1543562412261963,returncode 0,"command ['/usr/bin/scancel', '24994285']",9201,0


In [65]:
#ce5[0].apply(pd.Series)

In [59]:
ce6 = pd.read_csv('../Data/slurm_wrapper_ce6.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                  nrows=1000)
ce6.head()

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
2,2020-10-16 10:37:44.218760,user 9202,retry 0,time 0.05928945541381836,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
3,2020-10-16 10:37:44.256403,user 9202,retry 0,time 0.038695573806762695,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
4,2020-10-16 10:37:44.611603,user 9202,retry 0,time 0.03343677520751953,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."


Tried to do this below. It did not work because the number of values it split into did not match the number of columns I was trying to create with it. 

In [None]:
ce6['User'] = ce6[1].str[4:9]
ce6['Returncode'] = ce6[4].str[10:]
new_cols_3 = ce6[0].str.split('  ', expand=True, n=2)
ce6[['Date', 'Start Time', 'Duration']] = new_cols_3
ce6

In [None]:
It turns out this can only be split into 2 columns because of the number of keys 

In [82]:
ce6['User'] = ce6[1].str[4:9]
ce6['Returncode'] = ce6[4].str[10:]
new_cols = ce6[0].str.split(' ', expand=True, n=2)
ce6[['Date', 'Begin&Duration']] = new_cols
ce6

Unnamed: 0,0,1,2,3,4,5,User,Returncode,Date,Start_Duration,Begin & Duration,Begin&Duration
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.163454,10:37:44.163454,10:37:44.163454
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.206654,10:37:44.206654,10:37:44.206654
2,2020-10-16 10:37:44.218760,user 9202,retry 0,time 0.05928945541381836,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.218760,10:37:44.218760,10:37:44.218760
3,2020-10-16 10:37:44.256403,user 9202,retry 0,time 0.038695573806762695,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.256403,10:37:44.256403,10:37:44.256403
4,2020-10-16 10:37:44.611603,user 9202,retry 0,time 0.03343677520751953,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.611603,10:37:44.611603,10:37:44.611603
5,2020-10-16 10:37:44.672244,user 9202,retry 0,time 0.0326387882232666,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.672244,10:37:44.672244,10:37:44.672244
6,2020-10-16 10:37:44.835425,user 9202,retry 0,time 0.03149223327636719,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '...",9202,0,2020-10-16,10:37:44.835425,10:37:44.835425,10:37:44.835425
7,2020-10-16 10:37:54.188933,user 9202,retry 0,time 7.894402980804443,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job']",9202,0,2020-10-16,10:37:54.188933,10:37:54.188933,10:37:54.188933
8,2020-10-16 10:37:58.688565,user 9219,retry 0,time 8.69608187675476,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job']",9219,0,2020-10-16,10:37:58.688565,10:37:58.688565,10:37:58.688565
9,2020-10-16 10:38:02.935124,user 9203,retry 0,time 9.915137767791748,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job']",9203,0,2020-10-16,10:38:02.935124,10:38:02.935124,10:38:02.935124


**Phase 2: Clean and Transform the Data**  
Objectives:  
* Extract job completions from fullsample.csv.  
* Parse CE5 and CE6 logs to identify unresponsive events.  
* Create analysis-ready features (time windows, completion counts, unresponsiveness indicators).  
* Optionally include other features (currently running jobs or resource usage, time-of-day).  

Notebook Sections:  
* Code: Filtering and transforming datasets.  
* Markdown: Document preprocessing steps and reasoning.  
* Code: Combine datasets into a single dataset suitable for analysis.

**Phase 3: Analyze and Visualize**  
Objectives:  
* Explore the relationship between job completions and unresponsiveness.  
* Create visualizations and basic summary statistics.  

Notebook Sections:  
* Code: Time-series plots, scatterplots, boxplots, summary statistics.
* Markdown: Interpret the visualizations and describe patterns.  
* Code: Fit a simple logistic regression to test the hypothesis.
* Markdown: Summarize the results and draw conclusions from the model.  
* Optional: Explore additional factors (eg. day of week).

**Phase 4: Interpret and Conclude**  
Objectives:  
* Answer the main question: Does the data support the hypothesis that the slurm scheduler is more likely to be unresponsive during bursts of job completions?  
* Summarize findings and limitations.  

Notebook Sections:    
* Markdown: Summarize evidence for or against the hypothesis.  
* Markdown: Provide a clear conclusion.  

**Final Deliverable:**
A single Jupyter notebook that includes:  
1. Introduction & dataset overview  
2. Data exploration & cleaning  
3. Feature engineering  
4. Analysis & visualizations  
5. Interpretation & conclusion