In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

## The Advanced Computing Center for Research and Education

**Project Overview**
The Advanced Computing Center for Research and Education (ACCRE) operates Vanderbilt University's high-performance computing cluster. Jobs submitted to ACCRE are managed by the [slurm scheduler](https://slurm.schedmd.com/documentation.html), which tracks compute and memory resources.

ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

Your goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.

You are provided three datasets:  
* fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".  
* slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).  
Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:  
    * return code = 1
    * execution time > 15 seconds

**Phase 1: Explore the Data**  
Objectives:  
* Understand the purpose of each dataset.  
* Inspect column types, sizes, and example rows.  

Notebook Sections:  
* Code: Load each dataset, preview rows, summarize columns.  
* Markdown: Notes on data quality and initial observations.  

In [3]:
# Read csv file
jobs = pd.read_csv('../data/fullsample.csv')

In [12]:
# Read log file
ce5 = pd.read_csv('../data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python')

In [11]:
# Read log file
ce6 = pd.read_csv('../data/slurm_wrapper_ce6.log',
                  header=None,
                  delimiter=' - ',
                  engine='python')

> ## Data Inspection
>
> Below we inspect our data that we extract from the cell above. We examine the types of data and create a table with a description of each element in the jobs dataframe. The main issue we are adressing in this project is uncompleted jobs and what is likley causing these jobs to be uncompleted. The first thing we need to do is calculate how many of these jobs are uncompleted.
>
>* **0.28%** of the jobs in our dataset are **not completed.**
>* There are **20,801** incomplete jobs total in our dataframe.

In [36]:
(7395885 - 7375084)

20801

In [16]:
# Inspect variable types
jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7395885 entries, 0 to 7395884
Data columns (total 12 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   JOBID      object
 1   STATE      object
 2   BEGIN      object
 3   END        object
 4   REQMEM     object
 5   USEDMEM    object
 6   REQTIME    object
 7   USEDTIME   object
 8   NODES      int64 
 9   CPUS       int64 
 10  PARTITION  object
 11  EXITCODE   object
dtypes: int64(2), object(10)
memory usage: 677.1+ MB


column | description
-------|---------
JOBID | The identification number of the job or job step. Array jobs are in the form ArrayJobID_ArrayTaskID
STATE | Job state or status (COMPLETED, CANCELLED, FAILED, TIMEOUT, PREEMPTED, etc.)
BEGIN | Beginning time for the job.
END | Ending time for the job.
REQMEM | Requested memory in megabytes. May be per-core (Mc) or per-node (Mn)
USEDMEM | Used memory in megabytes per-node
REQTIME | Requested time in d-hh:mm:ss or hh:mm:ss
USEDTIME | Used time in d-hh:mm:ss or hh:mm:ss
NODES | Number of servers used for this job
CPUS | Total number of CPU-cores allocated to the job
PARTITION | Identifies the partition on which the job ran.
EXITCODE | The exit code returned by the job script or salloc, typically as set by the exit() function. Following the colon is the signal that caused the process to terminate if it was terminated by a signal.

In [26]:
# Find amount of uncompleted jobs
jobs['STATE'].describe()

count       7395885
unique          145
top       COMPLETED
freq        7375084
Name: STATE, dtype: object

In [35]:
# Calculate sample statistic
print(f'{(((7395885 - 7375084) / 7395885) * 100):.2f}% of the jobs in our dataset are not completed.')

0.28% of the jobs in our dataset are not completed.


> ## Clean and Inspect Log Files

In [9]:
ce5.head(2)

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."


In [17]:
ce5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4770893 entries, 0 to 4770892
Data columns (total 6 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   0       object
 1   1       object
 2   2       object
 3   3       object
 4   4       object
 5   5       object
dtypes: object(6)
memory usage: 218.4+ MB


In [18]:
ce6.head(2)

Unnamed: 0,0,1,2,3,4,5
0,2020-10-16 10:37:44.163454,user 9202,retry 0,time 0.08495402336120605,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."
1,2020-10-16 10:37:44.206654,user 9202,retry 0,time 0.08943057060241699,returncode 0,"command ['/usr/bin/scontrol', 'show', 'job', '..."


In [19]:
ce6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4776520 entries, 0 to 4776519
Data columns (total 6 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   0       object
 1   1       object
 2   2       object
 3   3       object
 4   4       object
 5   5       object
dtypes: object(6)
memory usage: 218.7+ MB


**Phase 2: Clean and Transform the Data**  
Objectives:  
* Extract job completions from fullsample.csv.  
* Parse CE5 and CE6 logs to identify unresponsive events.  
* Create analysis-ready features (time windows, completion counts, unresponsiveness indicators).  
* Optionally include other features (currently running jobs or resource usage, time-of-day).  

Notebook Sections:  
* Code: Filtering and transforming datasets.  
* Markdown: Document preprocessing steps and reasoning.  
* Code: Combine datasets into a single dataset suitable for analysis.

**Phase 3: Analyze and Visualize**  
Objectives:  
* Explore the relationship between job completions and unresponsiveness.  
* Create visualizations and basic summary statistics.  

Notebook Sections:  
* Code: Time-series plots, scatterplots, boxplots, summary statistics.
* Markdown: Interpret the visualizations and describe patterns.  
* Code: Fit a simple logistic regression to test the hypothesis.
* Markdown: Summarize the results and draw conclusions from the model.  
* Optional: Explore additional factors (eg. day of week).

**Phase 4: Interpret and Conclude**  
Objectives:  
* Answer the main question: Does the data support the hypothesis that the slurm scheduler is more likely to be unresponsive during bursts of job completions?  
* Summarize findings and limitations.  

Notebook Sections:    
* Markdown: Summarize evidence for or against the hypothesis.  
* Markdown: Provide a clear conclusion.  

**Final Deliverable:**
A single Jupyter notebook that includes:  
1. Introduction & dataset overview  
2. Data exploration & cleaning  
3. Feature engineering  
4. Analysis & visualizations  
5. Interpretation & conclusion