The Advanced Computing Center for Research and Education (ACCRE) is a computer cluster serving the high-performance computing needs of research for Vanderbilt University. In this data question, you will be analyzing data on jobs run on ACCRE's hardware.

When a job is submitted to ACCRE, it goes through the [slurm scheduler](https://slurm.schedmd.com/documentation.html), which tracks and manages compute and memory resources. For this project, your main objective is to investigate a potential cause of the scheduler to become unresponsive.

It is hypothesized that the slurm scheduler is processing so many job completions so frequently that it sometimes becomes unresponsive to commands from users trying to schedule new jobs or check job status. This is a particularly bad problem for clients who use automated submission systems, such as members of the Open Science Grid. The goal of this project is to investigate the hypothesis that lots of job completions in a short time period are causing the scheduler to be unresponsive, and determine the rough threshold at which it becomes an issue.

You have been provided three datasets for this task:
* **fullsample.csv**: This file contains output for jobs run through the slurm scheduler.
* **slurm_wrapper_ce5.log** and **slurm_wrapper_ce6.log**: Logs of every slurm command that a pair of servers, ce5 and ce6, executed, how long it took, and if it succeeded. These servers connect ACCRE's local cluster to the Open Science Grid and submit jobs to slurm on behalf of the grid.

Job completions can be identified by looking at the fullsample csv and loking for jobs with exit code 0:0 in the "COMPLETED" state.

To identify periods of unresponsiveness, use the two log files. Look for records that are the "sbatch" command from user 9204 (the test user) that have return code 1 and an execution time of greater than 15 seconds.

At the end of the project, your group will deliver a 10-15 minute presentation showing your findings and conclusions about this question. Do you find evidence to support the hypothesis that the scheduler is more likely to be unresponsive during periods of a high number of job completions? Your presentation can include any exploratory analysis that you did to work towards answering the main question. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
import requests
from bs4 import BeautifulSoup
from IPython.core.display import HTML
import io
import re
import json
from bs4 import BeautifulSoup
import csv
import os
import statsmodels.api as sm
import seaborn as sns

In [3]:
full_sample_df = pd.read_csv('../data/fullsample (1).csv')

In [4]:
full_sample_df

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,Unknown,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0
3,30935078,COMPLETED,2021-08-09T16:52:51,2021-09-07T20:52:55,65536Mn,20577.96M,29-04:00:00,29-04:00:04,1,8,cgw-platypus,0:0
4,31364111_2,COMPLETED,2021-08-17T07:45:07,2021-09-10T16:45:24,16384Mn,9733.43M,24-09:00:00,24-09:00:17,1,1,production,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...
7395880,25493434,COMPLETED,2020-10-31T23:39:00,2020-10-31T23:40:46,2000Mn,0.09M,2-00:00:00,00:01:46,1,1,sam,0:0
7395881,25493435,COMPLETED,2020-10-31T23:39:13,2020-10-31T23:40:38,2000Mn,187.92M,2-00:00:00,00:01:25,1,1,sam,0:0
7395882,25493476,COMPLETED,2020-10-31T23:46:29,2020-10-31T23:49:43,4096Mc,803.97M,12:00:00,00:03:14,1,1,production,0:0
7395883,25493515,COMPLETED,2020-10-31T23:49:44,2020-10-31T23:51:40,2000Mn,0.09M,2-00:00:00,00:01:56,1,1,sam,0:0


In [5]:
ce5 = pd.read_csv('../data/slurm_wrapper_ce5.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                 )

In [6]:
ce5.columns = ['timestamp', 'user', 'retry', 'time', 'return', 'command']


In [7]:
ce5['time'] = ce5['time'].str.replace('time', '').astype(float)

In [8]:
ce5

Unnamed: 0,timestamp,user,retry,time,return,command
0,2020-10-16 08:15:39.278699,user 0,retry 0,0.073476,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
1,2020-10-16 08:18:08.313309,user 0,retry 0,0.183632,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
2,2020-10-16 08:22:48.128689,user 0,retry 0,0.075471,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
3,2020-10-16 08:25:13.257408,user 0,retry 0,0.094844,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
4,2020-10-16 08:31:01.460723,user 0,retry 0,0.074988,returncode 0,"command ['/usr/bin/sacct', '-u', 'appelte1', '..."
...,...,...,...,...,...,...
4770888,2021-10-07 21:58:06.738329,user 9203,retry 0,0.026772,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770889,2021-10-07 21:58:15.931559,user 9201,retry 0,0.041662,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770890,2021-10-07 21:58:48.900136,user 9221,retry 0,0.143490,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."
4770891,2021-10-07 21:59:11.314056,user 9203,retry 0,0.026599,returncode 0,"command ['/usr/bin/squeue', '-o', '%i %T', '-u..."


In [9]:
ce5 = ce5[ce5['user'] == 'user 9204']

In [18]:
ce5 = ce5[ce5['time'] >= 15]

In [20]:
ce5 = ce5[ce5['command'].str.contains('sbatch')]

In [38]:
ce5['server'] = 'ce5'

In [40]:
ce5

Unnamed: 0,timestamp,user,retry,time,return,command,server
49958,2020-10-18 06:53:44.272915,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
49972,2020-10-18 06:54:04.322412,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
50467,2020-10-18 07:47:25.825172,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
50473,2020-10-18 07:47:45.871008,user 9204,retry 1,20.045221,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
50582,2020-10-18 07:53:33.972840,user 9204,retry 0,20.041486,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
...,...,...,...,...,...,...,...
4661384,2021-09-24 19:13:14.894282,user 9204,retry 0,20.051321,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
4726331,2021-10-02 08:14:16.557499,user 9204,retry 0,19.083227,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
4731181,2021-10-02 18:29:08.267199,user 9204,retry 0,20.043146,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
4731399,2021-10-02 18:57:09.500701,user 9204,retry 0,15.495682,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5


In [24]:
ce6 = pd.read_csv('../data/slurm_wrapper_ce6.log',
                  header=None,
                  delimiter=' - ',
                  engine='python',
                  )

In [26]:
ce6.columns = ['timestamp', 'user', 'retry', 'time', 'return', 'command']

In [28]:
ce6['time'] = ce6['time'].str.replace('time', '').astype(float)

In [30]:
ce6 = ce6[ce6['user'] == 'user 9204']

In [32]:
ce6 = ce6[ce6['time'] >= 15]

In [34]:
ce6 = ce6[ce6['command'].str.contains('sbatch')]

In [42]:
ce6['server'] = 'ce6'

In [44]:
ce6

Unnamed: 0,timestamp,user,retry,time,return,command,server
11319,2020-10-16 22:38:52.542223,user 9204,retry 0,19.019137,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
36913,2020-10-18 06:16:25.392946,user 9204,retry 0,20.037672,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
37605,2020-10-18 06:38:44.172473,user 9204,retry 0,20.038736,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
39075,2020-10-18 07:47:32.241050,user 9204,retry 0,20.018348,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
39356,2020-10-18 08:08:49.366063,user 9204,retry 0,20.030497,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
...,...,...,...,...,...,...,...
4662070,2021-09-24 12:56:56.057323,user 9204,retry 0,19.568814,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4662752,2021-09-24 13:29:48.498748,user 9204,retry 0,20.085085,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4667202,2021-09-24 20:59:45.540176,user 9204,retry 0,16.153547,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4737128,2021-10-02 19:03:06.524282,user 9204,retry 0,15.063486,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6


In [46]:
server_merge = pd.concat([ce5,ce6], ignore_index=False).reset_index(drop=True)

In [48]:
server_merge

Unnamed: 0,timestamp,user,retry,time,return,command,server
0,2020-10-18 06:53:44.272915,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
1,2020-10-18 06:54:04.322412,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
2,2020-10-18 07:47:25.825172,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
3,2020-10-18 07:47:45.871008,user 9204,retry 1,20.045221,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
4,2020-10-18 07:53:33.972840,user 9204,retry 0,20.041486,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5
...,...,...,...,...,...,...,...
4112,2021-09-24 12:56:56.057323,user 9204,retry 0,19.568814,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4113,2021-09-24 13:29:48.498748,user 9204,retry 0,20.085085,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4114,2021-09-24 20:59:45.540176,user 9204,retry 0,16.153547,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6
4115,2021-10-02 19:03:06.524282,user 9204,retry 0,15.063486,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6


In [74]:
server_merge['timestamp'] = pd.to_datetime(server_merge['timestamp'])


In [72]:
server_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4117 entries, 0 to 4116
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   timestamp   4117 non-null   object        
 1   user        4117 non-null   object        
 2   retry       4117 non-null   object        
 3   time        4117 non-null   float64       
 4   return      4117 non-null   object        
 5   command     4117 non-null   object        
 6   server      4117 non-null   object        
 7   Time_Stamp  4117 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(6)
memory usage: 257.4+ KB


In [76]:
server_merge['Date'] = server_merge['timestamp'].dt.date
server_merge['Time'] = server_merge['timestamp'].dt.time
server_merge

Unnamed: 0,timestamp,user,retry,time,return,command,server,Time_Stamp,Date,Time
0,2020-10-18 06:53:44.272915,user 9204,retry 0,20.038464,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5,2020-10-18 06:53:44.272915,2020-10-18,06:53:44.272915
1,2020-10-18 06:54:04.322412,user 9204,retry 1,20.048906,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5,2020-10-18 06:54:04.322412,2020-10-18,06:54:04.322412
2,2020-10-18 07:47:25.825172,user 9204,retry 0,20.082628,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5,2020-10-18 07:47:25.825172,2020-10-18,07:47:25.825172
3,2020-10-18 07:47:45.871008,user 9204,retry 1,20.045221,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5,2020-10-18 07:47:45.871008,2020-10-18,07:47:45.871008
4,2020-10-18 07:53:33.972840,user 9204,retry 0,20.041486,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce5,2020-10-18 07:53:33.972840,2020-10-18,07:53:33.972840
...,...,...,...,...,...,...,...,...,...,...
4112,2021-09-24 12:56:56.057323,user 9204,retry 0,19.568814,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6,2021-09-24 12:56:56.057323,2021-09-24,12:56:56.057323
4113,2021-09-24 13:29:48.498748,user 9204,retry 0,20.085085,returncode 1,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6,2021-09-24 13:29:48.498748,2021-09-24,13:29:48.498748
4114,2021-09-24 20:59:45.540176,user 9204,retry 0,16.153547,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6,2021-09-24 20:59:45.540176,2021-09-24,20:59:45.540176
4115,2021-10-02 19:03:06.524282,user 9204,retry 0,15.063486,returncode 0,"command ['/usr/bin/sbatch', '/tmp/condor_g_scr...",ce6,2021-10-02 19:03:06.524282,2021-10-02,19:03:06.524282


In [None]:
server_merge.drop(columns = ['timestamp'])

In [None]:
Job_completions_df = full_sample_df.loc[full_sample_df['STATE'] == 'COMPLETED'].loc[full_sample_df['EXITCODE'] == '0:0']

In [None]:
Job_completions_df

In [None]:
unresponsiveness_df1 = slurm_wrapper_ce5_df.loc[slurm_wrapper_ce5_df['user 9202'] == 'user 9204'].loc[slurm_wrapper_ce5_df['returncode 0'] == 'returncode 1'].loc[slurm_wrapper_ce5_df['time 0.07347559928894043'] > 