In [1]:
import pandas as pd
import numpy as np
pd.options.mode.copy_on_write = True
# not yet used
import datetime

# The Advanced Computing Center for Research and Education


Project Overview The Advanced Computing Center for Research and Education (ACCRE) operates Vanderbilt University's high-performance computing cluster. Jobs submitted to ACCRE are managed by the slurm scheduler, which tracks compute and memory resources.

ACCRE staff have hypothesized that the scheduler sometimes becomes unresponsive because it is processing large bursts of job completions. This especially affects automated job submitters, such as members of the Open Science Grid.

**Your goal is to evaluate whether the data supports the hypothesis of bursts of job completions contributing to scheduler unresponsiveness.** <br>
**(The big question: how many jobs can ACCRE process in an hour before 
issues emerge)** <br>
**(Answer the main question: Does the data support the hypothesis that the slurm scheduler is more likely to be unresponsive during bursts of job completions?)**


You are provided three datasets:

fullsample.csv: Contains slurm job records. Job completions correspond to jobs in the "COMPLETED" state with exit code "0:0".
slurm_wrapper_ce5.log, slurm_wrapper_ce6.log: These log files contain every slurm command executed by the CE5 and CE6 servers (gateways to the Open Science Grid).
Unresponsive periods are indicated by "sbatch" commands from user 9204 that have:
return code = 1
execution time > 15 seconds

## 1. Introduction & dataset overview  

I have loaded the slurm datasets and saved the unresponsive period data to csv files

## Phase 1: Explore the Data<br>

#### Objectives:

Understand the purpose of each dataset.<br>
Inspect column types, sizes, and example rows. 
<br>
<br>
#### Notebook Sections:

Code: Load each dataset, preview rows, summarize columns.
<br>
Markdown: Notes on data quality and initial observations.

The entire fullsample csv file contains 7,395,885 rows of data, one row per job. 

In [134]:
#def completed_jobs(filehere):
    
    ## read the jobs data into the notebook
#    chunk = pd.read_csv(f'../data/{filehere}')#, nrows=100000) 

    ## initialize empty DataFrame: data
#    data = pd.DataFrame()

    ## iterate over each DataFrame chunk
#    for portion in chunk:
        ## replace the unknowns in jobs df with nan values
        ## assign it to a new df, chunk_replace_unknown
#        chunk_replace_unknown = chunk.replace("Unknown", np.nan)

        ## isolate the rows containing exitcode 0:0
#        exit = chunk_replace_unknown[chunk_replace_unknown['EXITCODE'].str[-3:] == '0:0']

        ## further isolate rows in df to capture only COMPLETED STATE rows
#        completed_exit = exit[exit['STATE'] == 'COMPLETED']

        ## convert the BEGIN column to datetime from the format it was originally presented in
        ## assign it back to that column of the completed_exit
#        completed_exit['BEGIN'] = pd.to_datetime(completed_exit['BEGIN'], format = "%Y-%m-%dT%H:%M:%S")
    
        ## convert the END column to datetime from the format it was originally presented in
        ## assign it back to that column of the completed_exit
#        completed_exit['END'] = pd.to_datetime(completed_exit['END'], format = "%Y-%m-%dT%H:%M:%S")
    
        ## Concatenate DataFrame chunk to the end of data: data
#        data = pd.concat([data, completed_exit])
#        data.to_csv('../data/completed_jobs.csv')
    
#    return data #completed_exit #exit #chunk #data

In [136]:
#filehere = 'fullsample.csv'

In [138]:
##completed_jobs(filehere)#.head(2)

In [140]:
#completed_jobs = pd.read_csv("../data/completed_jobs.csv")

In [142]:
## read the csv file into the notebook
#jobs = pd.read_csv("../data/fullsample.csv")#, nrows = 100000)
#jobs.head(2)

In [144]:
## replace the unknowns in jobs df with nan values
## assign it to a new df, jobs_replace_unknown
jobs_replace_unknown = jobs.replace("Unknown", np.nan)
jobs_replace_unknown.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0


In [145]:
#jobs_replace_unknown.info() 

In [146]:
## isolate the rows containing exitcode 0:0
exit = jobs_replace_unknown[jobs_replace_unknown['EXITCODE'].str[-3:] == '0:0'] 
exit.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,30616928,RUNNING,2021-07-31T22:15:00,,2048Mn,0,10:04:00,67-22:14:22,1,1,production,0:0
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0


In [147]:
## further isolate rows in df to capture only COMPLETED STATE rows
completed_jobs = exit[exit['STATE'] == 'COMPLETED']
completed_jobs.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
1,30853133,COMPLETED,2021-08-06T11:36:09,2021-09-05T11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06T19:04:39,2021-09-05T19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0


The fullsample dataset contains job records, with one row per job.

Each job gets a unique ID, contained in the **JOBID** column.

Some jobs can be submitted as arrays of similar jobs. These are listed with an underscore in the JOBID, where the number after the underscore indicates the tasknumber. For example. JOBID 31781951 was an array job with 10 parts.

Jobs can have a few differents states, with the most common one being 'COMPLETED'. 

The BEGIN field indicates when the job was started (initiated on a computer node).

The END field indicates when the job ended (completed, failed, or was cancelled while running).

The REQMEM field is the amount of memory requested in megabytes. It can be per-core/CPU (Mc) or per-node (Mn).

In [150]:
## convert the BEGIN column to datetime from the format it was originally presented in
## assign it back to that column of the jobs_replace df
completed_jobs['BEGIN'] = pd.to_datetime(completed_jobs['BEGIN'], format = "%Y-%m-%dT%H:%M:%S")

In [151]:
## convert the END column to datetime from the format it was originally presented in
## assign it back to that column of the jobs_replace df
completed_jobs['END'] = pd.to_datetime(completed_jobs['END'], format = "%Y-%m-%dT%H:%M:%S")

In [152]:
completed_jobs.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
1,30853133,COMPLETED,2021-08-06 11:36:09,2021-09-05 11:36:32,262144Mn,20604.62M,30-00:00:00,30-00:00:23,1,1,cgw-platypus,0:0
2,30858137,COMPLETED,2021-08-06 19:04:39,2021-09-05 19:04:53,204800Mn,57553.77M,30-00:00:00,30-00:00:14,1,32,cgw-tbi01,0:0


In [153]:
## check beginning date of dataset
completed_jobs["BEGIN"].min()

Timestamp('2020-10-01 00:03:08')

In [154]:
## check ending date of dataset
completed_jobs["BEGIN"].max()

Timestamp('2021-10-07 20:39:26')

In [155]:
unresponsive_periods_ce5 = pd.read_csv("../data/ce5_slurm_unresponsive_periods.csv")
unresponsive_periods_ce5.head(2)

Unnamed: 0.1,Unnamed: 0,date_time,user,retry,time,returncode,command,sbatch
0,49958,2020-10-18 06:53:44.272915,9204,0,20.038464,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True
1,49972,2020-10-18 06:54:04.322412,9204,1,20.048906,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True


In [156]:
unresponsive_periods_ce6 = pd.read_csv("../data/ce6_slurm_unresponsive_periods.csv")
unresponsive_periods_ce6.head(2)

Unnamed: 0.1,Unnamed: 0,date_time,user,retry,time,returncode,command,sbatch
0,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True
1,37605,2020-10-18 06:38:44.172473,9204,0,20.038736,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True


In [157]:
unresponsive_periods_ce5.head(2)

Unnamed: 0.1,Unnamed: 0,date_time,user,retry,time,returncode,command,sbatch
0,49958,2020-10-18 06:53:44.272915,9204,0,20.038464,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True
1,49972,2020-10-18 06:54:04.322412,9204,1,20.048906,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True


In [158]:
## sort keys for merge_asof
unresponsive_periods_ce5 = unresponsive_periods_ce5.sort_values(by="Unnamed: 0")
unresponsive_periods_ce6 = unresponsive_periods_ce6.sort_values(by="Unnamed: 0")

In [159]:
## merge the two log files after distilling them down to unresponsive periods
## rename the slurm id column accordingly
unresponsive_periods_merged = pd.merge(
    unresponsive_periods_ce5, 
    unresponsive_periods_ce6, 
    on=["Unnamed: 0", "date_time", "user", "retry", "time", "returncode", "command", "sbatch"], 
    how='outer'
).rename(
    columns={"Unnamed: 0": "slurm_id"}
)
unresponsive_periods_merged.head(2)

Unnamed: 0,slurm_id,date_time,user,retry,time,returncode,command,sbatch
0,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True
1,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True


In [160]:
unresponsive_periods_merged["date_time"] = pd.to_datetime(unresponsive_periods_merged["date_time"])#.dt.floor('s')

In [161]:
unresponsive_periods_merged = unresponsive_periods_merged.sort_values("date_time")#.head(2)
unresponsive_periods_merged.head(2)

Unnamed: 0,slurm_id,date_time,user,retry,time,returncode,command,sbatch
0,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True
1,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True


In [162]:
unresponsive_periods_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19776 entries, 0 to 19775
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   slurm_id    19776 non-null  int64         
 1   date_time   19776 non-null  datetime64[ns]
 2   user        19776 non-null  int64         
 3   retry       19776 non-null  int64         
 4   time        19776 non-null  float64       
 5   returncode  19776 non-null  int64         
 6   command     19776 non-null  object        
 7   sbatch      19776 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 1.2+ MB


In [163]:
completed_jobs = completed_jobs.sort_values("END")#.head(2)
completed_jobs.head(2)

Unnamed: 0,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
6640577,24460577,COMPLETED,2020-10-01 00:05:55,2020-10-01 00:10:15,4096Mc,868.77M,12:00:00,00:04:20,1,1,production,0:0
6640635,24460647,COMPLETED,2020-10-01 00:10:38,2020-10-01 00:12:58,2000Mn,0.09M,2-00:00:00,00:02:20,1,1,sam,0:0


In [164]:
completed_jobs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7375084 entries, 6640577 to 1491978
Data columns (total 12 columns):
 #   Column     Dtype         
---  ------     -----         
 0   JOBID      object        
 1   STATE      object        
 2   BEGIN      datetime64[ns]
 3   END        datetime64[ns]
 4   REQMEM     object        
 5   USEDMEM    object        
 6   REQTIME    object        
 7   USEDTIME   object        
 8   NODES      int64         
 9   CPUS       int64         
 10  PARTITION  object        
 11  EXITCODE   object        
dtypes: datetime64[ns](2), int64(2), object(8)
memory usage: 731.5+ MB


In [186]:
## Merge_asof completed_jobs with unresponsive_periods_merged
unresponsive_completed_merged_asof = pd.merge_asof(
    unresponsive_periods_merged, 
    completed_jobs, 
    left_on="date_time", 
    right_on="END",
    direction="backward"
)#.head(50)

In [188]:
unresponsive_completed_merged_asof["BEGIN"].value_counts()#.min()#.unique()#

NameError: name 'unresponsive_complted_merged_asof' is not defined

In [190]:
unresponsive_completed_merged_asof.drop_duplicates(subset="date_time", keep="last")

Unnamed: 0,slurm_id,date_time,user,retry,time,returncode,command,sbatch,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
5,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
11,37605,2020-10-18 06:38:44.172473,9204,0,20.038736,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25032600_8,COMPLETED,2020-10-17 16:47:50,2020-10-18 06:38:03,4096Mn,335.27M,23:00:00,13:50:13,1,1,production,0:0
17,49958,2020-10-18 06:53:44.272915,9204,0,20.038464,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041598,COMPLETED,2020-10-18 06:31:59,2020-10-18 06:53:24,21878Mn,1.63M,2-00:00:00,00:21:25,1,8,nogpfs,0:0
23,49972,2020-10-18 06:54:04.322412,9204,1,20.048906,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041598,COMPLETED,2020-10-18 06:31:59,2020-10-18 06:53:24,21878Mn,1.63M,2-00:00:00,00:21:25,1,8,nogpfs,0:0
29,50467,2020-10-18 07:47:25.825172,9204,0,20.082628,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25042059,COMPLETED,2020-10-18 07:25:37,2020-10-18 07:47:02,21875Mn,1.83M,2-00:00:00,00:21:25,1,8,nogpfs,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19751,4661025,2021-09-24 18:14:35.862916,9204,0,20.041436,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32734318_2428,COMPLETED,2021-09-24 17:48:53,2021-09-24 18:14:01,4096Mn,527.00M,01:30:00,00:25:08,1,4,production,0:0
19757,4661384,2021-09-24 19:13:14.894282,9204,0,20.051321,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32734318_2472,COMPLETED,2021-09-24 18:52:49,2021-09-24 19:13:05,4096Mn,359.06M,01:30:00,00:20:16,1,4,production,0:0
19763,4726331,2021-10-02 08:14:16.557499,9204,0,19.083227,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32878685_2618,COMPLETED,2021-10-02 07:46:05,2021-10-02 08:13:02,1024Mn,362.38M,02:00:00,00:26:57,1,1,production,0:0
19769,4731181,2021-10-02 18:29:08.267199,9204,0,20.043146,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32908027,COMPLETED,2021-10-02 18:24:07,2021-10-02 18:28:16,4096Mc,956.50M,12:00:00,00:04:09,1,1,production,0:0


In [117]:
unresponsive_compelted_merged_asof[unresponsive_compelted_merged_asof["END"] == "2020-10-18 06:13:19"]#["command"].value_counts()

slurm_id  date_time                   user  retry  time       returncode  command                                                                              sbatch  JOBID     STATE      BEGIN                END                  REQMEM  USEDMEM  REQTIME   USEDTIME  NODES  CPUS  PARTITION   EXITCODE
36913     2020-10-18 06:16:25.392946  9204  0      20.037672  1           ['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x5572a7c77310.3390891/bl_23341e2dd5ae']  True    25041808  COMPLETED  2020-10-18 06:08:07  2020-10-18 06:13:19  4096Mc  871.60M  12:00:00  00:05:12  1      1     production  0:0         6
Name: count, dtype: int64

In [107]:
unresponsive_compelted_merged_asof

Unnamed: 0,slurm_id,date_time,user,retry,time,returncode,command,sbatch,JOBID,STATE,BEGIN,END,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE
0,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
1,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
2,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
3,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
4,36913,2020-10-18 06:16:25.392946,9204,0,20.037672,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,25041808,COMPLETED,2020-10-18 06:08:07,2020-10-18 06:13:19,4096Mc,871.60M,12:00:00,00:05:12,1,1,production,0:0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19771,4766868,2021-10-06 15:39:20.269943,9204,0,19.047097,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32920924,COMPLETED,2021-10-06 15:30:36,2021-10-06 15:33:18,2000Mn,392.43M,2-00:00:00,00:02:42,1,1,sam,0:0
19772,4766868,2021-10-06 15:39:20.269943,9204,0,19.047097,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32920924,COMPLETED,2021-10-06 15:30:36,2021-10-06 15:33:18,2000Mn,392.43M,2-00:00:00,00:02:42,1,1,sam,0:0
19773,4766868,2021-10-06 15:39:20.269943,9204,0,19.047097,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32920924,COMPLETED,2021-10-06 15:30:36,2021-10-06 15:33:18,2000Mn,392.43M,2-00:00:00,00:02:42,1,1,sam,0:0
19774,4766868,2021-10-06 15:39:20.269943,9204,0,19.047097,1,"['/usr/bin/sbatch', '/tmp/condor_g_scratch.0x...",True,32920924,COMPLETED,2021-10-06 15:30:36,2021-10-06 15:33:18,2000Mn,392.43M,2-00:00:00,00:02:42,1,1,sam,0:0


In [None]:
# check that datetime conversion went through
#jobs_replace_unknown.info()

In [None]:
# inspect head(2) of the jobsjobs_replace df
#completed_exit#.head(2)

In [None]:
## Jobs where memory was requested per node.
#completed_exit[completed_exit['REQMEM'].str[-2:] == 'Mn'].head(2)

In [None]:
## Jobs where memory was requested per core.
#completed_exit[completed_exit['REQMEM'].str[-2:] == 'Mc'].head(2)

In [None]:
## check to see how many unique values are in the partition column
## and their counts
## The "debug" and "sam" partitions are test jobs that are expected to be short from the partitions column in jobs
#completed_exit["PARTITION"].value_counts()

The USEDMEM column is the amount of memory used in MB per node.

The requested time (REQTIME) and used time (USEDTIME) columns are in d-hh:mm:ss or hh:mm:ss for jobs less than one day in duration.

NODES is the number of servers used for the job. Most jobs are single node. For multiple node jobs, memory usage is the maximum over all no
es.
 
CPUS is the total number of CPU cores allocated to the job, and for multi-node jobs, this includes all nodes.

Most jobs are run in the "production" or "nogpfs" partition. The "debug" and "sam" partitions are test jobs that are expected to be short, and the "maxwell", "pascal", and "turing" partitions are for GPU resources.

The **EXITCODE** gives the [exit code](https://www.agileconnection.com/article/overview-linux-exit-codes) for the job, with "0:0" indicating a successful job. Exit codes have two numbers, where if the first number is non-zero, it indicates a problem on the server side and if the second is nonzero, it indicates a problem on the user side.

### Slurm Wrapper ce5

The slurm_wrapper_ce5.log and slurm_wrapper_ce6.log files contain logs of jobs submitted from the Open Science Grid.

For this project, we are interested in jobs from user 9204 (the test user) where the command starts with '/usr/bin/sbatch, the returncode is non-zero and the time is greater than 15. These conditions indicate that the scheduler became unresponsive at that point in time.

In [None]:
#def build_slurm(filename):
    
    ## read the first of two log files, ce5, into the notebook in chunks
#    chunk = pd.read_csv(f'../data/{filename}',
#                  header=None,
#                  delimiter=' - ',
#                  engine='python')#,
#                  #nrows=100000)
#                  #chunksize=

    ## initialize empty DataFrame: data
#    data = pd.DataFrame()

    ## iterate over each DataFrame chunk
#    for portion in chunk:
        ## rename the columns to words instead of simply numbers
#        named = chunk.rename(columns={0: 'date_time', 1: 'user', 2: 'retry', 3: 'time', 4: 'returncode', 5: 'command'})

        ## convert the date_time column to datetime from the format it was originally presented in
        ## assign it back to that column of the ce5_names df
#        named['date_time'] = pd.to_datetime(named['date_time'], format = "mixed")

        ## strip the word user from within the user column and we are left with a number
#        named["user"] = named["user"].str.strip("user")
#        named["user"] = pd.to_numeric(named["user"])

        ## strip the word retry from within the retry column and we are left with a number
#        named["retry"] = named["retry"].str.strip("retry")
#        named["retry"] = pd.to_numeric(named["retry"])

        ## strip the word time from within the time column and we are left with a number
#        named["time"] = named["time"].str.strip("time")
#        named["time"] = pd.to_numeric(named["time"])
    
        ## strip the word returncode from within the returncode column and we are left with a number
#        named["returncode"] = named["returncode"].str.strip("returncode")
#        named["returncode"] = pd.to_numeric(named["returncode"])

        ## strip the word command from within the command column
#        named["command"] = named["command"].str.strip("command")

        ## locate rows which contain sbatch 
#        named["sbatch"] = named["command"].str.contains("/usr/bin/sbatch")
        ## save rows where the command column contains sbatch to a new df, named_sbatch
#        named_sbatch = named[named["sbatch"] == True]

        ## locate and save rows where the user column contains 9204 to a new df, named_sbatch_user
#        named_sbatch_user = named_sbatch[named_sbatch["user"] == 9204]

        ## locate and save rows where the returncode column contains 1 to a new df, named_sbatch_user_returncode
#        named_sbatch_user_returncode = named_sbatch_user[named_sbatch_user["returncode"] == 1]

        ## locate and save rows where the returncode column contains 1 to a new df, unresponsive_periods
#        unresponsive_periods = named_sbatch_user_returncode[named_sbatch_user_returncode["time"] > 15]

        # Concatenate DataFrame chunk to the end of data: data
#        data = pd.concat([data, unresponsive_periods])
#        data.to_csv('../data/ce6_slurm_unresponsive_periods.csv')#('../data/ce5_slurm_unresponsive_periods.csv')
    
#    return data

In [None]:
##filename variable for build_slurm function
filename = "slurm_wrapper_ce6.log" #"slurm_wrapper_ce5.log"

In [None]:
#build_slurm(filename)

In [None]:
# assign the END column rows which have a value of unknown to a new df
jobs_unknown = jobs[jobs["END"]=="Unknown"]

In [None]:
#jobs

In [None]:
#jobs_replace[0:3]

In [None]:
#jobs_replace[jobs_replace['JOBID'].str.contains('_')]

### Phase 2: Clean and Transform the Data

#### Objectives:
Extract job completions from fullsample.csv. <br>
Parse CE5 and CE6 logs to identify unresponsive events. <br>
Create analysis-ready features (time windows, completion counts, unresponsiveness indicators). <br>
Optionally include other features (currently running jobs or resource usage, time-of-day).

#### Notebook Sections:
Code: Filtering and transforming datasets <br>
Markdown: Document preprocessing steps and reasoning <br>
Code: Combine datasets into a single dataset suitable for analysis

### 3. Feature engineering  FINAL Deliverables

### Phase 3: Analyze and Visualize

#### Objectives:
Explore the relationship between job completions and unresponsiveness. <br>
Create visualizations and basic summary statistics

#### Notebook Sections:
Code: Time-series plots, scatterplots, boxplots, summary statistics. <br>
Markdown: Interpret the visualizations and describe patters. <br>
Code: Fit a simple logistic regression to test the hypothesis. <br>
Markdown: Summarize the results and draw conclusions from the model. <br>
Optional: Explore additional factors (eg. day of week).

### Phase 4: Interpret and Conclude

#### Objectives:
Answer the main question: Does the data support the hypothesis that the slurm scheduler is more likely to be unresponsive during bursts of job completions? <br>
Summarize findings and limitations.

#### Notebook Sections:
Markdown: Summarize evidence for or against the hypothesis. <br>
Markdown: Provide a clear conclusions.

### Final Deliverable: A single Jupyter notebook that includes:

1. Introduction & dataset overview
2. Data exploration & cleaning
3. Feature engineering
4. Analysis & visualizations
5. Interpretation & Conclusion