For Question #2, filter to CMS users (cmslocal, cmspilot), examine relationship betw early (< 60 min) termination jobs, production & nogpfs partitions, and specific nodes.  Look for commonly failed nodes

Actions: 1) pull in code to convert time to seconds 2) run query for cms users, used time less than 3600 seconds and 1800 seconds, partition = production, nogpfs. 3) analyze exit codes: left = user errors, right = node errors. 4) run query for cms user, 30 mins, plus State = Failed.  4) analyze for commonly occurring failed nodes

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
%matplotlib inline
from io import StringIO
import re

In [None]:
for_pd = StringIO()
with open('../data/accre-jobs-2020.csv') as accre:
    for line in accre:
        new_line = re.sub(r',', '|', line.rstrip(), count=12)
        print (new_line, file=for_pd)

for_pd.seek(0)

accre = pd.read_csv(for_pd, sep='|')
print (accre)

In [None]:
accre.head()

In [None]:
accre.info()

In [None]:
accre.groupby('STATE').size()

In [None]:
# bring in code to convert time to seconds 
# Create a function to split the hh:mm:ss string and calculate seconds from it
def to_sec(x):
    h,m,s = map(int,x.split(':'))
    return (h*60+m)*60+s

In [None]:
accre['REQTIME_DAY_SEC'] = accre['REQTIME'].str.extract('(\d+)-')
accre['REQTIME_DAY_SEC'] = pd.to_numeric(accre['REQTIME_DAY_SEC'])
accre['REQTIME_DAY_SEC'] = accre['REQTIME_DAY_SEC'].fillna(0)
accre['REQTIME_DAY_SEC'] = accre['REQTIME_DAY_SEC']*24*60*60
# Extract the hh:mm:ss from REQTIME, put in a new column, and then apply the to_sec function 
accre['REQTIME_T'] = accre['REQTIME'].str.extract('(..:..:..)$')
# REQTIME_SEC includes total seconds from REQTIME
accre['REQTIME_SEC'] = accre['REQTIME_T'].apply(to_sec) + accre['REQTIME_DAY_SEC']
# Do the same for USEDTIME
accre['USEDTIME_DAY_SEC'] = accre['USEDTIME'].str.extract('(\d+)-')
accre['USEDTIME_DAY_SEC'] = pd.to_numeric(accre['USEDTIME_DAY_SEC'])
accre['USEDTIME_DAY_SEC'] = accre['USEDTIME_DAY_SEC'].fillna(0)
accre['USEDTIME_DAY_SEC'] = accre['USEDTIME_DAY_SEC']*24*60*60
# Do the same for USEDTIME
accre['USEDTIME_T'] = accre['USEDTIME'].str.extract('(..:..:..)$')
# USEDTIME_SEC includes total second from USEDTIME
accre['USEDTIME_SEC'] = accre['USEDTIME_T'].apply(to_sec) + accre['USEDTIME_DAY_SEC']
# Check to make sure the data types look okay
accre.info()

In [None]:
accre_cms = accre.drop(['REQMEM', 'USEDMEM', 'REQTIME', 'USEDTIME', 'REQTIME_DAY_SEC', 'REQTIME_T', 'USEDTIME_DAY_SEC', 'USEDTIME_T'], axis = 1)

In [None]:
accre_cms.info()

First, filter the users to cmspilot and cmslocal

In [None]:
accre_cms_user = accre_cms.loc[accre_cms['USER'].isin(['cmslocal', 'cmspilot'])]
accre_cms_user

In [None]:
accre_cms_user.USER.unique()

In [None]:
accre_cms_user.PARTITION.unique()

In [None]:
accre_cms_user.info()

In [None]:
accre_cms_user.ACCOUNT.unique()

Note that although filtered to cmslocal and cmspilot, there are still 3 Accounts 

Now filter the partitions to production and nogpfs

In [None]:
accre_cms_user_part = accre_cms_user.loc[accre_cms_user['PARTITION'].isin(['production', 'nogpfs'])]
accre_cms_user_part

In [None]:
accre_cms_user_part.info()

In [None]:
accre_cms_user_part.PARTITION.unique()

In [None]:
accre_cms_user_part.ACCOUNT.unique()

Note there are now 2 Accounts associated with cmslocal and cmspilot

So we now have a database with cmslocal/cmspilot as the Users and production/nogpfs as the Partitions. Next, filter to jobs less than 60 minutes, and a separate df with 30 minutes.  (Note: I originally tried to first filter on original USEDTIME less than 1:00 but got error '<' not supported betw 'list' and 'str'

In [None]:
# df with jobs less than 60 minutes.  I used 60 minutes, 2 seconds in case the last digit is not included and 
#there's a delay in cancel execution.  
accre_cms_user_part_1hr = accre_cms_user_part[accre_cms_user_part['USEDTIME_SEC'].between(0,3602)]
accre_cms_user_part_1hr

In [None]:
accre_cms_user_part_1hr.describe()

In [None]:
#df with jobs less than 30 minutes.  I used 30 minutes, 2 seconds in case the last digit is not included and 
#there's a delay in cancel execution. 
accre_cms_user_part_30 = accre_cms_user_part[accre_cms_user_part['USEDTIME_SEC'].between(0,1802)]
accre_cms_user_part_30

In [None]:
accre_cms_user_part_30.describe()

474044 cms jobs, or 57%, are less than 30 minutes.  509,499 cms jobs, or 61%, are less than 1 hr 

In [None]:
accre_cms_user_part_30.sort_values('USEDTIME_SEC',ascending = 'false')
accre_cms_user_part_30

In [None]:
accre_cms_user_part_30.groupby('STATE')['STATE'].count()

In [None]:
accre_cms_user_part_30_fail = accre_cms_user_part_30[accre_cms_user_part_30['STATE'] == 'FAILED']
accre_cms_user_part_30_fail

In [None]:
accre_cms_user_part_30_fail.info()

In [None]:
accre_cms_user_part_30_fail.NODELIST.value_counts().plot(kind = 'bar')
plt.xticks(rotation=45)
plt.tight_layout()

In [None]:
accre_cms_user_part_30_fail.NODELIST.describe()

There were only 61 failed jobs for cms jobs under 30 minutes. There were 52 unique Nodelists, with the most frequent cn1387. 

In [None]:
def find_most_common_values(df,column):
    return df[column].value_counts(ascending=False).iloc[0:50]

In [None]:
find_most_common_values(accre_cms_user_part_30_fail, 'NODELIST')

Now determine the number of cancelled jobs under 30 minutes

In [None]:
accre_cms_user_part_30_cxl = accre_cms_user_part_30[accre_cms_user_part_30['STATE'] == 'CANCELLED']
accre_cms_user_part_30_cxl

In [None]:
accre_cms_user_part_30_cxl.info()

In [None]:
# review cancelled jobs under 30 minutes' nodelist.  There are 626 unique nodelists
accre_cms_user_part_30_cxl.NODELIST.describe()

In [None]:
accre_cms_user_part_30_cxl.EXITCODE.describe()

In [None]:
find_most_common_values(accre_cms_user_part_30_cxl, 'NODELIST')

In [None]:
accre_cms_user_part_30_cxl['STATE'].describe()

In [None]:
accre_cms_user_part_30.EXITCODE.unique()

In [None]:
accre.STATE.describe()

In [None]:
accre_cms.PARTITION.describe()

In [None]:
accre_cms.PARTITION.unique()

In [None]:
# review failed nodes, first identify failed jobs
failed = accre[accre['STATE'] == 'FAILED']
failed

In [None]:
failed.info()

In [None]:
failed.NODELIST.unique()

In [None]:
failed.NODELIST.value_counts()

In [None]:
def find_most_common_values(df,column):
    return df[column].value_counts(ascending=False).iloc[0:30]

In [None]:
find_most_common_values(failed, 'NODELIST')

In [None]:
#accre_cms = (accre_cms["ACCOUNT"] == 'cms') & (accre_["USER"] == 'cmslocal') | (accre["USER"] == 'cmspilot')

In [None]:
#accre_cms_1hr = accre_cms_user_part(['USEDTIME_SEC'] < 3600)

In [None]:
#accre_cms_user_part_30.groupby('EXITCODE')['NODELIST']

In [None]:
#failed_part = accre[(accre["STATE"] == 'FAILED') & (accre['PARTITION'] == 'production')]
#failed.head() 