# Reading in 'fullsample' data and cleaning it up

To get started, answer the following questions using just the fullsample.csv jobs dataset:

Calculate some descriptive statistics for how many jobs per hour are being completed. What does the completions per hour look like over the time span of the dataset? Are there weekly trends, and has it been increasing over the last year?

Does the job state affect completions per hour? i.e. if I only look at jobs with exit code 0:0 in the "COMPLETED" state, is that a similar number of completions per hour as with all jobs, failed or cancelled? This will indicate if the load on the scheduler is by user design or is a result of users not sufficiently testing their jobs before submitting very large arrays. We also expect that most job completions will be in the "production" partition, but is this actually true?

In [1]:
import pandas as pd
import datetime
from datetime import datetime
from datetime import timedelta
import time
import re
import numpy as np
pd.options.display.max_rows=1000

In [26]:
#Job list with only completed jobs
jobs = pd.read_csv('fullsample.csv')

In [33]:
#changing to datetime
jobs['BEGIN'] = pd.to_datetime(jobs['BEGIN'],
                               format = '%Y-%m-%d %H:%M:%S', 
                               errors='coerce')
jobs['END'] = pd.to_datetime(jobs['END'],
                             format = '%Y-%m-%d %H:%M:%S', 
                             errors='coerce')

In [35]:
# changing REQTIME and USEDTIME from string to time type
jobs['REQTIME'] = jobs['REQTIME'].str.replace("-", " days ")
jobs['USEDTIME'] = jobs['USEDTIME'].str.replace("-", " days ")
jobs['REQTIME'] = pd.to_timedelta(jobs['REQTIME'])
jobs['USEDTIME'] = pd.to_timedelta(jobs['USEDTIME'])

#extract unit to new column and turn REQMEM into int type for later calculations 
jobs['Unit'] = jobs['REQMEM'].str.extract(r'([a-zA-Z]+)')
jobs['REQMEM'] =jobs['REQMEM'].str.split(r'([a-zA-Z]+)').str[0]
jobs['REQMEM'] = jobs['REQMEM'].astype('int')

#extract unit, turn into float type
jobs['USEDMEM'] = jobs['USEDMEM'].str.split(r'([a-zA-Z]+)').str[0]
jobs['USEDMEM'] = jobs['USEDMEM'].astype('float')


#move Unit column
unit = jobs['Unit']
jobs = jobs.drop(columns=['Unit'])
jobs.insert(loc=5, column='Unit', value=unit)

In [36]:
#convert REQMEM and USEDMEM Mn to Mc
jobs['REQMEM'] = np.where(jobs['Unit'] == 'Mn', jobs['REQMEM']*jobs['NODES']/jobs['CPUS'], jobs['REQMEM'])
jobs['USEDMEM'] = jobs['USEDMEM']*jobs['NODES']/jobs['CPUS']

#Drop column since no longer needed
jobs = jobs.drop(columns=['Unit'])

#Rename columns 
jobs.rename(columns={'USEDMEM': 'USEDMEM (Mc)'},inplace=True)
jobs.rename(columns={'REQMEM': 'REQMEM (Mc)'},inplace=True)


In [38]:
jobs.to_csv('jobs')