## The Advanced Computing Center for Research and Education

The Advanced Computing Center for Research and Education (ACCRE) is a computer cluster serving the high-performance computing needs of research for Vanderbilt University. In this data question, you will be analyzing data on jobs run on ACCRE's hardware.

When a job is submitted to ACCRE, it goes through the slurm scheduler, which tracks and manages compute and memory resources. It is hypothesized that the slurm scheduler is processing so many job completions so frequently that it sometimes becomes unresponsive to commands from users trying to schedule new jobs or check job status. This is a particularly bad problem for clients who use automated submission systems, such as members of the Open Science Grid. The goal of this project is to investigate and potentially confirm that hypothesis that lots of job completions in a short time period are causing the scheduler to be unresponsive, and determine the rough threshold at which it becomes an issue.

You have been provided three datasets for this task:
* **fullsample.csv**: This file contains output for jobs run through the slurm scheduler.
* **slurm_wrapper_ce5.log** and **slurm_wrapper_ce6.log**: Logs of every slurm command that a pair of servers, ce5 and ce6, executed, how long it took, and if it succeeded. These servers connect ACCRE's local cluster to the Open Science Grid and submit jobs to slurm on behalf of the grid.

To get started, answer the following questions using just the fullsample.csv jobs dataset:

1. Calculate some descriptive statistics for how many jobs per hour are being completed. What does the completions per hour look like over the time span of the dataset? Are there weekly trends, and has it been increasing over the last year?

2. Does the job state affect completions per hour? i.e. if I only look at jobs with exit code 0:0 in the "COMPLETED" state, is that a similar number of completions per hour as with all jobs, failed or cancelled? This will indicate if the load on the scheduler is by user design or is a result of users not sufficiently testing their jobs before submitting very large arrays. We also expect that most job completions will be in the "production" partition, but is this actually true?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mpl_dates
from sqlalchemy import create_engine, select, Table, MetaData
import pylab
import seaborn as sns

In [None]:
# Reflect table from SQLite database.
engine = create_engine('sqlite:///data/jobs.db')
connection = engine.connect()
metadata = MetaData()
jobs = Table('jobs', metadata, autoload = True, autoload_with = engine)

# Read SQL column as DateTime.
end_times = pd.read_sql("SELECT STRFTIME('%Y-%m-%dT%H:%M:%S', END) AS END FROM jobs", 
                        con = connection, 
                        parse_dates = ['END'])

# Create blank end_times count column, then resample based on hour.
end_times['HOURCOUNT'] = ''
end_times = end_times.set_index('END').resample('1H').count().reset_index()

# Print basic statistics regarding hourly completions.
round(end_times.describe(),0)

In [None]:
# Read SQL columns for state, exitcode and end time as DateTime.

# Add the partition piece if time. Not initially included due to overall class decision not to include.
completions = (
    pd.read_sql("SELECT STRFTIME('%Y-%m-%dT%H:%M:%S', END) AS END FROM jobs\
    WHERE STATE = 'COMPLETED' AND EXITCODE = '0:0'", 
                con = connection,
                parse_dates = ['END']))

# Create blank completions count column, then resample based on hour.
completions['HOURCOUNT'] = ''
completions = completions.set_index('END').resample('1H').count().reset_index()

# Print statement regarding overlap of completions with 'completed' state and exitcodes.
# Additional testing that didn't result in anything new was done but not included here.
print(f"{round(sum(completions['HOURCOUNT'])/sum(end_times['HOURCOUNT'])*100,2):.2f}% \
of the completions per hour have a state of 'Completed' and an exitcode of '0:0'.\
Ending times are nearly synonymous with such completions.") 

In [None]:
# Plot histogram of completions per hour.
# Check bin count against Sturges' rule.
plt.xkcd()

def thousands_commas_please(ax, *args):
    if 'x' in args:
        ax.get_xaxis().set_major_formatter(mtick.FuncFormatter(lambda x, p: format(int(x), ',')))
    if 'y' in args:
        ax.get_yaxis().set_major_formatter(mtick.FuncFormatter(lambda x, p: format(int(x), ',')))

def plot_histogram(series):
    
    facecolor = 'white'
    fig, ax = plt.subplots(figsize=(14,8), 
                           facecolor = facecolor)
    ax.set_facecolor(facecolor)
    plt.hist(series, 
             bins = 100, 
             color = 'dodgerblue', 
             edgecolor = 'black', 
             linewidth = 1)
    ax.set_axisbelow(True)
    ax.grid(axis = 'y', 
            color = 'black', 
            lw = 1, 
            alpha = 0.8)
    plt.xlim([0,series.quantile(0.99)])
    
    thousands_commas_please(ax, 'x','y')
    
plot_histogram(end_times['HOURCOUNT'])
plt.xlabel('Number of Completions/Hr.')
plt.ylabel('Count of Number of Completions/Hr.')
plt.title('Distribution of Completions/Hr.')
plt.show();

In [None]:
# Plot ecdf for number of completions per hr. for a clearer picture.
facecolor = 'white'
fig, ax = plt.subplots(figsize=(14,8), 
                       facecolor = facecolor)
sns.ecdfplot(end_times['HOURCOUNT'], color = 'black')
plt.xlabel('Number of Completions/Hr.')
plt.ylabel('Probability Density at Each Value')
plt.xlim(0,14000)
ninetieth = end_times['HOURCOUNT'].quantile(0.9)
plt.axhline(y = 0.9, 
            xmin = 0, 
            xmax = ninetieth/ax.get_xlim()[1],
            color = 'red',
            linestyle = '--')
plt.axvline(x = ninetieth, 
            ymin = 0, 
            ymax = 0.9,
            color = 'red',
            linestyle = '--')
plt.text(2000, 
         0.8, 
         f'90th Percentile = {ninetieth:,.0f}', ma = 'right');

In [None]:
# Group sample completions per hour by week and save as a dataframe.
end_times_by_week = (
    end_times.set_index('END').\
    groupby(pd.Grouper(freq = 'W'))['HOURCOUNT'].\
    agg(['sum']).\
    reset_index().\
    rename(columns = {'sum':'Weekly Count of Hourly Completions',
                     'END':'Week Ending'}))

In [None]:
# Plot ecdf for number of completions per hr. for a clearer picture.
facecolor = 'white'
fig, ax = plt.subplots(figsize=(14,8), 
                       facecolor = facecolor)
sns.ecdfplot(end_times_by_week['Weekly Count of Hourly Completions'], color = 'black')
plt.xlabel('Number of Completions/Hr. Per Week')
plt.ylabel('Probability Density at Each Value')
plt.xlim(0,350000)
median = end_times_by_week['Weekly Count of Hourly Completions'].median()
plt.axhline(y = 0.5, 
            xmin = 0, 
            xmax = median/ax.get_xlim()[1],
            color = 'red',
            linestyle = '--')
plt.axvline(x = median, 
            ymin = 0, 
            ymax = 0.5,
            color = 'red',
            linestyle = '--')
plt.text(60000, 
         0.52, 
         f'Median = {median:,.0f}', 
         ma = 'right');

In [None]:
# Fill between plot to show completions per hour by week.
facecolor = 'white'
fig, ax = plt.subplots(figsize=(14,8), 
                       facecolor = facecolor)

y = end_times_by_week['Weekly Count of Hourly Completions']
x = end_times_by_week['Week Ending'].apply(mpl_dates.date2num)

ax.fill_between(x, 0, y, facecolor = 'blue', alpha = 0.5)

model = np.polyfit(x, y, 1)
predicted = np.poly1d(model)
pylab.plot(x, predicted(x),"k--")

plt.xticks(x, rotation = 75, fontsize = 10)
thousands_commas_please(ax, 'y')
plt.xlabel('Week Ending')
plt.ylabel('Number of Completions/Hr. Per Week')
plt.title('Number of Completions/Hr. Per Week Over Time')
ax.xaxis.set_major_formatter(mpl_dates.DateFormatter('%D'));

In [None]:
# Read SQL column as DateTime.
vague_memory = pd.read_sql("SELECT USEDMEMPERCORE, STRFTIME('%Y-%m-%dT%H:%M:%S', END) AS END FROM jobs",
                           con = connection,
                           parse_dates = ['END'])

# Group total and standard deviation of used memory per core per hour and save as a dataframe.
vague_memory_by_hour = (
    vague_memory.set_index('END').\
    groupby(pd.Grouper(freq = '1H'))['USEDMEMPERCORE'].\
    agg(['sum','std']).\
    reset_index().\
    rename(columns = {'END':'Hour',
                     'sum':'Total Memory Per Core',
                     'std':'Std. Dev. of Total Memory Per Core'}).\
    round())

In [None]:
# Plot histogram of total memory per cores
plot_histogram(vague_memory_by_hour['Total Memory Per Core'])
thousands_commas_please(ax, 'x','y')
plt.xlabel('Total Megabytes Used Per Core/Hr.')
plt.ylabel('Count of Total Megabytes Used Per Core/Hr.')
plt.title('Distribution of Total Megabytes Used Per Core/Hr.')
plt.show();

In [None]:
# Plot histogram of the standard deviation of the total memory per core.
plot_histogram(vague_memory_by_hour['Std. Dev. of Total Memory Per Core'])
thousands_commas_please(ax, 'x','y')
plt.xlabel('Std. Dev. of Total Megabytes Used Per Core/Hr.')
plt.ylabel('Count of Std. Dev. of Total Megabytes Used Per Core/Hr.')
plt.title('Distribution of Std. Dev. of Total Megabytes Used Per Core/Hr.')
plt.show();

In [None]:
# Merge with completions per hour with memory usage per hour dataset.
end_times = (
    end_times.merge(vague_memory_by_hour, 
                    left_on = 'END', 
                    right_on = 'Hour')[['END',
                                        'HOURCOUNT', 
                                        'Total Memory Per Core', 
                                        'Std. Dev. of Total Memory Per Core']]
)

In [None]:
# Build scatterplot to show that variability of memory is generally correlated with lower completion counts.
plt.rcdefaults() 
fig, ax = plt.subplots(figsize = (12, 8), dpi = 300)
plt.scatter(x = end_times['Std. Dev. of Total Memory Per Core'], 
            y = end_times['HOURCOUNT'],
            c = end_times['Total Memory Per Core'],
            cmap = 'magma',
            alpha = 0.5)

thousands_commas_please(ax, 'x','y')
plt.xlabel('Std. Dev. of Total Megabytes Used Per Core')
plt.ylabel('Completion Counts Per Hour')
plt.title('Memory Variability vs. Completions plus Memory Usage', fontsize = 10)
plt.show()