# Tally slurm total GPU hours for an account

This assumes that you have run

```bash
sacct -p pli-c --allusers --json > sacct_pli.json    
sacct -p pli --allusers --json > sacct_other.json    
```

on the server, printing out all jobs for the `pli` account.

Or for a larger time window, add `-S 2024-01-01 `

In [1]:
%load_ext autoreload
%autoreload 2

import json
from pathlib import Path
from pandas import DataFrame as DF
from slurm_analyzer import SLURMAnalyzer
import pandas as pd
from datetime import datetime
import tabulate

In [2]:
dpath = Path("..", 'data')
assert dpath.is_dir()


In [3]:
df = pd.concat([
    SLURMAnalyzer().parse(json.loads((dpath / "pli_1.json").read_text())),
    SLURMAnalyzer().parse(json.loads((dpath / "pli_c_1.json").read_text()))
])

Filtered 4017 jobs (14%) with no gpus
Filtered 9371 jobs (36%) with < 10min run time
Filtered 9402 jobs (13%) with no gpus
Filtered 21691 jobs (34%) with < 10min run time


In [17]:
def by_time(df, title=""):
    tab = []

    def add(label, query):
        sdf = df.query(query)
        tab.append((label, sdf.wait_time_h.mean(), len(sdf.query("wait_time_h > 24")), len(sdf)))
    
    add("Last 7 days", "age_days <= 7")
    add("Last 30 days", "age_days <= 30")
    add("Last 60 days", "age_days <= 60")
    add("Last 90 days", "age_days <= 90")
    add("Forever", "age_days > 0")

    if title:
        print(title)
    print(tabulate.tabulate(tab, headers=["Period", "Avg. wait (h)", "jobs with wait > 24h", "Jobs"]))


# Core partition

## Large jobs 

Use 23h as cutoff point because many jobs are set to terminate after 24h, so we want to avoid that edge.

In [21]:
for nodes in [1, 2, 4]:
    by_time(df.query(f"partition == 'pli-c' and allocation_nodes >= {nodes} and elapsed_h > 23"), f">= {nodes} nodes, >= 23h runtime")
    print()

>= 1 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           5.04036                       5      64
Last 30 days          7.7441                       18     183
Last 60 days          8.10278                      34     491
Last 90 days          3.54143                      39    1782
Forever               4.76086                      71    2265

>= 2 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days        0.00361111                       0       3
Last 30 days      21.3888                          10      44
Last 60 days      15.5377                          11      72
Last 90 days      15.3882                          12      76
Forever           34.8961                          22     101

>= 4 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait >

## Smaller jobs

In [19]:
for t in [1, 10, 24]:
    by_time(df.query(f"partition == 'pli-c' and gpu_time_h >= {t}"), f"<= {t} GPU hours")
    print()

<= 1 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           3.00042                      38     802
Last 30 days          9.0763                      930    7454
Last 60 days          9.86712                    2092   13427
Last 90 days          7.26785                    2126   19973
Forever               5.36976                    2184   29352

<= 10 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           7.54671                      37     256
Last 30 days          4.53623                      69    1014
Last 60 days          4.33863                      95    1973
Last 90 days          3.47459                     129    4599
Forever               3.20377                     187    7928

<= 24 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  -------

# Campus partition

In [20]:
for t in [1, 10, 24]:
    by_time(df.query(f"partition == 'pli' and gpu_time_h >= {t}"), f"<= {t} GPU hours")
    print()

<= 1 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days          0.634233                       0     212
Last 30 days         1.17159                        3    1878
Last 60 days         1.75922                       10    3049
Last 90 days         2.476                         88    4360
Forever              2.44454                      175    9928

<= 10 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days          0.539725                       0      95
Last 30 days         1.63462                        3     296
Last 60 days         2.77335                       10     551
Last 90 days         4.32858                       56     947
Forever              4.00865                      122    2471

<= 24 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  -------