# Tally slurm total GPU hours for an account

This assumes that you have run

```bash
STARTDATE=2024-01-26
sacct -S "${STARTDATE}" --partition pli-c --allusers --json > sacct_pli_core.json    
sacct -S "${STARTDATE}" --partition pli-lc --allusers --json > sacct_pli_large_campus.json    
sacct -S "${STARTDATE}" --partition pli --allusers --json > sacct_pli_campus.json    
```


In [41]:
%load_ext autoreload
%autoreload 2

import json
from pathlib import Path
from pandas import DataFrame as DF
from slurm_analyzer import SLURMAnalyzer
import pandas as pd
from datetime import datetime
import tabulate

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:
# dpath = Path("..", 'data')
# assert dpath.is_dir()
dpath = Path.home() / "tmp"


In [43]:
df = pd.concat([
    SLURMAnalyzer().parse(json.loads((dpath / "sacct_pli_core.json").read_text())),
    SLURMAnalyzer().parse(json.loads((dpath / "sacct_pli_campus.json").read_text())),
    SLURMAnalyzer().parse(json.loads((dpath / "sacct_pli_large_campus.json").read_text()))
])

Filtered 29421 jobs (16%) with no gpus
Filtered 53984 jobs (36%) with < 10min run time
Filtered 7580 jobs (14%) with no gpus
Filtered 18639 jobs (41%) with < 10min run time
Filtered 199 jobs (11%) with no gpus
Filtered 620 jobs (39%) with < 10min run time


In [44]:
## Total GPU h

In [45]:
def wait_by_time(df, title=""):
    tab = []

    def add(label, query):
        sdf = df.query(query)
        tab.append((label, sdf.wait_time_h.mean(), len(sdf.query("wait_time_h > 24")), len(sdf)))
    
    add("Last 7 days", "age_days <= 7")
    add("Last 30 days", "age_days <= 30")
    # add("Last 60 days", "age_days <= 60")
    # add("Last 90 days", "age_days <= 90")
    # add("Forever", "age_days > 0")

    if title:
        print(title)
    print(tabulate.tabulate(tab, headers=["Period", "Avg. wait (h)", "jobs with wait > 24h", "Jobs"]))


In [46]:
def wait_by_partition(df, title=""):
    tab = []
    for partition in ["pli-c", "pli-lc", "pli"]:
        sdf = df.query(f"partition == '{partition}'")
        tab.append((partition, sdf.wait_time_h.mean(), len(sdf.query("wait_time_h > 24")), len(sdf)))
    if title:
        print(title)
    print(tabulate.tabulate(tab, headers=["Partition", "Avg. wait (h)", "jobs with wait > 24h", "Jobs"]))

## Total utilization

In [47]:
for partition in ["pli-c", "pli-lc", "pli"]:
    _util = df.query(f"partition == '{partition}'")["gpu_time_h"].sum()
    print(f"Total utilization for {partition}: {_util/1000:.0f}k hours")


Total utilization for pli-c: 1432k hours
Total utilization for pli-lc: 8k hours
Total utilization for pli: 335k hours


## Wait times: Overall


In [49]:
large_query = "gpu_time_h > 23"
small_query = "gpu_time_h <= 23"

wait_by_partition(df.query(small_query), "Overall wait times (small jobs)")
print()
wait_by_partition(df.query(large_query), "Overall wait times (large jobs)")
# by_partition(df.query(large_query), "Overall wait times (large jobs)")


Overall wait times (small jobs)
Partition      Avg. wait (h)    jobs with wait > 24h    Jobs
-----------  ---------------  ----------------------  ------
pli-c                3.98162                    4557   86775
pli-lc               2.36385                      17     940
pli                  3.26638                     796   24042

Overall wait times (large jobs)
Partition      Avg. wait (h)    jobs with wait > 24h    Jobs
-----------  ---------------  ----------------------  ------
pli-c                5.72588                     585   10835
pli-lc               5.09143                       4      50
pli                  5.11254                     134    2405


# Wait times: Details 



## Core partition

### Large jobs 

Use 23h as cutoff point because many jobs are set to terminate after 24h, so we want to avoid that edge.

In [50]:
for nodes in [1, 2, 4]:
    wait_by_time(df.query(f"partition == 'pli-c' and allocation_nodes >= {nodes} and elapsed_h > 23"), f">= {nodes} nodes, >= 23h runtime")
    print()

>= 1 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           4.12803                       0     145
Last 30 days          1.65561                       1     468

>= 2 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           10.2675                       0       1
Last 30 days          20.5398                       1       3

>= 4 nodes, >= 23h runtime
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           10.2675                       0       1
Last 30 days          20.5398                       1       3



### Smaller jobs

In [51]:
for t in [1, 10, 24]:
    wait_by_time(df.query(f"partition == 'pli-c' and gpu_time_h >= {t}"), f">= {t} GPU hours")
    print()

>= 1 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           1.98473                       1    1883
Last 30 days          1.35327                      16    6405

>= 10 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           2.63008                       0     688
Last 30 days          1.84987                      14    2523

>= 24 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           3.26511                       0     431
Last 30 days          2.4314                       13    1381



## Campus partition

In [52]:
for t in [1, 10, 24]:
    wait_by_time(df.query(f"partition == 'pli' and gpu_time_h >= {t}"), f">= {t} GPU hours")
    print()

>= 1 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days          19.6366                       93     367
Last 30 days          8.48028                     162    1611

>= 10 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           26.9257                      86     217
Last 30 days          14.6505                     127     611

>= 24 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           13.078                        4      47
Last 30 days          11.5208                      43     279



## Large campus partition

In [53]:
for t in [1, 10, 24]:
    wait_by_time(df.query(f"partition == 'pli-lc' and gpu_time_h >= {t}"), f">= {t} GPU hours")
    print()

>= 1 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days           3.16946                       2      46
Last 30 days          1.5502                        2     295

>= 10 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days          0.133287                       0      12
Last 30 days         2.44854                        0      70

>= 24 GPU hours
Period          Avg. wait (h)    jobs with wait > 24h    Jobs
------------  ---------------  ----------------------  ------
Last 7 days          0.517593                       0       3
Last 30 days         0.216375                       0      20

