This repository hosts the public releases of Acme traces from the Shanghai AI Lab, encompassing workloads spanning from March 2023 to August 2023. We encourage anyone to use the traces for academic purposes, and if you had any questions, feel free to send an email to us, or file an issue on Github.
Furthermore, we have conducted a thorough analysis of the Acme workloads, detailed in our NSDI '24 paper titled Characterization of Large Language Model Development in the Datacenter.
Note that due to space constraints on GitHub, our cluster utilization files are not hosted here. If you're interested in accessing these files, they are available on HuggingFace (~80GB).
The main trace characteristics, dataset structure and schema are:
- Full Dataset size: 80GB (on HuggingFace)
- Dataset size: 109MB
- Duration: 6 months
- Number of independent GPU clusters: 2
- Total number of jobs: 880,740
- Total number of GPU jobs: 470,497
📦AcmeTrace
┣ 📂data
┃ ┣ 📂job_trace
┃ ┃ ┣ 📂trace_previous_work (Prior job traces for comparison)
┃ ┃ ┃ ┣ 📜helios_trace.csv
┃ ┃ ┃ ┣ 📜xxx.csv
┃ ┃ ┣ 📜trace_kalos.csv (Job trace file, collected from scheduler)
┃ ┃ ┗ 📜trace_seren.csv
┃ ┣ 📂utilization
┃ ┃ ┣ 📂ipmi (Power of different server models in Seren, collected from IPMI)
┃ ┃ ┃ ┣ 📜CPU_D_Power.csv
┃ ┃ ┃ ┣ 📜GPU_AB_Power.csv
┃ ┃ ┃ ┗ 📜GPU_C_Power.csv
┃ ┃ ┣ 📂kalos (Resource utilization logs, collected from DCGM & Prometheus)
┃ ┃ ┃ ┣ 📜DRAM_ACTIVE.csv
┃ ┃ ┃ ┣ 📜xxx.csv
┃ ┃ ┣ 📂seren
┃ ┃ ┃ ┣ 📜DRAM_ACTIVE.csv
┃ ┃ ┃ ┣ 📜xxx.csv
┃ ┃ ┣ 📂util_pkl (Processed pickle files for plotting)
┃ ┃ ┃ ┣ 📜gpu_power_kalos.pkl
┃ ┃ ┃ ┣ 📜xxx.pkl
┃ ┣ 📜cluster_summary.csv
┃ ┣ 📜generate_utilization_pkl.ipynb (Parse utilization files and generate pickles)
┃ ┗ 📜utils.py
┣ 📂figure (Examples of trace visualization)
┃ ┣ 📜bar_job_state.pdf
┃ ┣ 📜xxx.pdf
┣ 📜LICENSE.txt
┣ 📜README.md
┗ 📜analysis.ipynb (Scripts for plotting)
Provides rich information on all jobs submitted to scheduler in each cluster.
trace_seren.csv
Example
job_id | user | node_num | gpu_num | cpu_num | type | state | submit_time | start_time | end_time | duration | queue | gpu_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5778432 | u5907 | 1 | 8 | 128 | Other | FAILED | 2023-03-01 00:18:22+08:00 | 2023-03-01 00:18:54+08:00 | 2023-03-01 00:20:51+08:00 | 117 | 32 | 936.0 |
5778469 | u5907 | 1 | 8 | 128 | Other | COMPLETED | 2023-03-01 00:23:58+08:00 | 2023-03-01 00:24:11+08:00 | 2023-03-01 01:09:04+08:00 | 2693 | 13 | 21544.0 |
trace_kalos.csv
Example
job_id | user | node_num | gpu_num | cpu_num | mem_per_pod_GB | shared_mem_per_pod | type | state | submit_time | start_time | end_time | fail_time | stop_time | duration | queue | gpu_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dlctk696s0jbvitv | uf794 | 8 | 64 | 960 | 1000 | 100.0 | Other | FAILED | 2023-05-17 11:00:58+00:00 | 2023-05-17 11:01:08+00:00 | 2023-05-17 11:01:16+00:00 | 2023-05-17 11:01:16+00:00 | 18 | 10.0 | 1152.0 | |
dlc1t2ypl09b8qtp | uf794 | 8 | 64 | 960 | 1000 | 100.0 | Other | CANCELLED | 2023-05-17 11:28:42+00:00 | 2023-05-17 11:28:54+00:00 | 2023-05-17 11:30:04+00:00 | 2023-05-17 11:30:04+00:00 | 82 | 12.0 | 5248.0 |
Field | Description |
---|---|
job_id |
unique id of the job |
user |
hashed id for the user, prefix is 'u' |
node_num |
number of nodes in the job |
gpu_num |
number of GPUs required for the job |
cpu_num |
number of CPUs required for the job |
type |
workload type in LLM development |
state |
the job's status upon termination 1 |
submit_time |
the job's submission time |
start_time |
the job's start execution time |
end_time |
the job's termination time |
duration |
total job execution time of the job 2 |
queue |
total job queue time of the job 3 |
gpu_time |
total GPU resource consumed by the job 4 |
Only in Kalos:
Field | Description |
---|---|
mem_per_pod_GB |
Pod memory resource configuration |
shared_mem_per_pod |
Pod memory resource configuration |
fail_time |
the time that failure occurs |
stop_time |
the time that job stops |
- A job can end up with one of five statuses: (1)
COMPLETED
: it is finished successfully; (2)CANCELLED
: it is terminated by the user; (3)FAILED
: it is terminated due to internal or external errors; (4)TIMEOUT
: the execution time is out of limit; (5)NODE_FAIL
: it is terminated due to the node crash.TIMEOUT
andNODE_FAIL
are very rare in our traces, and are regarded as failed in our analysis. - Calculated from the difference between
end_time
andstart_time
. (Unit: seconds) - Calculated from the difference between
start_time
andsubmit_time
. (Unit: seconds) - Calculated from the product between
duration
andgpu_num
.
Cluster resource utilization monitoring data, collected from DCGM, IPMI and Prometheus.
NODE_CPU_UTILIZATION.csv
Example
Time | 10.140.1.10 | 10.140.1.54 | 10.140.1.90 | 10.140.1.41 | 10.140.1.98 | 10.140.0.166 | 10.140.1.4 | 10.140.1.40 | 10.140.1.134 | 10.140.0.147 | 10.140.1.119 | 10.140.0.184 | 10.140.0.151 | 10.140.0.254 | 10.140.1.83 | 10.140.0.246 | 10.140.1.78 | 10.140.1.103 | 10.140.1.155 | 10.140.1.87 | 10.140.1.106 | 10.140.1.140 | 10.140.1.150 | 10.140.1.107 | 10.140.1.172 | 10.140.1.95 | 10.140.0.146 | 10.140.1.125 | 10.140.1.50 | 10.140.1.112 | 10.140.0.159 | 10.140.0.144 | 10.140.0.215 | 10.140.1.36 | 10.140.1.143 | 10.140.1.147 | 10.140.1.14 | 10.140.1.85 | 10.140.1.56 | 10.140.0.243 | 10.140.0.242 | 10.140.1.63 | 10.140.0.132 | 10.140.0.255 | 10.140.1.59 | 10.140.1.130 | 10.140.0.218 | 10.140.0.220 | 10.140.1.27 | 10.140.1.67 | 10.140.1.136 | 10.140.1.84 | 10.140.0.190 | 10.140.1.121 | 10.140.1.146 | 10.140.1.38 | 10.140.0.232 | 10.140.1.18 | 10.140.1.66 | 10.140.0.205 | 10.140.1.154 | 10.140.1.170 | 10.140.0.179 | 10.140.0.135 | 10.140.1.102 | 10.140.1.72 | 10.140.0.249 | 10.140.1.138 | 10.140.1.24 | 10.140.1.60 | 10.140.1.82 | 10.140.0.233 | 10.140.1.23 | 10.140.0.241 | 10.140.0.248 | 10.140.1.68 | 10.140.1.1 | 10.140.0.219 | 10.140.1.116 | 10.140.0.157 | 10.140.0.178 | 10.140.1.29 | 10.140.1.57 | 10.140.0.163 | 10.140.1.52 | 10.140.1.177 | 10.140.1.11 | 10.140.1.26 | 10.140.1.34 | 10.140.1.92 | 10.140.0.211 | 10.140.0.161 | 10.140.0.131 | 10.140.1.124 | 10.140.0.238 | 10.140.1.44 | 10.140.0.237 | 10.140.1.79 | 10.140.1.17 | 10.140.0.214 | 10.140.1.153 | 10.140.1.117 | 10.140.1.109 | 10.140.0.167 | 10.140.0.207 | 10.140.0.134 | 10.140.1.99 | 10.140.1.31 | 10.140.1.127 | 10.140.0.250 | 10.140.1.139 | 10.140.1.53 | 10.140.1.123 | 10.140.1.77 | 10.140.0.133 | 10.140.0.251 | 10.140.1.55 | 10.140.1.12 | 10.140.1.19 | 10.140.1.47 | 10.140.1.118 | 10.140.1.61 | 10.140.1.110 | 10.140.1.64 | 10.140.1.129 | 10.140.0.217 | 10.140.1.104 | 10.140.0.244 | 10.140.0.213 | 10.140.1.97 | 10.140.0.136 | 10.140.1.22 | 10.140.1.32 | 10.140.1.171 | 10.140.1.151 | 10.140.1.96 | 10.140.1.46 | 10.140.0.158 | 10.140.1.51 | 10.140.1.86 | 10.140.1.30 | 10.140.0.156 | 10.140.1.43 | 10.140.1.74 | 10.140.1.89 | 10.140.1.169 | 10.140.1.80 | 10.140.1.2 | 10.140.1.108 | 10.140.1.93 | 10.140.1.73 | 10.140.0.180 | 10.140.1.71 | 10.140.1.88 | 10.140.0.209 | 10.140.1.81 | 10.140.0.152 | 10.140.1.28 | 10.140.1.58 | 10.140.0.236 | 10.140.0.138 | 10.140.0.149 | 10.140.0.206 | 10.140.1.15 | 10.140.0.240 | 10.140.0.203 | 10.140.1.5 | 10.140.1.37 | 10.140.0.143 | 10.140.0.160 | 10.140.0.252 | 10.140.1.75 | 10.140.1.115 | 10.140.0.247 | 10.140.1.6 | 10.140.1.16 | 10.140.0.216 | 10.140.0.150 | 10.140.1.25 | 10.140.0.208 | 10.140.1.62 | 10.140.1.173 | 10.140.1.137 | 10.140.1.9 | 10.140.1.65 | 10.140.1.111 | 10.140.1.135 | 10.140.1.114 | 10.140.1.132 | 10.140.0.154 | 10.140.0.204 | 10.140.1.91 | 10.140.1.120 | 10.140.1.105 | 10.140.1.131 | 10.140.0.165 | 10.140.0.210 | 10.140.0.148 | 10.140.1.133 | 10.140.0.239 | 10.140.1.13 | 10.140.1.144 | 10.140.0.137 | 10.140.0.234 | 10.140.1.142 | 10.140.1.168 | 10.140.0.235 | 10.140.0.140 | 10.140.1.39 | 10.140.0.153 | 10.140.0.139 | 10.140.1.3 | 10.140.1.7 | 10.140.1.94 | 10.140.1.145 | 10.140.1.149 | 10.140.1.152 | 10.140.1.35 | 10.140.0.141 | 10.140.1.69 | 10.140.1.100 | 10.140.1.126 | 10.140.0.142 | 10.140.0.185 | 10.140.1.42 | 10.140.0.231 | 10.140.0.253 | 10.140.0.212 | 10.140.1.21 | 10.140.1.148 | 10.140.1.49 | 10.140.1.128 | 10.140.0.164 | 10.140.1.70 | 10.140.1.45 | 10.140.0.162 | 10.140.1.101 | 10.140.0.145 | 10.140.1.20 | 10.140.1.176 | 10.140.1.33 | 10.140.1.113 | 10.140.1.122 | 10.140.1.76 | 10.140.1.141 | 10.140.1.8 | 10.140.0.155 | 10.140.1.48 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2023-07-01 08:00:00+08:00 | 8.101 | 7.809 | 8.034 | 0.437 | 0.672 | 8.988 | 8.395 | 8.205 | 8.763 | 2.037 | 6.661 | 9.177 | 9.017 | 8.096 | 14.423 | 8.04 | 0.354 | 0.34 | 0.843 | 8.66 | 0.657 | 8.104 | 0.902 | 7.006 | 0.107 | 8.298 | 8.546 | 6.413 | 8.1 | 6.633 | 8.167 | 9.246 | 9.055 | 2.963 | 7.995 | 0.707 | 8.119 | 10.531 | 6.654 | 7.707 | 4.626 | 0.848 | 25.274 | 7.95 | 8.014 | 7.908 | 9.313 | 9.184 | 7.877 | 0.484 | 8.451 | 6.137 | 0.124 | 6.163 | 0.316 | 8.343 | 9.024 | 7.922 | 8.427 | 0.455 | 67.47 | 0.395 | 7.487 | 9.142 | 7.898 | 8.071 | 7.717 | 0.755 | 7.869 | 8.193 | 8.368 | 8.911 | 8.108 | 7.934 | 8.269 | 8.161 | 8.349 | 9.252 | 6.933 | 4.823 | 7.527 | 8.42 | 7.243 | 9.166 | 8.04 | 0.092 | 7.921 | 8.28 | 8.027 | 0.365 | 8.71 | 9.302 | 0.88 | 8.055 | 8.817 | 8.07 | 9.316 | 8.064 | 8.061 | 9.319 | 7.101 | 5.221 | 7.086 | 7.701 | 9.259 | 8.857 | 5.079 | 7.944 | 8.02 | 8.244 | 8.038 | 8.269 | 5.108 | 6.971 | 1.787 | 8.095 | 8.055 | 8.275 | 8.396 | 7.787 | 6.898 | 8.224 | 16.323 | 0.671 | 8.071 | 9.125 | 8.004 | 7.888 | 8.785 | 5.412 | 0.621 | 8.004 | 7.91 | 6.727 | 10.327 | 0.413 | 8.499 | 7.735 | 8.255 | 8.087 | 8.001 | 5.908 | 8.239 | 8.279 | 7.272 | 0.14 | 8.186 | 0.526 | 6.771 | 6.386 | 6.763 | 7.308 | 6.741 | 8.047 | 8.883 | 7.059 | 8.79 | 7.864 | 8.065 | 9.474 | 0.481 | 9.179 | 9.579 | 8.157 | 9.063 | 7.339 | 8.295 | 6.81 | 9.029 | 9.037 | 8.042 | 0.717 | 6.675 | 7.838 | 8.192 | 8.038 | 9.004 | 8.621 | 8.117 | 8.177 | 22.467 | 0.198 | 3.4 | 8.086 | 7.86 | 6.891 | 4.376 | 7.144 | 5.331 | 8.924 | 7.668 | 0.332 | 7.961 | 7.958 | 8.164 | 5.741 | 8.938 | 8.969 | 6.372 | 8.816 | 8.361 | 12.62 | 9.149 | 9.151 | 8.374 | 8.831 | 9.332 | 9.181 | 8.142 | 8.653 | 1.449 | 8.268 | 8.481 | 8.568 | 0.468 | 59.942 | 66.076 | 8.191 | 8.96 | 8.223 | 0.478 | 8.023 | 9.129 | 9.6 | 8.164 | 9.518 | 8.172 | 9.551 | 8.012 | 14.544 | 8.154 | 8.069 | 9.344 | 0.357 | 8.09 | 0.463 | 8.082 | 7.657 | 8.139 | 0.164 | 8.143 | 6.56 | 6.632 | 8.018 | 8.065 | 8.288 | 8.667 | 8.078 |
Field | Description |
---|---|
Time |
sampling timestamp, interval is 15 seconds |
10.140.xx.xx |
server ip |