ClusterData 2011 traces
John Wilkes and Charles Reiss.
clusterdata-2011-2 trace represents 29 day's worth of Borg cell information
from May 2011, on a cluster of about 12.5k machines. (The
-2 refers to the fact that we added some additional data after the initial release, to create trace version 2.1.)
- If you haven't already joined our mailing list, please do so now. Important: please fill out the "reason" field, or your application will be rejected.
clusterdata-2011-2 trace starts at 19:00 EDT on Sunday May 1, 2011, and
the datacenter is in that timezone (US Eastern). This corresponds to a trace
timestamp of 600s; see the data schema documentation for why.
The trace is described in the trace-data v2.1 format + schema document.
Priorities in this trace range from 0 to 11 inclusive; bigger numbers mean "more
important". 0 and 1 are “free” priorities; 9, 10, and 11 are “production”
priorities; and 12 is a “monitoring” priority.
clusterdata-2011-2 trace is identical to the one called
clusterdata-2011-1, except for the addition of a single new column of data in
task_usage tables. This new data is a randomly-picked 1 second sample of
CPU usage from within the associated 5-minute usage-reporting period for that
task. Using this data, it is possible to build up a stochastic model of task
utilization over time for long-running tasks.
The data and trace documentation are made available under the CC-BY license. By downloading it or using them, you agree to the terms of this license.
Downloading the trace
Download instructions for the trace are in the v2.1 format + schema document.
The trace is stored in
Google Storage for Developers in the
clusterdata-2011-2. The total size of the compressed trace is
Most users should use the gsutil command-line tool to download the trace data.
Known anomalies in the trace
Disk-time-fraction data is only included in about the first 14 days, because of a change in our monitoring system.
Some jobs are deliberately omitted because they ran primarily on machines not included in this trace. The portion that ran on included machines amounts to approximately 0.003% of the machines’ task-seconds of usage.
We are aware of only one example of a job that retains its job ID after being stopped, reconfigured, and restarted (job number 6253771429).
Approximately 70 jobs (for example, job number 6377830001) have job event records but no task event records. We believe that this is legitimate in a majority of cases: typically because the job is started but its tasks are disabled for its entire duration.
Approximately 0.013% of task events and 0.0008% of job events in this trace have a non-empty missing info field.
We estimate that less than 0.05% of job and task scheduling event records are missing and less than 1% of resource usage measurements are missing.
Some cycles per instruction (CPI) and memory accesses per instruction (MAI) measurements are clearly inaccurate (for example, they are above or below the range possible on the underlying micro-architectures). We believe these measurements are caused by bugs in the data-capture system used, such as the cycle counter and instruction counter not being read at the same time. To obtain useful data from these measurements, we suggest filtering out measurements representing a very small amount of CPU time and measurements with unreasonable CPI and MAI values.
Please send email to email@example.com.