Skip to content

Latest commit

 

History

History
151 lines (127 loc) · 7.82 KB

CHANGELOG.md

File metadata and controls

151 lines (127 loc) · 7.82 KB

Changelog

0.9.2

  • Fixed non-deterministic ordering when running jobs. Now jobs always run starting with the lowest JID again.

0.9.1

  • Add the --stagger flag for machine setup. This helps avoid resource bottlenecks during setup tasks for many machines.
  • Add some more columns to job stat for convenience.
  • Add --only_done flag to job statto only include jobs that are done.
  • Add retry limits to job add and job matrix add
  • All command line arguments that expect a JID now also accept the value last to indicate the JID of the last job.

0.9.0

  • Renamed job matrix stat to job matrix ls for better consistency with job ls.
  • The following have all been removed in favor of the new job stat command:
    • The --output flags of job matrix ls and job ls
    • The job matrix csv subcommand
    • The job results subcommand
  • The job stat command has gained significant super powers. It is now vastly more useful for post-processing results, and outputing data into a number of useful formats. See the help message for more info, but here are some examples:
    • Print a plain text table of the given experiments with only the given columns:
      > j job stat --text --jid --time --machine $EXPERIMENT_JIDS
      JID   TIME    MACHINE
      14923 1h34m   clnode199.clemson.cloudlab.us:22
      14951 1h4m    clnode201.clemson.cloudlab.us:22
      14956
      14963 46m2s   clnode212.clemson.cloudlab.us:22
    • Print a JSON of ID and log path for all running jobs:
      > j job stat --json --jid --log --running
      [{"jid":"15778","log":"/path/to/my.log\n"},{"jid":"15781","log":"/path/to/my.log\n"},{"jid":"15787","log":"/path/to/my.log\n"},{"jid":"15792","log":"/path/to/my.log\n"},{"jid":"15798","log":"/path/to/my.log\n"},{"jid":"15831","log":"/path/to/my.log\n"},{"jid":"15832","log":"/path/to/my.log\n"},{"jid":"15833","log":"/path/to/my.log\n"},{"jid":"15834","log":"/path/to/my.log\n"},{"jid":"15835","log":"/path/to/my.log\n"},{"jid":"15836","log":"/path/to/my.log\n"},{"jid":"15837","log":"/path/to/my.log\n"}]
    • Print a CSV generated by mapping each job's info with the given scipt, which takes a JSON of all info about a job:
      > j job stat --id 14740 --jid --results --cmd --csv --mapper /nobackup/extract.py
      Data filename,Huge page,Runtime (s),cpu_clk_unhalted.thread_any,cs,dtlb_load_misses.miss_causes_a_walk,dtlb_load_misses.walk_active,dtlb_store_misses.miss_causes_a_walk,dtlb_store_misses.walk_active,faults,inst_
      retired.any,migrations
      /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-316505881.mmu,TODO,356.387142448,37422
      90180352,5889,4265220134,116025476938,271985527,8132321876,11494749,5143124399584,6
      /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-414504453.mmu,TODO,360.590801957,37566
      69277437,3031,4273389066,116331409111,277050659,8249010131,11494750,5142777500448,10
      /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-822510504.mmu,TODO,356.487874315,37420
      04270438,5900,4274810097,116482238285,273613875,8186410000,11494750,5142880002204,4
      /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-27-43-860570478.mmu,TODO,359.158573514,37522
      29639845,2997,4276891646,116319096098,276830049,8289222935,11494752,5142941529188,9
      /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721583554560-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-34-03-577381879.mmu,TODO,358.304955228,37518
      94627436,3188,4267564111,116262044231,274054941,8203832653,11258667,5141843316464,14
    • Print a plain text table of the given experiments, using the given script to map over a particular column of the data as plain text:
      > j job stat --running --cmd --cmd_map  /tmp/replace_a_with_unk.sh  --text
      cmd
      exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
      exp00010  --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
    • Combine with shell to restart all failed jobs among the listed experiments:
      > j job restart $(j job stat --text --jid --status $EXPERIMENT_JIDS | grep Failed | awk '{print $1}')
  • Added the job mvresults subcommand to copy all file associated with a task to a new location.
  • Fixed issue where copying results hangs due to SSH host key verification failure. This was a long-standing and annoying issue. Instead, we now detect this case and print a specific error message encouraging the use to add the given host to their known_hosts file. Additionally, we move the host out of the class so that further experiments won't error out wastefully. The user can move it back when the host has been added to known_hosts.

0.8.1

  • Fixes a panic on "narrow" terminals.

0.8.0

  • Change the way results files are identified. The runner should now return a common prefix of all files to be copied, and the jobserver will copy all files with that prefix. In contrast, in the past, you had to return a filepath with a glob.
  • The client now has some better support for manipulating said prefixes.

0.7.1

  • Added machine mv subcommand.
  • Fix minor bugs.

0.7.0

  • Matrices that have become empty because all of their jobs were forgotten will also be forgotten. This is different from prior behavior, so I'm bumping the major version.
  • Added support for timing out jobs.
  • Added a shortcut for restarting a job.
  • Added -r flag to list all running jobs.
  • Fix some bugs.
  • Bump the optimization level a bit.

0.6.1

  • Minor backwards-compatible changes to client-server protocol and vast refactoring of client-side printing for job ls. These produce a major improvement in the format of job listings for matrices.

0.6

  • Changes to client-side j machine rm arguments to allow removing classes of machines more easily. This allows removing expired reservations more easily.

0.5

  • Add j job results subcommand.
  • Major improvements to handling of failed/cloned jobs in matrices:
    • When a matrix job is cloned, the clone also ends up in the matrix.
    • Matrix jobs automatically repeat on failure.
  • j job matrix add now supports the -x flag.
  • j job ls now prints a summary of the printed jobs.
  • Internal rearchitecting of the thread that copies results back to the host. This may allow future improvements to handling of failed/timed out/hanging copying tasks.

0.4

  • Reimplemented the server state serialization for snapshots. This fixes weird errors where tasks would become corrutped after a server restart for no apparent reason. Unfortunately, this is breaking change to the format of the server snapshots, so tasks that were already in the snapshot will show as Unknown after restarting the server into version 0.4.

0.3

  • This is the first version I published on crates.io.