Skip to content

jasonwu224/cvmfs-model-file-benchmark

Repository files navigation

GSOC: Evaluating CVMFS for Machine Learning Model Distribution

Project Goals

The goal of this project is to evaluate the CernVM File System (CVMFS) as a platform to store, organize, and distribute model files. Specifically, we want to understand the latency and file access patterns for inference tasks using models loaded from a CVMFS repo.

Project State

What I did

  • Set up a test client machine and a CVMFS Stratum 0 server (at model-registry-test.cern.ch) on two LXPlus virtual machines.
  • Tested various container tools like Skopeo, Oras, and CVMFS Ducc to unpack then publish container images into model-registry-test.cern.ch
  • Coded a gpu-accelerated test inference script in python using Microsoft's Phi-4-mini LLM in .onnx format and recorded performance data into csv files comparing using local vs CVMFS storage. Coded a helper bash script to clear kernel and CVMFS cache for repeat trials.
  • Found that there is significant loading overhead on CVMFS (>30 sec) compared to local (>10 sec) on the first load, but after the model gets cached the load time is negligable (>1 sec). Inference performance (generating tokens) was the same for both.
  • Attempted testing loading overhead with file chunking disabled on the server, discovered a bug where CVMFS fails to read unchunked files.
  • Began investigating inference-time file access patterns using CVMFS debug logs and wrote scripts to count cvmfs_read() calls at inference time. Initial tests using cat <file> showed that the read only gets called once per file per CVMFS process, which was unexpected.

What remains (high to low priority)

  • Fix CVMFS continuously crashing on the client machine (began after investigating file access patterns)
  • Run the file tracing scripts during inference runs
  • Merge the file tracing scripts with fuse-monitor-read-write tool to display file access heat maps from this github repo
  • Run inference benchmarks on smaller, commonly used CERN models like Particle Transformer, which are just a few MB compared to Phi-4-mini which is several GB
  • Develop the server into a publically usable one by creating Stratum 1 Proxy servers and using Ducc to automatically add and unpack user requested container images, in the same manner as unpacked.cern.ch
  • Develop the server infrastructure into an OCI-compliant registry with authentication and HTTP requests

Additional info

  • fuse-monitor-data contains the access pattern for the locally-stored Phi-4-mini
  • benchmark-data contains some of the inference benchmarking I referenced previously
  • trace-cvmfs-reads.sh, inodes.csv, raw-reads.txt, and read_data.csv are part of the prototype CVMFS file tracing pipeline

Challenges and learning

  • The biggest challenges I faced were system-administration problems, where the packages I'd install would be incompatible with each other or the machine for various reasons, sometimes they would even cause the machine to crash!
  • I learned a lot about sysadmin tools like using ssh, configuring clients and servers, monitoring files and processes, and working with containers. One of the coolest moments was when I first got the inference script to work, seeing the client and server communicate seemlessly was amazing! Another one was when I used tmux to create two panes and simultaneously view the debug output of a running CVMFS process and print the size of the CVMFS cache, seeing the cache fill up with data chunks sent through HTTP was so cool.
  • In the beginning, it was challenging to understand all the container-specific terminology and implementations. Now I've gained an appreciation for how standardized container workflows have become (through OCI) and how convenient and secure they can be for both sharing coding environments and files (artifacts) through registries.
  • Debugging crashes can be quite difficult, and in the final week I never figured out why CVMFS kept on crashing on startup, which halted my progress.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published