GSOC: Evaluating CVMFS for Machine Learning Model Distribution

Project Goals

The goal of this project is to evaluate the CernVM File System (CVMFS) as a platform to store, organize, and distribute model files. Specifically, we want to understand the latency and file access patterns for inference tasks using models loaded from a CVMFS repo.

Project State

What I did

Set up a test client machine and a CVMFS Stratum 0 server (at model-registry-test.cern.ch) on two LXPlus virtual machines.
Tested various container tools like Skopeo, Oras, and CVMFS Ducc to unpack then publish container images into model-registry-test.cern.ch
Coded a gpu-accelerated test inference script in python using Microsoft's Phi-4-mini LLM in .onnx format and recorded performance data into csv files comparing using local vs CVMFS storage. Coded a helper bash script to clear kernel and CVMFS cache for repeat trials.
Found that there is significant loading overhead on CVMFS (>30 sec) compared to local (>10 sec) on the first load, but after the model gets cached the load time is negligable (>1 sec). Inference performance (generating tokens) was the same for both.
Attempted testing loading overhead with file chunking disabled on the server, discovered a bug where CVMFS fails to read unchunked files.
Began investigating inference-time file access patterns using CVMFS debug logs and wrote scripts to count cvmfs_read() calls at inference time. Initial tests using cat <file> showed that the read only gets called once per file per CVMFS process, which was unexpected.

What remains (high to low priority)

Fix CVMFS continuously crashing on the client machine (began after investigating file access patterns)
Run the file tracing scripts during inference runs
Merge the file tracing scripts with fuse-monitor-read-write tool to display file access heat maps from this github repo
Run inference benchmarks on smaller, commonly used CERN models like Particle Transformer, which are just a few MB compared to Phi-4-mini which is several GB
Develop the server into a publically usable one by creating Stratum 1 Proxy servers and using Ducc to automatically add and unpack user requested container images, in the same manner as unpacked.cern.ch
Develop the server infrastructure into an OCI-compliant registry with authentication and HTTP requests

Additional info

fuse-monitor-data contains the access pattern for the locally-stored Phi-4-mini
benchmark-data contains some of the inference benchmarking I referenced previously
trace-cvmfs-reads.sh, inodes.csv, raw-reads.txt, and read_data.csv are part of the prototype CVMFS file tracing pipeline

Challenges and learning

The biggest challenges I faced were system-administration problems, where the packages I'd install would be incompatible with each other or the machine for various reasons, sometimes they would even cause the machine to crash!
I learned a lot about sysadmin tools like using ssh, configuring clients and servers, monitoring files and processes, and working with containers. One of the coolest moments was when I first got the inference script to work, seeing the client and server communicate seemlessly was amazing! Another one was when I used tmux to create two panes and simultaneously view the debug output of a running CVMFS process and print the size of the CVMFS cache, seeing the cache fill up with data chunks sent through HTTP was so cool.
In the beginning, it was challenging to understand all the container-specific terminology and implementations. Now I've gained an appreciation for how standardized container workflows have become (through OCI) and how convenient and secure they can be for both sharing coding environments and files (artifacts) through registries.
Debugging crashes can be quite difficult, and in the final week I never figured out why CVMFS kept on crashing on startup, which halted my progress.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
benchmark-data		benchmark-data
fuse-monitor-data		fuse-monitor-data
.gitignore		.gitignore
README.md		README.md
clear-caches.sh		clear-caches.sh
inodes.csv		inodes.csv
process-trace-data.py		process-trace-data.py
raw_reads.txt		raw_reads.txt
read_data.csv		read_data.csv
requirements.txt		requirements.txt
test_phi4_inference.py		test_phi4_inference.py
trace-cvmfs-reads.sh		trace-cvmfs-reads.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GSOC: Evaluating CVMFS for Machine Learning Model Distribution

Project Goals

Project State

What I did

What remains (high to low priority)

Additional info

Challenges and learning

About

Uh oh!

Releases

Packages

Languages

jasonwu224/cvmfs-model-file-benchmark

Folders and files

Latest commit

History

Repository files navigation

GSOC: Evaluating CVMFS for Machine Learning Model Distribution

Project Goals

Project State

What I did

What remains (high to low priority)

Additional info

Challenges and learning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages