[Misc] Migrate to use XET-Core Rust based HF Hub in ome-agent replica#282
Conversation
Summary of ChangesHello @beiguo218, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Hugging Face model download capabilities within the ome-agent replica. By transitioning from a custom Go-based utility to the high-performance Rust-based XET-core library, the system is expected to handle model downloads more efficiently and reliably. The changes encompass core dependency updates, configuration adjustments, and comprehensive testing to validate the new integration. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully migrates the ome-agent replica from the Go-based hfutil to the new Rust-based XET-core library for Hugging Face Hub downloads. The changes are comprehensive, covering dependency updates, configuration changes, and refactoring of related code to use the new xet package. The introduction of a dedicated fx module for the xet client is well-structured and includes new tests. Overall, this is a solid improvement. I have a couple of suggestions to enhance error handling observability and code clarity.
| if err != nil { | ||
| // Check error type for better handling | ||
| var rateLimitErr *hub.RateLimitError | ||
| var httpErr *hub.HTTPError | ||
| if errors.As(err, &rateLimitErr) || | ||
| errors.As(err, &httpErr) && httpErr.StatusCode == 429 || | ||
| strings.Contains(err.Error(), "429") || | ||
| strings.Contains(err.Error(), "rate limit") { | ||
| logger.Warnf("Rate limited while downloading HuggingFace model %s: %v", input.Source.BucketName, err) | ||
| } else { | ||
| logger.Errorf("Failed to download HuggingFace model %s: %v", input.Source.BucketName, err) | ||
| } | ||
| return downloadPath, err | ||
| logger.Errorf("Failed to download snapshot: %v", err) | ||
| } |
There was a problem hiding this comment.
The error handling for downloadSnapHook has been simplified. The previous implementation had specific logic to detect rate-limiting errors from Hugging Face and log them as warnings. This new implementation logs all errors from the download as a generic "Failed to download snapshot" error.
While the new xet-core library is likely more resilient to rate-limiting, as mentioned in the PR description, this change could represent a loss of observability if such errors can still occur.
Could you please check if the xet library provides a way to distinguish different error types (e.g., transient network errors, rate-limiting, fatal errors)? If so, it would be beneficial to restore more nuanced error logging. For example:
if err != nil {
var xetErr *xet.XetError
if errors.As(err, &xetErr) && isRateLimitError(xetErr) {
logger.Warnf("Rate limited while downloading snapshot: %v", err)
} else {
logger.Errorf("Failed to download snapshot: %v", err)
}
}(Assuming isRateLimitError can be implemented based on xetErr.Code or xetErr.Message)
| func WithAppParams(params HubParams) Option { | ||
| return func(c *Config) error { | ||
| return nil | ||
| } | ||
| } |
There was a problem hiding this comment.
The function WithAppParams is currently a no-op. It is called in pkg/xet/module.go but doesn't perform any configuration. This can be confusing for future developers. If this function is a placeholder for future functionality, it would be helpful to add a // TODO comment explaining its purpose. If it's not needed, it should be removed to avoid dead code and its corresponding call in module.go.
| func WithAppParams(params HubParams) Option { | |
| return func(c *Config) error { | |
| return nil | |
| } | |
| } | |
| // WithAppParams applies configuration parameters from Hub params. | |
| // TODO: Implement logic to apply parameters from HubParams to the config if needed. | |
| func WithAppParams(params HubParams) Option { | |
| return func(c *Config) error { | |
| return nil | |
| } | |
| } |
|
Will do fix the build. |
There was a problem hiding this comment.
should we delete this or add this to gitignore
Fixed |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR migrates the ome-agent replica from using the Go-based hfutil package to the new Rust-based XET-core library for Hugging Face Hub downloads. It preserves the existing interface while leveraging the performance benefits of the Rust implementation. Including changes for below:
hfutilwithxetpackage in ome agent replica: Updated all Hugging Face download operations to use the new Rust-based XET-core client;Special notes for your reviewer:
Tested. Triggered a OME agent replica Job with 30 pods in parallel to import llama3.1 8b model from HF, no any 429 error being observed, all pods run successfully at their first try.