-
Couldn't load subscription status.
- Fork 19.6k
StringLookup & IntegerLookup now save vocabulary loaded from file
#21751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @hertschuh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness and portability of Keras models utilizing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request modifies StringLookup and IntegerLookup to embed vocabularies loaded from files directly into the .keras model archive. This is a great improvement, making saved models self-contained and removing the dependency on external vocabulary files at loading time. The implementation is solid, correctly adjusting get_config to no longer save the file path and leveraging save_assets to store the vocabulary content.
I've also noted the addition of a security check in set_vocabulary to prevent loading arbitrary files when safe_mode is enabled. This is an important enhancement. I've left one comment suggesting the addition of a test case to ensure this security feature is robust and remains functional in the future.
Overall, this is a high-quality contribution that improves model portability and security.
| if serialization_lib.in_safe_mode(): | ||
| raise ValueError( | ||
| "Loading vocabulary files outside of the model archive " | ||
| "being reloaded is not allowed" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great security addition to prevent loading arbitrary files when safe_mode is on. To ensure this behavior is always preserved, it would be beneficial to add a test case for this check.
For example, a test in string_lookup_test.py could look like this:
def test_safe_mode_vocabulary_file_disallowed(self):
temp_dir = self.get_temp_dir()
vocab_path = os.path.join(temp_dir, "vocab.txt")
with open(vocab_path, "w") as file:
file.write("a\nb\nc\n")
layer = layers.StringLookup()
with saving.serialization_lib.SafeModeScope(True):
with self.assertRaisesRegex(
ValueError,
"Loading vocabulary files outside of the model archive "
"being reloaded is not allowed"
):
layer.set_vocabulary(vocab_path)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21751 +/- ##
==========================================
+ Coverage 82.59% 82.64% +0.05%
==========================================
Files 572 572
Lines 58535 58558 +23
Branches 9158 9154 -4
==========================================
+ Hits 48345 48395 +50
+ Misses 7853 7834 -19
+ Partials 2337 2329 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if serialization_lib.in_safe_mode(): | ||
| raise ValueError( | ||
| "Loading vocabulary files outside of the model archive " | ||
| "being reloaded is not allowed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add details to the error message, in particular inform users of the workaround (turning off safe mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Augmented message with similar verbiage from other safe mode errors.
in the `.keras` archive when they are initialized with a path to a vocabulary file. This makes the `.keras` archive fully self contained. This was already the behavior when using either `set_vocabulary` or `adapt`. Simply, this behavior was extended to the case when `__init__` is called with a vocabulary file. Note that this is technically a breaking change. Previously, upon doing `keras.saving.load_model`, it would be looking up the vocabulary file at the exact same path as when originally constructed. Also disallow loading an arbitrary vocabulary file during model loading with `safe_mode=True` since the vocabulary file should now come from the archive.
99595c1 to
b56889f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
in the
.kerasarchive when they are initialized with a path to a vocabulary file. This makes the.kerasarchive fully self contained.This was already the behavior when using either
set_vocabulary(path)oradapt. Simply, this behavior was extended to the case when__init__is called with a vocabulary file.Note that this is technically a breaking change. Previously, upon doing
keras.saving.load_model, it would be looking up the vocabulary file at the exact same path as when originally constructed.Also disallow loading an arbitrary vocabulary file during model loading with
safe_mode=Truesince the vocabulary file should now come from the archive.