Skip to content

Conversation

@hertschuh
Copy link
Collaborator

in the .keras archive when they are initialized with a path to a vocabulary file. This makes the .keras archive fully self contained.

This was already the behavior when using either set_vocabulary(path) or adapt. Simply, this behavior was extended to the case when __init__ is called with a vocabulary file.

Note that this is technically a breaking change. Previously, upon doing keras.saving.load_model, it would be looking up the vocabulary file at the exact same path as when originally constructed.

Also disallow loading an arbitrary vocabulary file during model loading with safe_mode=True since the vocabulary file should now come from the archive.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hertschuh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness and portability of Keras models utilizing StringLookup and IntegerLookup layers. By embedding vocabulary files directly into the .keras archive upon initialization, it eliminates external file dependencies, ensuring that models are fully self-contained. This update also introduces a safeguard in safe_mode to prevent the loading of external vocabulary files, reinforcing the new self-contained paradigm.

Highlights

  • Vocabulary Embedding: The StringLookup and IntegerLookup layers now embed vocabulary files directly into the .keras archive when initialized with a file path, making the archive fully self-contained.
  • Self-Contained Archives: This change ensures that saved .keras models no longer depend on external vocabulary file paths during loading, enhancing portability and reliability.
  • Breaking Change: Previously, keras.saving.load_model would look for vocabulary files at their original paths. Now, the vocabulary is loaded from the embedded archive, which is a behavioral change.
  • Safe Mode Restriction: Loading external vocabulary files is now explicitly disallowed during model reloading when safe_mode=True to enforce the integrity of the self-contained archive.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies StringLookup and IntegerLookup to embed vocabularies loaded from files directly into the .keras model archive. This is a great improvement, making saved models self-contained and removing the dependency on external vocabulary files at loading time. The implementation is solid, correctly adjusting get_config to no longer save the file path and leveraging save_assets to store the vocabulary content.

I've also noted the addition of a security check in set_vocabulary to prevent loading arbitrary files when safe_mode is enabled. This is an important enhancement. I've left one comment suggesting the addition of a test case to ensure this security feature is robust and remains functional in the future.

Overall, this is a high-quality contribution that improves model portability and security.

Comment on lines 391 to 401
if serialization_lib.in_safe_mode():
raise ValueError(
"Loading vocabulary files outside of the model archive "
"being reloaded is not allowed"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great security addition to prevent loading arbitrary files when safe_mode is on. To ensure this behavior is always preserved, it would be beneficial to add a test case for this check.

For example, a test in string_lookup_test.py could look like this:

def test_safe_mode_vocabulary_file_disallowed(self):
    temp_dir = self.get_temp_dir()
    vocab_path = os.path.join(temp_dir, "vocab.txt")
    with open(vocab_path, "w") as file:
        file.write("a\nb\nc\n")

    layer = layers.StringLookup()
    with saving.serialization_lib.SafeModeScope(True):
        with self.assertRaisesRegex(
            ValueError,
            "Loading vocabulary files outside of the model archive "
            "being reloaded is not allowed"
        ):
            layer.set_vocabulary(vocab_path)

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.64%. Comparing base (3137cb0) to head (b56889f).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/layers/preprocessing/index_lookup.py 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21751      +/-   ##
==========================================
+ Coverage   82.59%   82.64%   +0.05%     
==========================================
  Files         572      572              
  Lines       58535    58558      +23     
  Branches     9158     9154       -4     
==========================================
+ Hits        48345    48395      +50     
+ Misses       7853     7834      -19     
+ Partials     2337     2329       -8     
Flag Coverage Δ
keras 82.44% <50.00%> (+0.05%) ⬆️
keras-jax 63.21% <50.00%> (+0.02%) ⬆️
keras-numpy 57.66% <50.00%> (+0.10%) ⬆️
keras-openvino 34.38% <25.00%> (+0.04%) ⬆️
keras-tensorflow 63.97% <50.00%> (+0.02%) ⬆️
keras-torch 63.53% <50.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

if serialization_lib.in_safe_mode():
raise ValueError(
"Loading vocabulary files outside of the model archive "
"being reloaded is not allowed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add details to the error message, in particular inform users of the workaround (turning off safe mode)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Augmented message with similar verbiage from other safe mode errors.

in the `.keras` archive when they are initialized with a path to a vocabulary file. This makes the `.keras` archive fully self contained.

This was already the behavior when using either `set_vocabulary` or `adapt`. Simply, this behavior was extended to the case when `__init__` is called with a vocabulary file.

Note that this is technically a breaking change. Previously, upon doing `keras.saving.load_model`, it would be looking up the vocabulary file at the exact same path as when originally constructed.

Also disallow loading an arbitrary vocabulary file during model loading with `safe_mode=True` since the vocabulary file should now come from the archive.
Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@google-ml-butler google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 17, 2025
@fchollet fchollet merged commit 61ac8c1 into keras-team:master Oct 17, 2025
8 checks passed
@google-ml-butler google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase kokoro:force-run labels Oct 17, 2025
@hertschuh hertschuh deleted the lookup_vocab branch October 18, 2025 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants