Skip to content

Add Data class and content-addressed storage upload#65

Merged
JyotinderSingh merged 4 commits intomainfrom
data-api-1
Mar 6, 2026
Merged

Add Data class and content-addressed storage upload#65
JyotinderSingh merged 4 commits intomainfrom
data-api-1

Conversation

@JyotinderSingh
Copy link
Collaborator

@JyotinderSingh JyotinderSingh commented Mar 4, 2026

Add Data class and content-addressed storage upload

Summary

  • Introduces the Data class, a user-facing abstraction for declaring data dependencies (local files/directories or GCS URIs) that should be available on remote pods
  • Adds content-addressed upload to GCS via upload_data(), with SHA-256-based deduplication and .cache_marker sentinel for O(1) cache hit detection
  • Establishes a serialization protocol (_make_data_ref / is_data_ref) for replacing Data objects with serializable dicts in payloads
  • Centralizes project resolution into get_default_project() in infra.py, removing duplicate logic from storage.py and execution.py

The Data class

Data wraps a local file path, local directory path, or GCS URI. It is exported at the package top level:

from keras_remote import Data

Constructor

Data(path: str)
  • Local paths are resolved to absolute paths and validated for existence (raises FileNotFoundError if missing)
  • GCS URIs (gs://...) are stored as-is with no local validation

Properties

Property Type Description
path str Resolved absolute local path or GCS URI
is_gcs bool True if the path starts with gs://
is_dir bool For local: os.path.isdir(); for GCS: True if URI ends with /

content_hash() -> str

Computes a SHA-256 hex digest of all file contents, used for content-addressed caching on GCS.

  • Files: hashes b"file:" + filename + contents (64KB chunks)
  • Directories: hashes b"dir:" then walks the tree (sorted, no symlink following), hashing relative_path + contents for each file
  • The dir: / file: prefix prevents hash collisions between a single file and a directory containing only that file
  • Raises ValueError for GCS URIs

Examples

# Local directory
Data("./my_dataset/")

# Local file
Data("./config.json")

# GCS URI (no upload needed — passed through as-is)
Data("gs://my-bucket/datasets/imagenet/")

Content-addressed upload: upload_data()

upload_data(
    bucket_name: str,
    data: Data,
    project: str | None = None,
    namespace_prefix: str = "default",
) -> str  # returns GCS URI

Upload flow:

  1. GCS passthrough — if data.is_gcs, returns the original URI immediately
  2. Compute content hashdata.content_hash() produces a SHA-256 digest
  3. Cache check — looks for {namespace_prefix}/data-cache/{hash}/.cache_marker (O(1) blob existence check)
  4. Cache hit — returns gs://{bucket}/{namespace_prefix}/data-cache/{hash} with no upload
  5. Cache miss — uploads files preserving directory structure, writes .cache_marker last as an atomicity signal
  6. Large data warning — logs a warning if local data exceeds 10 GB, suggesting direct GCS URI usage

GCS layout:

gs://{bucket}/{namespace}/data-cache/{sha256}/
    file1.csv
    sub/file2.csv
    .cache_marker

Serialization protocol

Data objects are replaced with serializable dicts before pickling:

# _make_data_ref() produces:
{
    "__data_ref__": True,
    "gcs_uri": "gs://bucket/prefix",
    "is_dir": True,
    "mount_path": None  # or "/data" for volume-mounted data
}

# is_data_ref(obj) checks for the __data_ref__ sentinel key

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances data management capabilities within the keras_remote framework by introducing a dedicated Data class for abstracting data locations. It also provides a robust mechanism for efficiently uploading and caching this data in Google Cloud Storage, which will streamline the process of handling datasets for remote function execution and improve performance by avoiding redundant uploads.

Highlights

  • Introduced Data Class: A new Data class has been added to represent data dependencies, supporting both local file/directory paths and GCS URIs. This class provides properties to determine if the path is a GCS URI or a directory, and includes a content_hash method for local data.
  • Content-Addressed GCS Uploads: A new upload_data utility function has been implemented to handle uploading Data objects to Google Cloud Storage. This function leverages content-hash-based caching, ensuring that data is uploaded only once and reused on subsequent cache hits, optimizing storage and transfer.
  • Top-Level Package Export: The newly created Data class is now exported from the keras_remote top-level package, making it easily accessible for users to declare their data dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_remote/init.py
    • Imported and re-exported the new Data class from keras_remote.data.
  • keras_remote/data.py
    • Added a new file defining the Data class for managing local and GCS data paths.
    • Implemented __init__ to resolve local paths and validate existence.
    • Included properties path, is_gcs, and is_dir for data introspection.
    • Developed content_hash method to generate SHA-256 hashes for local files and directories, ensuring deterministic hashing and preventing collisions.
    • Added helper functions _make_data_ref for serializing data references and is_data_ref for checking them.
  • keras_remote/data_test.py
    • Added a new test file for the Data class and its associated helper functions.
    • Included comprehensive tests for Data constructor behavior with local files, directories, and GCS URIs.
    • Verified error handling for non-existent local paths.
    • Tested the content_hash method for determinism, content sensitivity, and handling of directories versus files.
    • Validated the functionality of _make_data_ref and is_data_ref.
  • keras_remote/utils/storage.py
    • Added the upload_data function to facilitate content-addressed uploads of Data objects to GCS.
    • Implemented logic for GCS cache hit detection using a sentinel marker blob.
    • Integrated content_hash from the Data class to determine cache keys.
    • Included a warning for uploading very large local datasets.
    • Added helper function _compute_total_size to calculate the size of files or directories.
    • Added helper function _upload_directory to recursively upload local directories to GCS while preserving their structure.
  • keras_remote/utils/storage_test.py
    • Imported the new Data class and related storage utilities for testing.
    • Added _make_temp_path helper for creating temporary test directories.
    • Introduced TestUploadData class to test the upload_data function, covering GCS data handling, cache hits, cache misses for files and directories, and custom namespaces.
    • Added TestComputeTotalSize class to test the _compute_total_size helper for files and directories.
    • Added TestUploadDirectory class to test the _upload_directory helper, ensuring correct preservation of directory structure during GCS uploads.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust Data class for managing data dependencies, supporting both local paths and GCS URIs. The implementation of content-addressable storage upload to GCS, complete with caching and a sentinel blob for efficient cache hit checks, is a significant improvement for performance and reliability. The warning for large local datasets is a thoughtful addition for user guidance. The accompanying unit tests are comprehensive and cover critical aspects and edge cases, ensuring the correctness of the new features. Overall, this is a well-designed and thoroughly tested addition.

@JyotinderSingh JyotinderSingh added the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 4, 2026
@github-actions github-actions bot removed the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 4, 2026
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Left some comments.

@property
def is_dir(self) -> bool:
if self.is_gcs:
return self._raw_path.endswith("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users frequently type Data("gs://my-bucket/dataset")
(without the trailing slash) when pointing to a directory.
maybe normalize the GCS URI on initialization (if it has multiple objects, treat it as a directory),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if normalizing via a GCS list-objects call in __init__ is the right thing to do here. GCS has no real concept of directories, we'll basically be listing objects under a prefix, which could be very slow for large buckets.

I've instead gone ahead with two light-weight approaches to reduce the chances of errors:

  1. Added an explicit note in the docstring that a trailing slash is required for GCS directories, and updated the examples to show both cases (gs://my-bucket/datasets/imagenet/ vs gs://my-bucket/datasets/weights.h5).

  2. If the GCS path doesn't end with / and the last path segment has no file extension (no .), we emit a warning suggesting the user add a trailing slash. This is just a heuristic, but it catches the common case (Data("gs://my-bucket/dataset")) without any GCS sdk calls.

h = hashlib.sha256()
if os.path.isdir(self._resolved_path):
h.update(b"dir:")
for root, dirs, files in os.walk(self._resolved_path, followlinks=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.walk(..., followlinks=False) only prevents the code from walking into symlinked directories. It does not stop os.walk from yielding symlinked files. you should explicitly ignore symlinked files before opening them, or read their symlink target strings instead of their contents

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashing symlink target paths instead of contents would break content-addressed storage. Two datasets with identical actual data would produce different hashes just because one uses symlinks.

Ignoring symlinked files entirely would silently skip data that will be present on the remote side, producing a hash that doesn't represent the full dataset.

The current behavior (reading and hashing the resolved content of symlinked files) is the right thing for a content-addressed store, since the hash should reflect what the user's code will actually see at runtime. The followlinks=False is there to prevent infinite recursion from circular directory symlinks, and it does that correctly.

The docstring is inaccurate though, I've updated it to be more precise about what's actually happening.

h = hashlib.sha256()
if os.path.isdir(self._resolved_path):
h.update(b"dir:")
for root, dirs, files in os.walk(self._resolved_path, followlinks=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because empty directories aren't hashed, if a user changes their dataset structure by adding a required but empty output directory (e.g., ./my_dataset/empty_outputs/), the cache hash will be identical, and the upload_data() logic won't upload the new structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS is a flat object store with no concept of empty directories, so even if we re-uploaded, _upload_directory only walks files and the empty directory wouldn't exist on the remote side.

If user code needs an empty output directory at runtime, it should create it with os.makedirs(..., exist_ok=True).

return total


def _upload_directory(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_upload_directory uploads files completely sequentially. Deep Learning datasets (like ImageNet or audio clips) often contain 100,000+ tiny files. Because each GCS upload has overhead (network latency, SSL handshake, HTTP framing), uploading 100,000 files sequentially will take hours, even if the total size is only a few megabytes.

You need to thread these file uploads to achieve adequate throughput

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's tackle this separately in another change, created #68 for tracking.

bucket = client.bucket(bucket_name)

# O(1) cache hit check via sentinel blob
marker_blob = bucket.blob(f"{cache_prefix}/.cache_marker")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -0,0 +1,122 @@
"""Data class for declaring data dependencies in remote functions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a new folder for Data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just one file, not sure if it's worth creating a new directory just yet?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but just this file under the main keras_remote folder is weird.

Introduces the Data class for declaring data dependencies (local paths
or GCS URIs) and upload_data() for content-hash-based caching in GCS.
@JyotinderSingh JyotinderSingh merged commit 7268485 into main Mar 6, 2026
4 checks passed
@JyotinderSingh JyotinderSingh deleted the data-api-1 branch March 6, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants