Add Data class and content-addressed storage upload by JyotinderSingh · Pull Request #65 · keras-team/remote

JyotinderSingh · 2026-03-04T19:01:11Z

Add `Data` class and content-addressed storage upload

Summary

Introduces the Data class, a user-facing abstraction for declaring data dependencies (local files/directories or GCS URIs) that should be available on remote pods
Adds content-addressed upload to GCS via upload_data(), with SHA-256-based deduplication and .cache_marker sentinel for O(1) cache hit detection
Establishes a serialization protocol (_make_data_ref / is_data_ref) for replacing Data objects with serializable dicts in payloads
Centralizes project resolution into get_default_project() in infra.py, removing duplicate logic from storage.py and execution.py

The `Data` class

Data wraps a local file path, local directory path, or GCS URI. It is exported at the package top level:

from keras_remote import Data

Constructor

Data(path: str)

Local paths are resolved to absolute paths and validated for existence (raises FileNotFoundError if missing)
GCS URIs (gs://...) are stored as-is with no local validation

Properties

Property	Type	Description
`path`	`str`	Resolved absolute local path or GCS URI
`is_gcs`	`bool`	`True` if the path starts with `gs://`
`is_dir`	`bool`	For local: `os.path.isdir()`; for GCS: `True` if URI ends with `/`

`content_hash() -> str`

Computes a SHA-256 hex digest of all file contents, used for content-addressed caching on GCS.

Files: hashes b"file:" + filename + contents (64KB chunks)
Directories: hashes b"dir:" then walks the tree (sorted, no symlink following), hashing relative_path + contents for each file
The dir: / file: prefix prevents hash collisions between a single file and a directory containing only that file
Raises ValueError for GCS URIs

Examples

# Local directory
Data("./my_dataset/")

# Local file
Data("./config.json")

# GCS URI (no upload needed — passed through as-is)
Data("gs://my-bucket/datasets/imagenet/")

Content-addressed upload: `upload_data()`

upload_data(
    bucket_name: str,
    data: Data,
    project: str | None = None,
    namespace_prefix: str = "default",
) -> str  # returns GCS URI

Upload flow:

GCS passthrough — if data.is_gcs, returns the original URI immediately
Compute content hash — data.content_hash() produces a SHA-256 digest
Cache check — looks for {namespace_prefix}/data-cache/{hash}/.cache_marker (O(1) blob existence check)
Cache hit — returns gs://{bucket}/{namespace_prefix}/data-cache/{hash} with no upload
Cache miss — uploads files preserving directory structure, writes .cache_marker last as an atomicity signal
Large data warning — logs a warning if local data exceeds 10 GB, suggesting direct GCS URI usage

GCS layout:

gs://{bucket}/{namespace}/data-cache/{sha256}/
    file1.csv
    sub/file2.csv
    .cache_marker

Serialization protocol

Data objects are replaced with serializable dicts before pickling:

# _make_data_ref() produces:
{
    "__data_ref__": True,
    "gcs_uri": "gs://bucket/prefix",
    "is_dir": True,
    "mount_path": None  # or "/data" for volume-mounted data
}

# is_data_ref(obj) checks for the __data_ref__ sentinel key

gemini-code-assist · 2026-03-04T19:01:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances data management capabilities within the keras_remote framework by introducing a dedicated Data class for abstracting data locations. It also provides a robust mechanism for efficiently uploading and caching this data in Google Cloud Storage, which will streamline the process of handling datasets for remote function execution and improve performance by avoiding redundant uploads.

Highlights

Introduced Data Class: A new Data class has been added to represent data dependencies, supporting both local file/directory paths and GCS URIs. This class provides properties to determine if the path is a GCS URI or a directory, and includes a content_hash method for local data.
Content-Addressed GCS Uploads: A new upload_data utility function has been implemented to handle uploading Data objects to Google Cloud Storage. This function leverages content-hash-based caching, ensuring that data is uploaded only once and reused on subsequent cache hits, optimizing storage and transfer.
Top-Level Package Export: The newly created Data class is now exported from the keras_remote top-level package, making it easily accessible for users to declare their data dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

keras_remote/init.py
- Imported and re-exported the new Data class from keras_remote.data.
keras_remote/data.py
- Added a new file defining the Data class for managing local and GCS data paths.
- Implemented __init__ to resolve local paths and validate existence.
- Included properties path, is_gcs, and is_dir for data introspection.
- Developed content_hash method to generate SHA-256 hashes for local files and directories, ensuring deterministic hashing and preventing collisions.
- Added helper functions _make_data_ref for serializing data references and is_data_ref for checking them.
keras_remote/data_test.py
- Added a new test file for the Data class and its associated helper functions.
- Included comprehensive tests for Data constructor behavior with local files, directories, and GCS URIs.
- Verified error handling for non-existent local paths.
- Tested the content_hash method for determinism, content sensitivity, and handling of directories versus files.
- Validated the functionality of _make_data_ref and is_data_ref.
keras_remote/utils/storage.py
- Added the upload_data function to facilitate content-addressed uploads of Data objects to GCS.
- Implemented logic for GCS cache hit detection using a sentinel marker blob.
- Integrated content_hash from the Data class to determine cache keys.
- Included a warning for uploading very large local datasets.
- Added helper function _compute_total_size to calculate the size of files or directories.
- Added helper function _upload_directory to recursively upload local directories to GCS while preserving their structure.
keras_remote/utils/storage_test.py
- Imported the new Data class and related storage utilities for testing.
- Added _make_temp_path helper for creating temporary test directories.
- Introduced TestUploadData class to test the upload_data function, covering GCS data handling, cache hits, cache misses for files and directories, and custom namespaces.
- Added TestComputeTotalSize class to test the _compute_total_size helper for files and directories.
- Added TestUploadDirectory class to test the _upload_directory helper, ensuring correct preservation of directory structure during GCS uploads.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust Data class for managing data dependencies, supporting both local paths and GCS URIs. The implementation of content-addressable storage upload to GCS, complete with caching and a sentinel blob for efficient cache hit checks, is a significant improvement for performance and reliability. The warning for large local datasets is a thoughtful addition for user guidance. The accompanying unit tests are comprehensive and cover critical aspects and edge cases, ensuring the correctness of the new features. Overall, this is a well-designed and thoroughly tested addition.

keras_remote/data/data.py

keras_remote/data.py

keras_remote/utils/storage.py

divyashreepathihalli

Thank you for the PR! Left some comments.

keras_remote/data/data.py

divyashreepathihalli · 2026-03-04T22:06:07Z

keras_remote/data/data.py

+  @property
+  def is_dir(self) -> bool:
+    if self.is_gcs:
+      return self._raw_path.endswith("/")


Users frequently type Data("gs://my-bucket/dataset")
(without the trailing slash) when pointing to a directory.
maybe normalize the GCS URI on initialization (if it has multiple objects, treat it as a directory),

I'm not sure if normalizing via a GCS list-objects call in __init__ is the right thing to do here. GCS has no real concept of directories, we'll basically be listing objects under a prefix, which could be very slow for large buckets.

I've instead gone ahead with two light-weight approaches to reduce the chances of errors:

Added an explicit note in the docstring that a trailing slash is required for GCS directories, and updated the examples to show both cases (gs://my-bucket/datasets/imagenet/ vs gs://my-bucket/datasets/weights.h5).

If the GCS path doesn't end with / and the last path segment has no file extension (no .), we emit a warning suggesting the user add a trailing slash. This is just a heuristic, but it catches the common case (Data("gs://my-bucket/dataset")) without any GCS sdk calls.

divyashreepathihalli · 2026-03-04T22:08:20Z

keras_remote/data/data.py

+    h = hashlib.sha256()
+    if os.path.isdir(self._resolved_path):
+      h.update(b"dir:")
+      for root, dirs, files in os.walk(self._resolved_path, followlinks=False):


os.walk(..., followlinks=False) only prevents the code from walking into symlinked directories. It does not stop os.walk from yielding symlinked files. you should explicitly ignore symlinked files before opening them, or read their symlink target strings instead of their contents

Hashing symlink target paths instead of contents would break content-addressed storage. Two datasets with identical actual data would produce different hashes just because one uses symlinks.

Ignoring symlinked files entirely would silently skip data that will be present on the remote side, producing a hash that doesn't represent the full dataset.

The current behavior (reading and hashing the resolved content of symlinked files) is the right thing for a content-addressed store, since the hash should reflect what the user's code will actually see at runtime. The followlinks=False is there to prevent infinite recursion from circular directory symlinks, and it does that correctly.

The docstring is inaccurate though, I've updated it to be more precise about what's actually happening.

divyashreepathihalli · 2026-03-04T22:10:12Z

keras_remote/data/data.py

+    h = hashlib.sha256()
+    if os.path.isdir(self._resolved_path):
+      h.update(b"dir:")
+      for root, dirs, files in os.walk(self._resolved_path, followlinks=False):


Because empty directories aren't hashed, if a user changes their dataset structure by adding a required but empty output directory (e.g., ./my_dataset/empty_outputs/), the cache hash will be identical, and the upload_data() logic won't upload the new structure.

GCS is a flat object store with no concept of empty directories, so even if we re-uploaded, _upload_directory only walks files and the empty directory wouldn't exist on the remote side.

If user code needs an empty output directory at runtime, it should create it with os.makedirs(..., exist_ok=True).

keras_remote/data/data.py

keras_remote/utils/storage.py

divyashreepathihalli · 2026-03-04T22:16:31Z

keras_remote/utils/storage.py

+  return total
+
+
+def _upload_directory(


_upload_directory uploads files completely sequentially. Deep Learning datasets (like ImageNet or audio clips) often contain 100,000+ tiny files. Because each GCS upload has overhead (network latency, SSL handshake, HTTP framing), uploading 100,000 files sequentially will take hours, even if the total size is only a few megabytes.

You need to thread these file uploads to achieve adequate throughput

Let's tackle this separately in another change, created #68 for tracking.

keras_remote/utils/storage.py

divyashreepathihalli · 2026-03-04T22:18:51Z

keras_remote/utils/storage.py

+  bucket = client.bucket(bucket_name)
+
+  # O(1) cache hit check via sentinel blob
+  marker_blob = bucket.blob(f"{cache_prefix}/.cache_marker")


divyashreepathihalli · 2026-03-04T22:39:21Z

keras_remote/data/data.py

@@ -0,0 +1,122 @@
+"""Data class for declaring data dependencies in remote functions.


lets add a new folder for Data?

There's just one file, not sure if it's worth creating a new directory just yet?

but just this file under the main keras_remote folder is weird.

Introduces the Data class for declaring data dependencies (local paths or GCS URIs) and upload_data() for content-hash-based caching in GCS.

JyotinderSingh mentioned this pull request Mar 4, 2026

Integrate Data API into execution pipeline #66

Merged

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

JyotinderSingh added the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 4, 2026

github-actions bot removed the run-e2e Runs End-to-End tests on Internal Cluster. label Mar 4, 2026

JyotinderSingh requested a review from divyashreepathihalli March 4, 2026 20:39

JyotinderSingh force-pushed the data-api-1 branch from 4313077 to a05a214 Compare March 4, 2026 20:56

divyashreepathihalli requested changes Mar 4, 2026

View reviewed changes

divyashreepathihalli reviewed Mar 4, 2026

View reviewed changes

JyotinderSingh added 2 commits March 5, 2026 08:17

Add Data class and content-addressed storage upload

e1ffe88

Introduces the Data class for declaring data dependencies (local paths or GCS URIs) and upload_data() for content-hash-based caching in GCS.

Address reviews

c11bf88

JyotinderSingh force-pushed the data-api-1 branch from a05a214 to c11bf88 Compare March 5, 2026 02:47

JyotinderSingh mentioned this pull request Mar 5, 2026

Support threaded dataset uploads in _upload_directory() #68

Closed

Address reviews

8d91b67

JyotinderSingh requested a review from divyashreepathihalli March 5, 2026 03:21

divyashreepathihalli approved these changes Mar 6, 2026

View reviewed changes

move data into submodule

72eaed0

JyotinderSingh merged commit 7268485 into main Mar 6, 2026
4 checks passed

JyotinderSingh deleted the data-api-1 branch March 6, 2026 05:09

JyotinderSingh mentioned this pull request Mar 9, 2026

Define dataset pipeline and remote synchronization #30

Closed

		@@ -0,0 +1,122 @@
		"""Data class for declaring data dependencies in remote functions.

Conversation

JyotinderSingh commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!