Add Data class and content-addressed storage upload#65
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances data management capabilities within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust Data class for managing data dependencies, supporting both local paths and GCS URIs. The implementation of content-addressable storage upload to GCS, complete with caching and a sentinel blob for efficient cache hit checks, is a significant improvement for performance and reliability. The warning for large local datasets is a thoughtful addition for user guidance. The accompanying unit tests are comprehensive and cover critical aspects and edge cases, ensuring the correctness of the new features. Overall, this is a well-designed and thoroughly tested addition.
4313077 to
a05a214
Compare
divyashreepathihalli
left a comment
There was a problem hiding this comment.
Thank you for the PR! Left some comments.
| @property | ||
| def is_dir(self) -> bool: | ||
| if self.is_gcs: | ||
| return self._raw_path.endswith("/") |
There was a problem hiding this comment.
Users frequently type Data("gs://my-bucket/dataset")
(without the trailing slash) when pointing to a directory.
maybe normalize the GCS URI on initialization (if it has multiple objects, treat it as a directory),
There was a problem hiding this comment.
I'm not sure if normalizing via a GCS list-objects call in __init__ is the right thing to do here. GCS has no real concept of directories, we'll basically be listing objects under a prefix, which could be very slow for large buckets.
I've instead gone ahead with two light-weight approaches to reduce the chances of errors:
-
Added an explicit note in the docstring that a trailing slash is required for GCS directories, and updated the examples to show both cases (
gs://my-bucket/datasets/imagenet/vsgs://my-bucket/datasets/weights.h5). -
If the GCS path doesn't end with
/and the last path segment has no file extension (no.), we emit a warning suggesting the user add a trailing slash. This is just a heuristic, but it catches the common case (Data("gs://my-bucket/dataset")) without any GCS sdk calls.
| h = hashlib.sha256() | ||
| if os.path.isdir(self._resolved_path): | ||
| h.update(b"dir:") | ||
| for root, dirs, files in os.walk(self._resolved_path, followlinks=False): |
There was a problem hiding this comment.
os.walk(..., followlinks=False) only prevents the code from walking into symlinked directories. It does not stop os.walk from yielding symlinked files. you should explicitly ignore symlinked files before opening them, or read their symlink target strings instead of their contents
There was a problem hiding this comment.
Hashing symlink target paths instead of contents would break content-addressed storage. Two datasets with identical actual data would produce different hashes just because one uses symlinks.
Ignoring symlinked files entirely would silently skip data that will be present on the remote side, producing a hash that doesn't represent the full dataset.
The current behavior (reading and hashing the resolved content of symlinked files) is the right thing for a content-addressed store, since the hash should reflect what the user's code will actually see at runtime. The followlinks=False is there to prevent infinite recursion from circular directory symlinks, and it does that correctly.
The docstring is inaccurate though, I've updated it to be more precise about what's actually happening.
| h = hashlib.sha256() | ||
| if os.path.isdir(self._resolved_path): | ||
| h.update(b"dir:") | ||
| for root, dirs, files in os.walk(self._resolved_path, followlinks=False): |
There was a problem hiding this comment.
Because empty directories aren't hashed, if a user changes their dataset structure by adding a required but empty output directory (e.g., ./my_dataset/empty_outputs/), the cache hash will be identical, and the upload_data() logic won't upload the new structure.
There was a problem hiding this comment.
GCS is a flat object store with no concept of empty directories, so even if we re-uploaded, _upload_directory only walks files and the empty directory wouldn't exist on the remote side.
If user code needs an empty output directory at runtime, it should create it with os.makedirs(..., exist_ok=True).
| return total | ||
|
|
||
|
|
||
| def _upload_directory( |
There was a problem hiding this comment.
_upload_directory uploads files completely sequentially. Deep Learning datasets (like ImageNet or audio clips) often contain 100,000+ tiny files. Because each GCS upload has overhead (network latency, SSL handshake, HTTP framing), uploading 100,000 files sequentially will take hours, even if the total size is only a few megabytes.
You need to thread these file uploads to achieve adequate throughput
There was a problem hiding this comment.
Let's tackle this separately in another change, created #68 for tracking.
| bucket = client.bucket(bucket_name) | ||
|
|
||
| # O(1) cache hit check via sentinel blob | ||
| marker_blob = bucket.blob(f"{cache_prefix}/.cache_marker") |
| @@ -0,0 +1,122 @@ | |||
| """Data class for declaring data dependencies in remote functions. | |||
There was a problem hiding this comment.
lets add a new folder for Data?
There was a problem hiding this comment.
There's just one file, not sure if it's worth creating a new directory just yet?
There was a problem hiding this comment.
but just this file under the main keras_remote folder is weird.
Introduces the Data class for declaring data dependencies (local paths or GCS URIs) and upload_data() for content-hash-based caching in GCS.
a05a214 to
c11bf88
Compare
Add
Dataclass and content-addressed storage uploadSummary
Dataclass, a user-facing abstraction for declaring data dependencies (local files/directories or GCS URIs) that should be available on remote podsupload_data(), with SHA-256-based deduplication and.cache_markersentinel for O(1) cache hit detection_make_data_ref/is_data_ref) for replacingDataobjects with serializable dicts in payloadsget_default_project()ininfra.py, removing duplicate logic fromstorage.pyandexecution.pyThe
DataclassDatawraps a local file path, local directory path, or GCS URI. It is exported at the package top level:Constructor
FileNotFoundErrorif missing)gs://...) are stored as-is with no local validationProperties
pathstris_gcsboolTrueif the path starts withgs://is_dirboolos.path.isdir(); for GCS:Trueif URI ends with/content_hash() -> strComputes a SHA-256 hex digest of all file contents, used for content-addressed caching on GCS.
b"file:" + filename + contents(64KB chunks)b"dir:"then walks the tree (sorted, no symlink following), hashingrelative_path + contentsfor each filedir:/file:prefix prevents hash collisions between a single file and a directory containing only that fileValueErrorfor GCS URIsExamples
Content-addressed upload:
upload_data()Upload flow:
data.is_gcs, returns the original URI immediatelydata.content_hash()produces a SHA-256 digest{namespace_prefix}/data-cache/{hash}/.cache_marker(O(1) blob existence check)gs://{bucket}/{namespace_prefix}/data-cache/{hash}with no upload.cache_markerlast as an atomicity signalGCS layout:
Serialization protocol
Data objects are replaced with serializable dicts before pickling: