Skip to content

google cloud storage fixes#782

Merged
gsilvestrin merged 6 commits intomainfrom
gsilvestrin/issue_780
Apr 19, 2023
Merged

google cloud storage fixes#782
gsilvestrin merged 6 commits intomainfrom
gsilvestrin/issue_780

Conversation

@gsilvestrin
Copy link
Copy Markdown
Contributor

@gsilvestrin gsilvestrin commented Apr 17, 2023

  • avoid copying files to work around gcs limitation on copying files created by multipart upload
  • Disable cache on objects stored by lance, it will make public buckets behave just like private ones
  • bump object_store

GCloud docs on Caching: https://cloud.google.com/storage/docs/xml-api/reference-headers#cachecontrol

close #780

gsilvestrin added 2 commits April 17, 2023 16:57
- avoid copying files to work around gcs bug
- Disable cache on objects stored by lance
- bump object_store
@gsilvestrin gsilvestrin changed the title WIP writing manifest twice to work around GCloud bug google cloud storage fixes Apr 18, 2023
@gsilvestrin gsilvestrin linked an issue Apr 18, 2023 that may be closed by this pull request
@gsilvestrin gsilvestrin marked this pull request as ready for review April 18, 2023 18:50
async fn build_gcs_object_store(uri: &str) -> Result<Arc<dyn OSObjectStore>> {
// GCS enables cache for public buckets, we disable to improve consistency
let mut headers = HeaderMap::new();
headers.insert(CACHE_CONTROL, "no-cache".parse().unwrap());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this impact on read or write? Any implication that you can see we hardcode the cache behavior on user's behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to all requests (read and write), but in practice what is does it that it sets the "Cache-Control" metadata of the uploaded files. object_store does not cache responses, so lance will always retrieve the files when performing a GET even for objects that could be cached

Copy link
Copy Markdown
Member

@eddyxu eddyxu Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mark the file as "uncached" even users want to cache it (outside of the lance library), for example, if users want to have a cache setup for their data lake, will this overwrite their cache setup or is it only applicalicable for our lance library.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will apply to all http caches - so let's say if chrome used to cache files downloaded from the bucket, it will no longer do so. I'm not sure how the cache setup for data lakes work, but if the bucket is non-public the files are already setup to not be cached.

@gsilvestrin gsilvestrin requested a review from eddyxu April 19, 2023 01:08
@gsilvestrin gsilvestrin merged commit d66f2b3 into main Apr 19, 2023
@gsilvestrin gsilvestrin deleted the gsilvestrin/issue_780 branch April 19, 2023 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Writing dataset to GCS fails with OSError

2 participants