Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Blob Storage remote #868

Merged
merged 1 commit into from Jul 9, 2018
Merged

Add Azure Blob Storage remote #868

merged 1 commit into from Jul 9, 2018

Conversation

@c-w
Copy link
Contributor

@c-w c-w commented Jul 9, 2018

This change adds a new cloud remote based on Azure Blob Storage and the azure-storage-blob SDK.

The implementation closely follows the patterns laid out by the existing cloud remotes (AWS and GCP). Notable differences include:

  • No use of a lock in the progress-bar callback and no requirement to look up total transfer size ahead of time: the Azure Storage SDK provides the total transfer bytes and current transfer offset in its progress callback.

  • When exists is called, check the existence of every blob instead of listing the entire container. Container list operations can be expensive so unless the number of checked files is very large, making separate requests is likely going to be more efficient. Please let me know if a list-then-check implementation of the exists function would be preferred.

  • For increased flexibility, access credentials for the Azure Storage backend can be provided either connection-string style via the remote URL or via environment variables. The Azure Storage backend can be emulated via Azurite.

This change was created together with @EricSchles.

This change adds a new cloud remote based on Azure Blob Storage [1] and
the azure-storage-blob SDK [2].

The implementation closely follows the patterns laid out by the existing
cloud remotes (AWS and GCP). Notable differences include:

- No use of a lock in the progress-bar callback and no requirement to
  look up total transfer size ahead of time: the Azure Storage SDK
  provides the total transfer bytes and current transfer offset in its
  progress callback.

- When `exists` is called, check the existence of every blob instead of
  listing the entire container. Container list operations can be
  expensive so unless the number of checked files is very large, making
  separate requests is likely going to be more efficient.

- For increased flexibility, access credentials for the Azure Storage
  backend can be provided either connection-string style via the remote
  URL or via environment variables. The Azure Storage backend can be
  emulated via Azurite [3].

[1] https://azure.microsoft.com/en-us/services/storage/blobs/
[2] https://github.com/Azure/azure-storage-python
[3] https://github.com/Azure/azurite

This change was created together with @EricSchles.
efiop
efiop approved these changes Jul 9, 2018
Copy link
Member

@efiop efiop left a comment

Great patch! Thank you!

} for md5 in md5s]

def exists(self, path_infos):
# TODO: why does S3 fetch all blobs first?
Copy link
Member

@efiop efiop Jul 9, 2018

It is simply much faster :) We mostly use exists() method when filtering a bulk of cache files to decide if we need to download/upload them and in s3 list_objects is much-much faster than trying to check keys one-by-one. If blob_service.exists() method is plenty fast as is, there is no reason to redo this part of code.

Copy link
Member

@shcheklein shcheklein Jul 9, 2018

Let's put this as a comment instead of this TODO.

@efiop
Copy link
Member

@efiop efiop commented Jul 9, 2018

Fixes #442 .

@efiop efiop merged commit 1bcd740 into iterative:master Jul 9, 2018
3 checks passed
@c-w c-w mentioned this pull request Jul 9, 2018
@efiop
Copy link
Member

@efiop efiop commented Jul 9, 2018

Current exists() implementation is fine, thank you! I'll setup a test shortly and if it is too slow will change it to list-then-check logic.

@c-w @EricSchles Thank you guys so much for the amazing work! I will be sure to release new version this week, after I'll setup azure for testing.

@c-w
Copy link
Contributor Author

@c-w c-w commented Jul 9, 2018

Thanks for the super quick merge, @efiop.

Would be excellent to know the results of the test for multiple-exists versus list-then-check, so please keep me in the loop :)

Do also let me know if you need any help setting up Blob Storage for testing, e.g. Azurite works well and I've used that with Docker on Travis on other OSS projects in the past.

@efiop
Copy link
Member

@efiop efiop commented Jul 9, 2018

Would be excellent to know the results of the test for multiple-exists versus list-then-check, so please keep me in the loop :)

Sure :)

Do also let me know if you need any help setting up Blob Storage for testing, e.g. Azurite works well and I've used that with Docker on Travis on other OSS projects in the past.

Actually, I'm not very familiar with Azure and I would really appreciate if you could add azurite to our travis setup, so we could test your driver with it. Could you add it please? I will be happy to provide any help needed to assist with that. After test is up and running on CI, I will be able to instantaneously release new version of dvc with your driver included.

@efiop efiop mentioned this pull request Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants