Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Blob Storage remote #868

Merged
merged 1 commit into from Jul 9, 2018

Conversation

Projects
None yet
3 participants
@c-w
Copy link
Contributor

commented Jul 9, 2018

This change adds a new cloud remote based on Azure Blob Storage and the azure-storage-blob SDK.

The implementation closely follows the patterns laid out by the existing cloud remotes (AWS and GCP). Notable differences include:

  • No use of a lock in the progress-bar callback and no requirement to look up total transfer size ahead of time: the Azure Storage SDK provides the total transfer bytes and current transfer offset in its progress callback.

  • When exists is called, check the existence of every blob instead of listing the entire container. Container list operations can be expensive so unless the number of checked files is very large, making separate requests is likely going to be more efficient. Please let me know if a list-then-check implementation of the exists function would be preferred.

  • For increased flexibility, access credentials for the Azure Storage backend can be provided either connection-string style via the remote URL or via environment variables. The Azure Storage backend can be emulated via Azurite.

This change was created together with @EricSchles.

Add Azure Blob Storage remote
This change adds a new cloud remote based on Azure Blob Storage [1] and
the azure-storage-blob SDK [2].

The implementation closely follows the patterns laid out by the existing
cloud remotes (AWS and GCP). Notable differences include:

- No use of a lock in the progress-bar callback and no requirement to
  look up total transfer size ahead of time: the Azure Storage SDK
  provides the total transfer bytes and current transfer offset in its
  progress callback.

- When `exists` is called, check the existence of every blob instead of
  listing the entire container. Container list operations can be
  expensive so unless the number of checked files is very large, making
  separate requests is likely going to be more efficient.

- For increased flexibility, access credentials for the Azure Storage
  backend can be provided either connection-string style via the remote
  URL or via environment variables. The Azure Storage backend can be
  emulated via Azurite [3].

[1] https://azure.microsoft.com/en-us/services/storage/blobs/
[2] https://github.com/Azure/azure-storage-python
[3] https://github.com/Azure/azurite

This change was created together with @EricSchles.
@efiop

efiop approved these changes Jul 9, 2018

Copy link
Member

left a comment

Great patch! Thank you!

} for md5 in md5s]

def exists(self, path_infos):
# TODO: why does S3 fetch all blobs first?

This comment has been minimized.

Copy link
@efiop

efiop Jul 9, 2018

Member

It is simply much faster :) We mostly use exists() method when filtering a bulk of cache files to decide if we need to download/upload them and in s3 list_objects is much-much faster than trying to check keys one-by-one. If blob_service.exists() method is plenty fast as is, there is no reason to redo this part of code.

This comment has been minimized.

Copy link
@shcheklein

shcheklein Jul 9, 2018

Member

Let's put this as a comment instead of this TODO.

@efiop

This comment has been minimized.

Copy link
Member

commented Jul 9, 2018

Fixes #442 .

@efiop efiop merged commit 1bcd740 into iterative:master Jul 9, 2018

3 checks passed

codeclimate Approved by Ruslan Kuprieiev.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@c-w c-w referenced this pull request Jul 9, 2018

Merged

Fix CodeClimate inspections #869

@efiop

This comment has been minimized.

Copy link
Member

commented Jul 9, 2018

Current exists() implementation is fine, thank you! I'll setup a test shortly and if it is too slow will change it to list-then-check logic.

@c-w @EricSchles Thank you guys so much for the amazing work! I will be sure to release new version this week, after I'll setup azure for testing.

@c-w

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2018

Thanks for the super quick merge, @efiop.

Would be excellent to know the results of the test for multiple-exists versus list-then-check, so please keep me in the loop :)

Do also let me know if you need any help setting up Blob Storage for testing, e.g. Azurite works well and I've used that with Docker on Travis on other OSS projects in the past.

@efiop

This comment has been minimized.

Copy link
Member

commented Jul 9, 2018

Would be excellent to know the results of the test for multiple-exists versus list-then-check, so please keep me in the loop :)

Sure :)

Do also let me know if you need any help setting up Blob Storage for testing, e.g. Azurite works well and I've used that with Docker on Travis on other OSS projects in the past.

Actually, I'm not very familiar with Azure and I would really appreciate if you could add azurite to our travis setup, so we could test your driver with it. Could you add it please? I will be happy to provide any help needed to assist with that. After test is up and running on CI, I will be able to instantaneously release new version of dvc with your driver included.

@efiop efiop referenced this pull request Jul 10, 2018

Closed

Azure cloud #442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.