Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADLS Gen 2 spout #41

Open
lukemarsden opened this issue Jul 13, 2021 · 3 comments
Open

ADLS Gen 2 spout #41

lukemarsden opened this issue Jul 13, 2021 · 3 comments

Comments

@lukemarsden
Copy link
Contributor

lukemarsden commented Jul 13, 2021

It's increasingly obvious that ADLS gen 2 is a must-have for Microsoft sales folks. Need a spout that copies ADLS gen 2 data into pachyderm as it comes in.

How we can do ADLS gen 2 spouts into pachyderm-AML: get the terraform to:

  1. set up an ADLS gen 2 storage account (i.e. "enable hierarchical namespace when creating a storage account")
  2. create a service bus instance and a topic within it
  3. configure an ADLS gen 2 event to push the appropriate events to the service bus
  4. have terraform output appropriate credentials for both of the above that we can pass into pachyderm (have terraform write them into a k8s secret? or can we get the pachyderm spout to inherit an Azure service account somehow?)
  5. create a pachyderm spout which reads from the queue and uses the notifications to go and download data from ADLS gen 2 into pachyderm (with caveats below)
  6. tada! demo dropping some json into ADLS gen 2 and the spout runs and a pachyderm commit appears magically as an immutable AML dataset version that the data scientist can read - this is cool because ADLS gen 2 doesn't itself support versioning

making it production-grade will be hard, considerations:

  • what happens when there's data in ADLS before the spout starts listening
  • what happens when the spout gets disconnected
  • what happens when there's terabytes of data and billions of files
  • what happens if there are conflicting changes on both sides
@albscui
Copy link
Contributor

albscui commented Jul 15, 2021

Caveat from Andrei: ADLS gen 2 events only include blob creation rename and destroy, not modification.

@JoeyZwicker
Copy link
Member

Caveat from Andrei: ADLS gen 2 events only include blob creation rename and destroy, not modification.

Is this a feature that we want ADLS to add. I know the VP fo Product for ADLS and can try to get it escalated. Are there any more specifics about the feature I might need to give him the exact info we're looking for?

@albscui
Copy link
Contributor

albscui commented Sep 8, 2021

Caveat from Andrei: ADLS gen 2 events only include blob creation rename and destroy, not modification.

Is this a feature that we want ADLS to add. I know the VP fo Product for ADLS and can try to get it escalated. Are there any more specifics about the feature I might need to give him the exact info we're looking for?

Events for blob/file modification would be useful for versioning specific objects. However, I don't think we need to escalate this until we have a concrete solution for consuming these events. So let's wait until we start building this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants