Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

Merged
merged 11 commits into from Aug 5, 2021

Conversation

hayesgb
Copy link
Contributor

@hayesgb hayesgb commented Jul 28, 2021

  • When uploading strings to as model artifact attributes to abfs using put method and adlfs, operations were failing. Added ability to alter write method based on incoming data
  • get, listdir, stat filepath handling
  • Validated performance with private integration testing against adlfs

…and more robust filepath creation for listdir and get methods if key has leading or trailing delimiters
Copy link
Contributor

@Hedingber Hedingber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayesgb one minor comment
Also, if it were failing, maybe worth adding a test so that future changes won't mistakenly break it again

mlrun/datastore/azure_blob.py Outdated Show resolved Hide resolved
@hayesgb
Copy link
Contributor Author

hayesgb commented Jul 31, 2021

@Hedingber -- I've been investigating why the tests did not catch this. It turns out there doesn't appear to be an obvious way to close a DataStore once its been created. This causes the initial DataStore created during unit testing to be persisted, and gets used for all of the authentication methods.

I'll make additional updates to this PR to see if I can resolve that issue.

@Hedingber
Copy link
Contributor

@hayesgb
@theSaarco identified this problem as well, and he fixed it in #1149
Sorry for not letting you know, his PR will likely be merged on Monday, I'll keep you updated

…ed to allow separation of tests with different auth methods. Also added separate test containers based on auth method. Refactored azure_blob.py to pass tests
@hayesgb
Copy link
Contributor Author

hayesgb commented Aug 1, 2021

Thanks @Hedingber

From what I can tell, once a data_item is created, the StoreManager creates a DataStore for that container, and caches the credentials for future use. Even if the data_item is deleted, the StoreManager does not remove them. My proposed (short-term) solution is to create a specific Azure Container for each of the individual authentication methods, and delete the environmental variables between tests.

Additionally, I've parametrized the unit test, so each authentication method gets tested separately, rather than having them run together as a single test. This allowed me to find some bugs in the listdir() dataitem method.

While investigating this, I realized there's not a close() method for either a DataItem or a DataStore, which creates the potential for memory leaks. I'll file a separate issue on this.

@Hedingber
Copy link
Contributor

@hayesgb part of @theSaarco 's PR is also parametrizing the tests, I will highly suggest you to wait for his PR to be merged before proceeding here

@theSaarco
Copy link
Member

@hayesgb: as @Hedingber mentioned - I noticed the same issue you're seeing. I have fixed it in #1149 for tests by modifying the basic test fixture to always clean up between tests - in #1149 I'm clearing up all the datastores from the store_manager, which forces it to re-create the datastore when it's needed, then applying any configuration available at the time of creation. However, this means messing with private members of the store_manager and is not what we'd want the user to do.
Question is - do you consider this action to be something needed in "real life" and not just in tests? Usually once you've configured your connection parameters, you won't change them in the same session (and you can always restart your ipython kernel or similar means to reset). We can add a reset() api to the store_manager that will do the same, but it doesn't seem to be a common practice and therefore it may cause more damage than add value. WDYT?

@hayesgb
Copy link
Contributor Author

hayesgb commented Aug 2, 2021

Thanks @theSaarco

Regarding the question of whether it can actually happen, there are two scenarios that come to mind.

  1. If Access Control Lists are being leveraged within a container, within the same K8 pod (or step in a Kubeflow Pipeline)
  2. If read and write operations occur with different credentials, with the same Azure container. (For example, I need one SPN to read source data in my container, but need a different SPN to write them back). This is actually pretty common, and requires workarounds b/c you can't currently explicitly specify storage_options in a mlrun dataitem (They must be read from env_vars).

@hayesgb
Copy link
Contributor Author

hayesgb commented Aug 2, 2021

@Hedingber -- Updated this PR to make use of the revised unit test from #1149.

@hayesgb hayesgb requested a review from Hedingber August 3, 2021 16:48
Copy link
Contributor

@Hedingber Hedingber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@theSaarco WDYT ?

Copy link
Member

@theSaarco theSaarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayesgb - Looks good. One small fix that I'd like implemented, but I'm not sure about the actual support in adlfs - please take a look.

mlrun/datastore/azure_blob.py Outdated Show resolved Hide resolved
Co-authored-by: Saar Cohen <66667568+theSaarco@users.noreply.github.com>
Copy link
Member

@theSaarco theSaarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Approved.

@Hedingber Hedingber merged commit e786753 into mlrun:development Aug 5, 2021
@Hedingber
Copy link
Contributor

Hedingber commented Aug 5, 2021

@hayesgb PR was merged, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants