[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

hayesgb · 2021-07-28T23:10:02Z

When uploading strings to as model artifact attributes to abfs using put method and adlfs, operations were failing. Added ability to alter write method based on incoming data
get, listdir, stat filepath handling
Validated performance with private integration testing against adlfs

…and more robust filepath creation for listdir and get methods if key has leading or trailing delimiters

Hedingber

@hayesgb one minor comment
Also, if it were failing, maybe worth adding a test so that future changes won't mistakenly break it again

mlrun/datastore/azure_blob.py

hayesgb · 2021-07-31T11:48:19Z

@Hedingber -- I've been investigating why the tests did not catch this. It turns out there doesn't appear to be an obvious way to close a DataStore once its been created. This causes the initial DataStore created during unit testing to be persisted, and gets used for all of the authentication methods.

I'll make additional updates to this PR to see if I can resolve that issue.

…private class method

… avoids the potential for dangling open connections

Hedingber · 2021-07-31T22:34:21Z

@hayesgb
@theSaarco identified this problem as well, and he fixed it in #1149
Sorry for not letting you know, his PR will likely be merged on Monday, I'll keep you updated

…ed to allow separation of tests with different auth methods. Also added separate test containers based on auth method. Refactored azure_blob.py to pass tests

hayesgb · 2021-08-01T00:56:18Z

Thanks @Hedingber

From what I can tell, once a data_item is created, the StoreManager creates a DataStore for that container, and caches the credentials for future use. Even if the data_item is deleted, the StoreManager does not remove them. My proposed (short-term) solution is to create a specific Azure Container for each of the individual authentication methods, and delete the environmental variables between tests.

Additionally, I've parametrized the unit test, so each authentication method gets tested separately, rather than having them run together as a single test. This allowed me to find some bugs in the listdir() dataitem method.

While investigating this, I realized there's not a close() method for either a DataItem or a DataStore, which creates the potential for memory leaks. I'll file a separate issue on this.

Hedingber · 2021-08-01T16:45:41Z

@hayesgb part of @theSaarco 's PR is also parametrizing the tests, I will highly suggest you to wait for his PR to be merged before proceeding here

theSaarco · 2021-08-02T07:17:35Z

@hayesgb: as @Hedingber mentioned - I noticed the same issue you're seeing. I have fixed it in #1149 for tests by modifying the basic test fixture to always clean up between tests - in #1149 I'm clearing up all the datastores from the store_manager, which forces it to re-create the datastore when it's needed, then applying any configuration available at the time of creation. However, this means messing with private members of the store_manager and is not what we'd want the user to do.
Question is - do you consider this action to be something needed in "real life" and not just in tests? Usually once you've configured your connection parameters, you won't change them in the same session (and you can always restart your ipython kernel or similar means to reset). We can add a reset() api to the store_manager that will do the same, but it doesn't seem to be a common practice and therefore it may cause more damage than add value. WDYT?

hayesgb · 2021-08-02T10:40:34Z

Thanks @theSaarco

Regarding the question of whether it can actually happen, there are two scenarios that come to mind.

If Access Control Lists are being leveraged within a container, within the same K8 pod (or step in a Kubeflow Pipeline)
If read and write operations occur with different credentials, with the same Azure container. (For example, I need one SPN to read source data in my container, but need a different SPN to write them back). This is actually pretty common, and requires workarounds b/c you can't currently explicitly specify storage_options in a mlrun dataitem (They must be read from env_vars).

…ng in put method for append=True from mlrun/datastore/azure_blob.py

hayesgb · 2021-08-02T11:02:48Z

@Hedingber -- Updated this PR to make use of the revised unit test from #1149.

Hedingber

LGTM
@theSaarco WDYT ?

theSaarco

@hayesgb - Looks good. One small fix that I'd like implemented, but I'm not sure about the actual support in adlfs - please take a look.

mlrun/datastore/azure_blob.py

Co-authored-by: Saar Cohen <66667568+theSaarco@users.noreply.github.com>

theSaarco

Looks good to me. Approved.

Hedingber · 2021-08-05T15:44:50Z

@hayesgb PR was merged, thanks a lot!

hayesgb added 3 commits July 28, 2021 17:24

Fix for put with adlfs and differentiating between string and bytes, …

22d0cb7

…and more robust filepath creation for listdir and get methods if key has leading or trailing delimiters

Use posix_path to create path string for stat operation and adlfs

6173bc8

Linting

3636c22

Hedingber suggested changes Jul 29, 2021

View reviewed changes

mlrun/datastore/azure_blob.py Outdated Show resolved Hide resolved

hayesgb added 3 commits July 31, 2021 06:49

Moves logic for creating remote_path for adlfs authentication into a …

7ef7934

…private class method

Create self.bsc when class is instantiated

415c600

Converted self.bsc operations to occur within a context manager. This…

c8fb27c

… avoids the potential for dangling open connections

hayesgb added 2 commits July 31, 2021 19:44

Refactored test_azure_blob.py to use pytest fixtures, and parameteriz…

270c4f1

…ed to allow separation of tests with different auth methods. Also added separate test containers based on auth method. Refactored azure_blob.py to pass tests

Lint

1da71e3

hayesgb added 2 commits August 2, 2021 05:59

Merged test_azure_blob.py from development and update to error handli…

3a3e14f

…ng in put method for append=True from mlrun/datastore/azure_blob.py

Merge branch 'development' into fix_azure_get

fa83b7d

hayesgb requested a review from Hedingber August 3, 2021 16:48

Hedingber approved these changes Aug 4, 2021

View reviewed changes

Hedingber requested a review from theSaarco August 4, 2021 01:52

theSaarco reviewed Aug 4, 2021

View reviewed changes

mlrun/datastore/azure_blob.py Outdated Show resolved Hide resolved

Update mlrun/datastore/azure_blob.py

57f2ea8

Co-authored-by: Saar Cohen <66667568+theSaarco@users.noreply.github.com>

theSaarco approved these changes Aug 5, 2021

View reviewed changes

Hedingber merged commit e786753 into mlrun:development Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

hayesgb commented Jul 28, 2021

Hedingber left a comment •

edited

hayesgb commented Jul 31, 2021

Hedingber commented Jul 31, 2021

hayesgb commented Aug 1, 2021

Hedingber commented Aug 1, 2021

theSaarco commented Aug 2, 2021

hayesgb commented Aug 2, 2021 •

edited

hayesgb commented Aug 2, 2021

Hedingber left a comment

theSaarco left a comment

theSaarco left a comment

Hedingber commented Aug 5, 2021 •

edited

[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

[Azure DataStore] Handle upload strings vs bytes and filepath formation when using adlfs #1159

Conversation

hayesgb commented Jul 28, 2021

Hedingber left a comment • edited

Choose a reason for hiding this comment

hayesgb commented Jul 31, 2021

Hedingber commented Jul 31, 2021

hayesgb commented Aug 1, 2021

Hedingber commented Aug 1, 2021

theSaarco commented Aug 2, 2021

hayesgb commented Aug 2, 2021 • edited

hayesgb commented Aug 2, 2021

Hedingber left a comment

Choose a reason for hiding this comment

theSaarco left a comment

Choose a reason for hiding this comment

theSaarco left a comment

Choose a reason for hiding this comment

Hedingber commented Aug 5, 2021 • edited

Hedingber left a comment •

edited

hayesgb commented Aug 2, 2021 •

edited

Hedingber commented Aug 5, 2021 •

edited