Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datastore] Support writing partitioned parquet data #898

Merged
merged 41 commits into from
May 20, 2021

Conversation

gtopper
Copy link
Collaborator

@gtopper gtopper commented Apr 28, 2021

No description provided.

Copy link
Contributor

@dinal dinal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you double check it's still working with spark?

mlrun/datastore/base.py Show resolved Hide resolved
Copy link
Contributor

@urihoenig urihoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what I wrote, I think that all the added parameters should be specific to ParquetTarget. I'd move as much logic as I can to storey to minimize the coupling.

mlrun/datastore/base.py Show resolved Hide resolved
mlrun/datastore/base.py Outdated Show resolved Hide resolved
mlrun/datastore/targets.py Show resolved Hide resolved
mlrun/datastore/targets.py Outdated Show resolved Hide resolved
mlrun/datastore/targets.py Show resolved Hide resolved
mlrun/datastore/targets.py Outdated Show resolved Hide resolved
else ""
)

_legal_time_units = ["year", "month", "day", "hour", "minute", "second"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be imported from storey, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storey doesn't have this list. I'm not sure putting it in storey just for the sake of importing it is so great. I mean, it makes some logical sense, but creates another sync point between the projects unnecessarily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the list in mlrun/storey#219
But mlrun tests fail on import from storey

mlrun/datastore/targets.py Show resolved Hide resolved
@gtopper gtopper closed this May 18, 2021
@gtopper gtopper reopened this May 18, 2021
mlrun/datastore/targets.py Outdated Show resolved Hide resolved
Comment on lines 142 to 147
return reader(fs.open(url), **kwargs)
return reader(url, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this will also work if the path is v3io for example ?
Even if pandas is fsspec aware - I'm not sure it will know how to initialize the v3iofs instance (which we're doing above)
Also, note that the df_module you're using, doesn't have to be pandas, can be provided from outside, I'm assuming it's used for spark df
Lastly, if it is working like this, the get_filesystem and the if are not needed no ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if get_filesystem part is still needed to retain the existing semantics (there is different handling based on whether there's a scheme to the url or not. I do think that the call should be self.get_filesystem(silent=False) though, now that I look at the implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test tests writing to v3io, so it does work. It's needed because you can't call open on a directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked into it, and the if and silent=True thing is actually there so that v3io can fall back on simple HTTP, which I guess would only work for a single file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding propagation of storage options.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is only supported in pandas >= 1.2

mlrun/datastore/targets.py Outdated Show resolved Hide resolved
mlrun/datastore/targets.py Outdated Show resolved Hide resolved
mlrun/datastore/targets.py Outdated Show resolved Hide resolved
@gtopper gtopper closed this May 19, 2021
@gtopper gtopper reopened this May 19, 2021
@Hedingber Hedingber changed the title Support writing partitioned parquet data. [Datastore] Support writing partitioned parquet data May 20, 2021
@Hedingber Hedingber merged commit 8d1ea44 into mlrun:development May 20, 2021
gtopper pushed a commit to gtopper/mlrun that referenced this pull request May 20, 2021
This test was added in mlrun#898 and broken by the concurrent mlrun#934.
@gtopper gtopper mentioned this pull request May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants