-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datastore] Support writing partitioned parquet data #898
Conversation
_drop_reserved_columns -> drop_reserved_columns
…ed-pq # Conflicts: # mlrun/datastore/targets.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you double check it's still working with spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what I wrote, I think that all the added parameters should be specific to ParquetTarget. I'd move as much logic as I can to storey to minimize the coupling.
else "" | ||
) | ||
|
||
_legal_time_units = ["year", "month", "day", "hour", "minute", "second"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be imported from storey, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storey doesn't have this list. I'm not sure putting it in storey just for the sake of importing it is so great. I mean, it makes some logical sense, but creates another sync point between the projects unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the list in mlrun/storey#219
But mlrun tests fail on import from storey
mlrun/datastore/base.py
Outdated
return reader(fs.open(url), **kwargs) | ||
return reader(url, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this will also work if the path is v3io
for example ?
Even if pandas is fsspec aware - I'm not sure it will know how to initialize the v3iofs instance (which we're doing above)
Also, note that the df_module
you're using, doesn't have to be pandas, can be provided from outside, I'm assuming it's used for spark df
Lastly, if it is working like this, the get_filesystem
and the if
are not needed no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if get_filesystem
part is still needed to retain the existing semantics (there is different handling based on whether there's a scheme to the url or not. I do think that the call should be self.get_filesystem(silent=False)
though, now that I look at the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test tests writing to v3io, so it does work. It's needed because you can't call open
on a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked into it, and the if and silent=True thing is actually there so that v3io can fall back on simple HTTP, which I guess would only work for a single file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding propagation of storage options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this is only supported in pandas >= 1.2
No description provided.