[Datastore] Support writing partitioned parquet data #898

gtopper · 2021-04-28T12:47:47Z

No description provided.

_drop_reserved_columns -> drop_reserved_columns

…ed-pq # Conflicts: # mlrun/datastore/targets.py

…ed-pq

dinal

did you double check it's still working with spark?

mlrun/datastore/base.py

…ed-pq

urihoenig

In addition to what I wrote, I think that all the added parameters should be specific to ParquetTarget. I'd move as much logic as I can to storey to minimize the coupling.

mlrun/datastore/base.py

mlrun/datastore/targets.py

urihoenig · 2021-05-13T14:05:32Z

mlrun/datastore/targets.py

+            else ""
+        )
+
+    _legal_time_units = ["year", "month", "day", "hour", "minute", "second"]


Should be imported from storey, no?

storey doesn't have this list. I'm not sure putting it in storey just for the sake of importing it is so great. I mean, it makes some logical sense, but creates another sync point between the projects unnecessarily.

Added the list in mlrun/storey#219
But mlrun tests fail on import from storey

mlrun/datastore/targets.py

…ed-pq

This reverts commit ecd53b0.

This reverts commit e9019b5.

mlrun/datastore/targets.py

Hedingber · 2021-05-18T16:15:13Z

mlrun/datastore/base.py

-            return reader(fs.open(url), **kwargs)
+            return reader(url, **kwargs)


Are you sure this will also work if the path is v3io for example ?
Even if pandas is fsspec aware - I'm not sure it will know how to initialize the v3iofs instance (which we're doing above)
Also, note that the df_module you're using, doesn't have to be pandas, can be provided from outside, I'm assuming it's used for spark df
Lastly, if it is working like this, the get_filesystem and the if are not needed no ?

The if get_filesystem part is still needed to retain the existing semantics (there is different handling based on whether there's a scheme to the url or not. I do think that the call should be self.get_filesystem(silent=False) though, now that I look at the implementation.

The test tests writing to v3io, so it does work. It's needed because you can't call open on a directory.

So I looked into it, and the if and silent=True thing is actually there so that v3io can fall back on simple HTTP, which I guess would only work for a single file.

Adding propagation of storage options.

Turns out this is only supported in pandas >= 1.2

mlrun/datastore/targets.py

This test was added in mlrun#898 and broken by the concurrent mlrun#934.

Gal Topper added 17 commits April 25, 2021 12:49

Support writing partitioned parquet data.

248b7f1

Formatting.

67ef4ff

Get pq partitioning to work end to end.

8585ea5

Formatting.

27c222b

Serialization hell.

ab3916a

lint

36ff871

time_partitioning -> time_partitioning_granularity

962d0f1

_drop_reserved_columns -> drop_reserved_columns

Merge remote-tracking branch 'mlrun/development' into write-partition…

ceddd8b

…ed-pq # Conflicts: # mlrun/datastore/targets.py

Post merge fixes.

6bfbfd0

Fix for non-parquet targets.

f59b6a1

Merge remote-tracking branch 'mlrun/development' into write-partition…

ff083dc

…ed-pq

Support key_bucketing_number.

77d8fa4

More test cases.

c168169

Change parameter order.

c445a1b

Support partition_cols.

ea81037

Merge remote-tracking branch 'mlrun/development' into write-partition…

708d685

…ed-pq

Copy _drop_reserved_columns from storey.

6086929

dinal reviewed Apr 29, 2021

View reviewed changes

mlrun/datastore/base.py Show resolved Hide resolved

Gal Topper added 10 commits May 2, 2021 10:15

Use get_offline_features in tests.

b72c00e

Merge remote-tracking branch 'mlrun/development' into write-partition…

2b53b1c

…ed-pq

Fix bug and improve test.

28ceac6

Use igzpart_ prefix for system partitions.

15c8c9e

Make fmt.

ceb6e40

Merge remote-tracking branch 'mlrun/development' into write-partition…

bf04ced

…ed-pq

Added "partitioned" parameter.

49fe100

Delete target.is_dir.

f6075f3

Merge remote-tracking branch 'mlrun/development' into write-partition…

77d801b

…ed-pq

Add missing field to list.

9e9563d

urihoenig suggested changes May 13, 2021

View reviewed changes

Import drop_reserved_columns from storey.

e9019b5

Gal Topper added 5 commits May 18, 2021 11:02

Simplify code.

a765aee

Move import.

ecd53b0

Merge remote-tracking branch 'mlrun/development' into write-partition…

86b27f1

…ed-pq

Revert "Move import."

a1146f8

This reverts commit ecd53b0.

Revert "Import drop_reserved_columns from storey."

038622d

This reverts commit e9019b5.

gtopper closed this May 18, 2021

gtopper reopened this May 18, 2021

Hedingber suggested changes May 18, 2021

View reviewed changes

Gal Topper added 8 commits May 19, 2021 10:33

Minor refactoring.

4f57ddc

Replace exception.

cc6b5f3

Minor refactoring in test.

82413c2

Fix.

a7d1051

Propagate storage options.

5b5bd41

Disable passing of storage_options.

71f061f

Avoid regression when reading a single file and pandas<1.2.

098d863

lint

8a0449f

gtopper closed this May 19, 2021

gtopper reopened this May 19, 2021

Hedingber approved these changes May 20, 2021

View reviewed changes

Hedingber changed the title ~~Support writing partitioned parquet data.~~ [Datastore] Support writing partitioned parquet data May 20, 2021

Hedingber merged commit 8d1ea44 into mlrun:development May 20, 2021

gtopper pushed a commit to gtopper/mlrun that referenced this pull request May 20, 2021

[Feature store] Fix test

e959cad

This test was added in mlrun#898 and broken by the concurrent mlrun#934.

gtopper mentioned this pull request May 20, 2021

[Feature store] Fix test #948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datastore] Support writing partitioned parquet data #898

[Datastore] Support writing partitioned parquet data #898

gtopper commented Apr 28, 2021

dinal left a comment

urihoenig left a comment

urihoenig May 13, 2021

gtopper May 16, 2021

gtopper May 19, 2021

Hedingber May 18, 2021

gtopper May 19, 2021

gtopper May 19, 2021

gtopper May 19, 2021

gtopper May 19, 2021

gtopper May 19, 2021

		return reader(fs.open(url), **kwargs)
		return reader(url, **kwargs)

[Datastore] Support writing partitioned parquet data #898

[Datastore] Support writing partitioned parquet data #898

Conversation

gtopper commented Apr 28, 2021

dinal left a comment

Choose a reason for hiding this comment

urihoenig left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment