Use Parquet instead of Sequence files for entities #237

darabos · 2022-05-24T12:40:48Z

This is a major change that we made within a private project. It had extensive code review there, and I'm just upstreaming it now. It's great. It simplifies the code, it fixes performance issues on GCS, it makes LynxKite easier to debug, and easier to integrate with external graph systems. It's faster too, by 10-20% on some benchmarks.

But it's an incompatible change for existing data. So this goes into LynxKite 5.0. But migration is simple: you just delete the $KITE_DATA/partitioned directory. Things will be recomputed, so it may be slow, but at least it's easy.

Copying one review comment and my response from the original PR:

I really like to see writeAttributes go! But what change has really made that possible?

I think when we originally wrote that code, it was possible for the DataFrame to be coming from anything. It could be a CSV, or a JDBC source. In that case it was important to do a single pass for performance / consistency. With the "new" (LynxKite 2.0) import procedure the table is always backed by a Parquet file.

The original code was very fancy to avoid reading the input multiple times. This was a concern, for example with large CSVs. But now the input is always a Table, which is backed directly by a Parquet file in LynxKite storage. Parquet is column-oriented, so we can just read the columns one by one.

…now.

I've never even heard about it.

darabos added 14 commits May 24, 2022 14:22

Store entity data in Parquet instead of SequenceFiles.

7efdc60

kiterc_template is generated from this file.

05bdc10

Fix issue with iterator.

d89bbbc

Maybe more efficient row counting.

1af80f9

Do counting on RDD.

6107b8f

Rewrite copyAndRepartition.

82fb0f6

Update test, fix revealed bug.

b425949

Reconstruct original partitioning during load.

64e1b6d

Drop nulls.

1da8862

Partition-preserving Parquet data source.

947e886

Switch EntityIOTest reference data to Parquet.

1e7e444

Do the corruption hack another way. Random directories are tolerated …

9d1b2d1

…now.

Fix ID assignment with missing values.

7100dbc

darabos changed the base branch from main to darabos-upstreaming May 24, 2022 12:40

darabos added 5 commits May 24, 2022 14:43

Update CHANGELOG for Parquet switch.

b1356e6

Remove KITE_SCRIPT_LOGS setting.

50fd43a

I've never even heard about it.

Comment about TableToAttributes and randomNumbered.

a0f65ea

Bit simpler null filtering.

a54ed5f

Handle special column names in TableToAttributes.

8c14e11

darabos force-pushed the darabos-parquet branch from d9611d7 to 8c14e11 Compare May 24, 2022 17:53

darabos merged commit bcadfe1 into darabos-upstreaming May 25, 2022

darabos deleted the darabos-parquet branch May 25, 2022 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Parquet instead of Sequence files for entities #237

Use Parquet instead of Sequence files for entities #237

darabos commented May 24, 2022

Use Parquet instead of Sequence files for entities #237

Use Parquet instead of Sequence files for entities #237

Conversation

darabos commented May 24, 2022