Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Parquet instead of Sequence files for entities #237

Merged
merged 19 commits into from
May 25, 2022

Conversation

darabos
Copy link
Contributor

@darabos darabos commented May 24, 2022

This is a major change that we made within a private project. It had extensive code review there, and I'm just upstreaming it now. It's great. It simplifies the code, it fixes performance issues on GCS, it makes LynxKite easier to debug, and easier to integrate with external graph systems. It's faster too, by 10-20% on some benchmarks.

But it's an incompatible change for existing data. So this goes into LynxKite 5.0. But migration is simple: you just delete the $KITE_DATA/partitioned directory. Things will be recomputed, so it may be slow, but at least it's easy.

Copying one review comment and my response from the original PR:

I really like to see writeAttributes go! But what change has really made that possible?

I think when we originally wrote that code, it was possible for the DataFrame to be coming from anything. It could be a CSV, or a JDBC source. In that case it was important to do a single pass for performance / consistency. With the "new" (LynxKite 2.0) import procedure the table is always backed by a Parquet file.

@darabos darabos changed the base branch from main to darabos-upstreaming May 24, 2022 12:40
@darabos darabos merged commit bcadfe1 into darabos-upstreaming May 25, 2022
@darabos darabos deleted the darabos-parquet branch May 25, 2022 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant