Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Parquet format instead of CSV for data consumption #17

Merged
merged 12 commits into from
Aug 18, 2023
Merged

Conversation

prrao87
Copy link
Owner

@prrao87 prrao87 commented Aug 18, 2023

Closes #4.

The aim of this PR is to generate parquet data files instead of CSVs (much smaller in size, while keeping schema as part of the data), so that we can ingest the parquet data into the Neo4j and Kùzu graphs.

  • The upstream Kùzu parquet reader has been fixed, and so we can fully use parquet for reading data when building the graph.
  • The added benefit of parquet is that using pl.read_parquet instead of pl.read_csv is much less verbose because we don't have to worry about specifying separators and other schema information

To do

  • Need to update doc sections that mention CSV and change to parquet

@prrao87
Copy link
Owner Author

prrao87 commented Aug 18, 2023

@andyfengHKU and @ray6080, I've completed this stage of my benchmark study after switching to reading the data via parquet as per this PR. Here are my findings:

  • For ingestion, Kùzu is consistently faster than Neo4j by a factor of ~18x for a graph size of 100k nodes and ~2.4M edges. This speedup factor is expected to be even higher as the dataset gets larger and larger.
  • For OLAP querying, Kùzu is significantly faster than Neo4j for most types of queries, especially for ones that involve aggregating on many-many relationships.

I've left the question as to why certain types of queries are on par with Neo4j as open-ended, we can take a look at those as we go along, and I can rerun the numbers then. Thanks!

@prrao87 prrao87 merged commit 44d887e into main Aug 18, 2023
@prrao87 prrao87 deleted the parquet branch August 18, 2023 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert CSV node/edge generation to parquet
1 participant