diff --git a/README.md b/README.md index 613318e..a95482a 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,11 @@ Code for the benchmark study described in this [blog post](https://thedataquarry.com/posts/embedded-db-2/). +> [!NOTE] +> Neo4j version: `5.18.0` +> KùzuDB version: `0.3.2` + + [Kùzu](https://kuzudb.com/) is an in-process (embedded) graph database management system (GDBMS) written in C++. It is blazing fast 🔥, and is optimized for handling complex join-heavy analytical workloads on very large graphs. Kùzu is being actively developed, and its [goal](https://kuzudb.com/docusaurus/blog/what-every-gdbms-should-do-and-vision) is to do in the graph data science space what DuckDB did in the world of tabular data science -- that is, to provide a fast, lightweight, embeddable graph database for analytics (OLAP) use cases, with minimal infrastructure setup. This study has the following goals: @@ -92,23 +97,20 @@ The run times for both ingestion and queries are compared. * For ingestion, KùzuDB is consistently faster than Neo4j by a factor of **~18x** for a graph size of 100K nodes and ~2.4M edges. * For OLAP queries, KùzuDB is **significantly faster** than Neo4j, especially for ones that involve multi-hop queries via nodes with many-to-many relationships. -### Testing conditions +### Benchmark conditions -* M3 Macbook Pro, 32 GB RAM -* Neo4j version: `5.16.0` -* KùzuDB version: `0.2.0` +The benchmark is run M3 Macbook Pro with 36 GB RAM. ### Ingestion performance -In total, ~100K nodes and ~2.5 million edges are ingested **~18x** faster in KùzuDB than in Neo4j. - Case | Neo4j (sec) | Kùzu (sec) | Speedup factor --- | ---: | ---: | ---: -Nodes | 2.3 | 0.1 | 23x -Edges | 30.6 | 2.2 | 14x -Total | 32.9 | 2.3 | 14x +Nodes | 2.4 | 0.2 | 12x +Edges | 30.9 | 0.4 | 77x +Total | 33.3 | 0.6 | 55x -Nodes are ingested significantly faster in Kùzu in this case, and Neo4j's node ingestion remains of the order of seconds despite setting constraints on the ID fields as per their best practices. The speedup factors shown are expected to be even higher as the dataset gets larger and larger, with Kùzu being around two orders of magnitude faster for inserting nodes. +Nodes are ingested significantly faster in Kùzu, and Neo4j's node ingestion remains of the order of seconds despite setting constraints on the ID fields as per their best practices. The speedup factors shown are expected to be even higher as the dataset gets larger and larger using this approach, and +the only way to speed up Neo4j data ingestion is to avoid using Python and use `admin-import` instead. ### Query performance benchmark @@ -132,15 +134,15 @@ The following table shows the run times for each query (averaged over the number Query | Neo4j (sec) | Kùzu (sec) | Speedup factor --- | ---: | ---: | ---: -1 | 1.5396 | 0.283 | 5.4 -2 | 0.5680 | 0.378 | 1.5 -3 | 0.0338 | 0.011 | 3.1 -4 | 0.0391 | 0.009 | 4.3 -5 | 0.0069 | 0.003 | 2.3 -6 | 0.0159 | 0.034 | 0.5 -7 | 0.1433 | 0.007 | 20.5 -8 | 2.9034 | 0.092 | 31.6 -9 | 3.6319 | 0.103 | 35.2 +1 | 1.7614 | 0.2722 | 6.5x +2 | 0.6149 | 0.3340 | 1.8x +3 | 0.0388 | 0.0112 | 3.5x +4 | 0.0426 | 0.0094 | 4.5x +5 | 0.0080 | 0.0037 | 2.2x +6 | 0.0212 | 0.0335 | 0.6x +7 | 0.1592 | 0.0070 | 22.7x +8 | 3.2919 | 0.0901 | 36.5x +9 | 4.0125 | 0.1016 | 39.5x #### Neo4j vs. Kùzu multi-threaded @@ -148,15 +150,15 @@ KùzuDB (by default) supports multi-threaded execution of queries. The following Query | Neo4j (sec) | Kùzu (sec) | Speedup factor --- | ---: | ---: | ---: -1 | 1.5396 | 0.171 | 9.0 -2 | 0.5680 | 0.203 | 2.8 -3 | 0.0338 | 0.013 | 2.6 -4 | 0.0391 | 0.012 | 3.3 -5 | 0.0069 | 0.004 | 1.7 -6 | 0.0159 | 0.033 | 0.5 -7 | 0.1433 | 0.008 | 17.9 -8 | 2.9034 | 0.074 | 39.3 -9 | 3.6319 | 0.087 | 41.8 +1 | 1.7614 | 0.1678 | 10.5x +2 | 0.6149 | 0.2025 | 3.0x +3 | 0.0388 | 0.0145 | 2.7x +4 | 0.0426 | 0.0136 | 3.1x +5 | 0.0080 | 0.0046 | 1.7x +6 | 0.0212 | 0.0346 | 0.6x +7 | 0.1592 | 0.0079 | 20.1x +8 | 3.2919 | 0.0777 | 42.4x +9 | 4.0125 | 0.0664 | 60.4x > 🔥 The second-degree path-finding queries (8 and 9) show the biggest speedup over Neo4j, due to innovations in KùzuDB's query planner and execution engine. @@ -164,7 +166,7 @@ Query | Neo4j (sec) | Kùzu (sec) | Speedup factor #### Scale up the dataset -It's possible to regenerate a fake dataset of ~100M nodes and ~2.5B edges, and see how the performance of KùzuDB and Neo4j compare -- it's likely that Neo4j cannot handle 2-hop path-finding queries at that scale on a single node, so queries 8 and 9 can be disabled for that larger dataset. +It's possible to regenerate an artificial dataset of ~100M nodes and ~2.5B edges, and see how the performance of KùzuDB and Neo4j compare -- it's likely that Neo4j cannot handle 2-hop path-finding queries at that scale on a single node, so queries 8 and 9 can be disabled for that larger dataset. ```sh # Generate data with 100M persons and ~2.5B edges (Might take a while in Python!) diff --git a/data/output/nodes/cities.parquet b/data/output/nodes/cities.parquet index db4452f..8737f66 100644 Binary files a/data/output/nodes/cities.parquet and b/data/output/nodes/cities.parquet differ diff --git a/data/output/nodes/persons.parquet b/data/output/nodes/persons.parquet index 292faa4..7baa8a3 100644 Binary files a/data/output/nodes/persons.parquet and b/data/output/nodes/persons.parquet differ diff --git a/data/output/nodes/states.parquet b/data/output/nodes/states.parquet index 3d13458..e0b8e75 100644 Binary files a/data/output/nodes/states.parquet and b/data/output/nodes/states.parquet differ diff --git a/kuzudb/README.md b/kuzudb/README.md index 2c9548c..453b254 100644 --- a/kuzudb/README.md +++ b/kuzudb/README.md @@ -25,8 +25,8 @@ As expected, the nodes load much faster than the edges, since there are many mor ```bash $ python build_graph.py -Nodes loaded in 0.1509s -Edges loaded in 2.2402s +Nodes loaded in 0.1542s +Edges loaded in 0.3803s Successfully loaded nodes and edges into KùzuDB! ``` @@ -420,67 +420,67 @@ Queries completed in 0.7561s The benchmark is run using `pytest-benchmark` package as follows. ```sh -$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname -========================================================================================= test session starts ========================================================================================== -platform darwin -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0 +$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname ✘ 130 update-kuzu ✱ +====================================================================================================== test session starts ======================================================================================================= +platform darwin -- Python 3.11.7, pytest-8.1.1, pluggy-1.4.0 benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=5) rootdir: /Users/prrao/code/kuzudb-study/kuzudb plugins: Faker-23.1.0, benchmark-4.0.0 -collected 9 items +collected 9 items -benchmark_query.py ......... [100%] +benchmark_query.py ......... [100%] ---------------------------------------------------------------------------------------- benchmark: 9 tests -------------------------------------------------------------------------------------- -Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -test_benchmark_query1 252.0117 (86.58) 347.7262 (58.99) 283.1725 (84.65) 40.3382 (189.01) 263.0762 (79.45) 55.3614 (233.22) 1;0 3.5314 (0.01) 5 1 -test_benchmark_query2 295.4568 (101.50) 490.8745 (83.28) 378.4995 (113.15) 82.3863 (386.04) 403.3876 (121.83) 127.7752 (538.28) 2;0 2.6420 (0.01) 5 1 -test_benchmark_query3 10.3258 (3.55) 12.6975 (2.15) 10.8966 (3.26) 0.4811 (2.25) 10.7724 (3.25) 0.3823 (1.61) 11;4 91.7722 (0.31) 66 1 -test_benchmark_query4 8.0921 (2.78) 9.1896 (1.56) 8.7203 (2.61) 0.2134 (1.0) 8.7555 (2.64) 0.2837 (1.20) 25;1 114.6747 (0.38) 78 1 -test_benchmark_query5 2.9108 (1.0) 5.8945 (1.0) 3.3450 (1.0) 0.3156 (1.48) 3.3112 (1.0) 0.2374 (1.0) 13;3 298.9503 (1.0) 114 1 -test_benchmark_query6 32.9890 (11.33) 36.0460 (6.12) 34.3424 (10.27) 0.7993 (3.75) 34.3895 (10.39) 1.1800 (4.97) 10;0 29.1185 (0.10) 27 1 -test_benchmark_query7 6.1617 (2.12) 7.5800 (1.29) 6.7920 (2.03) 0.3007 (1.41) 6.7980 (2.05) 0.4178 (1.76) 34;0 147.2325 (0.49) 93 1 -test_benchmark_query8 87.3487 (30.01) 94.6871 (16.06) 92.0254 (27.51) 2.4032 (11.26) 92.1223 (27.82) 3.5121 (14.80) 3;0 10.8666 (0.04) 9 1 -test_benchmark_query9 99.9393 (34.33) 105.5227 (17.90) 103.5184 (30.95) 1.7100 (8.01) 104.0372 (31.42) 1.4556 (6.13) 2;1 9.6601 (0.03) 8 1 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +-------------------------------------------------------------------------------------- benchmark: 9 tests -------------------------------------------------------------------------------------- +Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ +test_benchmark_query1 254.0487 (77.28) 331.9569 (65.09) 272.2583 (72.23) 33.5090 (118.10) 258.9573 (70.13) 24.4201 (82.49) 1;1 3.6730 (0.01) 5 1 +test_benchmark_query2 293.4545 (89.27) 388.5836 (76.19) 334.0680 (88.63) 48.9350 (172.46) 301.8217 (81.74) 88.7258 (299.73) 2;0 2.9934 (0.01) 5 1 +test_benchmark_query3 10.4950 (3.19) 12.3280 (2.42) 11.2442 (2.98) 0.3642 (1.28) 11.2188 (3.04) 0.4407 (1.49) 19;2 88.9345 (0.34) 62 1 +test_benchmark_query4 8.6238 (2.62) 11.0205 (2.16) 9.3746 (2.49) 0.4232 (1.49) 9.2816 (2.51) 0.4236 (1.43) 15;6 106.6709 (0.40) 76 1 +test_benchmark_query5 3.2872 (1.0) 5.1003 (1.0) 3.7691 (1.0) 0.3535 (1.25) 3.6925 (1.0) 0.2960 (1.0) 23;9 265.3119 (1.0) 104 1 +test_benchmark_query6 32.8883 (10.00) 35.4205 (6.94) 33.5387 (8.90) 0.5317 (1.87) 33.3696 (9.04) 0.6214 (2.10) 6;1 29.8163 (0.11) 28 1 +test_benchmark_query7 6.2537 (1.90) 7.7147 (1.51) 7.0166 (1.86) 0.2837 (1.0) 7.0423 (1.91) 0.3966 (1.34) 34;0 142.5183 (0.54) 91 1 +test_benchmark_query8 86.9893 (26.46) 91.6528 (17.97) 90.1817 (23.93) 1.5253 (5.38) 90.8585 (24.61) 2.1688 (7.33) 1;0 11.0887 (0.04) 9 1 +test_benchmark_query9 98.5566 (29.98) 105.5151 (20.69) 101.6341 (26.96) 2.2933 (8.08) 101.5073 (27.49) 2.8376 (9.59) 2;0 9.8392 (0.04) 7 1 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean -========================================================================================== 9 passed in 11.55s ========================================================================================== +======================================================================================================= 9 passed in 11.30s ======================================================================================================= ``` #### Query performance (Kùzu multi-threaded) ```sh -$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname -========================================================================================= test session starts ========================================================================================== -platform darwin -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0 +$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname ✘ 130 update-kuzu ✱ +====================================================================================================== test session starts ======================================================================================================= +platform darwin -- Python 3.11.7, pytest-8.1.1, pluggy-1.4.0 benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=5) rootdir: /Users/prrao/code/kuzudb-study/kuzudb plugins: Faker-23.1.0, benchmark-4.0.0 -collected 9 items +collected 9 items -benchmark_query.py ......... [100%] +benchmark_query.py ......... [100%] -------------------------------------------------------------------------------------- benchmark: 9 tests -------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -test_benchmark_query1 144.0958 (38.16) 268.6453 (45.91) 171.1153 (37.70) 54.5636 (238.27) 147.4435 (32.35) 34.3697 (105.56) 1;1 5.8440 (0.03) 5 1 -test_benchmark_query2 201.5580 (53.38) 206.7633 (35.33) 203.4183 (44.81) 2.1432 (9.36) 203.2485 (44.60) 3.0854 (9.48) 1;0 4.9160 (0.02) 5 1 -test_benchmark_query3 12.5962 (3.34) 13.7345 (2.35) 13.0013 (2.86) 0.2290 (1.0) 12.9419 (2.84) 0.3256 (1.0) 16;1 76.9153 (0.35) 57 1 -test_benchmark_query4 11.6342 (3.08) 13.6797 (2.34) 12.4356 (2.74) 0.5213 (2.28) 12.2805 (2.69) 0.7021 (2.16) 21;0 80.4144 (0.37) 59 1 -test_benchmark_query5 3.7759 (1.0) 5.8518 (1.0) 4.5393 (1.0) 0.3693 (1.61) 4.5571 (1.0) 0.4914 (1.51) 31;1 220.2987 (1.0) 102 1 -test_benchmark_query6 31.2499 (8.28) 49.5773 (8.47) 33.4805 (7.38) 3.3679 (14.71) 32.5581 (7.14) 1.9644 (6.03) 1;1 29.8682 (0.14) 29 1 -test_benchmark_query7 7.1773 (1.90) 10.1346 (1.73) 8.2812 (1.82) 0.6263 (2.74) 8.1582 (1.79) 0.6606 (2.03) 19;6 120.7552 (0.55) 80 1 -test_benchmark_query8 64.1151 (16.98) 83.9229 (14.34) 73.9044 (16.28) 5.2735 (23.03) 73.8122 (16.20) 4.6661 (14.33) 4;2 13.5310 (0.06) 12 1 -test_benchmark_query9 55.9865 (14.83) 147.0649 (25.13) 84.8630 (18.70) 27.7042 (120.98) 74.6289 (16.38) 3.9108 (12.01) 3;4 11.7837 (0.05) 14 1 +test_benchmark_query1 143.7831 (36.84) 252.6395 (38.09) 167.8478 (36.78) 47.4400 (111.50) 147.9915 (33.16) 29.3312 (60.86) 1;1 5.9578 (0.03) 5 1 +test_benchmark_query2 198.2216 (50.79) 205.8762 (31.04) 202.4746 (44.37) 3.0336 (7.13) 203.0530 (45.50) 4.6756 (9.70) 2;0 4.9389 (0.02) 5 1 +test_benchmark_query3 13.5389 (3.47) 15.5465 (2.34) 14.4884 (3.17) 0.4255 (1.0) 14.4661 (3.24) 0.4819 (1.0) 15;1 69.0209 (0.31) 52 1 +test_benchmark_query4 12.5585 (3.22) 14.5405 (2.19) 13.6137 (2.98) 0.4390 (1.03) 13.5607 (3.04) 0.5406 (1.12) 20;1 73.4555 (0.34) 55 1 +test_benchmark_query5 3.9030 (1.0) 6.6330 (1.0) 4.5634 (1.0) 0.4712 (1.11) 4.4623 (1.0) 0.4962 (1.03) 16;5 219.1327 (1.0) 101 1 +test_benchmark_query6 32.6305 (8.36) 42.6955 (6.44) 34.6170 (7.59) 2.0708 (4.87) 34.1572 (7.65) 0.7366 (1.53) 3;3 28.8876 (0.13) 27 1 +test_benchmark_query7 6.9358 (1.78) 9.6718 (1.46) 7.8832 (1.73) 0.4438 (1.04) 7.8641 (1.76) 0.4891 (1.02) 22;2 126.8526 (0.58) 91 1 +test_benchmark_query8 65.6220 (16.81) 125.4942 (18.92) 77.7316 (17.03) 21.5360 (50.61) 66.9292 (15.00) 3.3628 (6.98) 3;3 12.8648 (0.06) 14 1 +test_benchmark_query9 64.6778 (16.57) 68.5543 (10.34) 66.3754 (14.55) 1.0579 (2.49) 66.3023 (14.86) 1.0378 (2.15) 4;1 15.0658 (0.07) 14 1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean -========================================================================================== 9 passed in 10.20s ========================================================================================== -``` \ No newline at end of file +======================================================================================================= 9 passed in 10.13s ======================================================================================================= +``` diff --git a/kuzudb/benchmark_query.py b/kuzudb/benchmark_query.py index 24524e1..4795dea 100644 --- a/kuzudb/benchmark_query.py +++ b/kuzudb/benchmark_query.py @@ -49,8 +49,8 @@ def test_benchmark_query3(benchmark, connection): assert result[0]["city"] == "Austin" assert result[1]["city"] == "Kansas City" assert result[2]["city"] == "Miami" - assert result[3]["city"] == "San Antonio" - assert result[4]["city"] == "Houston" + assert result[3]["city"] == "Houston" + assert result[4]["city"] == "San Antonio" def test_benchmark_query4(benchmark, connection): @@ -61,9 +61,9 @@ def test_benchmark_query4(benchmark, connection): assert result[0]["countries"] == "United States" assert result[1]["countries"] == "Canada" assert result[2]["countries"] == "United Kingdom" - assert result[0]["personCounts"] == 30733 - assert result[1]["personCounts"] == 3046 - assert result[2]["personCounts"] == 1816 + assert result[0]["personCounts"] == 30698 + assert result[1]["personCounts"] == 3037 + assert result[2]["personCounts"] == 1819 def test_benchmark_query5(benchmark, connection): @@ -114,7 +114,7 @@ def test_benchmark_query7(benchmark, connection): result = result.to_dicts() assert len(result) == 1 - assert result[0]["numPersons"] == 168 + assert result[0]["numPersons"] == 165 assert result[0]["state"] == "California" assert result[0]["country"] == "United States" @@ -132,4 +132,4 @@ def test_benchmark_query9(benchmark, connection): result = result.to_dicts() assert len(result) == 1 - assert result[0]["numPaths"] == 46220422 + assert result[0]["numPaths"] == 46061065 diff --git a/kuzudb/build_graph.py b/kuzudb/build_graph.py index 6dceae4..8ece48d 100644 --- a/kuzudb/build_graph.py +++ b/kuzudb/build_graph.py @@ -118,7 +118,7 @@ def main(conn: Connection) -> None: conn.execute(f"COPY CityIn FROM '{EDGES_PATH}/city_in.parquet';") conn.execute(f"COPY StateIn FROM '{EDGES_PATH}/state_in.parquet';") - print(f"Successfully loaded nodes and edges into KùzuDB!") + print("Successfully loaded nodes and edges into KùzuDB!") if __name__ == "__main__": diff --git a/kuzudb/query.py b/kuzudb/query.py index b554e7f..e53f3bc 100644 --- a/kuzudb/query.py +++ b/kuzudb/query.py @@ -4,7 +4,6 @@ from typing import Any import kuzu -import polars as pl from codetiming import Timer from kuzu import Connection @@ -18,7 +17,7 @@ def run_query1(conn: Connection) -> None: """ print(f"\nQuery 1:\n {query}") response = conn.execute(query) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print(f"Top 3 most-followed persons:\n{result}") return result @@ -34,7 +33,7 @@ def run_query2(conn: Connection) -> None: """ print(f"\nQuery 2:\n {query}") response = conn.execute(query) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print(f"City in which most-followed person lives:\n{result}") return result @@ -49,7 +48,7 @@ def run_query3(conn: Connection, params: list[tuple[str, Any]]) -> None: """ print(f"\nQuery 3:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print(f"Cities with lowest average age in {params['country']}:\n{result}") return result @@ -64,7 +63,7 @@ def run_query4(conn: Connection, params: list[tuple[str, Any]]) -> None: """ print(f"\nQuery 4:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print(f"Persons between ages {params['age_lower']}-{params['age_upper']} in each country:\n{result}") return result @@ -82,7 +81,7 @@ def run_query5(conn: Connection, params: list[tuple[str, Any]]) -> None: """ print(f"\nQuery 5:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print( f"Number of {params['gender']} users in {params['city']}, {params['country']} who have an interest in {params['interest']}:\n{result}" ) @@ -102,7 +101,7 @@ def run_query6(conn: Connection, params: list[tuple[str, Any]]) -> None: """ print(f"\nQuery 6:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print( f"City with the most {params['gender']} users who have an interest in {params['interest']}:\n{result}" ) @@ -122,7 +121,7 @@ def run_query7(conn: Connection, params: list[tuple[str, Any]]) -> None: """ print(f"\nQuery 7:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print( f""" State in {params['country']} with the most users between ages {params['age_lower']}-{params['age_upper']} who have an interest in {params['interest']}:\n{result} @@ -139,7 +138,7 @@ def run_query8(conn: Connection) -> None: """ print(f"\nQuery 8:\n {query}") response = conn.execute(query) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print( f""" Number of second-degree paths:\n{result} @@ -158,7 +157,7 @@ def run_query9(conn: Connection, params: list[tuple[str, Any]]) -> None: print(f"\nQuery 9:\n {query}") response = conn.execute(query, parameters=params) - result = pl.from_arrow(response.get_as_arrow(chunk_size=10_000)) + result = response.get_as_pl() print( f""" Number of paths through persons below {params['age_1']} to persons above {params['age_2']}:\n{result} diff --git a/neo4j/.env.example b/neo4j/.env.example index ec8017b..086b943 100644 --- a/neo4j/.env.example +++ b/neo4j/.env.example @@ -1,3 +1,3 @@ -NEO4J_VERSION = "5.16.0" +NEO4J_VERSION = "5.18.0" NEO4J_USER = "neo4j" NEO4J_PASSWORD = \ No newline at end of file diff --git a/neo4j/README.md b/neo4j/README.md index d53c23b..d6a0169 100644 --- a/neo4j/README.md +++ b/neo4j/README.md @@ -47,8 +47,8 @@ The numbers shown below are for when we ingest 100K person nodes, ~10K location # Set large batch size of 500k $ python build_graph.py -b 500000 -Nodes loaded in 2.3172s -Edges loaded in 30.6305s +Nodes loaded in 2.3581s +Edges loaded in 30.8509s ``` As expected, the nodes load much faster than the edges, since there are many more edges than nodes. In addition, the nodes in Neo4j are indexed (via uniqueness constraints), following which the edges are created based on a match on existing nodes, allowing us to achieve this performance. @@ -249,32 +249,32 @@ The benchmark is run using `pytest-benchmark` package as follows. ```sh $ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname -========================================================== test session starts =========================================================== -platform darwin -- Python 3.11.7, pytest-8.0.0, pluggy-1.4.0 +================================================= test session starts ================================================== +platform darwin -- Python 3.11.7, pytest-8.1.1, pluggy-1.4.0 benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=5) rootdir: /Users/prrao/code/kuzudb-study/neo4j plugins: Faker-23.1.0, benchmark-4.0.0 -collected 9 items +collected 9 items -benchmark_query.py ......... [100%] +benchmark_query.py ......... [100%] --------------------------------------------------------------------------------- benchmark: 9 tests --------------------------------------------------------------------------------- Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -test_benchmark_query1 1.5030 (252.50) 1.5621 (118.21) 1.5396 (222.80) 0.0271 (28.35) 1.5547 (235.80) 0.0459 (55.89) 1;0 0.6495 (0.00) 5 1 -test_benchmark_query2 0.5365 (90.13) 0.6017 (45.53) 0.5680 (82.20) 0.0291 (30.42) 0.5625 (85.30) 0.0534 (64.93) 2;0 1.7605 (0.01) 5 1 -test_benchmark_query3 0.0316 (5.31) 0.0440 (3.33) 0.0338 (4.90) 0.0025 (2.62) 0.0332 (5.03) 0.0018 (2.19) 2;2 29.5547 (0.20) 26 1 -test_benchmark_query4 0.0362 (6.09) 0.0493 (3.73) 0.0391 (5.66) 0.0034 (3.56) 0.0381 (5.78) 0.0020 (2.45) 3;4 25.5809 (0.18) 21 1 -test_benchmark_query5 0.0060 (1.0) 0.0132 (1.0) 0.0069 (1.0) 0.0010 (1.0) 0.0066 (1.0) 0.0008 (1.0) 15;7 144.7100 (1.0) 106 1 -test_benchmark_query6 0.0140 (2.35) 0.0203 (1.53) 0.0159 (2.30) 0.0016 (1.69) 0.0154 (2.33) 0.0015 (1.84) 14;5 62.7980 (0.43) 48 1 -test_benchmark_query7 0.1398 (23.49) 0.1505 (11.39) 0.1433 (20.73) 0.0035 (3.67) 0.1431 (21.70) 0.0028 (3.35) 1;1 6.9803 (0.05) 7 1 -test_benchmark_query8 2.8413 (477.32) 2.9614 (224.10) 2.9034 (420.15) 0.0513 (53.67) 2.9204 (442.93) 0.0875 (106.48) 2;0 0.3444 (0.00) 5 1 -test_benchmark_query9 3.5675 (599.32) 3.7076 (280.56) 3.6319 (525.58) 0.0659 (68.93) 3.6184 (548.78) 0.1257 (152.89) 1;0 0.2753 (0.00) 5 1 +test_benchmark_query1 1.6634 (249.15) 1.8413 (171.27) 1.7614 (219.26) 0.0710 (78.24) 1.7787 (229.50) 0.1107 (185.42) 2;0 0.5677 (0.00) 5 1 +test_benchmark_query2 0.5965 (89.35) 0.6333 (58.91) 0.6149 (76.55) 0.0160 (17.68) 0.6091 (78.59) 0.0276 (46.27) 2;0 1.6262 (0.01) 5 1 +test_benchmark_query3 0.0360 (5.39) 0.0463 (4.30) 0.0388 (4.83) 0.0023 (2.49) 0.0384 (4.96) 0.0020 (3.37) 4;1 25.7565 (0.21) 20 1 +test_benchmark_query4 0.0404 (6.04) 0.0500 (4.65) 0.0426 (5.30) 0.0023 (2.55) 0.0421 (5.43) 0.0017 (2.90) 3;3 23.4888 (0.19) 24 1 +test_benchmark_query5 0.0067 (1.0) 0.0108 (1.0) 0.0080 (1.0) 0.0009 (1.0) 0.0078 (1.0) 0.0013 (2.12) 25;2 124.4822 (1.0) 98 1 +test_benchmark_query6 0.0182 (2.72) 0.0257 (2.39) 0.0212 (2.64) 0.0018 (1.97) 0.0205 (2.65) 0.0029 (4.78) 14;0 47.2173 (0.38) 44 1 +test_benchmark_query7 0.1557 (23.32) 0.1673 (15.56) 0.1592 (19.81) 0.0037 (4.10) 0.1581 (20.40) 0.0006 (1.0) 1;2 6.2826 (0.05) 7 1 +test_benchmark_query8 3.0889 (462.66) 3.3602 (312.56) 3.2919 (409.78) 0.1153 (126.95) 3.3429 (431.32) 0.1042 (174.43) 1;1 0.3038 (0.00) 5 1 +test_benchmark_query9 3.9647 (593.83) 4.0488 (376.61) 4.0125 (499.48) 0.0316 (34.82) 4.0214 (518.87) 0.0398 (66.58) 2;0 0.2492 (0.00) 5 1 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean -====================================================== 9 passed in 66.44s (0:01:06) ====================================================== +=============================================================== 9 passed in 73.77s (0:01:13) ================================================================ ``` \ No newline at end of file diff --git a/neo4j/benchmark_query.py b/neo4j/benchmark_query.py index df678da..940a66a 100644 --- a/neo4j/benchmark_query.py +++ b/neo4j/benchmark_query.py @@ -56,8 +56,8 @@ def test_benchmark_query3(benchmark, session): assert result[0]["city"] == "Austin" assert result[1]["city"] == "Kansas City" assert result[2]["city"] == "Miami" - assert result[3]["city"] == "San Antonio" - assert result[4]["city"] == "Houston" + assert result[3]["city"] == "Houston" + assert result[4]["city"] == "San Antonio" def test_benchmark_query4(benchmark, session): @@ -68,9 +68,9 @@ def test_benchmark_query4(benchmark, session): assert result[0]["countries"] == "United States" assert result[1]["countries"] == "Canada" assert result[2]["countries"] == "United Kingdom" - assert result[0]["personCounts"] == 30733 - assert result[1]["personCounts"] == 3046 - assert result[2]["personCounts"] == 1816 + assert result[0]["personCounts"] == 30698 + assert result[1]["personCounts"] == 3037 + assert result[2]["personCounts"] == 1819 def test_benchmark_query5(benchmark, session): @@ -96,7 +96,7 @@ def test_benchmark_query7(benchmark, session): result = result.to_dicts() assert len(result) == 1 - assert result[0]["numPersons"] == 168 + assert result[0]["numPersons"] == 165 assert result[0]["state"] == "California" assert result[0]["country"] == "United States" @@ -114,4 +114,4 @@ def test_benchmark_query9(benchmark, session): result = result.to_dicts() assert len(result) == 1 - assert result[0]["numPaths"] == 46220422 + assert result[0]["numPaths"] == 46061065 diff --git a/requirements.txt b/requirements.txt index 40fba9f..93c9955 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,8 @@ faker~=23.1.0 polars~=0.20.0 pyarrow~=15.0.0 -kuzu~=0.2.0 -neo4j~=5.17.0 +kuzu~=0.3.0 +neo4j~=5.18.0 python-dotenv>=1.0.0 codetiming>=1.4.0 pytest-benchmark>=4.0.0 \ No newline at end of file