-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexes #28
Comments
Hi @maxdemarzi, I added indexes as you suggested, but for some reason, it makes the performance worse, not better, and I'm not sure why. Also, I'll clarify each of your points below:
$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname
=========================================================== test session starts ============================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=5)
rootdir: /code/kuzudb-study/neo4j
plugins: Faker-19.2.0, anyio-3.7.1, benchmark-4.0.0
collected 9 items
benchmark_query.py ......... [100%]
--------------------------------------------------------------------------------- benchmark: 9 tests ---------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_query1 1.5733 (253.20) 1.6033 (63.31) 1.5928 (215.76) 0.0131 (7.05) 1.6004 (227.62) 0.0201 (41.80) 1;0 0.6278 (0.00) 5 1
test_benchmark_query2 0.5663 (91.13) 0.5889 (23.26) 0.5770 (78.17) 0.0095 (5.12) 0.5746 (81.72) 0.0163 (33.91) 2;0 1.7331 (0.01) 5 1
test_benchmark_query3 0.0362 (5.83) 0.0527 (2.08) 0.0394 (5.34) 0.0043 (2.33) 0.0376 (5.34) 0.0040 (8.33) 2;2 25.3731 (0.19) 19 1
test_benchmark_query4 0.0410 (6.60) 0.0566 (2.24) 0.0435 (5.89) 0.0032 (1.72) 0.0425 (6.04) 0.0016 (3.42) 2;2 23.0038 (0.17) 23 1
test_benchmark_query5 0.0062 (1.0) 0.0267 (1.05) 0.0074 (1.0) 0.0021 (1.15) 0.0070 (1.0) 0.0005 (1.0) 1;5 135.4661 (1.0) 88 1
test_benchmark_query6 0.0177 (2.84) 0.0253 (1.0) 0.0197 (2.67) 0.0019 (1.0) 0.0192 (2.73) 0.0014 (2.81) 7;5 50.6911 (0.37) 45 1
test_benchmark_query7 0.1517 (24.41) 0.1685 (6.66) 0.1556 (21.07) 0.0058 (3.11) 0.1538 (21.87) 0.0007 (1.46) 1;2 6.4286 (0.05) 7 1
test_benchmark_query8 3.1052 (499.72) 3.1835 (125.71) 3.1393 (425.27) 0.0333 (17.89) 3.1493 (447.93) 0.0535 (111.43) 2;0 0.3185 (0.00) 5 1
test_benchmark_query9 7.6747 (>1000.0) 7.7181 (304.78) 7.7004 (>1000.0) 0.0164 (8.82) 7.7041 (>1000.0) 0.0205 (42.60) 2;0 0.1299 (0.00) 5 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
======================================================= 9 passed in 97.45s (0:01:37) ======================================================= |
Results after adding the three indexes as per #29: Neo4j vs. Kùzu multi-threaded
Query 8, with the And I ran the benchmark multiple times, to obtain the same results. The larger points in my blog posts are still valid, I feel -- with Kùzu, I don't have to "think" about what properties are being queried on beforehand, and the only properties being indexed in Kùzu are the primary key ( |
When returning ids… on Neo you are returning the id property not the id(node) primary key. That may affect some query speeds. As far as lowercase vs not. Try what I suggested and index the pre lowered property so you are comparing traversal speed not string modifications. you can ask for the Profile in Neo4j to see where the query is taking longer. I would also suggest using a real benchmarking tool like Gatling and running each query for at least 60 seconds after warm up. |
But the goal of this study isn't to say that one tool is "better" or faster than the other. It's to apply a similar set of data preprocessing and querying techniques to answer questions about the data, without giving special treatment to Neo4j. I've been a long time user of Neo4j, and the constant dancing around one has to do to with indexing, refactoring data models to answer specific types of queries, and such, are what are making it hard to justify running on production workloads on large datasets. I've been experimenting with other datasets at work, and the same speed issues apply (with or without indexing), whereas Kùzu just works. My takeaway is that Neo4j performs the same role in graph DBMS as Postgres does in RDBMS -- they're both OLTP row/record-wise stores, better at handling transactions, and for OLAP workloads, Neo4j will face the same performance issues that Postgres does, when compared against their OLAP counterparts (ClickHouse or DuckDB for RDBMS, and Kùzu for GDBMS). I appreciate your inputs, and am glad I went through this learning exercise. Closing this for now. Thanks! |
Hello 👋, As a person who has been asked to look into this. I think this should be left open a little white longer |
@JoshInnis could you open a new issue with your findings when you have them? Thanks! |
Can you try to run Neo4j with indexes?
For example Country.country, Person.age, City.city
Create two properties Person.lower_gender and Interest.lower_interest with the property already in lowercase and index both... or just set the data to be the toLower version of it and index/use that directly.
Also what does the throughput of these queries look like? Not just the time.
Also comparing queries that take so little time is not the most useful.
0.0086 seconds vs 0.0046 seconds is kinda pointless.
The text was updated successfully, but these errors were encountered: