perf: add B-Tree index for equality search#611
perf: add B-Tree index for equality search#611matheusvir wants to merge 3 commits intomsiemens:masterfrom
Conversation
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
…tion Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
a188ae2 to
0d60d3b
Compare
|
Thanks for the PR. The benchmarks and write-up are thoughtful, and I appreciate the work here. However, I think this is out of scope for TinyDB core. Even as an optional feature, adding indexing introduces a fair bit of complexity to the table implementation and raises the maintenance burden around inserts, updates, removes, and query behavior. The performance gains are clear, but this feels more like a step toward a more feature-heavy database engine than something that belongs in the core. My rule of thumb is that at a size where TinyDB's performance becomes an issue, one will gain more performance by using e.g. SQLite (possibly with their JSON extensions) than even the most optimized Python code could offer. That being said, I’d be more comfortable seeing this explored as an extension instead. If you publish this as an extension, I'd be glad to link to it from the docs extension list. |
What was done
This PR introduces an optional in-memory B-Tree index to improve document lookup performance in TinyDB.
By default, TinyDB performs searches using a full linear scan over in-memory documents, resulting in O(n) lookup complexity.
With this change, indexed fields use a B-Tree structure, reducing key lookup complexity to O(log n).
The index is created on demand using
create_index(field_name)and does not modify TinyDB's default behavior unless explicitly enabled.Implementation details
index.pyimplementing a B-Tree index.Table.__init__()to maintain an internal index registry.search()to use the index when available.insert()to update indexes on insertion.update()to maintain index consistency.remove()to remove entries from indexes.Complexity impact
Where:
Performance observations
The implementation does not add new test files. The existing suite (204 passed, 1 skipped) exercises all affected code paths —
search(),insert(),update(), andremove()— through the parametrized fixtures intests/test_operations.pyandtests/test_tables.py. All tests pass with the index enabled and with the default behavior (no index) unchanged.Performance
All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.
Methodology
JSONStorage, reflecting real TinyDB usage.time.perf_counter_ns()with GC disabled during measurement.Results
Analysis
The gain is consistent across all scales, ranging from ~64% to ~70%. The baseline cost grows nearly linearly with document count, while the indexed path scales significantly better. The reduction in standard deviation in the optimized runs also indicates more predictable latency.
The write-time overhead from index maintenance is an expected and acceptable trade-off for workloads that are read-heavy.
Reproducing the benchmark
The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.
Relevant files:
setup/tinydb/Dockerfileexperiments/tinydb/benchmark_btree.pyexperiments/tinydb/run_btree.shTo run:
# From the root of eda-oss-performance bash experiments/tinydb/run_btree.shThis builds the Docker image from
setup/tinydb/Dockerfile, mounts the repository, and runs the benchmark script inside the container. Results are written toresults/tinydb/result_tinydb_btree.json.Feedback on the index implementation, API design, and edge cases is welcome.
Relates to #480 and #544.