Skip to content

feat(lsm): expose lsm api to python#6259

Merged
jackye1995 merged 6 commits intolance-format:mainfrom
zhangyue19921010:lsm-python
Apr 30, 2026
Merged

feat(lsm): expose lsm api to python#6259
jackye1995 merged 6 commits intolance-format:mainfrom
zhangyue19921010:lsm-python

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

Expose LSM related API to Python

Writer:

  • initialize_mem_wal
  • mem_wal_writer

Reader:

  • LsmScanner
  • LsmPointLookupPlanner
  • LsmVectorSearchPlanner

Optimize:

  • mark_generations_as_merged

@github-actions github-actions Bot added enhancement New feature or request python labels Mar 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat(lsm): expose lsm api to python

Overall this is a well-structured PR with good test coverage and clean Python/Rust layering. A few issues to flag:

P0: GIL not released in ExecutionPlanReader::next()

In python/src/mem_wal.rs, the ExecutionPlanReader::next() implementation calls rt().spawn(None, ...), passing None instead of Some(py). This means the GIL is not released while the tokio future executes. Since next() is called from Python (which holds the GIL), this blocks the GIL for the entire duration of each batch fetch. If any downstream async work (or another Python thread) needs the GIL, this will deadlock.

Every other async call site in this PR correctly uses rt().block_on(Some(py), ...) to release the GIL. However, next() does not have access to py since it implements Iterator. Consider restructuring, e.g. collecting batches eagerly in to_reader while holding py, or using pyo3-async-runtimes to properly yield the GIL during iteration.

P1: put() collects all batches into memory

The put() method in PyRegionWriter materializes the entire input stream into a Vec before writing. Per project guidelines, this can OOM on large writes. Consider streaming batches to the writer incrementally instead of collecting them all at once.

Minor

  • Dead code: _unwrap_region_id() in python/python/lance/mem_wal.py is defined but never called. The UUID validation is already done on the Rust side. Remove it or wire it up.
  • Docstring typo: dataset.py line ~158 in mem_wal_writer docstring: initialize_mem_wal is missing its opening backtick.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

PR Review: feat(lsm): expose lsm api to python

All fixed.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Hi @jackye1995 Sorry to bother you. Would u mind to take a look for this PR at your convenience? Really Appreciate!

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the work, I fixed one thing, apart from that it looks good to me!

@jackye1995 jackye1995 merged commit e970e1d into lance-format:main Apr 30, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants