Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements performance optimizations for the JSON Lines reader by introducing zero-copy string parsing through slice-based parsing and expanding buffer protocol support to accept memoryview and bytearray inputs.
Key changes:
- Added slice-based parsing (
ParseLineKV,ParseStringSlice,ParseValueSlice) to avoid string allocations during parsing - Replaced explicit memoryview handling with generic buffer protocol support (
PyObject_GetBuffer) - Added benchmarking comparisons with PyArrow in addition to Opteryx
Reviewed Changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_memoryview_input.py | New test file validating buffer protocol support for bytes, memoryview, and bytearray inputs |
| rugo/jsonl_src/jsonl_reader.hpp | Added string_slices vector to JsonlColumn struct for storing string slice references |
| rugo/jsonl_src/jsonl_reader.cpp | Implemented slice-based parsing methods and refactored ReadJsonl to use zero-copy string parsing |
| rugo/jsonl_src/jsonl.pyx | Replaced memoryview-specific code with generic buffer protocol handling using PyObject_GetBuffer |
| rugo/jsonl_src/init.py | Reformatted imports to follow PEP 8 style guidelines |
| rugo/converters/orso.py | Removed trailing blank line |
| pyproject.toml | Version bump from 0.1.10 to 0.1.11 |
| examples/read_jsonl.py | Added path manipulation for running example from repository root |
| benchmarks/compare_opteryx_performance.py | Added PyArrow benchmarking support and improved output formatting |
| benchmark_jsonl.py | Reordered imports and reformatted string literals |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if (c == 't' && pos_ + 4 <= size_ && | ||
| memcmp(data_ + pos_, "true", 4) == 0) { | ||
| type = JsonType::Boolean; | ||
| out_start = data_ + pos_ - data_; out_len = 4; |
There was a problem hiding this comment.
The calculation data_ + pos_ - data_ is redundant and simplifies to just pos_. This should be out_start = pos_; out_len = 4;
| if (c == 'f' && pos_ + 5 <= size_ && | ||
| memcmp(data_ + pos_, "false", 5) == 0) { | ||
| type = JsonType::Boolean; | ||
| out_start = data_ + pos_ - data_; out_len = 5; |
There was a problem hiding this comment.
The calculation data_ + pos_ - data_ is redundant and simplifies to just pos_. This should be out_start = pos_; out_len = 5;
| out_start = data_ + pos_ - data_; out_len = 5; | |
| out_start = pos_; out_len = 5; |
| if (c == '"') { | ||
| type = JsonType::String; | ||
| size_t s_start = 0, s_len = 0; | ||
| size_t saved_pos = pos_; |
There was a problem hiding this comment.
The variable saved_pos is declared but never used. It should be removed to improve code clarity.
| size_t saved_pos = pos_; |
|
|
||
| # Add rugo to path if running from source | ||
| sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) | ||
| sys.path.insert(1, os.path.join(sys.path[0], "../opteryx")) |
There was a problem hiding this comment.
This path manipulation appears to be hardcoded for a specific directory structure and may not work in different environments. Consider removing this line or adding a comment explaining its purpose.
| sys.path.insert(1, os.path.join(sys.path[0], "../opteryx")) | |
| # Add opteryx to path if running from source and the directory exists. | |
| opteryx_path = os.path.join(sys.path[0], "../opteryx") | |
| if os.path.isdir(opteryx_path): | |
| sys.path.insert(1, opteryx_path) | |
| # Note: This assumes a specific directory structure. If you install opteryx via pip, this is not needed. |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
No description provided.