Skip to content

Json reader#47

Merged
joocer merged 3 commits intomainfrom
json-reader
Oct 21, 2025
Merged

Json reader#47
joocer merged 3 commits intomainfrom
json-reader

Conversation

@joocer
Copy link
Member

@joocer joocer commented Oct 21, 2025

No description provided.

Copilot AI review requested due to automatic review settings October 21, 2025 12:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements performance optimizations for the JSON Lines reader by introducing zero-copy string parsing through slice-based parsing and expanding buffer protocol support to accept memoryview and bytearray inputs.

Key changes:

  • Added slice-based parsing (ParseLineKV, ParseStringSlice, ParseValueSlice) to avoid string allocations during parsing
  • Replaced explicit memoryview handling with generic buffer protocol support (PyObject_GetBuffer)
  • Added benchmarking comparisons with PyArrow in addition to Opteryx

Reviewed Changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_memoryview_input.py New test file validating buffer protocol support for bytes, memoryview, and bytearray inputs
rugo/jsonl_src/jsonl_reader.hpp Added string_slices vector to JsonlColumn struct for storing string slice references
rugo/jsonl_src/jsonl_reader.cpp Implemented slice-based parsing methods and refactored ReadJsonl to use zero-copy string parsing
rugo/jsonl_src/jsonl.pyx Replaced memoryview-specific code with generic buffer protocol handling using PyObject_GetBuffer
rugo/jsonl_src/init.py Reformatted imports to follow PEP 8 style guidelines
rugo/converters/orso.py Removed trailing blank line
pyproject.toml Version bump from 0.1.10 to 0.1.11
examples/read_jsonl.py Added path manipulation for running example from repository root
benchmarks/compare_opteryx_performance.py Added PyArrow benchmarking support and improved output formatting
benchmark_jsonl.py Reordered imports and reformatted string literals

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if (c == 't' && pos_ + 4 <= size_ &&
memcmp(data_ + pos_, "true", 4) == 0) {
type = JsonType::Boolean;
out_start = data_ + pos_ - data_; out_len = 4;
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation data_ + pos_ - data_ is redundant and simplifies to just pos_. This should be out_start = pos_; out_len = 4;

Copilot uses AI. Check for mistakes.
if (c == 'f' && pos_ + 5 <= size_ &&
memcmp(data_ + pos_, "false", 5) == 0) {
type = JsonType::Boolean;
out_start = data_ + pos_ - data_; out_len = 5;
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation data_ + pos_ - data_ is redundant and simplifies to just pos_. This should be out_start = pos_; out_len = 5;

Suggested change
out_start = data_ + pos_ - data_; out_len = 5;
out_start = pos_; out_len = 5;

Copilot uses AI. Check for mistakes.
if (c == '"') {
type = JsonType::String;
size_t s_start = 0, s_len = 0;
size_t saved_pos = pos_;
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable saved_pos is declared but never used. It should be removed to improve code clarity.

Suggested change
size_t saved_pos = pos_;

Copilot uses AI. Check for mistakes.

# Add rugo to path if running from source
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(1, os.path.join(sys.path[0], "../opteryx"))
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path manipulation appears to be hardcoded for a specific directory structure and may not work in different environments. Consider removing this line or adding a comment explaining its purpose.

Suggested change
sys.path.insert(1, os.path.join(sys.path[0], "../opteryx"))
# Add opteryx to path if running from source and the directory exists.
opteryx_path = os.path.join(sys.path[0], "../opteryx")
if os.path.isdir(opteryx_path):
sys.path.insert(1, opteryx_path)
# Note: This assumes a specific directory structure. If you install opteryx via pip, this is not needed.

Copilot uses AI. Check for mistakes.
@joocer joocer requested a review from Copilot October 21, 2025 12:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@joocer joocer merged commit a10d366 into main Oct 21, 2025
5 checks passed
@joocer joocer deleted the json-reader branch October 21, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants