Skip to content

feat(loaders): stream JSON via ijson; replace peek+replay with seek(0) where possible#125

Merged
garlontas merged 11 commits intomainfrom
copilot/refactor-json-xml-loaders
Apr 13, 2026
Merged

feat(loaders): stream JSON via ijson; replace peek+replay with seek(0) where possible#125
garlontas merged 11 commits intomainfrom
copilot/refactor-json-xml-loaders

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 13, 2026

  • Fix indentation error in __xml_loader.py (extra space before def on lines 119 and 131)
  • Remove trailing newline in __xml_loader.py
  • Remove pointless module-level string literals in __json_loader.py (W0105)
  • Fix lines too long in __json_loader.py (C0301)
  • Run poetry lock

Summary by Sourcery

Stream JSON and XML loading to avoid reading entire documents into memory and wire up the new json streaming dependency.

New Features:

  • Introduce streaming JSON parsing via ijson for files and strings, yielding items incrementally as namedtuples.
  • Add streaming XML parsing using iterparse so elements are yielded incrementally instead of building a full DOM.

Enhancements:

  • Update JSON namedtuple conversion to handle nested lists and dicts recursively for streamed items.
  • Refine XML loader behaviour to optionally yield each root child as it is closed and free memory by removing processed elements.

Build:

  • Add ijson as an optional dependency, define a json_loader extra, and include it in the aggregated all extra.
  • Include ijson in tox test dependencies and refresh the Poetry lockfile.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Apr 13, 2026

Reviewer's Guide

Refactors the JSON and XML loaders to stream data incrementally (using ijson for JSON and ElementTree.iterparse for XML), adds the ijson optional dependency and test wiring, and performs minor style/indentation fixes and lockfile refresh.

Sequence diagram for streaming JSON file loading with ijson

sequenceDiagram
    participant Caller
    participant json
    participant lazy_load_json_file
    participant stream_json_items
    participant PeekableBytesReader
    participant TextToBytesWrapper
    participant ijson
    participant jsonfile

    Caller->>json: json(src, read_from_src=False)
    json->>lazy_load_json_file: __lazy_load_json_file(file_path)
    lazy_load_json_file-->>Caller: generator
    loop iteration
        Caller->>lazy_load_json_file: next(generator)
        lazy_load_json_file->>+stream_json_items: __stream_json_items(jsonfile)
        stream_json_items->>jsonfile: read(_PEEK_SIZE)
        stream_json_items->>PeekableBytesReader: __init__(initial_bytes, TextToBytesWrapper(handle))
        stream_json_items->>TextToBytesWrapper: _TextToBytesWrapper(handle, encoding)
        alt root_is_array
            stream_json_items->>+ijson: items(PeekableBytesReader, item)
            ijson-->>stream_json_items: item_dict
            stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(item_dict)
        else root_is_single_object
            stream_json_items->>+ijson: items(PeekableBytesReader, root_path)
            ijson-->>stream_json_items: obj_dict
            stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(obj_dict)
        end
        lazy_load_json_file-->>Caller: namedtuple_item
    end
Loading

Sequence diagram for streaming XML parsing with iterparse

sequenceDiagram
    participant Caller
    participant xml
    participant _lazy_parse_xml_file as lazy_parse_xml_file
    participant _iterparse_xml as iterparse_xml
    participant ElementTree

    Caller->>xml: xml(src, read_from_src=False, retrieve_children, cast_types)
    xml->>lazy_parse_xml_file: _lazy_parse_xml_file(file_path, encoding, retrieve_children, cast_types)
    lazy_parse_xml_file-->>Caller: generator

    loop iteration
        Caller->>lazy_parse_xml_file: next(generator)
        lazy_parse_xml_file->>+iterparse_xml: _iterparse_xml(xmlfile, retrieve_children, cast_types)
        iterparse_xml->>+ElementTree: iterparse(source, events)
        ElementTree-->>iterparse_xml: (event, elem) stream
        alt retrieve_children == True and depth == 1 on end
            iterparse_xml->>iterparse_xml: __parse_xml(elem, cast_types)
            iterparse_xml-->>lazy_parse_xml_file: child_namedtuple
        else retrieve_children == False and depth == 0 on end
            iterparse_xml->>iterparse_xml: __parse_xml(root, cast_types)
            iterparse_xml-->>lazy_parse_xml_file: root_namedtuple
            iterparse_xml-->>lazy_parse_xml_file: return
        end
        lazy_parse_xml_file-->>Caller: namedtuple_item
    end
Loading

Class diagram for new JSON streaming helper classes

classDiagram
    class _TextToBytesWrapper {
      - _handle
      - _encoding
      + _TextToBytesWrapper(handle, encoding)
      + read(size)
    }

    class _PeekableBytesReader {
      - _buf
      - _src
      + _PeekableBytesReader(buffer, source)
      + read(size)
    }

    class __json_loader_module {
      + __stream_json_items(handle)
      + __dict_to_namedtuple(d, name)
    }

    _TextToBytesWrapper --> _PeekableBytesReader : wraps_source
    _PeekableBytesReader --> __json_loader_module : used_by
    __json_loader_module ..> _TextToBytesWrapper : creates
    __json_loader_module ..> _PeekableBytesReader : creates
Loading

Flow diagram for JSON root detection and streaming behaviour

flowchart TD
    A["Start __stream_json_items"] --> B["Read initial chunk from handle"]
    B --> C["Convert initial data to str and bytes"]
    C --> D["Strip leading whitespace"]
    D --> E{"Any non-whitespace data?"}
    E -->|No| F["Return without yielding items"]
    E -->|Yes| G["Inspect first non-whitespace character"]
    G --> H["Create PeekableBytesReader with initial_bytes and TextToBytesWrapper(handle)"]
    H --> I{"first_char == '['"}
    I -->|Yes| J["Use ijson.items(reader, item) to stream array elements"]
    I -->|No| K["Use ijson.items(reader, root_path) to read single object"]
    J --> L["For each item dict, call __dict_to_namedtuple and yield"]
    K --> M["Get first object dict, call __dict_to_namedtuple and yield if not None"]
Loading

File-Level Changes

Change Details Files
Switch JSON loader from loading full content via json.loads to streaming parsing via ijson, including new helper wrappers to feed bytes to ijson and support both files and strings.
  • Replace jsonlib import and whole-file reads with a streaming pipeline based on ijson.items
  • Introduce _TextToBytesWrapper to adapt text-mode file handles to the byte interface expected by ijson
  • Introduce _PeekableBytesReader to replay an initial peek buffer before delegating subsequent reads to the underlying source
  • Add __stream_json_items helper that peeks at the first non-whitespace character to distinguish arrays from single objects and streams items accordingly
  • Make __dict_to_namedtuple recursive over nested dicts and lists to preserve previous behaviour under streaming
pystreamapi/loaders/__json/__json_loader.py
Refactor XML loader to use ElementTree.iterparse for true streaming rather than reading the entire document into memory.
  • Change _lazy_parse_xml_file to pass the open file object into a new _iterparse_xml function instead of reading the whole file into a string
  • Change _lazy_parse_xml_string to wrap the string in io.StringIO and pass it into _iterparse_xml
  • Introduce _iterparse_xml which drives ElementTree.iterparse, tracks element depth, and yields either root children incrementally (removing them to free memory) or the root element once, depending on retrieve_children
  • Remove the old _parse_xml_string_lazy and __flatten helpers that operated on a fully-built ElementTree structure
pystreamapi/loaders/__xml/__xml_loader.py
Declare and wire up the new ijson optional dependency for the JSON loader and tests, and update the lockfile.
  • Add ijson as an optional dependency in pyproject.toml with a json_loader extra and include it in the all extra
  • Ensure tox test environment installs ijson
  • Regenerate poetry.lock to capture the new dependency graph
pyproject.toml
tox.ini
poetry.lock

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

garlontas and others added 2 commits April 13, 2026 16:11
This PR refactors documentation and file formatting to comply with style guidelines. It wraps long docstring lines, cleans up extraneous blank lines at the ends of files, and adds missing module and function docstrings.

- Doc line too long: The patch rewraps overly long docstrings into multiple lines, breaking sentences at natural boundaries and aligning each line with proper indentation. This ensures that all documentation lines stay within the prescribed maximum line length.
- Multiple blank lines detected at end of the file: Removed extraneous blank lines at the end of files to enforce a single newline termination. This change eliminates unexpected trailing whitespace and maintains consistent file formatting.
- Missing module/function docstring: Added descriptive module-level and function-level docstrings where none existed, including explanations of purpose, parameters, and return values. These additions improve code readability and satisfy documentation coverage requirements.

> This Autofix was generated by AI. Please review the change before merging.
Copilot AI requested a review from garlontas April 13, 2026 14:20
Fix formatting of docstring in read method.
@garlontas garlontas marked this pull request as ready for review April 13, 2026 14:22
Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In __stream_json_items, the BOM/leading-non-whitespace handling relies on lstrip() and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots.
  • In the XML iterparse path (_iterparse_xml), using root.remove(elem) for each child can be O(n²) on large documents; clearing processed elements (elem.clear()) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `__stream_json_items`, the BOM/leading-non-whitespace handling relies on `lstrip()` and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots.
- In the XML iterparse path (`_iterparse_xml`), using `root.remove(elem)` for each child can be O(n²) on large documents; clearing processed elements (`elem.clear()`) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.

## Individual Comments

### Comment 1
<location path="pystreamapi/loaders/__xml/__xml_loader.py" line_range="65-74" />
<code_context>
+                root = elem
+        else:  # 'end'
+            depth -= 1
+            if retrieve_children:
+                if depth == 1:
+                    yield __parse_xml(elem, cast_types)
+                    root.remove(elem)
+            else:
+                if depth == 0:
</code_context>
<issue_to_address>
**suggestion (performance):** When streaming XML children, consider explicitly clearing elements after removal to minimise memory usage.

In the `retrieve_children` branch you already `root.remove(elem)` once a child is fully parsed, which keeps the root’s children list small. However, `ElementTree` may still retain internal data for that element. To further reduce memory usage on large documents, also clear the element after removal:

```python
yield __parse_xml(elem, cast_types)
root.remove(elem)
elem.clear()
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@sonarqubecloud
Copy link
Copy Markdown

@garlontas garlontas merged commit 8abcc2c into main Apr 13, 2026
11 checks passed
@garlontas garlontas deleted the copilot/refactor-json-xml-loaders branch April 13, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants