feat(loaders): stream JSON via ijson; replace peek+replay with seek(0) where possible by Copilot · Pull Request #125 · pickwicksoft/pystreamapi

Copilot · 2026-04-13T14:08:25Z

Fix indentation error in __xml_loader.py (extra space before def on lines 119 and 131)
Remove trailing newline in __xml_loader.py
Remove pointless module-level string literals in __json_loader.py (W0105)
Fix lines too long in __json_loader.py (C0301)
Run poetry lock

Summary by Sourcery

Stream JSON and XML loading to avoid reading entire documents into memory and wire up the new json streaming dependency.

New Features:

Introduce streaming JSON parsing via ijson for files and strings, yielding items incrementally as namedtuples.
Add streaming XML parsing using iterparse so elements are yielded incrementally instead of building a full DOM.

Enhancements:

Update JSON namedtuple conversion to handle nested lists and dicts recursively for streamed items.
Refine XML loader behaviour to optionally yield each root child as it is closed and free memory by removing processed elements.

Build:

Add ijson as an optional dependency, define a json_loader extra, and include it in the aggregated all extra.
Include ijson in tox test dependencies and refresh the Poetry lockfile.

Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/82b1cf1d-99d7-44ca-b24f-32dd8310dc66 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>

sourcery-ai · 2026-04-13T14:09:16Z

Reviewer's Guide

Refactors the JSON and XML loaders to stream data incrementally (using ijson for JSON and ElementTree.iterparse for XML), adds the ijson optional dependency and test wiring, and performs minor style/indentation fixes and lockfile refresh.

Sequence diagram for streaming JSON file loading with ijson

sequenceDiagram
    participant Caller
    participant json
    participant lazy_load_json_file
    participant stream_json_items
    participant PeekableBytesReader
    participant TextToBytesWrapper
    participant ijson
    participant jsonfile

    Caller->>json: json(src, read_from_src=False)
    json->>lazy_load_json_file: __lazy_load_json_file(file_path)
    lazy_load_json_file-->>Caller: generator
    loop iteration
        Caller->>lazy_load_json_file: next(generator)
        lazy_load_json_file->>+stream_json_items: __stream_json_items(jsonfile)
        stream_json_items->>jsonfile: read(_PEEK_SIZE)
        stream_json_items->>PeekableBytesReader: __init__(initial_bytes, TextToBytesWrapper(handle))
        stream_json_items->>TextToBytesWrapper: _TextToBytesWrapper(handle, encoding)
        alt root_is_array
            stream_json_items->>+ijson: items(PeekableBytesReader, item)
            ijson-->>stream_json_items: item_dict
            stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(item_dict)
        else root_is_single_object
            stream_json_items->>+ijson: items(PeekableBytesReader, root_path)
            ijson-->>stream_json_items: obj_dict
            stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(obj_dict)
        end
        lazy_load_json_file-->>Caller: namedtuple_item
    end

Sequence diagram for streaming XML parsing with iterparse

sequenceDiagram
    participant Caller
    participant xml
    participant _lazy_parse_xml_file as lazy_parse_xml_file
    participant _iterparse_xml as iterparse_xml
    participant ElementTree

    Caller->>xml: xml(src, read_from_src=False, retrieve_children, cast_types)
    xml->>lazy_parse_xml_file: _lazy_parse_xml_file(file_path, encoding, retrieve_children, cast_types)
    lazy_parse_xml_file-->>Caller: generator

    loop iteration
        Caller->>lazy_parse_xml_file: next(generator)
        lazy_parse_xml_file->>+iterparse_xml: _iterparse_xml(xmlfile, retrieve_children, cast_types)
        iterparse_xml->>+ElementTree: iterparse(source, events)
        ElementTree-->>iterparse_xml: (event, elem) stream
        alt retrieve_children == True and depth == 1 on end
            iterparse_xml->>iterparse_xml: __parse_xml(elem, cast_types)
            iterparse_xml-->>lazy_parse_xml_file: child_namedtuple
        else retrieve_children == False and depth == 0 on end
            iterparse_xml->>iterparse_xml: __parse_xml(root, cast_types)
            iterparse_xml-->>lazy_parse_xml_file: root_namedtuple
            iterparse_xml-->>lazy_parse_xml_file: return
        end
        lazy_parse_xml_file-->>Caller: namedtuple_item
    end

Class diagram for new JSON streaming helper classes

classDiagram
    class _TextToBytesWrapper {
      - _handle
      - _encoding
      + _TextToBytesWrapper(handle, encoding)
      + read(size)
    }

    class _PeekableBytesReader {
      - _buf
      - _src
      + _PeekableBytesReader(buffer, source)
      + read(size)
    }

    class __json_loader_module {
      + __stream_json_items(handle)
      + __dict_to_namedtuple(d, name)
    }

    _TextToBytesWrapper --> _PeekableBytesReader : wraps_source
    _PeekableBytesReader --> __json_loader_module : used_by
    __json_loader_module ..> _TextToBytesWrapper : creates
    __json_loader_module ..> _PeekableBytesReader : creates

Flow diagram for JSON root detection and streaming behaviour

flowchart TD
    A["Start __stream_json_items"] --> B["Read initial chunk from handle"]
    B --> C["Convert initial data to str and bytes"]
    C --> D["Strip leading whitespace"]
    D --> E{"Any non-whitespace data?"}
    E -->|No| F["Return without yielding items"]
    E -->|Yes| G["Inspect first non-whitespace character"]
    G --> H["Create PeekableBytesReader with initial_bytes and TextToBytesWrapper(handle)"]
    H --> I{"first_char == '['"}
    I -->|Yes| J["Use ijson.items(reader, item) to stream array elements"]
    I -->|No| K["Use ijson.items(reader, root_path) to read single object"]
    J --> L["For each item dict, call __dict_to_namedtuple and yield"]
    K --> M["Get first object dict, call __dict_to_namedtuple and yield if not None"]

File-Level Changes

Change	Details	Files
Switch JSON loader from loading full content via json.loads to streaming parsing via ijson, including new helper wrappers to feed bytes to ijson and support both files and strings.	Replace jsonlib import and whole-file reads with a streaming pipeline based on ijson.items Introduce _TextToBytesWrapper to adapt text-mode file handles to the byte interface expected by ijson Introduce _PeekableBytesReader to replay an initial peek buffer before delegating subsequent reads to the underlying source Add __stream_json_items helper that peeks at the first non-whitespace character to distinguish arrays from single objects and streams items accordingly Make __dict_to_namedtuple recursive over nested dicts and lists to preserve previous behaviour under streaming	`pystreamapi/loaders/__json/__json_loader.py`
Refactor XML loader to use ElementTree.iterparse for true streaming rather than reading the entire document into memory.	Change _lazy_parse_xml_file to pass the open file object into a new _iterparse_xml function instead of reading the whole file into a string Change _lazy_parse_xml_string to wrap the string in io.StringIO and pass it into _iterparse_xml Introduce _iterparse_xml which drives ElementTree.iterparse, tracks element depth, and yields either root children incrementally (removing them to free memory) or the root element once, depending on retrieve_children Remove the old _parse_xml_string_lazy and __flatten helpers that operated on a fully-built ElementTree structure	`pystreamapi/loaders/__xml/__xml_loader.py`
Declare and wire up the new ijson optional dependency for the JSON loader and tests, and update the lockfile.	Add ijson as an optional dependency in pyproject.toml with a json_loader extra and include it in the all extra Ensure tox test environment installs ijson Regenerate poetry.lock to capture the new dependency graph	`pyproject.toml` `tox.ini` `poetry.lock`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

This PR refactors documentation and file formatting to comply with style guidelines. It wraps long docstring lines, cleans up extraneous blank lines at the ends of files, and adds missing module and function docstrings. - Doc line too long: The patch rewraps overly long docstrings into multiple lines, breaking sentences at natural boundaries and aligning each line with proper indentation. This ensures that all documentation lines stay within the prescribed maximum line length. - Multiple blank lines detected at end of the file: Removed extraneous blank lines at the end of files to enforce a single newline termination. This change eliminates unexpected trailing whitespace and maintains consistent file formatting. - Missing module/function docstring: Added descriptive module-level and function-level docstrings where none existed, including explanations of purpose, parameters, and return values. These additions improve code readability and satisfy documentation coverage requirements. > This Autofix was generated by AI. Please review the change before merging.

Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/5d738f28-a5b0-447a-b1aa-dcc973d07552 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>

Fix formatting of docstring in read method.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In __stream_json_items, the BOM/leading-non-whitespace handling relies on lstrip() and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots.
In the XML iterparse path (_iterparse_xml), using root.remove(elem) for each child can be O(n²) on large documents; clearing processed elements (elem.clear()) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `__stream_json_items`, the BOM/leading-non-whitespace handling relies on `lstrip()` and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots.
- In the XML iterparse path (`_iterparse_xml`), using `root.remove(elem)` for each child can be O(n²) on large documents; clearing processed elements (`elem.clear()`) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.

## Individual Comments

### Comment 1
<location path="pystreamapi/loaders/__xml/__xml_loader.py" line_range="65-74" />
<code_context>
+                root = elem
+        else:  # 'end'
+            depth -= 1
+            if retrieve_children:
+                if depth == 1:
+                    yield __parse_xml(elem, cast_types)
+                    root.remove(elem)
+            else:
+                if depth == 0:
</code_context>
<issue_to_address>
**suggestion (performance):** When streaming XML children, consider explicitly clearing elements after removal to minimise memory usage.

In the `retrieve_children` branch you already `root.remove(elem)` once a child is fully parsed, which keeps the root’s children list small. However, `ElementTree` may still retain internal data for that element. To further reduce memory usage on large documents, also clear the element after removal:

```python
yield __parse_xml(elem, cast_types)
root.remove(elem)
elem.clear()
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

pystreamapi/loaders/__xml/__xml_loader.py

Remove redundant root.remove(elem) call in XML parsing.

sonarqubecloud · 2026-04-13T14:38:15Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
83.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI and others added 2 commits April 13, 2026 13:56

refactor: make JSON and XML loaders truly streaming/lazy

7cf00ee

Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/82b1cf1d-99d7-44ca-b24f-32dd8310dc66 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>

address code review: add comments and type annotation

17116c6

Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/82b1cf1d-99d7-44ca-b24f-32dd8310dc66 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>

Copilot AI assigned Copilot and garlontas Apr 13, 2026

Copilot created this pull request from a session on behalf of garlontas April 13, 2026 14:08 View session

garlontas and others added 2 commits April 13, 2026 16:11

Add ijson to dependencies in tox.ini

18ca525

Copilot started work on behalf of garlontas April 13, 2026 14:14 View session

fix: resolve syntax/lint errors in xml and json loaders, run poetry lock

663680c

Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/5d738f28-a5b0-447a-b1aa-dcc973d07552 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>

Copilot finished work on behalf of garlontas April 13, 2026 14:20

Copilot AI requested a review from garlontas April 13, 2026 14:20

Correct docstring formatting in read method

15b9269

Fix formatting of docstring in read method.

garlontas marked this pull request as ready for review April 13, 2026 14:22

sourcery-ai bot reviewed Apr 13, 2026

View reviewed changes

pystreamapi/loaders/__xml/__xml_loader.py Show resolved Hide resolved

garlontas added 5 commits April 13, 2026 16:25

Refactor XML parsing with helper functions

638a2d8

Refactor XML parsing logic in __xml_loader.py

a343fe6

Fix XML parsing to free space completely

f4b4b5a

Update __xml_loader.py

029e56b

Fix XML parsing by removing unnecessary element removal

3e20fa8

Remove redundant root.remove(elem) call in XML parsing.

garlontas merged commit 8abcc2c into main Apr 13, 2026
11 checks passed

garlontas deleted the copilot/refactor-json-xml-loaders branch April 13, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loaders): stream JSON via ijson; replace peek+replay with seek(0) where possible#125

feat(loaders): stream JSON via ijson; replace peek+replay with seek(0) where possible#125
garlontas merged 11 commits intomainfrom
copilot/refactor-json-xml-loaders

Copilot AI commented Apr 13, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Apr 13, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 13, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for streaming JSON file loading with ijson

Sequence diagram for streaming XML parsing with iterparse

Class diagram for new JSON streaming helper classes

Flow diagram for JSON root detection and streaming behaviour

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 13, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 13, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Apr 13, 2026 •

edited

Loading