Conversation
Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/82b1cf1d-99d7-44ca-b24f-32dd8310dc66 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>
Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/82b1cf1d-99d7-44ca-b24f-32dd8310dc66 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
garlontas
April 13, 2026 14:08
View session
Reviewer's GuideRefactors the JSON and XML loaders to stream data incrementally (using ijson for JSON and ElementTree.iterparse for XML), adds the ijson optional dependency and test wiring, and performs minor style/indentation fixes and lockfile refresh. Sequence diagram for streaming JSON file loading with ijsonsequenceDiagram
participant Caller
participant json
participant lazy_load_json_file
participant stream_json_items
participant PeekableBytesReader
participant TextToBytesWrapper
participant ijson
participant jsonfile
Caller->>json: json(src, read_from_src=False)
json->>lazy_load_json_file: __lazy_load_json_file(file_path)
lazy_load_json_file-->>Caller: generator
loop iteration
Caller->>lazy_load_json_file: next(generator)
lazy_load_json_file->>+stream_json_items: __stream_json_items(jsonfile)
stream_json_items->>jsonfile: read(_PEEK_SIZE)
stream_json_items->>PeekableBytesReader: __init__(initial_bytes, TextToBytesWrapper(handle))
stream_json_items->>TextToBytesWrapper: _TextToBytesWrapper(handle, encoding)
alt root_is_array
stream_json_items->>+ijson: items(PeekableBytesReader, item)
ijson-->>stream_json_items: item_dict
stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(item_dict)
else root_is_single_object
stream_json_items->>+ijson: items(PeekableBytesReader, root_path)
ijson-->>stream_json_items: obj_dict
stream_json_items-->>lazy_load_json_file: __dict_to_namedtuple(obj_dict)
end
lazy_load_json_file-->>Caller: namedtuple_item
end
Sequence diagram for streaming XML parsing with iterparsesequenceDiagram
participant Caller
participant xml
participant _lazy_parse_xml_file as lazy_parse_xml_file
participant _iterparse_xml as iterparse_xml
participant ElementTree
Caller->>xml: xml(src, read_from_src=False, retrieve_children, cast_types)
xml->>lazy_parse_xml_file: _lazy_parse_xml_file(file_path, encoding, retrieve_children, cast_types)
lazy_parse_xml_file-->>Caller: generator
loop iteration
Caller->>lazy_parse_xml_file: next(generator)
lazy_parse_xml_file->>+iterparse_xml: _iterparse_xml(xmlfile, retrieve_children, cast_types)
iterparse_xml->>+ElementTree: iterparse(source, events)
ElementTree-->>iterparse_xml: (event, elem) stream
alt retrieve_children == True and depth == 1 on end
iterparse_xml->>iterparse_xml: __parse_xml(elem, cast_types)
iterparse_xml-->>lazy_parse_xml_file: child_namedtuple
else retrieve_children == False and depth == 0 on end
iterparse_xml->>iterparse_xml: __parse_xml(root, cast_types)
iterparse_xml-->>lazy_parse_xml_file: root_namedtuple
iterparse_xml-->>lazy_parse_xml_file: return
end
lazy_parse_xml_file-->>Caller: namedtuple_item
end
Class diagram for new JSON streaming helper classesclassDiagram
class _TextToBytesWrapper {
- _handle
- _encoding
+ _TextToBytesWrapper(handle, encoding)
+ read(size)
}
class _PeekableBytesReader {
- _buf
- _src
+ _PeekableBytesReader(buffer, source)
+ read(size)
}
class __json_loader_module {
+ __stream_json_items(handle)
+ __dict_to_namedtuple(d, name)
}
_TextToBytesWrapper --> _PeekableBytesReader : wraps_source
_PeekableBytesReader --> __json_loader_module : used_by
__json_loader_module ..> _TextToBytesWrapper : creates
__json_loader_module ..> _PeekableBytesReader : creates
Flow diagram for JSON root detection and streaming behaviourflowchart TD
A["Start __stream_json_items"] --> B["Read initial chunk from handle"]
B --> C["Convert initial data to str and bytes"]
C --> D["Strip leading whitespace"]
D --> E{"Any non-whitespace data?"}
E -->|No| F["Return without yielding items"]
E -->|Yes| G["Inspect first non-whitespace character"]
G --> H["Create PeekableBytesReader with initial_bytes and TextToBytesWrapper(handle)"]
H --> I{"first_char == '['"}
I -->|Yes| J["Use ijson.items(reader, item) to stream array elements"]
I -->|No| K["Use ijson.items(reader, root_path) to read single object"]
J --> L["For each item dict, call __dict_to_namedtuple and yield"]
K --> M["Get first object dict, call __dict_to_namedtuple and yield if not None"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
This PR refactors documentation and file formatting to comply with style guidelines. It wraps long docstring lines, cleans up extraneous blank lines at the ends of files, and adds missing module and function docstrings. - Doc line too long: The patch rewraps overly long docstrings into multiple lines, breaking sentences at natural boundaries and aligning each line with proper indentation. This ensures that all documentation lines stay within the prescribed maximum line length. - Multiple blank lines detected at end of the file: Removed extraneous blank lines at the end of files to enforce a single newline termination. This change eliminates unexpected trailing whitespace and maintains consistent file formatting. - Missing module/function docstring: Added descriptive module-level and function-level docstrings where none existed, including explanations of purpose, parameters, and return values. These additions improve code readability and satisfy documentation coverage requirements. > This Autofix was generated by AI. Please review the change before merging.
Agent-Logs-Url: https://github.com/pickwicksoft/pystreamapi/sessions/5d738f28-a5b0-447a-b1aa-dcc973d07552 Co-authored-by: garlontas <70283087+garlontas@users.noreply.github.com>
Fix formatting of docstring in read method.
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
__stream_json_items, the BOM/leading-non-whitespace handling relies onlstrip()and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots. - In the XML iterparse path (
_iterparse_xml), usingroot.remove(elem)for each child can be O(n²) on large documents; clearing processed elements (elem.clear()) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `__stream_json_items`, the BOM/leading-non-whitespace handling relies on `lstrip()` and inspecting the first char; consider explicitly handling a UTF-8 BOM (and possibly other markers) so that JSON documents starting with a BOM are still correctly detected as array vs object roots.
- In the XML iterparse path (`_iterparse_xml`), using `root.remove(elem)` for each child can be O(n²) on large documents; clearing processed elements (`elem.clear()`) or periodically clearing the root can reduce the cost of frequent removals while still keeping memory usage low.
## Individual Comments
### Comment 1
<location path="pystreamapi/loaders/__xml/__xml_loader.py" line_range="65-74" />
<code_context>
+ root = elem
+ else: # 'end'
+ depth -= 1
+ if retrieve_children:
+ if depth == 1:
+ yield __parse_xml(elem, cast_types)
+ root.remove(elem)
+ else:
+ if depth == 0:
</code_context>
<issue_to_address>
**suggestion (performance):** When streaming XML children, consider explicitly clearing elements after removal to minimise memory usage.
In the `retrieve_children` branch you already `root.remove(elem)` once a child is fully parsed, which keeps the root’s children list small. However, `ElementTree` may still retain internal data for that element. To further reduce memory usage on large documents, also clear the element after removal:
```python
yield __parse_xml(elem, cast_types)
root.remove(elem)
elem.clear()
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Remove redundant root.remove(elem) call in XML parsing.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



__xml_loader.py(extra space beforedefon lines 119 and 131)__xml_loader.py__json_loader.py(W0105)__json_loader.py(C0301)poetry lockSummary by Sourcery
Stream JSON and XML loading to avoid reading entire documents into memory and wire up the new json streaming dependency.
New Features:
Enhancements:
Build: