Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add large file support for read_xml #45724

Merged
merged 15 commits into from
Mar 18, 2022

Conversation

ParfaitG
Copy link
Contributor

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these tests seem to use the same data source / structure as the tests in test_xml.py can they be shared.

Also could you reduce the amount of url reading involved in the testing (i.e. save the source as a file and read directly)? I've been trying to reduce the amount of CI testing flakiness due to this.

"for value in iterparse"
)

if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein, your thoughts on this handling? Since iterparse can potentially read very large XML files (1 GB, 5 GB, 10 GB+), this checks for strings, online docs, or compressed docs. The idea is to avoid get_handle downloading or extracting such large content in memory and raise MemoryError or OSError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, some people have lots of memory and would be fine downloading a 10GB file. Might be easier to have such a warning as part of the doc-string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterparse is memory efficient for large files. Only physical, fully extracted files on hard disk is being allowed in this method. This raises ParserError for online or compressed sources to avoid in-memory downloading or decompression by get_handle. Users may have a 10GB XML file on local disk to be iterparsed here. Do these if conditions exhaust all non-file system possibilities? Can get_handle or a related method check if path is a local file and return the path without reading content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use pandas.io.common.is_url(...) or pandas.io.common.is_fsspec_url(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already using this :) and some more conditions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add a local_only keyword to get_handle but that might make get_handle even more complex.

@ParfaitG
Copy link
Contributor Author

ParfaitG commented Feb 1, 2022

@mroeschke, I combined tests with tests in original and in dtypes. For this new feature, one URL is tested to raise an error so may never reach the endpoint. I removed any new S3 calls. Given the large size potential, this feature requires only uncompressed file system paths.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2022

@ParfaitG if you can merge master

@jreback jreback added the IO XML read_xml, to_xml label Feb 27, 2022
pandas/io/xml.py Show resolved Hide resolved
pandas/io/xml.py Show resolved Hide resolved
pandas/io/xml.py Show resolved Hide resolved


def test_parse_dates_true(parser):
df_result = read_xml(xml_dates, parse_dates=True, parser=parser)
with tm.ensure_clean() as path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to create a helper function to tests iterparse vs full read in a more concise way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea and now implemented.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2022

@ParfaitG can you merge master

@jreback jreback added this to the 1.5 milestone Mar 18, 2022
@jreback
Copy link
Contributor

jreback commented Mar 18, 2022

lgtm, @mroeschke if any comments.

@@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
df = pd.read_xml(xml, stylesheet=xsl)
df

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entire xml section in the user guide linked to the read_xml docstring? If not it would be good to link them under the Notes of a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question otherwise LGTM

@mroeschke mroeschke merged commit fa7e31b into pandas-dev:main Mar 18, 2022
@mroeschke
Copy link
Member

Awesome, thanks for sticking with it

@ParfaitG ParfaitG deleted the xml_iterparse branch March 18, 2022 21:59
@bailsman
Copy link

Awesome feature, already started using it!

What is the purpose of this bit at

del elem.getparent()[0]

This started giving me "TypeError: 'NoneType' object does not support item deletion" on my XML files. The equivalent ElementTree parser does not have this (only lxml). After removing it, everything is working well, at least on my xml file.

As to attribute parsing, shouldn't we parse only on row_node? Otherwise, another element entirely that's a child of row_node could have the attribute but it may not be what we're looking for.

Maybe this? (untested)

diff --git a/pandas/io/xml.py b/pandas/io/xml.py
index 181b0fe..8740e8c 100644
--- a/pandas/io/xml.py
+++ b/pandas/io/xml.py
@@ -411,13 +411,14 @@ class _EtreeFrameParser(_XMLFrameParser):
             if event == "start":
                 if curr_elem == row_node:
                     row = {}
+                    for col in self.iterparse[row_node]:
+                        if col in elem.attrib:
+                            row[col] = elem.attrib[col]

             if row is not None:
                 for col in self.iterparse[row_node]:
                     if curr_elem == col:
                         row[col] = elem.text.strip() if elem.text else None
-                    if col in elem.attrib:
-                        row[col] = elem.attrib[col]

             if event == "end":
                 if curr_elem == row_node and row is not None:
@@ -659,22 +660,20 @@ class _LxmlFrameParser(_XMLFrameParser):
             if event == "start":
                 if curr_elem == row_node:
                     row = {}
+                    for col in self.iterparse[row_node]:
+                        if col in elem.attrib:
+                            row[col] = elem.attrib[col]

             if row is not None:
                 for col in self.iterparse[row_node]:
                     if curr_elem == col:
                         row[col] = elem.text.strip() if elem.text else None
-                    if col in elem.attrib:
-                        row[col] = elem.attrib[col]

@ParfaitG
Copy link
Contributor Author

Thank you for your comment, @bailsman. Given this PR is closed, can you raise a potential BUG issue of this new feature with a reproducible example? And possibly ask a QST issue for your attribute parsing question or ask on StackOverflow. Specifically, we need a minimum XML sample and attempted code with desired result.

@bailsman
Copy link

Before I go off and try to debug this so that I can make a reproducible example, can I just ask you what this line does and what the purpose of this line was intended to be?

del elem.getparent()[0]

It's lxml specific (the elementtree equivalent doesn't have it), and, at least, in my case it seems to work fine without it and throws errors if I keep it. On initial investigation, the problem doesn't occur on small files that I've tested, so it would take some additional debugging effort on my part to figure out how to reproduce it, which I'd rather invest with fuller understanding to save time.

@ParfaitG
Copy link
Contributor Author

Correct. That is an lxml-specific method where iterparse docs indicate that line intends to clean up preceding siblings with stated goal:

If you have elements with a long list of children in your XML file and want to save more memory during parsing, you can clean up the preceding siblings of the current element

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* ENH: Add large file support for read_xml

* Combine tests, slightly fix docs

* Adjust pytest decorator on URL test; fix doc strings

* Adjust tests for helper function

* Add iterparse feature to some tests

* Add IO docs link in docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO XML read_xml, to_xml
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants