ENH: Add large file support for read_xml #45724

ParfaitG · 2022-01-30T22:56:02Z

closes #BUG: read_xml not support large file #45442
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v1.5.0.rst file if fixing a bug or adding a new feature.

mroeschke

A lot of these tests seem to use the same data source / structure as the tests in test_xml.py can they be shared.

Also could you reduce the amount of url reading involved in the testing (i.e. save the source as a file and read directly)? I've been trying to reduce the amount of CI testing flakiness due to this.

ParfaitG · 2022-01-31T14:13:36Z

pandas/io/xml.py

+                "for value in iterparse"
+            )
+
+        if (


@twoertwein, your thoughts on this handling? Since iterparse can potentially read very large XML files (1 GB, 5 GB, 10 GB+), this checks for strings, online docs, or compressed docs. The idea is to avoid get_handle downloading or extracting such large content in memory and raise MemoryError or OSError

I'm not sure, some people have lots of memory and would be fine downloading a 10GB file. Might be easier to have such a warning as part of the doc-string?

Iterparse is memory efficient for large files. Only physical, fully extracted files on hard disk is being allowed in this method. This raises ParserError for online or compressed sources to avoid in-memory downloading or decompression by get_handle. Users may have a 10GB XML file on local disk to be iterparsed here. Do these if conditions exhaust all non-file system possibilities? Can get_handle or a related method check if path is a local file and return the path without reading content?

Could use pandas.io.common.is_url(...) or pandas.io.common.is_fsspec_url(...)

You are already using this :) and some more conditions

Could also add a local_only keyword to get_handle but that might make get_handle even more complex.

ParfaitG · 2022-02-01T19:04:24Z

@mroeschke, I combined tests with tests in original and in dtypes. For this new feature, one URL is tested to raise an error so may never reach the endpoint. I removed any new S3 calls. Given the large size potential, this feature requires only uncompressed file system paths.

jreback · 2022-02-27T20:32:18Z

@ParfaitG if you can merge master

pandas/io/xml.py

jreback · 2022-02-28T14:14:18Z

pandas/tests/io/xml/test_xml_dtypes.py



 def test_parse_dates_true(parser):
    df_result = read_xml(xml_dates, parse_dates=True, parser=parser)
+    with tm.ensure_clean() as path:


may want to create a helper function to tests iterparse vs full read in a more concise way

Good idea and now implemented.

jreback · 2022-03-16T01:09:58Z

@ParfaitG can you merge master

jreback · 2022-03-18T00:55:30Z

lgtm, @mroeschke if any comments.

mroeschke · 2022-03-18T01:09:29Z

doc/source/user_guide/io.rst

@@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
   df = pd.read_xml(xml, stylesheet=xsl)
   df

+For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`


Is this entire xml section in the user guide linked to the read_xml docstring? If not it would be good to link them under the Notes of a docstring

Good idea! Done.

mroeschke

One small question otherwise LGTM

mroeschke · 2022-03-18T20:05:00Z

Awesome, thanks for sticking with it

bailsman · 2022-06-13T15:50:14Z

Awesome feature, already started using it!

What is the purpose of this bit at

pandas/pandas/io/xml.py

Line 677 in fa7e31b

del elem.getparent()[0]

This started giving me "TypeError: 'NoneType' object does not support item deletion" on my XML files. The equivalent ElementTree parser does not have this (only lxml). After removing it, everything is working well, at least on my xml file.

As to attribute parsing, shouldn't we parse only on row_node? Otherwise, another element entirely that's a child of row_node could have the attribute but it may not be what we're looking for.

Maybe this? (untested)

diff --git a/pandas/io/xml.py b/pandas/io/xml.py
index 181b0fe..8740e8c 100644
--- a/pandas/io/xml.py
+++ b/pandas/io/xml.py
@@ -411,13 +411,14 @@ class _EtreeFrameParser(_XMLFrameParser):
             if event == "start":
                 if curr_elem == row_node:
                     row = {}
+                    for col in self.iterparse[row_node]:
+                        if col in elem.attrib:
+                            row[col] = elem.attrib[col]

             if row is not None:
                 for col in self.iterparse[row_node]:
                     if curr_elem == col:
                         row[col] = elem.text.strip() if elem.text else None
-                    if col in elem.attrib:
-                        row[col] = elem.attrib[col]

             if event == "end":
                 if curr_elem == row_node and row is not None:
@@ -659,22 +660,20 @@ class _LxmlFrameParser(_XMLFrameParser):
             if event == "start":
                 if curr_elem == row_node:
                     row = {}
+                    for col in self.iterparse[row_node]:
+                        if col in elem.attrib:
+                            row[col] = elem.attrib[col]

             if row is not None:
                 for col in self.iterparse[row_node]:
                     if curr_elem == col:
                         row[col] = elem.text.strip() if elem.text else None
-                    if col in elem.attrib:
-                        row[col] = elem.attrib[col]

ParfaitG · 2022-06-13T17:39:28Z

Thank you for your comment, @bailsman. Given this PR is closed, can you raise a potential BUG issue of this new feature with a reproducible example? And possibly ask a QST issue for your attribute parsing question or ask on StackOverflow. Specifically, we need a minimum XML sample and attempted code with desired result.

bailsman · 2022-06-14T12:26:20Z

Before I go off and try to debug this so that I can make a reproducible example, can I just ask you what this line does and what the purpose of this line was intended to be?

pandas/pandas/io/xml.py

Line 677 in fa7e31b

del elem.getparent()[0]

It's lxml specific (the elementtree equivalent doesn't have it), and, at least, in my case it seems to work fine without it and throws errors if I keep it. On initial investigation, the problem doesn't occur on small files that I've tested, so it would take some additional debugging effort on my part to figure out how to reproduce it, which I'd rather invest with fuller understanding to save time.

ParfaitG · 2022-06-16T07:13:28Z

Correct. That is an lxml-specific method where iterparse docs indicate that line intends to clean up preceding siblings with stated goal:

If you have elements with a long list of children in your XML file and want to save more memory during parsing, you can clean up the preceding siblings of the current element

* ENH: Add large file support for read_xml * Combine tests, slightly fix docs * Adjust pytest decorator on URL test; fix doc strings * Adjust tests for helper function * Add iterparse feature to some tests * Add IO docs link in docstring

ENH: Add large file support for read_xml

640c70e

mroeschke requested changes Jan 31, 2022

View reviewed changes

ParfaitG added 2 commits January 31, 2022 08:03

Combine tests, slightly fix docs

4011b4b

Merge remote-tracking branch 'upstream/main' into xml_iterparse

e9b2c3a

ParfaitG commented Jan 31, 2022

View reviewed changes

ParfaitG added 2 commits January 31, 2022 19:51

Resolve conflicts in tests

00c4a72

Merge remote-tracking branch 'upstream/main' into xml_iterparse

36ec05a

jreback added the IO XML read_xml, to_xml label Feb 27, 2022

ParfaitG added 3 commits February 27, 2022 18:49

Merge to master and resolve conflicts in tests and docs

65698fa

Adjust pytest decorator on URL test; fix doc strings

5514025

Merge remote-tracking branch 'upstream/main' into xml_iterparse

37a5dc5

jreback requested changes Feb 28, 2022

View reviewed changes

ParfaitG added 4 commits February 28, 2022 22:52

Adjust tests for helper function

2c4d81f

Merge remote-tracking branch 'upstream/main' into xml_iterparse

e4973ad

Add iterparse feature to some tests

3d065b5

Merge remote-tracking branch 'upstream/main' into xml_iterparse

3476c05

Merge remote-tracking branch 'upstream/main' into xml_iterparse

e236a4d

jreback added this to the 1.5 milestone Mar 18, 2022

jreback approved these changes Mar 18, 2022

View reviewed changes

mroeschke reviewed Mar 18, 2022

View reviewed changes

ParfaitG added 2 commits March 18, 2022 12:03

Add IO docs link in docstring

e37c20a

Merge remote-tracking branch 'upstream/main' into xml_iterparse

697fe92

mroeschke approved these changes Mar 18, 2022

View reviewed changes

mroeschke merged commit fa7e31b into pandas-dev:main Mar 18, 2022

ParfaitG deleted the xml_iterparse branch March 18, 2022 21:59

mroeschke mentioned this pull request Mar 18, 2022

BUG: read_xml not support large file #45442

Closed

3 tasks

bailsman mentioned this pull request Jun 14, 2022

BUG: iterparse on read_xml overwrites with attributes on child elements #47343

Closed

3 tasks

bailsman mentioned this pull request Jun 19, 2022

BUG: read_xml iterparse doesn't handle multiple toplevel elements with lxml parser #47422

Closed

3 tasks

mroeschke mentioned this pull request Jun 21, 2022

BUG: iterparse of read_xml not parsing duplicate element and attribute names #47414

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add large file support for read_xml #45724

ENH: Add large file support for read_xml #45724

ParfaitG commented Jan 30, 2022

mroeschke left a comment

ParfaitG Jan 31, 2022

twoertwein Jan 31, 2022

ParfaitG Jan 31, 2022

twoertwein Jan 31, 2022

twoertwein Jan 31, 2022

twoertwein Jan 31, 2022

ParfaitG commented Feb 1, 2022

jreback commented Feb 27, 2022

jreback Feb 28, 2022

ParfaitG Mar 1, 2022

jreback commented Mar 16, 2022

jreback commented Mar 18, 2022

mroeschke Mar 18, 2022

ParfaitG Mar 18, 2022

mroeschke left a comment

mroeschke commented Mar 18, 2022

bailsman commented Jun 13, 2022

ParfaitG commented Jun 13, 2022

bailsman commented Jun 14, 2022

ParfaitG commented Jun 16, 2022

ENH: Add large file support for read_xml #45724

ENH: Add large file support for read_xml #45724

Conversation

ParfaitG commented Jan 30, 2022

mroeschke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParfaitG commented Feb 1, 2022

jreback commented Feb 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 16, 2022

jreback commented Mar 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Mar 18, 2022

bailsman commented Jun 13, 2022

ParfaitG commented Jun 13, 2022

bailsman commented Jun 14, 2022

ParfaitG commented Jun 16, 2022