Skip to content

Commit

Permalink
ENH: Add large file support for read_xml (#45724)
Browse files Browse the repository at this point in the history
* ENH: Add large file support for read_xml

* Combine tests, slightly fix docs

* Adjust pytest decorator on URL test; fix doc strings

* Adjust tests for helper function

* Add iterparse feature to some tests

* Add IO docs link in docstring
  • Loading branch information
ParfaitG committed Mar 18, 2022
1 parent ec3eedd commit fa7e31b
Show file tree
Hide file tree
Showing 5 changed files with 721 additions and 50 deletions.
39 changes: 39 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
df = pd.read_xml(xml, stylesheet=xsl)
df
For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
without holding entire tree in memory.

.. versionadded:: 1.5.0

.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument.
Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be
a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
used in this method, descendants do not need to share same relationship with one another. Below shows example
of reading in Wikipedia's very large (12 GB+) latest article data dump.

.. code-block:: ipython
In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]}
... )
... df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
[3578765 rows x 3 columns]
.. _io.xml:

Expand Down
37 changes: 37 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,43 @@ apply converter methods, and parse dates (:issue:`43567`).
df
df.dtypes
.. _whatsnew_150.read_xml_iterparse:

read_xml now supports large XML using ``iterparse``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through XML trees and extract specific elements
and attributes without holding entire tree in memory (:issue:`#45442`).

.. code-block:: ipython
In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]})
... )
df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450
[3578765 rows x 3 columns]
.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

.. _whatsnew_150.api_breaking.api_breaking2:

api_breaking_change2
Expand Down
Loading

0 comments on commit fa7e31b

Please sign in to comment.