Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add large file support for read_xml #45724

Merged
merged 15 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3287,6 +3287,45 @@ output (as shown below for demonstration) for easier parse into ``DataFrame``:
df = pd.read_xml(xml, stylesheet=xsl)
df

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entire xml section in the user guide linked to the read_xml docstring? If not it would be good to link them under the Notes of a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done.

supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes.
without holding entire tree in memory.

.. versionadded:: 1.5.0

.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

To use this feature, you must pass a physical XML file path into ``read_xml`` and use the ``iterparse`` argument.
Files should not be compressed or point to online sources but stored on local disk. Also, ``iterparse`` should be
a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of
any element or attribute that is a descendant (i.e., child, grandchild) of repeating node. Since XPath is not
used in this method, descendants do not need to share same relationship with one another. Below shows example
of reading in Wikipedia's very large (12 GB+) latest article data dump.

.. code-block:: ipython

In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]}
... )
... df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450

[3578765 rows x 3 columns]

.. _io.xml:

Expand Down
37 changes: 37 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,43 @@ apply converter methods, and parse dates (:issue:`43567`).
df
df.dtypes

.. _whatsnew_150.read_xml_iterparse:

read_xml now supports large XML using ``iterparse``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml`
now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_
which are memory-efficient methods to iterate through XML trees and extract specific elements
and attributes without holding entire tree in memory (:issue:`#45442`).

.. code-block:: ipython

In [1]: df = pd.read_xml(
... "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
... iterparse = {"page": ["title", "ns", "id"]})
... )
df
Out[2]:
title ns id
0 Gettysburg Address 0 21450
1 Main Page 0 42950
2 Declaration by United Nations 0 8435
3 Constitution of the United States of America 0 8435
4 Declaration of Independence (Israel) 0 17858
... ... ... ...
3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649
3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649
3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649
3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291
3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450

[3578765 rows x 3 columns]


.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk
.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse

.. _whatsnew_150.api_breaking.api_breaking2:

api_breaking_change2
Expand Down
Loading