Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

ENH: read-html fixes #3616

Merged
merged 1 commit into from May 20, 2013
Jump to file or symbol
Failed to load files and symbols.
+1,095 −210
Split
View
@@ -92,12 +92,11 @@ Optional dependencies
- openpyxl version 1.6.1 or higher, for writing .xlsx files
- xlrd >= 0.9.0
- Needed for Excel I/O
- - `lxml <http://lxml.de>`__, or `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for reading HTML tables
- - The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
- is faster), however sometimes Beautiful Soup returns what you might
- intuitively expect. Both backends are implemented, so try them both to
- see which one you like. They should return very similar results.
- - Note that lxml requires Cython to build successfully
+ - Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
+ `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
+ reading HTML tables
+ - These can both easily be installed by ``pip install html5lib`` and ``pip
+ install beautifulsoup4``.
- `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
View
@@ -30,8 +30,9 @@ pandas 0.11.1
**New features**
- - pd.read_html() can now parse HTML string, files or urls and return dataframes
- courtesy of @cpcloud. (GH3477_)
+ - ``pandas.read_html()`` can now parse HTML strings, files or urls and
+ returns a list of ``DataFrame`` s courtesy of @cpcloud. (GH3477_, GH3605_,
+ GH3606_)
- Support for reading Amazon S3 files. (GH3504_)
- Added module for reading and writing Stata files: pandas.io.stata (GH1512_)
- Added support for writing in ``to_csv`` and reading in ``read_csv``,
@@ -48,7 +49,7 @@ pandas 0.11.1
**Improvements to existing features**
- Fixed various issues with internal pprinting code, the repr() for various objects
- including TimeStamp and *Index now produces valid python code strings and
+ including TimeStamp and Index now produces valid python code strings and
can be used to recreate the object, (GH3038_, GH3379_, GH3251_, GH3460_)
- ``convert_objects`` now accepts a ``copy`` parameter (defaults to ``True``)
- ``HDFStore``
@@ -146,6 +147,9 @@ pandas 0.11.1
- ``sql.write_frame`` failing when writing a single column to sqlite (GH3628_),
thanks to @stonebig
- Fix pivoting with ``nan`` in the index (GH3558_)
+ - Fix running of bs4 tests when it is not installed (GH3605_)
+ - Fix parsing of html table (GH3606_)
+ - ``read_html()`` now only allows a single backend: ``html5lib`` (GH3616_)
.. _GH3164: https://github.com/pydata/pandas/issues/3164
.. _GH2786: https://github.com/pydata/pandas/issues/2786
@@ -209,6 +213,9 @@ pandas 0.11.1
.. _GH3141: https://github.com/pydata/pandas/issues/3141
.. _GH3628: https://github.com/pydata/pandas/issues/3628
.. _GH3638: https://github.com/pydata/pandas/issues/3638
+.. _GH3605: https://github.com/pydata/pandas/issues/3605
+.. _GH3606: https://github.com/pydata/pandas/issues/3606
+.. _Gh3616: https://github.com/pydata/pandas/issues/3616
pandas 0.11.0
=============
View
@@ -99,12 +99,11 @@ Optional Dependencies
* `openpyxl <http://packages.python.org/openpyxl/>`__, `xlrd/xlwt <http://www.python-excel.org/>`__
* openpyxl version 1.6.1 or higher
* Needed for Excel I/O
- * `lxml <http://lxml.de>`__, or `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for reading HTML tables
- * The differences between lxml and Beautiful Soup 4 are mostly speed (lxml
- is faster), however sometimes Beautiful Soup returns what you might
- intuitively expect. Both backends are implemented, so try them both to
- see which one you like. They should return very similar results.
- * Note that lxml requires Cython to build successfully
+ * Both `html5lib <https://github.com/html5lib/html5lib-python>`__ **and**
+ `Beautiful Soup 4 <http://www.crummy.com/software/BeautifulSoup>`__: for
+ reading HTML tables
+ * These can both easily be installed by ``pip install html5lib`` and ``pip
+ install beautifulsoup4``.
.. note::
View
@@ -918,18 +918,18 @@ which, if set to ``True``, will additionally output the length of the Series.
HTML
----
-Reading HTML format
+Reading HTML Content
~~~~~~~~~~~~~~~~~~~~~~
.. _io.read_html:
.. versionadded:: 0.11.1
-The toplevel :func:`~pandas.io.parsers.read_html` function can accept an HTML string/file/url
-and will parse HTML tables into pandas DataFrames.
+The toplevel :func:`~pandas.io.parsers.read_html` function can accept an HTML
+string/file/url and will parse HTML tables into list of pandas DataFrames.
-Writing to HTML format
+Writing to HTML files
~~~~~~~~~~~~~~~~~~~~~~
.. _io.html:
View
@@ -64,9 +64,27 @@ API changes
Enhancements
~~~~~~~~~~~~
-
- - ``pd.read_html()`` can now parse HTML string, files or urls and return dataframes
- courtesy of @cpcloud. (GH3477_)
+ - ``pd.read_html()`` can now parse HTML strings, files or urls and return
+ DataFrames
+ courtesy of @cpcloud. (GH3477_, GH3605_, GH3606_)
+ - ``read_html()`` (GH3616_)
+ - now works with only a *single* parser backend, that is:
+ - BeautifulSoup4 + html5lib
+ - does *not* and will never support using the html parsing library
+ included with Python as a parser backend
+ - is a bit smarter about the parent table elements of matched text: if
+ multiple matches are found then only the *unique* parents of the result
+ are returned (uniqueness is determined using ``set``).
+ - no longer tries to guess about what you want to do with empty table cells
+ - argument ``infer_types`` now defaults to ``False``.
+ - now returns DataFrames whose default column index is the elements of
+ ``<thead>`` elements in the HTML soup, if any exist.
+ - considers all ``<th>`` and ``<td>`` elements inside of ``<thead>``
+ elements.
+ - tests are now correctly skipped if the proper libraries are not
+ installed.
+ - tests now include a ground-truth csv file from the FDIC failed bank list
+ data set.
- ``HDFStore``
- will retain index attributes (freq,tz,name) on recreation (GH3499_)
@@ -203,3 +221,6 @@ on GitHub for a complete list.
.. _GH1651: https://github.com/pydata/pandas/issues/1651
.. _GH3141: https://github.com/pydata/pandas/issues/3141
.. _GH3638: https://github.com/pydata/pandas/issues/3638
+.. _GH3616: https://github.com/pydata/pandas/issues/3616
+.. _GH3605: https://github.com/pydata/pandas/issues/3605
+.. _GH3606: https://github.com/pydata/pandas/issues/3606
Oops, something went wrong.