read_html ignores paragraphs in table cells #24766

sasan00 · 2019-01-14T16:00:26Z

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            <p>Field 1</p>
            <p>Field 2</p>
        </td>
        <td>
            <p>Value 1</p>
            <p>Value 2</p>
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Problem description

In the current implementation, the p tags are ignored, and therefore it's not possible to infer that field 1 has value 1 and field 2 has value 2.

Expected Output

tables[0].iat[0, 0] == r'Field 1\nField 2'
tables[0].iat[0, 1] == r'Value 1\nValue 2'

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.3.0
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-01-14T16:48:17Z

Can you write the exact expected output?

sasan00 · 2019-01-17T15:46:57Z

I have updated the issue with the requested information. Couldn't find a way to remove the "Needs Info" label.

TomAugspurger · 2019-01-17T16:07:45Z

Thanks. Can you check if the HTML parsing libraries (lxml, bs4) typically convert p tags to newlines? Do they provide options to do that?

sasan00 · 2019-01-17T16:38:02Z

That wouldn't help as the below example shows:

import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1
            
            Field 2
        </td>
        <td>
            Value 1
            
            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Still returns "Field 1 Field 2"

TomAugspurger · 2019-01-17T16:41:05Z

I'm just wondering if our behavior matches the expected behavior of the underlying parsing libraries, and whether they have ways of dealing with it. Presumably they've had requests for similar features around whitespace normalization.

…

On Thu, Jan 17, 2019 at 10:38 AM sasan00 ***@***.***> wrote: That wouldn't help as the below example shows: import pandas as pd html = """<html><body><table> <tr> <td> Field 1 Field 2 </td> <td> Value 1 Value 2 </td> </tr></table></body></html>""" tables = pd.read_html(html)print(tables[0].iat[0, 0]) Still returns "Field 1 Field 2" — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24766 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInwNpzOb1pvbnIjUgP387yVDxBzwks5vEKbvgaJpZM4Z-i-t> .

sasan00 · 2019-01-17T16:49:33Z

lxml respects whitespaces.

import pandas as pd
from lxml.etree import fromstring
from lxml.html import HTMLParser

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1
            
            Field 2
        </td>
        <td>
            Value 1
            
            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])
parser = HTMLParser()
root = fromstring(html, parser)
for elem in root.iter('td'):
    print(repr(elem.text))

Result:

Field 1 Field 2
'\n Field 1\n \n Field 2\n '
'\n Value 1\n \n Value 2\n '

TomAugspurger · 2019-01-17T17:06:47Z

Thanks. Can you check if pandas explicitly strips / normalizes whitespace in read_html then? If so, this would be a good parameter to add to read_html.

…

On Thu, Jan 17, 2019 at 10:49 AM sasan00 ***@***.***> wrote: lxml respects whitespaces. import pandas as pdfrom lxml.etree import fromstringfrom lxml.html import HTMLParser html = """<html><body><table> <tr> <td> Field 1 Field 2 </td> <td> Value 1 Value 2 </td> </tr></table></body></html>""" tables = pd.read_html(html)print(tables[0].iat[0, 0]) parser = HTMLParser() root = fromstring(html, parser)for elem in root.iter('td'): print(repr(elem.text)) Result: Field 1 Field 2 '\n Field 1\n \n Field 2\n ' '\n Value 1\n \n Value 2\n ' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24766 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIt86UF1iJOARwLdx1oJBZzcjMMWWks5vEKmjgaJpZM4Z-i-t> .

sasan00 · 2019-01-17T18:03:15Z

Yes. In _parse_raw_data, calls are made to _remove_whitespace for each column in each row using the default value of the regex argument which is _RE_WHITESPACE whose value is re.compile(r'[\r\n]+|\s{2,}').

I think whether whitespaces are "cleaned up" (i.e., replaced with a single space character) should be an optional functionality.

TomAugspurger · 2019-01-17T20:43:51Z

Thanks for investigating. I think an option to disable that behavior makes sense.

You've given two examples now, one with newlines in the text, and one with <p> tags. Do you expect to normalize the <p> tags to newlines, so that the two would give the same output? Do we have any prior art to copy here?

sasan00 · 2019-01-18T14:06:13Z

I think adding an extra argument as a function that takes the raw text of a cell, and returns the "cleaned up" version would work best. Its default value would be _remove_whitespace to ensure backwards compatibility.

markmbaum · 2020-09-25T16:13:44Z

Hi, wondering if this issue was ever resolved? In my case, I have a <ul> inside the HTML table and all the elements of each list are squished together after the table is parsed by read_html.

Derekt2 · 2021-02-18T17:32:11Z

still encountering this bug.

jreback · 2021-02-19T03:10:10Z

@Derekt2 this is open
you are welcome to submit a pull request to patch

Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766

Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766 Co-authored-by: Romain Lebbadi-Breteau <romain@lebbadi.fr>

TST: Added a simple test for issue pandas-dev#24766 Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766 Co-authored-by: Romain Lebbadi-Breteau <romain@lebbadi.fr> Co-authored-by: Fredrik Wallner <fredrik@wallner.nu>

SuryaThiru · 2023-07-12T17:25:13Z

Hi. I'm experiencing the same bug with newlines in cells. I see that there are some existing contributions, will they be merged?

iamef · 2023-12-08T07:17:40Z

same

RomainL972 · 2024-08-09T03:22:15Z

take

TomAugspurger added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 14, 2019

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Jan 14, 2019

TomAugspurger removed the Needs Info Clarification about behavior needed to assess issue label Jan 17, 2019

TomAugspurger added this to the 0.24.0 milestone Jan 17, 2019

jreback modified the milestones: 0.24.0, Contributions Welcome Jan 21, 2019

mroeschke added the Bug label May 7, 2020

Derekt2 pushed a commit to Derekt2/pandas that referenced this issue Feb 20, 2021

Updated read_html to add option

663a713

Adds optional boolean parameter "remove_whitespace" to skip the remove_whitespace functionality. Defaults to true to support backwards compatibility. See pandas-dev#24766

Derekt2 mentioned this issue Feb 20, 2021

Updated read_html to add option #39925

Closed

3 tasks

fredrikw added a commit to fredrikw/pandas that referenced this issue Jun 4, 2021

Added a simple test for issue pandas-dev#24766

d82c506

RomainL972 pushed a commit to RomainL972/pandas that referenced this issue Feb 19, 2022

TST: Added a simple test for issue pandas-dev#24766

4806dbb

RomainL972 mentioned this issue Feb 19, 2022

ENH: Updated read_html to add option #46075

Closed

4 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

RomainL972 linked a pull request Aug 9, 2024 that will close this issue

ENH: Add an option to prevent stripping extra whitespaces in pd.read_html #59455

Open

5 tasks

github-actions bot assigned RomainL972 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html ignores paragraphs in table cells #24766

read_html ignores paragraphs in table cells #24766

sasan00 commented Jan 14, 2019 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Jan 14, 2019

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 •

edited

Loading

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 via email

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 via email

sasan00 commented Jan 17, 2019 •

edited

Loading

TomAugspurger commented Jan 17, 2019

sasan00 commented Jan 18, 2019 •

edited

Loading

markmbaum commented Sep 25, 2020

Derekt2 commented Feb 18, 2021

jreback commented Feb 19, 2021

SuryaThiru commented Jul 12, 2023

iamef commented Dec 8, 2023

RomainL972 commented Aug 9, 2024

read_html ignores paragraphs in table cells #24766

read_html ignores paragraphs in table cells #24766

Comments

sasan00 commented Jan 14, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Jan 14, 2019

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 • edited Loading

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 via email

sasan00 commented Jan 17, 2019

TomAugspurger commented Jan 17, 2019 via email

sasan00 commented Jan 17, 2019 • edited Loading

TomAugspurger commented Jan 17, 2019

sasan00 commented Jan 18, 2019 • edited Loading

markmbaum commented Sep 25, 2020

Derekt2 commented Feb 18, 2021

jreback commented Feb 19, 2021

SuryaThiru commented Jul 12, 2023

iamef commented Dec 8, 2023

RomainL972 commented Aug 9, 2024

sasan00 commented Jan 14, 2019 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Jan 17, 2019 •

edited

Loading

sasan00 commented Jan 17, 2019 •

edited

Loading

sasan00 commented Jan 18, 2019 •

edited

Loading