HTML parsing abnormal #469

Alalalalaki · 2021-07-03T20:27:19Z

I recently do a "conda update --all" and then find that the HTML parsing of requests-html begins to work abnormally. In particular, the objection gotten from html.find() still contains all content of the html, e.g. if a = html.find("something", first=True), then a.text still shows all text of the page.

I then create a clean environment with only requests-html and it works well. So I guess the cause might be some recent updated version of some other package in my main environment has conflict with HTML parsing in requests-html. But I have no idea how this would happen and what would be the potential problematic package.

Any suggestion will be appreciated.

DanielPython2021 · 2021-07-16T01:33:05Z

I had the same problem. It is strange since I saw in youtube running similar code but with expected results but, it is not my experience.
to help I copy, so you can reproduce the problem (these are cells from jupyter nb). I also print the results of BeautifulSoup

from requests_html import HTMLSession, HTML

doc = '<div class="class1">text1</div><div class="class2">text2</div><div class="class3">text3</div><div  class="class4">text4</div>'
`html = HTML(html=doc)

for cl in ['class1', 'class2', 'class3', 'class4']:
    print(html.find('div.' + cl, first=True).html)
    print(html.find('div.' + cl, first=True).text)
    print('-' * 100)

text1

text2

text3

text4

text1 text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------

text2

text3

text4

text2 text3 text4 ----------------------------------------------------------------------------------------------------------------------

text3

text4

text3 text4 ----------------------------------------------------------------------------------------------------------------------

text4

text4 ----------------------------------------------------------------------------------------------------------------------

for x in html.lxml:
    print(x.tag, x.attrib, x.text)
    print()

div {'class': 'class1'} text1

div {'class': 'class2'} text2

div {'class': 'class3'} text3

div {'class': 'class4'} text4

from bs4 import BeautifulSoup as bs

soup = bs(doc)
for cl in ['class1', 'class2', 'class3', 'class4']:
    print(soup.find('div', {'class': cl}))
    print(soup.find('div', {'class': cl}).text)
    print('-' * 80)

text1

text1 --------------------------------------------------------------------------------

text2

text2 --------------------------------------------------------------------------------

text3

text3 --------------------------------------------------------------------------------

text4

text4 --------------------------------------------------------------------------------

DanielPython2021 · 2021-07-16T01:37:13Z

previous results error pasting. The last results should be as follows:

<div class="class1">text1</div>
text1
--------------------------------------------------------------------------------
<div class="class2">text2</div>
text2
--------------------------------------------------------------------------------
<div class="class3">text3</div>
text3
--------------------------------------------------------------------------------
<div class="class4">text4</div>
text4
--------------------------------------------------------------------------------

to fix a bug of wrong parse result when use find() with lxml==4.9.0(maybe lower). see issues psf#469 and psf#479.

fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.

fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479 .

TiesdeKok mentioned this issue Aug 30, 2021

LXML bug that breaks .find() on new installs #478

Open

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022

Update requests_html.py

813ad00

to fix a bug of wrong parse result when use find() with lxml==4.9.0(maybe lower). see issues psf#469 and psf#479.

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022

Update requests_html.py

742d208

fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.

AbstractDataType mentioned this issue Jun 6, 2022

Update requests_html.py kennethreitz/requests-html#2

Open

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022

Update requests_html.py

539e29e

fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022

Update requests_html.py

2f1518b

fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479 .

AbstractDataType mentioned this issue Jun 6, 2022

Update requests_html.py #510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML parsing abnormal #469

HTML parsing abnormal #469

Alalalalaki commented Jul 3, 2021 •

edited

Loading

DanielPython2021 commented Jul 16, 2021

DanielPython2021 commented Jul 16, 2021

HTML parsing abnormal #469

HTML parsing abnormal #469

Comments

Alalalalaki commented Jul 3, 2021 • edited Loading

DanielPython2021 commented Jul 16, 2021

DanielPython2021 commented Jul 16, 2021

Alalalalaki commented Jul 3, 2021 •

edited

Loading