adding minify option in Parser, so the parsed Document occupies less memory. #2003

oymy · 2023-09-22T14:34:57Z

When parsing large html, the memory usage of the parsed Document is huge.

So maybe we can add minify option in Parser,
so the parsed Document could have less nodes, and occupies less memory.

jhy · 2023-10-18T05:49:14Z

I'm always interested in new ideas on how to use less memory! Can you tell me more about your idea and your use case?

What memory budget per Document do you have / need?
What size are the input documents?
What are you actually doing with the documents (selecting / editing / outputting...?)
What nodes do you not want to retain when parsing? How would you consider controlling that?

One feature I've been thinking about for a while is to implement a streaming type parser that would just emit tokens / nodes, and perhaps the current stack, but not retain a full DOM. But I see this as being a difficult interface for users of the library to work with. So am interested in hearing your use case (and everyone's! Others please feel free to comment also) and what developer experience would be best.

oymy · 2023-10-20T15:00:34Z

Thanks for the reply.
Sorry I didn't explain my idea.

Rarely I have Document that may occupies 300M.
The size of of the input document may extends 200M,
I use jsoup to format the html to xml ,then export to pdf.
What I don't want to retain in parsing is empty nodes like "\n" in the html. because they don't affect the final look of the pdf.

for example: comparing the parsed documents of the following two lines. the first line has 5 more empty nodes than the second:
"<tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>"
"<tr><td>a</td><td>a</td><td>a</td><td>a</td></tr>"

so minify the html in parsing can save memory spaces.

jhy · 2023-11-17T07:40:58Z

Thanks -- it makes sense. Have you measured the memory impact of removing those nodes?

Would you be able to share an example of the document? Please contact me directly (jonathan@hedley.net) if you can. Or, is there an example file I could use as a proxy?

I have been thinking of adding a stream() function to the Parser that would emit a stream of Elements / Nodes as the document is consumed. And one could then node.remove() the unneeded content. Or, just process the content in a streaming mode and delete the processed content.

Would be keen to get real-world examples and beta testers for this functionality.

jhy · 2024-01-05T00:43:22Z

I've built out a new feature -- StreamParser -- that should address this. Take a look at the examples in #2096. Would be great if you can give it a try.

oymy changed the title ~~adding minify option in Parser, so the parsed Document occupies less memory.~~ [discussion] adding minify option in Parser, so the parsed Document occupies less memory. Sep 22, 2023

oymy changed the title ~~[discussion] adding minify option in Parser, so the parsed Document occupies less memory.~~ adding minify option in Parser, so the parsed Document occupies less memory. Sep 22, 2023

jhy added needs-more-info More information is needed from the reporter to progress the issue discussion Discussion for a new feature, or other change proposal labels Oct 18, 2023

jhy removed the needs-more-info More information is needed from the reporter to progress the issue label Nov 17, 2023

jhy mentioned this issue Dec 29, 2023

Parsing a part of an html string #2093

Closed

jhy closed this as completed Jan 5, 2024

jhy added improvement and removed discussion Discussion for a new feature, or other change proposal labels Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding minify option in Parser, so the parsed Document occupies less memory. #2003

adding minify option in Parser, so the parsed Document occupies less memory. #2003

oymy commented Sep 22, 2023

jhy commented Oct 18, 2023

oymy commented Oct 20, 2023 •

edited

Loading

jhy commented Nov 17, 2023

jhy commented Jan 5, 2024

adding minify option in Parser, so the parsed Document occupies less memory. #2003

adding minify option in Parser, so the parsed Document occupies less memory. #2003

Comments

oymy commented Sep 22, 2023

jhy commented Oct 18, 2023

oymy commented Oct 20, 2023 • edited Loading

jhy commented Nov 17, 2023

jhy commented Jan 5, 2024

oymy commented Oct 20, 2023 •

edited

Loading