Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding minify option in Parser, so the parsed Document occupies less memory. #2003

Closed
oymy opened this issue Sep 22, 2023 · 4 comments
Closed

Comments

@oymy
Copy link

oymy commented Sep 22, 2023

When parsing large html, the memory usage of the parsed Document is huge.

So maybe we can add minify option in Parser,
so the parsed Document could have less nodes, and occupies less memory.

@oymy oymy changed the title adding minify option in Parser, so the parsed Document occupies less memory. [discussion] adding minify option in Parser, so the parsed Document occupies less memory. Sep 22, 2023
@oymy oymy changed the title [discussion] adding minify option in Parser, so the parsed Document occupies less memory. adding minify option in Parser, so the parsed Document occupies less memory. Sep 22, 2023
@jhy
Copy link
Owner

jhy commented Oct 18, 2023

I'm always interested in new ideas on how to use less memory! Can you tell me more about your idea and your use case?

  • What memory budget per Document do you have / need?
  • What size are the input documents?
  • What are you actually doing with the documents (selecting / editing / outputting...?)
  • What nodes do you not want to retain when parsing? How would you consider controlling that?

One feature I've been thinking about for a while is to implement a streaming type parser that would just emit tokens / nodes, and perhaps the current stack, but not retain a full DOM. But I see this as being a difficult interface for users of the library to work with. So am interested in hearing your use case (and everyone's! Others please feel free to comment also) and what developer experience would be best.

@jhy jhy added needs-more-info More information is needed from the reporter to progress the issue discussion Discussion for a new feature, or other change proposal labels Oct 18, 2023
@oymy
Copy link
Author

oymy commented Oct 20, 2023

Thanks for the reply.
Sorry I didn't explain my idea.

Rarely I have Document that may occupies 300M.
The size of of the input document may extends 200M,
I use jsoup to format the html to xml ,then export to pdf.
What I don't want to retain in parsing is empty nodes like "\n" in the html. because they don't affect the final look of the pdf.

for example: comparing the parsed documents of the following two lines. the first line has 5 more empty nodes than the second:
"<tr> <td>a</td> <td>a</td> <td>a</td> <td>a</td> </tr>"
"<tr><td>a</td><td>a</td><td>a</td><td>a</td></tr>"

so minify the html in parsing can save memory spaces.

@jhy jhy removed the needs-more-info More information is needed from the reporter to progress the issue label Nov 17, 2023
@jhy
Copy link
Owner

jhy commented Nov 17, 2023

Thanks -- it makes sense. Have you measured the memory impact of removing those nodes?

Would you be able to share an example of the document? Please contact me directly (jonathan@hedley.net) if you can. Or, is there an example file I could use as a proxy?

I have been thinking of adding a stream() function to the Parser that would emit a stream of Elements / Nodes as the document is consumed. And one could then node.remove() the unneeded content. Or, just process the content in a streaming mode and delete the processed content.

Would be keen to get real-world examples and beta testers for this functionality.

@jhy
Copy link
Owner

jhy commented Jan 5, 2024

I've built out a new feature -- StreamParser -- that should address this. Take a look at the examples in #2096. Would be great if you can give it a try.

@jhy jhy closed this as completed Jan 5, 2024
@jhy jhy added improvement and removed discussion Discussion for a new feature, or other change proposal labels Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants