Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handles external links (without SW) #142

Merged
merged 9 commits into from
Jan 3, 2024
Merged

Handles external links (without SW) #142

merged 9 commits into from
Jan 3, 2024

Conversation

mgautierfr
Copy link
Contributor

@mgautierfr mgautierfr commented Dec 21, 2023

Fixes #137

With this PR we don't rewrite html (a href) link when the url is not in the warc archive.

Other links are always rewritten to avoid zim file silently phoning home.

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

@@ -282,17 +283,24 @@ def iter_all_warc_records(self):

yield from iter_warc_records(self.inputs)

def find_main_page_metadata(self):
for record in self.iter_all_warc_records():
def gather_information_from_warc(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name is misleading. It's all about the main page (and its metadata) and given how long this is already, probably good not to add additional stuff to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that doing the main page searching and the filling of known_url in two different steps means that we loop twice (and one more time for the real processing) the warc entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not suggesting we change the behavior ; just the name. Maybe find_homepage_and_urls or something

src/warc2zim/converter.py Outdated Show resolved Hide resolved
src/warc2zim/items.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
tests/test_url_rewriting.py Outdated Show resolved Hide resolved
tests/test_html_rewriting.py Outdated Show resolved Hide resolved
tests/test_html_rewriting.py Outdated Show resolved Hide resolved
@mgautierfr mgautierfr changed the title Make the different rewriter directly take a url_rewriter. handle external links Dec 21, 2023
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it's very weird to do so many things in the WARCPayloadItem class __init__ method, especially rewriting item URLs in this class seems odd. This is made especially visible by the fact that we need to pass all known URLs to the class, which means this class knows a lot of things about the surrounding logic. I would have expected this class to be more a kind of adapter to some ZIM specifities, based on pre-processed information.

I'm also concerned by the fact that we keep all know URLs in memory. We've already seen websites with more than 10k pages, and I would expect we will have to deal with 100k pages at some point, meaning 100k URLs in memory. I don't know exactly how bad this could be (in terms of memory footprint), but I think we should be at least aware of this / track this risk/need in a specific issue.

@rgaudin
Copy link
Member

rgaudin commented Dec 21, 2023

I'm also concerned by the fact that we keep all know URLs in memory.

This is not a new concern ; several scrapers need to store huge list of entries/urls. There's no average size for a URL but even in UTF-8, most characters in most URLs are ASCII so I think 500b for an average URL can be seen consevative. With a hundred thousands of them it's just over 50MB. Sure that's a lot but in the context of a scraper turning 100K requests into a ZIM, that's not an issue IMO.

Copy link

codecov bot commented Dec 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (eca6903) 91.86% compared to head (d0dd6d7) 92.08%.

Additional details and impacted files
@@                   Coverage Diff                   @@
##           better_wombat_setup     #142      +/-   ##
=======================================================
+ Coverage                91.86%   92.08%   +0.21%     
=======================================================
  Files                        9        9              
  Lines                      701      720      +19     
  Branches                   116      121       +5     
=======================================================
+ Hits                       644      663      +19     
  Misses                      40       40              
  Partials                    17       17              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kelson42 kelson42 changed the title handle external links Handles external links (without SW) Dec 22, 2023
@kelson42 kelson42 linked an issue Dec 22, 2023 that may be closed by this pull request
@mgautierfr
Copy link
Contributor Author

I find it's very weird to do so many things in the WARCPayloadItem class init method, especially rewriting item URLs in this class seems odd. This is made especially visible by the fact that we need to pass all known URLs to the class, which means this class knows a lot of things about the surrounding logic. I would have expected this class to be more a kind of adapter to some ZIM specifities, based on pre-processed information.

I partly agree. This __init__ method become a bit to big.
We could move the rewrite part in a separated function, but then it would be this function to knows a lot of things about the surrounding logic. And either we pass the information to the __init__ which will call the rewrite function or we could the rewrite function in the converter and pass the rewriten content to the item. In the first case, it doesn't help a lot. In the second case, we could simply drop WARCPayloadItem and simply use StaticItem.

I'm also concerned by the fact that we keep all know URLs in memory. We've already seen websites with more than 10k pages, and I would expect we will have to deal with 100k pages at some point, meaning 100k URLs in memory. I don't know exactly how bad this could be (in terms of memory footprint), but I think we should be at least aware of this / track this risk/need in a specific issue.

Agree. But as has said @rgaudin, it should not be such a problem. And I don't know how to do it without storing the urls somewhere (or loop the whole warc for each url rewrite)

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I agree the problem I mentioned is not such of an issue. A simple solution (I didn't say straightforward) would have been to store all data in a local DB, typically SQLite like is done for Gutenberg. Of course it comes with some drawbacks it terms of CPU / scrape duration, but at least it is scalable.

And you convinced me regarding the __init__ method, let's keep it like this until we find it becomes a real issue.

@mgautierfr
Copy link
Contributor Author

(Put in draft as it needs merging of #141 first)

@mgautierfr mgautierfr marked this pull request as draft December 22, 2023 16:42
Base automatically changed from better_wombat_setup to warc2zim2 December 29, 2023 16:09
@mgautierfr mgautierfr marked this pull request as ready for review December 29, 2023 16:09
@mgautierfr mgautierfr merged commit ef6f425 into warc2zim2 Jan 3, 2024
12 checks passed
@mgautierfr mgautierfr deleted the external_links branch January 3, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle external links in warc2zim without SW
3 participants