Handles external links (without SW) #142

mgautierfr · 2023-12-21T10:23:31Z

Fixes #137

With this PR we don't rewrite html (a href) link when the url is not in the warc archive.

Other links are always rewritten to avoid zim file silently phoning home.

rgaudin

Good!

rgaudin · 2023-12-21T12:05:39Z

src/warc2zim/converter.py

@@ -282,17 +283,24 @@ def iter_all_warc_records(self):

        yield from iter_warc_records(self.inputs)

-    def find_main_page_metadata(self):
-        for record in self.iter_all_warc_records():
+    def gather_information_from_warc(self):


I think the name is misleading. It's all about the main page (and its metadata) and given how long this is already, probably good not to add additional stuff to it

The problem is that doing the main page searching and the filling of known_url in two different steps means that we loop twice (and one more time for the real processing) the warc entries.

I am not suggesting we change the behavior ; just the name. Maybe find_homepage_and_urls or something

src/warc2zim/converter.py

src/warc2zim/items.py

src/warc2zim/url_rewriting.py

tests/test_url_rewriting.py

tests/test_html_rewriting.py

benoit74

I find it's very weird to do so many things in the WARCPayloadItem class __init__ method, especially rewriting item URLs in this class seems odd. This is made especially visible by the fact that we need to pass all known URLs to the class, which means this class knows a lot of things about the surrounding logic. I would have expected this class to be more a kind of adapter to some ZIM specifities, based on pre-processed information.

I'm also concerned by the fact that we keep all know URLs in memory. We've already seen websites with more than 10k pages, and I would expect we will have to deal with 100k pages at some point, meaning 100k URLs in memory. I don't know exactly how bad this could be (in terms of memory footprint), but I think we should be at least aware of this / track this risk/need in a specific issue.

rgaudin · 2023-12-21T13:23:03Z

I'm also concerned by the fact that we keep all know URLs in memory.

This is not a new concern ; several scrapers need to store huge list of entries/urls. There's no average size for a URL but even in UTF-8, most characters in most URLs are ASCII so I think 500b for an average URL can be seen consevative. With a hundred thousands of them it's just over 50MB. Sure that's a lot but in the context of a scraper turning 100K requests into a ZIM, that's not an issue IMO.

codecov · 2023-12-22T14:36:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (eca6903) 91.86% compared to head (d0dd6d7) 92.08%.

Additional details and impacted files

@@                   Coverage Diff                   @@
##           better_wombat_setup     #142      +/-   ##
=======================================================
+ Coverage                91.86%   92.08%   +0.21%     
=======================================================
  Files                        9        9              
  Lines                      701      720      +19     
  Branches                   116      121       +5     
=======================================================
+ Hits                       644      663      +19     
  Misses                      40       40              
  Partials                    17       17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

By default it rewrite all urls. But potentially not rewrite url not pointing to a known entry.

For html content, we want to keep external link in `a` tag.

…nvertion.

mgautierfr · 2023-12-22T16:16:03Z

I find it's very weird to do so many things in the WARCPayloadItem class init method, especially rewriting item URLs in this class seems odd. This is made especially visible by the fact that we need to pass all known URLs to the class, which means this class knows a lot of things about the surrounding logic. I would have expected this class to be more a kind of adapter to some ZIM specifities, based on pre-processed information.

I partly agree. This __init__ method become a bit to big.
We could move the rewrite part in a separated function, but then it would be this function to knows a lot of things about the surrounding logic. And either we pass the information to the __init__ which will call the rewrite function or we could the rewrite function in the converter and pass the rewriten content to the item. In the first case, it doesn't help a lot. In the second case, we could simply drop WARCPayloadItem and simply use StaticItem.

I'm also concerned by the fact that we keep all know URLs in memory. We've already seen websites with more than 10k pages, and I would expect we will have to deal with 100k pages at some point, meaning 100k URLs in memory. I don't know exactly how bad this could be (in terms of memory footprint), but I think we should be at least aware of this / track this risk/need in a specific issue.

Agree. But as has said @rgaudin, it should not be such a problem. And I don't know how to do it without storing the urls somewhere (or loop the whole warc for each url rewrite)

rgaudin

LGTM

benoit74

LGTM

I agree the problem I mentioned is not such of an issue. A simple solution (I didn't say straightforward) would have been to store all data in a local DB, typically SQLite like is done for Gutenberg. Of course it comes with some drawbacks it terms of CPU / scrape duration, but at least it is scalable.

And you convinced me regarding the __init__ method, let's keep it like this until we find it becomes a real issue.

mgautierfr · 2023-12-22T16:42:05Z

(Put in draft as it needs merging of #141 first)

mgautierfr requested review from rgaudin and benoit74 December 21, 2023 10:23

rgaudin requested changes Dec 21, 2023

View reviewed changes

mgautierfr changed the title ~~Make the different rewriter directly take a url_rewriter.~~ handle external links Dec 21, 2023

benoit74 requested changes Dec 21, 2023

View reviewed changes

mgautierfr force-pushed the better_wombat_setup branch from b9a0591 to 11d6267 Compare December 22, 2023 14:33

mgautierfr force-pushed the external_links branch from fea067d to 6969e4c Compare December 22, 2023 14:33

mgautierfr force-pushed the better_wombat_setup branch from 11d6267 to aec2c0a Compare December 22, 2023 14:49

mgautierfr force-pushed the external_links branch from 6969e4c to 1f10b11 Compare December 22, 2023 14:49

mgautierfr added 5 commits December 22, 2023 16:11

Make the different rewriter directly take a url_rewriter.

af47202

Make the converter know all the entries before starting the convertion.

508229d

Make the ArticleUrlRewriter potentially not rewrite a url.

94ac8ec

By default it rewrite all urls. But potentially not rewrite url not pointing to a known entry.

Don't rewrite all link in html content.

f024bd0

For html content, we want to keep external link in `a` tag.

Pass our known entry to the Item and to the content rewriter.

02eb854

mgautierfr force-pushed the better_wombat_setup branch from aec2c0a to eca6903 Compare December 22, 2023 15:11

kelson42 changed the title ~~handle external links~~ Handles external links (without SW) Dec 22, 2023

kelson42 linked an issue Dec 22, 2023 that may be closed by this pull request

Handle external links in warc2zim without SW #137

Closed

mgautierfr added 4 commits December 22, 2023 17:06

fixup! Make the converter know all the entries before starting the co…

eb77ee6

…nvertion.

fixup! Make the ArticleUrlRewriter potentially not rewrite a url.

e6e7df6

fixup! Pass our known entry to the Item and to the content rewriter.

4b86346

fixup! Make the different rewriter directly take a url_rewriter.

d0dd6d7

mgautierfr force-pushed the external_links branch from 1f10b11 to d0dd6d7 Compare December 22, 2023 16:10

mgautierfr requested review from benoit74 and rgaudin December 22, 2023 16:16

rgaudin approved these changes Dec 22, 2023

View reviewed changes

benoit74 approved these changes Dec 22, 2023

View reviewed changes

mgautierfr mentioned this pull request Dec 22, 2023

New request: BBC Persian openzim/zim-requests#769

Open

mgautierfr marked this pull request as draft December 22, 2023 16:42

Base automatically changed from better_wombat_setup to warc2zim2 December 29, 2023 16:09

mgautierfr marked this pull request as ready for review December 29, 2023 16:09

mgautierfr merged commit ef6f425 into warc2zim2 Jan 3, 2024
12 checks passed

mgautierfr deleted the external_links branch January 3, 2024 12:24

mgautierfr mentioned this pull request Jan 3, 2024

Handle external links in warc2zim without SW #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handles external links (without SW) #142

Handles external links (without SW) #142

mgautierfr commented Dec 21, 2023 •

edited by kelson42

Loading

rgaudin left a comment

rgaudin Dec 21, 2023

mgautierfr Dec 22, 2023

rgaudin Dec 22, 2023

benoit74 left a comment

rgaudin commented Dec 21, 2023

codecov bot commented Dec 22, 2023 •

edited

Loading

mgautierfr commented Dec 22, 2023

rgaudin left a comment

benoit74 left a comment

mgautierfr commented Dec 22, 2023

Handles external links (without SW) #142

Handles external links (without SW) #142

Conversation

mgautierfr commented Dec 21, 2023 • edited by kelson42 Loading

rgaudin left a comment

Choose a reason for hiding this comment

rgaudin Dec 21, 2023

Choose a reason for hiding this comment

mgautierfr Dec 22, 2023

Choose a reason for hiding this comment

rgaudin Dec 22, 2023

Choose a reason for hiding this comment

benoit74 left a comment

Choose a reason for hiding this comment

rgaudin commented Dec 21, 2023

codecov bot commented Dec 22, 2023 • edited Loading

Codecov Report

mgautierfr commented Dec 22, 2023

rgaudin left a comment

Choose a reason for hiding this comment

benoit74 left a comment

Choose a reason for hiding this comment

mgautierfr commented Dec 22, 2023

mgautierfr commented Dec 21, 2023 •

edited by kelson42

Loading

codecov bot commented Dec 22, 2023 •

edited

Loading