-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handles external links (without SW) #142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
@@ -282,17 +283,24 @@ def iter_all_warc_records(self): | |||
|
|||
yield from iter_warc_records(self.inputs) | |||
|
|||
def find_main_page_metadata(self): | |||
for record in self.iter_all_warc_records(): | |||
def gather_information_from_warc(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name is misleading. It's all about the main page (and its metadata) and given how long this is already, probably good not to add additional stuff to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that doing the main page searching and the filling of known_url in two different steps means that we loop twice (and one more time for the real processing) the warc entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not suggesting we change the behavior ; just the name. Maybe find_homepage_and_urls
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it's very weird to do so many things in the WARCPayloadItem
class __init__
method, especially rewriting item URLs in this class seems odd. This is made especially visible by the fact that we need to pass all known URLs to the class, which means this class knows a lot of things about the surrounding logic. I would have expected this class to be more a kind of adapter to some ZIM specifities, based on pre-processed information.
I'm also concerned by the fact that we keep all know URLs in memory. We've already seen websites with more than 10k pages, and I would expect we will have to deal with 100k pages at some point, meaning 100k URLs in memory. I don't know exactly how bad this could be (in terms of memory footprint), but I think we should be at least aware of this / track this risk/need in a specific issue.
This is not a new concern ; several scrapers need to store huge list of entries/urls. There's no average size for a URL but even in UTF-8, most characters in most URLs are ASCII so I think 500b for an average URL can be seen consevative. With a hundred thousands of them it's just over 50MB. Sure that's a lot but in the context of a scraper turning 100K requests into a ZIM, that's not an issue IMO. |
b9a0591
to
11d6267
Compare
fea067d
to
6969e4c
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## better_wombat_setup #142 +/- ##
=======================================================
+ Coverage 91.86% 92.08% +0.21%
=======================================================
Files 9 9
Lines 701 720 +19
Branches 116 121 +5
=======================================================
+ Hits 644 663 +19
Misses 40 40
Partials 17 17 ☔ View full report in Codecov by Sentry. |
11d6267
to
aec2c0a
Compare
6969e4c
to
1f10b11
Compare
By default it rewrite all urls. But potentially not rewrite url not pointing to a known entry.
For html content, we want to keep external link in `a` tag.
aec2c0a
to
eca6903
Compare
1f10b11
to
d0dd6d7
Compare
I partly agree. This
Agree. But as has said @rgaudin, it should not be such a problem. And I don't know how to do it without storing the urls somewhere (or loop the whole warc for each url rewrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I agree the problem I mentioned is not such of an issue. A simple solution (I didn't say straightforward) would have been to store all data in a local DB, typically SQLite like is done for Gutenberg. Of course it comes with some drawbacks it terms of CPU / scrape duration, but at least it is scalable.
And you convinced me regarding the __init__
method, let's keep it like this until we find it becomes a real issue.
(Put in draft as it needs merging of #141 first) |
Fixes #137
With this PR we don't rewrite html (
a href
) link when the url is not in the warc archive.Other links are always rewritten to avoid zim file silently phoning home.