Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically redirect to articles with same checksum #86

Closed
wants to merge 1 commit into from

Conversation

benoit74
Copy link
Collaborator

Fix #33

This is clearly a first draft / proposal, open to review / fixes.

This change adds a method add_autodedup_filter to specify which paths should be checked for duplicates (checksums are only computed on this path, and checked for duplicates on this path, in order to limit the footprint).

This change modify the method add_item_for to first compute the SHA256 (to be sure to avoid collision) checksum of content or content at fpath; if duplicate is found, an add_redirect is done instead ; if no duplicate, the checksum is added to a dictionary of checksum => item_path.

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #86 (a3ad5c5) into master (fe0b5fa) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #86   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         1159      1186   +27     
=========================================
+ Hits          1159      1186   +27     
Impacted Files Coverage Δ
src/zimscraperlib/zim/creator.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe0b5fa...a3ad5c5. Read the comment docs.

@kelson42 kelson42 marked this pull request as draft April 21, 2022 07:09
@rgaudin rgaudin removed their request for review April 21, 2022 08:32
@benoit74
Copy link
Collaborator Author

@rgaudin @kelson42 I consider this code ready for review now, feel free to comment

@benoit74 benoit74 marked this pull request as ready for review April 21, 2022 12:53
@kelson42 kelson42 requested a review from rgaudin April 21, 2022 15:41
@kelson42
Copy link
Contributor

@rgaudin Could you please review this code?

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @benoit74 , sorry for the delay in responding to this PR.

I'll write my comments here as those are general ones mostly:

  • After giving it some though, I think this belongs in a separate module and not within the Creator.
    • The Creator is a central piece to every scraper and we should be conserative in adding features to it.
    • Creator is inherited from pylibzim which is mostly a C++ wrapper with its own challenges. Keeping it somewhat isolated is crutial for debug/maintenance.
    • By nature, this code is meant to consume a growing amount of memory. Puting it inside the Creator makes debugging memory leaks harder.
    • All the ZIM content passes through the Creator ; using moving parts such as ContentProviders.
    • The Creator is expected to live during the full lifespan of the process and python doesn't releases memory back to the system while running.
    • your approach is probably good for most scrapers but some super large ones may require an out-of-python version as well.
  • I'd suggest we use md5 instead of sha256
    • md5 is faster. This might have an important impact if numerous large files are passes through it.
    • we don't need the security.
    • digest is half the size so half the memory consumption.
    • note that Codefactor will complain so you'll probably need to use # nosec.

Could you move your code to a new cache module ? You may want to offer the feature as an independant tool, a Creator subclass or a Mixin depending on whether you want to keep the single-step addition or not.

Let me know if this makes sense or not. A docstring on those two methods might help as well.

@rgaudin
Copy link
Member

rgaudin commented Apr 27, 2022

@benoit74 if you have reasons to believe that md5 will generate many exceptions (@kelson42 tells me you mentioned it somewhere), you may take a look at http://cyan4973.github.io/xxHash/ (from openzim/libzim#614)

@benoit74
Copy link
Collaborator Author

Hi @rgaudin, thank you for the review.

To be honest, I'm embarrassed about it.

First because I'm quite sure to not have the competencies / time to do the requested modifications, second because I'm not sure to really agree about your arguments / PoV.

I won't start a detailed discussion about it here due to the combination of these both issues.

In addition, since you will be the one(s) to support the long term maintenance of this library, I would be more than embarrassed to "force" you to do something that could be proven to be wrong / costly without supporting these consequences myself.

I will hence close this PR and live with my own non-standard implementation in my scrapper.

No worries anyway, I'm not upset at all, I just have my own agenda / priorities and you have yours, quite normal.

@benoit74 benoit74 closed this Apr 27, 2022
@rgaudin
Copy link
Member

rgaudin commented May 2, 2022

@benoit74, I'm glad you're not upset ; definitely not my intent but I am a bit disappointed myself. Not willing to do a change or nor having time to is 100% OK. You're volunteering time on this project and we're thankful for it.

Yet, not sharing your disagreements is meh (can't find a better word). We're all doing this in the open so we can all benefit from other's point of views and experience.

We may find time to implement this at some point, or another volunteer could or maybe my assumptions are incorrect and the PR should be merged… but we'd need to know why we're wrong or why you think we are.

I do know how time and resource consuming it can be to argument or demonstrate a point. You think it's not worth it and that's probably true from your POV but that's what collaboration is about.

We understand that time is precious which is why we're OK with pointers, approximations or hunches ; it's all over our repositories. Let us know what you think is wrong and we'll have a chance to think about/research it. I'm not saying we'll do it either way but a criticism is a chance to improve. Thanks again for your contribution. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically redirect to articles with same checksum
3 participants