Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIM Deletion Policy #103

Open
rgaudin opened this issue May 3, 2024 · 9 comments
Open

ZIM Deletion Policy #103

rgaudin opened this issue May 3, 2024 · 9 comments
Assignees
Labels

Comments

@rgaudin
Copy link
Member

rgaudin commented May 3, 2024

As agreed during the Hackathon, here's a request to the Content Team for a ZIM Deletion Policy. We want a Wiki entry that lists all the possible reasons for deleting a ZIM. Requests for deletion will then need to provide said reason.

We've discussed that one of the reason could be Metadata not aligned anymore with our Q/A standard. We want to ultimately allow content team to fix that themselves (@benoit74 to open a ticket on zimfarm). In the mean time, individual delete-requests can exceptionally be fixed by developers using zimrecreate.

@benoit74
Copy link
Collaborator

benoit74 commented May 6, 2024

I will try a proposition of deletion policy, please provide feedback quickly, I would like to enforce it by the end of the month of May 2024 at the latest. WDYT?

Deletion policy

On an exceptional basis, it is possible to delete a ZIM which has been published on library.kiwix.org ; we need to ensure this is kept exceptional. As a publisher we somehow promise users to make our best effort so they keep access to the content we've published once. And as a publisher we need to enforce Q/A so that published ZIMs are known to be OK before publication.

Scope details:

  • only ZIMs published on "production" libraries like library.kiwix.org are concerned by this deletion policy
  • ZIMs published on dev.library.kiwix.org are not concerned by any deletion policy so far, they might usually be deleted at any time

The only acceptable reasons to delete a ZIM are:

  • ZIM content is now known to be significantly broken (e.g. content not displaying at all, lots of missing content or broken links)
  • ZIM content is now known to have copyright issues (no need to wait for content owner complain to delete this)
  • ZIM content is now known to be wrong (wrong information, false theories, very outdated - not old - information, ...)
  • ZIM has very recently been published by mistake, without Q/A job been done
  • ZIM is using a technical format which is not supported by a majority of readers
  • ZIM has been superseeded by two more recent versions (this deletion is already automated)

Except the last one, all these reasons are not expected to happen on a regular basis, or even never happened in the past, so we expect they will continue to lead to a very low level of ZIM deletions.

Following reasons are not acceptable:

  • ZIM metadatas are incorrect (e.g. typos, not in-line with our Q/A standard)
    • Tooling already exist and must continue to be developed so that Content team can fix metadata issues without Developers involvement
  • Online content is not available anymore
    • While this means we probably cannot update the ZIM anymore, this is not a sufficient reason to stop publishing a ZIM which is in working shape and probably valuable to some users (and potentially even more since the online source is not here anymore)
  • Content is old
    • Being old does not means being outdated, only outdated content might be considered for deletion

@benoit74
Copy link
Collaborator

benoit74 commented May 6, 2024

@RavanJAltaie FYI, suggestions are welcomed

@benoit74
Copy link
Collaborator

benoit74 commented May 6, 2024

Zimfarm issue about metadata update is here: openzim/zimfarm#956

@rgaudin
Copy link
Member Author

rgaudin commented May 6, 2024

LGTM ;

@Popolechien do you remember the wikipedia_en_all_maxi ZIM that had one article defaced with racist content at the time of scrape? 2023-10-07

I think this would match “ZIM content is now known to be wrong” but we'd still have to discuss case-by-case whether it's worth deleting (as we know vandalized articles are most likely included in every ZIM).

@Popolechien
Copy link
Member

Yeah I don't think this particular case fit in the reasons listed, but then this seems fairly common sense. Maybe add something along the lines of "Zim content deviates significantly from educational mission". There's another zim that has been flagged recently as moving away from prepper content/thematics to simple product placement: I still need to look into it but to me that would also warrant removal.

Other than that, I would remove this sentence from the intro:

As a publisher we somehow promise users to make our best effort so they keep access to the content we've published once.

Not sure about the somehow (sounds weird to me), but more broadly that makes us an archival project, which I don't really agree with (plus the fact that we don't make any effort to ensure compatibility with older zim files; nor can we afford to, as a matter of fact)

@benoit74
Copy link
Collaborator

benoit74 commented May 7, 2024

Yeah I don't think this particular case fit in the reasons listed, but then this seems fairly common sense

I don't agree, a policy is meant to avoid relying on common sense since it is clear that this is to much a topic of interpretation.

I would add a reason like "ZIM contains vandalized / defaced content on important pages". I'm a bit afraid this is still a bit too subjective, but the past showed us that we made the decision to delete the ZIM for one single vandalized page, so it seems this is the path we want to follow.

Maybe add something along the lines of "Zim content deviates significantly from educational mission".

I would make it even broader with "ZIM content does not match acceptable content policy (educational mission, ...)"

Not sure about the somehow (sounds weird to me), but more broadly that makes us an archival project, which I don't really agree with

I don't mind to remove the "somehow". But still I don't think this phrase makes us an archival project, and I consider it is very important. Most content providers have the same kind of core promise.

For instance, StackExchange gets contributions because they promise users will continue to get access to the published content for "the time being". StackExchange has a strong policy on which questions might get deleted at https://meta.stackexchange.com/help/deleted-questions (and they do delete a lot AFAIK). Without both, I'm quite sure the project would fade out quickly.

If we remove this sentence, then I don't get why we would really need a deletion policy and what could help us decide what is acceptable or not in this policy. I would consider we might delete any ZIM which is not suiting any of us anymore, whatever the reason, since it is clearly the least effort path and our available bandwidth is very limited anyway.

To help me better understand, I would probably benefit from another "core promise" which explains why the deletions I've listed as not acceptable are indeed not acceptable. Otherwise it looks to me this will always be the topic of debates.

That being said, if at least we are all aligned today on the acceptable reasons, I don't mind we remove the phrase if it is not ok for a majority (I don't like consensus ^^)

@rgaudin
Copy link
Member Author

rgaudin commented May 7, 2024

but the past showed us that we made the decision to delete the ZIM for one single vandalized page, so it seems this is the path we want to follow.

Very important clarification: we did not remove that content from the Catalog. We removed one ZIM file because we keep two specifically for such reasons. If the latest one out of the Zimfarm has an issue, we can delete it and continue to serve the content (we only serve one version of a Title at once). Also, that content is being refreshed periodically (but recreating is fragile and takes time).

I think in my mind the policy was for for removing content and not individual ZIMs when there's another one but it's probably the place to clarify both situation

@benoit74
Copy link
Collaborator

benoit74 commented May 7, 2024

I think in my mind the policy was for for removing content and not individual ZIMs when there's another one but it's probably the place to clarify both situation

It is named "ZIM deletion policy", so I thought we wanna deal with individuals ZIMs. This is intentional from my side, and the reason why I clearly mentioned these "two more recent versions". And probably the right granularity for such a policy since anyway deletion requests are usually done at the ZIM level (not content).

@Popolechien
Copy link
Member

We will never be able to cover every possible way things can go wrong, unless the policy goes into so much detail that it becomes irrelevant. There will always be some level of arbitrary decision.

For the case referred to of a specific Zim file with problematic content, the informal policy we had with @RavanJAltaie is "Do people complain, which means that people notice?". That allows us to identify high-traffic, high visibility zim files/pages that need immediate action (whereas low-traffic ones can be automatically handled by the next scraper iteration.

A choice has to be made between "Delete old zim files, with exceptions" and "Do not delete zim files, with exceptions". Finding a wording that intersects both would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants