Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce addAlias method. #833

Merged
merged 7 commits into from
Nov 28, 2023
Merged

Introduce addAlias method. #833

merged 7 commits into from
Nov 28, 2023

Conversation

mgautierfr
Copy link
Collaborator

This is the first PR for kiwix/overview#95

Fix #824

@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (ca1bcee) 57.57% compared to head (fcc3eec) 57.60%.

❗ Current head fcc3eec differs from pull request most recent head 9a59af5. Consider uploading reports for the commit 9a59af5 to get more accurate results

Files Patch % Lines
src/writer/creator.cpp 37.50% 4 Missing and 6 partials ⚠️
src/writer/_dirent.h 64.28% 3 Missing and 2 partials ⚠️
src/item.cpp 0.00% 2 Missing ⚠️
src/writer/dirent.cpp 83.33% 0 Missing and 1 partial ⚠️
src/writer/direntPool.h 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #833      +/-   ##
==========================================
+ Coverage   57.57%   57.60%   +0.02%     
==========================================
  Files          99       99              
  Lines        4542     4590      +48     
  Branches     1909     1922      +13     
==========================================
+ Hits         2615     2644      +29     
- Misses        668      677       +9     
- Partials     1259     1269      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation information about presence of clones in nowhere recorded.

Have conscious decisions been made (and properly implemented) regarding the following questions:

  1. How should clones be accounted for in media type counts?
  2. How should full text indexing on cloned entries work? If FT indexing associates a document only with the entry added via the zim::Creator::addItem() method, then the cloned dirents better be named aliases because of the resulting asymmetry.
  3. Shouldn't it be possible to iterate over items while skipping cloned entries (i.e. having each data item returned only once)?
  4. Shouldn't it be possible to find out all the clone-siblings of a given dirent?

src/writer/_dirent.h Outdated Show resolved Hide resolved
include/zim/writer/creator.h Outdated Show resolved Hide resolved
src/writer/creator.cpp Outdated Show resolved Hide resolved
src/writer/direntPool.h Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Collaborator Author

In the current implementation information about presence of clones in nowhere recorded.

It is somehow intended. In fact nothing in the specification (from the beginning) prevent two dirent to point to the same data. It is just that no implementation allow that.
I don't see why we should inform about their presence. Especially, reader should always be prepared to that.

Have conscious decisions been made (and properly implemented) regarding the following questions:

  • How should clones be accounted for in media type counts?
  • How should full text indexing on cloned entries work? If FT indexing associates a document only with the entry added via the zim::Creator::addItem() method, then the cloned dirents better be named aliases because of the resulting asymmetry.

Indeed. For now, clones are handled "as redirect":

  • They are not counted in the media type counter
  • They are not fulltext indexed. (Only title)

This is mainly a technical limitation as we don't have access to the content (only the cluster/blob index) when we add the clone.

Shouldn't it be possible to iterate over items while skipping cloned entries (i.e. having each data item returned only once)?
Shouldn't it be possible to find out all the clone-siblings of a given dirent?

I'm not sure we need that. The only need I see for now is in zimcheck but I handle that there.


This new feature may evolve and be used for something not foreseen. But for now, the use case is to "reference" binary content (as youtube video) under different names for zimit files.

[...] then the cloned dirents better be named aliases because of the resulting asymmetry.

This a good point. I agree with you. Maybe alias is a better name.

I have named it with a generic name as it is technically a generic feature but the intended use case is pretty limited:

  • We will not clone articles (so no problem about indexation)
  • This is used to resolve dynamically generated url, so the content must be counted for one.

So I don't think we need to do more, at least for now. If other use cases appear, we may extend our API accordingly.

include/zim/writer/creator.h Show resolved Hide resolved
src/writer/direntPool.h Outdated Show resolved Hide resolved
@veloman-yunkan
Copy link
Collaborator

Shouldn't it be possible to iterate over items while skipping cloned entries (i.e. having each data item returned only once)?
Shouldn't it be possible to find out all the clone-siblings of a given dirent?

I'm not sure we need that. The only need I see for now is in zimcheck but I handle that there.

Recovering information about a dirent being an alias requires extra effort. Can't we preserve it easily? For example is Dirent::parameter used for anything? If not, why don't we record the dirent's purpose there?

@mgautierfr
Copy link
Collaborator Author

Dirent::parameter used for anything? If not, why don't we record the dirent's purpose there?

Dirent::parameter is indeed not used for now. But is our only way to extend a dirent without changing the zim format so if we start to use it, we have to define a way to use it for several things. If we use it "blindly" to store "isClone" we lost a way to store other values.

The idea of this change was it was "creator" only and do not change anything on reading part.
Let's me think on this 🙂

@kelson42 kelson42 modified the milestones: 9.0.0, 10.0.0 Nov 12, 2023
@kelson42
Copy link
Contributor

@mgautierfr Status not clear to me, seems you are over your side but new review has not been requested.

@mgautierfr
Copy link
Collaborator Author

Status not clear to me, seems you are over your side but new review has not been requested.

PR updated and ready for another review.

What is left open is the question about extending the format to use Dirent::parameter.
I still not convince about that. But neither have a strong opinion on the opposite side.

I would suggest that we go with this PR and discuss the extension of the format in a issue.
But the issue would have to be fixed/resolved BEFORE we do another release and start creating zim file with alias.

As long as I have not finished kiwix/overview#95, there is no need to merge it. So maybe we can simply continue the review process to agree with @veloman-yunkan.

@kelson42
Copy link
Contributor

kelson42 commented Nov 25, 2023

@veloman-yunkan

What is left open is the question about extending the format to use Dirent::parameter.
I still not convince about that. But neither have a strong opinion on the opposite side.

I'm not convince about the necessity to do that in general, in right now in particular, I would recommend to open a dedicated ticket to keep track of this idea.

But it is pretty urgent to move forward with the code review process!

That said, we should be able to identify clearly a ZIM file which clearly potentialy have such clones (even if this was allowed implicitly in the past, it will be from now explicitly) I would recommend to:

  • increment the ZIM version number by a minor version
  • Update the spec to clearly say it is allowed

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Fixup commits must be eliminated.

This allow user to add a new entry which is a `Clone` of a previously
added one.

The new entry is the "same" that the original one but for the
path and title.
This is use in zim-check to not count clone entries as duplicated entries.
@mgautierfr mgautierfr merged commit abf7c11 into main Nov 28, 2023
28 checks passed
@mgautierfr mgautierfr deleted the clone_entry branch November 28, 2023 15:07
@veloman-yunkan veloman-yunkan changed the title Introduce add_clone method. Introduce addAlias method. Dec 5, 2023
@kelson42 kelson42 modified the milestones: 11.0.0, 9.1.0 Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Libzim should support alias creation.
3 participants