Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OPDS "search" sorting #702

Open
kelson42 opened this issue Feb 5, 2022 · 6 comments
Open

Implement OPDS "search" sorting #702

kelson42 opened this issue Feb 5, 2022 · 6 comments
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

kelson42 commented Feb 5, 2022

Currently there is no sorting at all. The results should be sorted descending by popularity. This meaan that if we only filter for ZIM in French, it will just return all the French content descending by popularity. If we search for French content and text pattern is "Wikipedia" then it will return only the matching pattern sorted like Xapian does, but at last criteria the popularity will be taken in account (we should get all the ZIM with "wikipedia" in title or description sorted by popularity).

Once we will have introduced the attribute of popularity this will be necessary, see #489

But there is another need. I wanted to be informed about the latest ZIM published via OPDS and I remarked that rhere is no sorting by date (descending)... and there is not even a way to filter with a creation date range.

@kelson42 kelson42 added this to the 10.1.0 milestone Feb 5, 2022
@kelson42 kelson42 added this to To do in Better library Feb 5, 2022
@veloman-yunkan veloman-yunkan self-assigned this Feb 5, 2022
@veloman-yunkan
Copy link
Collaborator

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry.

OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus

  • atom:updated is the date corresponding to the OPDS Entry (rather than the publication associated with it). In our case this should be the time when the book was added to the library during library loading or when the entry was updated during library reloading.
  • atom:published is the earliest value that atom:updated had for this OPDS feed entry. In our case this should be the time when the book was added to the library during library loading.
  • dc:issued is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

kainjow::mustache::object getSingleBookData(const Book& book)
{
const MustacheData bookUrl = book.getUrl().empty()
? MustacheData(false)
: MustacheData(book.getUrl());
return kainjow::mustache::object{
{"id", book.getId()},
{"name", book.getName()},
{"title", book.getTitle()},
{"description", book.getDescription()},
{"language", book.getLanguage()},
{"content_id", urlEncode(book.getHumanReadableIdFromPath(), true)},
{"updated", book.getDate() + "T00:00:00Z"},
{"category", book.getCategory()},
{"flavour", book.getFlavour()},
{"tags", book.getTags()},
{"article_count", to_string(book.getArticleCount())},
{"media_count", to_string(book.getMediaCount())},
{"author_name", book.getCreator()},
{"publisher_name", book.getPublisher()},
{"url", bookUrl},
{"size", to_string(book.getSize())},
{"icons", getBookIllustrationInfo(book)},
};
}

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

@kelson42
Copy link
Collaborator Author

kelson42 commented Feb 6, 2022

@veloman-yunkan Thank you for this research work, this is englighting!

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry.
OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus

* `atom:updated` is the date corresponding to the OPDS Entry (rather than the publication associated with it). In our case this should be the time when the book was added to the library duringThe library loading or when the entry was updated during library reloading.

If I understand properly, if we restart kiwix-serve, then all these values will be reseted. I hardly see if this works like this how this could be useful at all, actually it would be pretty misleading IMO.

The only scenario I can imagine is that this is the same file, but a few metadata have been changed. A situation which does not happen now, but will happen once the CMS will be in production. In such a scenario, it is impossible for the libkiwix/kiwix-serve to know that something has changed (because of lack of persistent memory if kiwix-serve is restarted). This should be handled in library.xml.

* `atom:published` is the earliest value that `atom:updated` had for this OPDS feed entry. In our case this should be the time when the book was added to the library during library loading.

OK, but IMO this value can only be set by the CMS and not automatically handled by libkiwix/kiwix-serve.

* `dc:issued` is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

kainjow::mustache::object getSingleBookData(const Book& book)
{
const MustacheData bookUrl = book.getUrl().empty()
? MustacheData(false)
: MustacheData(book.getUrl());
return kainjow::mustache::object{
{"id", book.getId()},
{"name", book.getName()},
{"title", book.getTitle()},
{"description", book.getDescription()},
{"language", book.getLanguage()},
{"content_id", urlEncode(book.getHumanReadableIdFromPath(), true)},
{"updated", book.getDate() + "T00:00:00Z"},
{"category", book.getCategory()},
{"flavour", book.getFlavour()},
{"tags", book.getTags()},
{"article_count", to_string(book.getArticleCount())},
{"media_count", to_string(book.getMediaCount())},
{"author_name", book.getCreator()},
{"publisher_name", book.getPublisher()},
{"url", bookUrl},
{"size", to_string(book.getSize())},
{"icons", getBookIllustrationInfo(book)},
};
}

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

@veloman-yunkan
Copy link
Collaborator

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

@veloman-yunkan
Copy link
Collaborator

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

In #715 I added <dc:issued> to both legacy (/catalog) and current (/catalog/v2) OPDS feeds.

@kelson42
Copy link
Collaborator Author

  • dc:issued is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

I have created a ticket to track this idea at openzim/overview#9

@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants