Implement OPDS "search" sorting #702

kelson42 · 2022-02-05T17:39:20Z

Currently there is no sorting at all. The results should be sorted descending by popularity. This meaan that if we only filter for ZIM in French, it will just return all the French content descending by popularity. If we search for French content and text pattern is "Wikipedia" then it will return only the matching pattern sorted like Xapian does, but at last criteria the popularity will be taken in account (we should get all the ZIM with "wikipedia" in title or description sorted by popularity).

Once we will have introduced the attribute of popularity this will be necessary, see #489

But there is another need. I wanted to be informed about the latest ZIM published via OPDS and I remarked that rhere is no sorting by date (descending)... and there is not even a way to filter with a creation date range.

veloman-yunkan · 2022-02-06T16:32:01Z

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry.

OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus

atom:updated is the date corresponding to the OPDS Entry (rather than the publication associated with it). In our case this should be the time when the book was added to the library during library loading or when the entry was updated during library reloading.
atom:published is the earliest value that atom:updated had for this OPDS feed entry. In our case this should be the time when the book was added to the library during library loading.
dc:issued is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

libkiwix/src/opds_dumper.cpp

Lines 73 to 97 in dc4f9a4

    
           kainjow::mustache::object getSingleBookData(const Book& book) 
        
           { 
        
               const MustacheData bookUrl = book.getUrl().empty() 
        
                                          ? MustacheData(false) 
        
                                          : MustacheData(book.getUrl()); 
        
               return kainjow::mustache::object{ 
        
                 {"id", book.getId()}, 
        
                 {"name", book.getName()}, 
        
                 {"title", book.getTitle()}, 
        
                 {"description", book.getDescription()}, 
        
                 {"language", book.getLanguage()}, 
        
                 {"content_id",  urlEncode(book.getHumanReadableIdFromPath(), true)}, 
        
                 {"updated", book.getDate() + "T00:00:00Z"}, 
        
                 {"category", book.getCategory()}, 
        
                 {"flavour", book.getFlavour()}, 
        
                 {"tags", book.getTags()}, 
        
                 {"article_count", to_string(book.getArticleCount())}, 
        
                 {"media_count", to_string(book.getMediaCount())}, 
        
                 {"author_name", book.getCreator()}, 
        
                 {"publisher_name", book.getPublisher()}, 
        
                 {"url", bookUrl}, 
        
                 {"size", to_string(book.getSize())}, 
        
                 {"icons", getBookIllustrationInfo(book)}, 
        
               }; 
        
           }

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

kelson42 · 2022-02-06T21:17:51Z

@veloman-yunkan Thank you for this research work, this is englighting!

Before implementing this enhancement we need to make sure that it plays well with the OPDS spec. The spec mentions three fields that have to do with dates:

OPDS Catalog Entries must include an atom:updated element indicating when the OPDS Catalog Entry was last updated. A dc:issued element should be used to indicate the first publication date of the Publication and must not represent any date related to the OPDS Catalog Entry.
OPDS Catalog Entries may use atom:published to indicate when the OPDS Catalog Entry was first accessible.

Thus
* `atom:updated` is the date corresponding to the OPDS Entry (rather than the publication associated with it). In our case this should be the time when the book was added to the library duringThe library loading or when the entry was updated during library reloading.

If I understand properly, if we restart kiwix-serve, then all these values will be reseted. I hardly see if this works like this how this could be useful at all, actually it would be pretty misleading IMO.

The only scenario I can imagine is that this is the same file, but a few metadata have been changed. A situation which does not happen now, but will happen once the CMS will be in production. In such a scenario, it is impossible for the libkiwix/kiwix-serve to know that something has changed (because of lack of persistent memory if kiwix-serve is restarted). This should be handled in library.xml.

* `atom:published` is the earliest value that `atom:updated` had for this OPDS feed entry. In our case this should be the time when the book was added to the library during library loading.

OK, but IMO this value can only be set by the CMS and not automatically handled by libkiwix/kiwix-serve.

* `dc:issued` is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).

I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

While doing this small research I found out that in our OPDS streams we populate the atom:updated field with the book creation date (which is against the spec):

libkiwix/src/opds_dumper.cpp

Lines 73 to 97 in dc4f9a4

kainjow::mustache::object getSingleBookData(const Book& book)

{

const MustacheData bookUrl = book.getUrl().empty()

? MustacheData(false)

: MustacheData(book.getUrl());

return kainjow::mustache::object{

{"id", book.getId()},

{"name", book.getName()},

{"title", book.getTitle()},

{"description", book.getDescription()},

{"language", book.getLanguage()},

{"content_id", urlEncode(book.getHumanReadableIdFromPath(), true)},

{"updated", book.getDate() + "T00:00:00Z"},

{"category", book.getCategory()},

{"flavour", book.getFlavour()},

{"tags", book.getTags()},

{"article_count", to_string(book.getArticleCount())},

{"media_count", to_string(book.getMediaCount())},

{"author_name", book.getCreator()},

{"publisher_name", book.getPublisher()},

{"url", bookUrl},

{"size", to_string(book.getSize())},

{"icons", getBookIllustrationInfo(book)},

};

}

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

veloman-yunkan · 2022-02-19T07:13:54Z

Now the question is - should we fix the inconsistency with the usage of the atom:updated field and put the ZIM file creation date in a dc:issued node instead?

Yes, this is wrong to my opinion too. It should be fixed.

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

veloman-yunkan · 2022-02-19T07:47:50Z

Should we fix it both in /catalog and /catalog/v2 OPDS feeds or only in the latter?

In #715 I added <dc:issued> to both legacy (/catalog) and current (/catalog/v2) OPDS feeds.

kelson42 · 2022-02-19T09:52:59Z

dc:issued is the time when the actual publication was issued. It is unambiguous for publications that have only one of the hardcopy or digital representations. However, if we consider a paper publication that was then digitized or a book that was first published online and printed on paper later should we treat the hardcopy and the digital version as different representations of the same publication or as two different publications? I think we can use it to represent the creation date of ZIM files (though if a ZIM file represents a single real-world publication then we hit the mentioned interpretation problem).
I think it should be the time when the ZIM is created, but your questionning is really pertinent and concrete to me. We should IMO track it in openzim/libzim or openzim/overview (we would need to update the ZIM specification).

I have created a ticket to track this idea at openzim/overview#9

stale · 2022-07-10T23:35:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 added the enhancement label Feb 5, 2022

kelson42 added this to the 10.1.0 milestone Feb 5, 2022

kelson42 added this to To do in Better library Feb 5, 2022

veloman-yunkan self-assigned this Feb 5, 2022

veloman-yunkan mentioned this issue Feb 19, 2022

Added <dc:issued> field to OPDS entries #715

Merged

kelson42 mentioned this issue Feb 19, 2022

Consider saving content publication date in the ZIM openzim/overview#9

Open

kelson42 mentioned this issue Feb 19, 2022

Add new dates to the publishing stream openzim/cms#58

Open

kelson42 modified the milestones: 10.1.0, 10.2.0 Mar 23, 2022

kelson42 mentioned this issue Apr 13, 2022

Option to sort libraries shown on kiwix-serve #752

Closed

stale bot added the stale label Jul 10, 2022

kelson42 mentioned this issue Aug 20, 2023

Online library filter/search API calls are not optimal kiwix/kiwix-desktop#957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement OPDS "search" sorting #702

Implement OPDS "search" sorting #702

kelson42 commented Feb 5, 2022 •

edited

veloman-yunkan commented Feb 6, 2022

kelson42 commented Feb 6, 2022

veloman-yunkan commented Feb 19, 2022

veloman-yunkan commented Feb 19, 2022

kelson42 commented Feb 19, 2022

stale bot commented Jul 10, 2022

Implement OPDS "search" sorting #702

Implement OPDS "search" sorting #702

Comments

kelson42 commented Feb 5, 2022 • edited

veloman-yunkan commented Feb 6, 2022

kelson42 commented Feb 6, 2022

veloman-yunkan commented Feb 19, 2022

veloman-yunkan commented Feb 19, 2022

kelson42 commented Feb 19, 2022

stale bot commented Jul 10, 2022

kelson42 commented Feb 5, 2022 •

edited