Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to filter out duplicates in suggestions #276

Closed
kelson42 opened this issue Sep 1, 2019 · 9 comments · Fixed by #515
Closed

Option to filter out duplicates in suggestions #276

kelson42 opened this issue Sep 1, 2019 · 9 comments · Fixed by #515
Assignees
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Sep 1, 2019

We should be able to avoid duplicates in suggestions search. Currently and most of the time, it returns many time the same article (partly via multiple redirects). This is not only useless but limit in addition the number of alternatives proposed.

I would propose to:

  • Identify duplicates (via redirects)
  • Choose the best candidate (based on levenstein distance between the search pattern and all the candidates).
@holta
Copy link

holta commented Nov 26, 2019

As an example — attempt a title-search for "mexico" in the search box/textfield in the top-right of...
https://library.kiwix.org/wikipedia_es_all_maxi_2019-09

Presents 2 UX challenges...

@kelson42 kelson42 pinned this issue Nov 26, 2019
@kelson42 kelson42 changed the title Option to filter duplicates Option to filter out duplicates Jan 5, 2020
@kelson42 kelson42 changed the title Option to filter out duplicates Option to filter out duplicates in suggestions May 1, 2020
@kelson42 kelson42 unpinned this issue Jun 30, 2020
@kelson42
Copy link
Contributor Author

@maneeshpm My thoughs about this problem have evolved since I have written the ticket. Here is I believe a better approach to solve it:

  • At the indexing time, if the article is a redirect, write the targeted article URL instead of the redirect URL in the Xapian title index
  • At the searching time, keep track of the suggestions URL and if a URL appears a second time ( from the most relevant to the less) then just skip the suggestion.

That way we are almost sure to avoid suggestion duplicates and we trust Xapian ranking algorithm.

@mgautierfr
Copy link
Collaborator

At the indexing time, if the article is a redirect, write the targeted article URL instead of the redirect URL in the Xapian title index

I'm not against that, but two points to notice :

  • The current implementation of kiwix-serve suggestions is based on title, so even if we store the targeted article url in xapian, we will have a redirection on the browser side (But 1. this can be change. 2. Not really important for the feature itself of removing duplicates)
  • On xapian side we will lost the information about the fact this is a redirection. If we want to get this information back we would need to do a "classic" findByTitle and check the found dirent. (But do we really need this information ?)
  • This may not work for chained redirection as we would store the "first" targeted url. (But do we have a lot of chained redirection ? Is it a acceptable limitation ?)

At the searching time, keep track of the suggestions URL and if a URL appears a second time ( from the most relevant to the less) then just skip the suggestion.

It is probably possible to ask xapian to regroup the result by url, or even just ensure that url is unique.

@kelson42
Copy link
Contributor Author

kelson42 commented Mar 4, 2021

At the indexing time, if the article is a redirect, write the targeted article URL instead of the redirect URL in the Xapian title index

I'm not against that, but two points to notice :

* The current implementation of kiwix-serve suggestions is based on title, so even if we store the targeted article url in xapian, we will have a redirection on the browser side (But 1. this can be change. 2. Not really important for the feature itself of removing duplicates)

I'm talking about suggestions in general and the system based on Xapian. The special case of Kiwix-serve is not much of interest for me at this stage. It's way of working is "wrong" and needs to be fixed, see kiwix/kiwix-tools#205.

* On xapian side we will lost the information about the fact this is a redirection. If we want to get this information back we would need to do a "classic" findByTitle and check the found dirent. (But do we really need this information ?)

Yes, I don't believe this is something necessary (but I'm not happy about that either).

* This may not work for chained redirection as we would store the "first" targeted url. (But do we have a lot of chained redirection ? Is it a acceptable limitation ?)

Yes, this will be inefficient in that case - so we might have two similar titles pointing ultimatively the same non-redirect article.

At the searching time, keep track of the suggestions URL and if a URL appears a second time ( from the most relevant to the less) then just skip the suggestion.

It is probably possible to ask xapian to regroup the result by url, or even just ensure that url is unique.

If Xapian can do that then I vote for this approach obviously. @maneeshpm this is on you ;)

@maneeshpm
Copy link
Collaborator

I agree with your approach @mgautierfr. The results are returned as a Xapian::MSet object, which cannot be modified directly like a list. So we will have to use a collapse_key at the search time which will do this task for us. For the indexing time, If we can use redirectPath attribute appropriately instead of directly using path in the index.

@kelson42 kelson42 added this to the libzim 7.0.0 milestone Mar 5, 2021
@maneeshpm
Copy link
Collaborator

maneeshpm commented Mar 6, 2021

@mgautierfr @kelson42 It seems that using a keymaker is supported only on sorting right now, but not on collapsing mset. Doing it directly using Xapian is not possible. So we need a custom solution to this.
The issue is we stores path as the data of the index whereas Xapain uses values for tasks such as collapsing. If we use the data as a sort or collapse parameter, we will need to check and compare for each entry which will affect the performance.

@kelson42
Copy link
Contributor Author

kelson42 commented Mar 6, 2021

@maneeshpm So if we store the path as value then it will work? If "yes", then we should do so and secure back compatibility.

@maneeshpm
Copy link
Collaborator

If we store the path as a value, it will be as easy as setting the collapse key to the valueslot of path. We should probably do that for future zims. I am figuring out a way to ensure backward compatibility as well.

@kelson42
Copy link
Contributor Author

kelson42 commented Mar 6, 2021

@maneeshpm I would associate a key metadata "value_store" to the database and act according to its presence/value. Secure that you access this metadata in lazy mode so you don't have to access the xapian file each time you need to check the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants