Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solr support for synonyms for numbers/abbreviations #6635

Open
popcar2 opened this issue Jun 7, 2022 · 7 comments · May be fixed by #6922
Open

Add solr support for synonyms for numbers/abbreviations #6635

popcar2 opened this issue Jun 7, 2022 · 7 comments · May be fixed by #6922
Assignees
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Search Issues related to search UI and backend. [managed]

Comments

@popcar2
Copy link

popcar2 commented Jun 7, 2022

I've been using the website for a long time now and one of my biggest gripes is how searching works. When searching for books in OpenLibrary, you often need to write exactly the correct title. This means that if a book uses words for numbers (One, Two, Three etc), searching the same title with digits (1, 2, 3 etc) would give no result.

Another example is if a book uses "Vol." in the title, searching "volume" would net no result even though they mean the same thing. This makes finding specific books a lot more difficult.

Describe the problem that you'd like solved

The search engine searches exact terms, but it should have tolerance when dealing with numbers or words of equivalent meaning.
Here's an example:

image
image
I would like searching "The Walking Dead Compendium Four" and "The Walking Dead Compendium 4" to find the book.

Proposal & Constraints

The search engine should be error tolerant to words of the same meaning.
"Vol." should be the same as writing "Volume"
"Two" should be the same as writing "2" or "II"
"&" and "and" should also be interchangeable.

Additional context

Another example, but with "vol" and "volume"
image
image

@popcar2 popcar2 added Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jun 7, 2022
@mekarpeles mekarpeles added Theme: Search Issues related to search UI and backend. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Jun 13, 2022
@cdrini cdrini added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Aug 15, 2022
@cdrini
Copy link
Collaborator

cdrini commented Aug 15, 2022

I think the solution for this would be to make use of solr's synonyms feature. But some experimenting / investigation needed. Anyone who has some time to experiment with adding synonyms to solr, please do!

@bicolino34
Copy link
Collaborator

@cdrini I would like to do it, how can I?

@bicolino34
Copy link
Collaborator

The search is strict not only with terms, but also with letters. Compare Безпека життєдіяльності and Безпека життєдіяльност. With just one letter missing (і) there are no results

@mekarpeles mekarpeles added the Needs: Response Issues which require feedback from lead label Aug 22, 2022
@cdrini
Copy link
Collaborator

cdrini commented Aug 23, 2022

So this is a solr research task; here are some of places where it will need modifications:

The solr schema which defines the various type of text fields has synonyms enabled -- but only at query time:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- A text field with better cross-language defaults:
it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
(empty by default), and down cases. At query time only, it
also applies synonyms.
-->
<fieldType name="text_international" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>

This blog post has some info: https://library.brown.edu/create/digitaltechnologies/using-synonyms-in-solr/

In a nutshell we need synonyms inside https://github.com/internetarchive/openlibrary/blob/ccabd95be2a82c4f79d94b1f10e46ea1d3c5c730/conf/solr/conf/synonyms.txt

And then test locally with a full reindex (See https://github.com/internetarchive/openlibrary/wiki/Solr#making-changes-to-solr-config )

But for numbers, they probably need to be in English only for now? I'm not sure how we should handle non-English numbers. Ideally we'd want different synonyms files for different user locales, but I'm not sure if/how to do this in solr.

@cdrini
Copy link
Collaborator

cdrini commented Aug 23, 2022

But we can definitely add something like vol,vol.,Volume in there and see if it helps with that!

@cdrini cdrini removed the Needs: Response Issues which require feedback from lead label Aug 23, 2022
@cdrini
Copy link
Collaborator

cdrini commented Aug 23, 2022

Actually it looks like the synonyms file is working! You can see the TV one in action here: https://openlibrary.org/search?q=television+kid&mode=everything .

So adding volume should be easy enough!

@cdrini
Copy link
Collaborator

cdrini commented Aug 30, 2022

@bicolino34 For your issue, that would probably be handled by solr's spell checking features. So having something like "Did you mean?" when a user's query is close to be not perfectly correct. Would you mind creating a separate issue to add support for "Did you mean?" ? That'll require a different approach on the solr side, but would help users a ton!

@cdrini cdrini changed the title Add error tolerance while searching with equivalent terms Add solr support for synonyms for numbers/abbreviations Aug 30, 2022
@bicolino34 bicolino34 linked a pull request Aug 31, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Search Issues related to search UI and backend. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants