Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index normalized author name in solr #178

Closed
anandology opened this issue Mar 26, 2013 · 31 comments · Fixed by #5600
Closed

Index normalized author name in solr #178

anandology opened this issue Mar 26, 2013 · 31 comments · Fixed by #5600
Assignees
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] Theme: Search Issues related to search UI and backend. [managed] Theme: Unicode Issues and pull requests related to Unicode characters Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@anandology
Copy link
Collaborator

Imagine the case where the author name author name has special accent characters like "Ghaṭṭi Añjanēyaśarma". Most of the time, the user won't be able to enter the accent characters and autocomplete will fail.

The search engine should index the accent-stripped version of the author name along with the real name to avoid such issues.

@ghost ghost assigned anandology Mar 26, 2013
@bencomp
Copy link
Contributor

bencomp commented Mar 26, 2013

👍

@anandology
Copy link
Collaborator Author

Looks like I've already reported this bug 3 years back, but not fixed yet.

https://bugs.launchpad.net/openlibrary/+bug/540866

Edward had some suggestions about how it can be fixed.

@bencomp
Copy link
Contributor

bencomp commented Mar 28, 2013

So it's a matter of configuration? Edward's solution was using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

<filter class="solr.ASCIIFoldingFilterFactory"/>

@anandology
Copy link
Collaborator Author

I tried that didn't seem to work. Requires more exploration.

On Thursday, March 28, 2013, bencomp wrote:

So it's a matter of configuration? Edward's solution was using
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are
not in the first 127 ASCII characters (the "Basic Latin" Unicode block)
into their ASCII equivalents, if one exists.


Reply to this email directly or view it on GitHubhttps://github.com//issues/178#issuecomment-15572628
.

Anand
http://anandology.com/

@anandology
Copy link
Collaborator Author

Working on moving to solr with single core and improved schema. Will fix that after that is done. Targeting this for May.

@tfmorris
Copy link
Contributor

I would recommend something more sophisticated like the NFKC_Casefold option of:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory

so that we handle Unicode normalization as well. I know I've seen both composed and decomposed forms in OpenLibrary.

This tokenizer probably deserves investigation as well:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Here are some very basic names which aren't found: Antonin Dvořak, Antonin Dvořák, Antonín Dvořák, Antonín Dvorák, Antonín Dvořak

Amongst other problems, not having them show up in search makes them very difficult to merge.

@tfmorris
Copy link
Contributor

tfmorris commented Sep 1, 2013

This is a duplicate of issue #11.

@tfmorris
Copy link
Contributor

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many of the useful diacritic folding capabilities were introduced with Solr 3.1 in 2011. Is there a reason not to move to a more modern version?

@george08
Copy link

Time, I bet ;)

@anandology
Copy link
Collaborator Author

On Fri, Oct 11, 2013 at 5:04 AM, Tom Morris notifications@github.comwrote:

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many
of the useful diacritic folding capabilities were introduced with Solr 3.1
in 2011. Is there a reason not to move to a more modern version?

In progress. I've already setup a node with solr 3.1 and improved setup to
handle searching for editions, authors and works. Will go live in a month
or so.

@tfmorris
Copy link
Contributor

@Gio I know you made some Solr changes recently. Was diacritic folding and/or unicode normalization part of that work or is this still open?

@LeadSongDog
Copy link

@anandology there still seem to be two very similar records at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A and at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A
Neither of them is found yet by an author search for "Ghatti Anjaneyasarma"

@tfmorris
Copy link
Contributor

@LeadSongDog Anand (anandology) isn't involved any more. As I understand it, Gio (@gdamdam) is the current dev. Unfortunately when I attempted to ping him for status back in January, I inadvertently used the wrong username.

@gdamdam Any update on Solr diacritic folding?

@bfalling
Copy link
Collaborator

We've moved the full-text search engine to an Internet-Archive-based Elastic Search cluster. A decision needs to be made about the OL metadata search engine. Keep SOLR? Also move to Elastic Search?

@bfalling bfalling added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 22, 2016
@tfmorris
Copy link
Contributor

tfmorris commented Oct 5, 2016 via email

@bfalling
Copy link
Collaborator

bfalling commented Oct 5, 2016

A little context on the move to Elastic Search: The SOLR used for searching inside books was found to be continuously corrupting. Repaired data re-corrupted after a few weeks for no ascertainable reason. We weighed between upgrading SOLR and moving to ES, which has much more support within the Archive. We chose the latter.

@tfmorris
Copy link
Contributor

tfmorris commented Oct 6, 2016 via email

@tfmorris
Copy link
Contributor

tfmorris commented Apr 5, 2017

@bfalling If a switch to Elasticsearch is a blocker for this task, has any progress been made on advertising the potential change (e.g. to ol-tech or ol-discuss), soliciting feedback, preparing downstream consumers for the change?

This bug represents a significant usability issue and was first reported in 2010. It'd be nice to make some progress on it.

@mekarpeles
Copy link
Member

mekarpeles commented Oct 18, 2017

Regarding operating an ES instance and a solr instance, I agree that it is somewhat indefensible to have OL and IA on completely different search indices and databases. @tfmorris one thing we've started to do is write back openlibrary_work and openlibrary_edition IDs into their corresponding archive.org items. This allows us to do more querying against Internet Archive Elastic Search. OL still need solr (or its own ES) in the interim because there are many works and editions for which there are no corresponding archive.org items and IA is reluctant to store metadata in ES for works/editions which are not digitized.

One of the current challenges is solr takes a while to update and its becoming increasingly difficult to keep our tiny solr instance sync'd with IA's borrow availability data. We've been switching Open Library to use a special Archive.org availability API to get this info (instead of trying to write back to solr). One downside is we can't easily query Open Library for available works.

In the next year or so I'd like to see tighter integration between IA and OL in terms of moving metadata away from OL's postgres and solr instance into some official shared infrastructure which both services can agree upon. This direction is a very early stage idea, but it's worth bringing up in case there are strong opinions which may help us avoid "gotchas".

@hornc hornc added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Nov 1, 2017
@tfmorris
Copy link
Contributor

tfmorris commented Nov 8, 2017

As pointed out in #599, the ICU Normalizer, mentioned in my Aug 2013 note, isn't powerful enough and we actually want ICU Folding.

@tfmorris
Copy link
Contributor

The fix for this is in tfmorris@c7026ff and is straightforward, but it requires a Solr config change by OL staff which is unlikely to ever happen so unassigning myself.

@tfmorris tfmorris removed their assignment Apr 30, 2020
@hornc hornc removed the CH: unicode label Nov 16, 2020
@cclauss cclauss added the Theme: Unicode Issues and pull requests related to Unicode characters label Mar 9, 2021
@cdrini cdrini added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] and removed Needs: Lead labels Jun 2, 2021
@cdrini cdrini modified the milestones: May 2013, Active Sprint Jun 2, 2021
@cdrini cdrini modified the milestones: Next (proposed), Active Sprint Aug 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] Theme: Search Issues related to search UI and backend. [managed] Theme: Unicode Issues and pull requests related to Unicode characters Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.