Index normalized author name in solr #178

anandology · 2013-03-26T10:18:49Z

Imagine the case where the author name author name has special accent characters like "Ghaṭṭi Añjanēyaśarma". Most of the time, the user won't be able to enter the accent characters and autocomplete will fail.

The search engine should index the accent-stripped version of the author name along with the real name to avoid such issues.

bencomp · 2013-03-26T10:21:30Z

👍

anandology · 2013-03-28T07:09:55Z

Looks like I've already reported this bug 3 years back, but not fixed yet.

https://bugs.launchpad.net/openlibrary/+bug/540866

Edward had some suggestions about how it can be fixed.

bencomp · 2013-03-28T07:58:28Z

So it's a matter of configuration? Edward's solution was using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

<filter class="solr.ASCIIFoldingFilterFactory"/>

anandology · 2013-03-28T08:16:57Z

I tried that didn't seem to work. Requires more exploration.

On Thursday, March 28, 2013, bencomp wrote:

So it's a matter of configuration? Edward's solution was using
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
:

solr.ASCIIFoldingFilterFactory

Creates org.apache.lucene.analysis.ASCIIFoldingFilter.

Converts alphabetic, numeric, and symbolic Unicode characters which are
not in the first 127 ASCII characters (the "Basic Latin" Unicode block)
into their ASCII equivalents, if one exists.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/178#issuecomment-15572628
.

Anand
http://anandology.com/

anandology · 2013-05-01T18:03:03Z

Working on moving to solr with single core and improved schema. Will fix that after that is done. Targeting this for May.

tfmorris · 2013-08-30T15:08:18Z

I would recommend something more sophisticated like the NFKC_Casefold option of:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUNormalizer2FilterFactory

so that we handle Unicode normalization as well. I know I've seen both composed and decomposed forms in OpenLibrary.

This tokenizer probably deserves investigation as well:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Here are some very basic names which aren't found: Antonin Dvořak, Antonin Dvořák, Antonín Dvořák, Antonín Dvorák, Antonín Dvořak

Amongst other problems, not having them show up in search makes them very difficult to merge.

tfmorris · 2013-09-01T18:31:11Z

This is a duplicate of issue #11.

tfmorris · 2013-10-10T23:34:04Z

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many of the useful diacritic folding capabilities were introduced with Solr 3.1 in 2011. Is there a reason not to move to a more modern version?

george08 · 2013-10-10T23:42:12Z

Time, I bet ;)

anandology · 2013-10-11T01:22:09Z

On Fri, Oct 11, 2013 at 5:04 AM, Tom Morris notifications@github.comwrote:

OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many
of the useful diacritic folding capabilities were introduced with Solr 3.1
in 2011. Is there a reason not to move to a more modern version?

In progress. I've already setup a node with solr 3.1 and improved setup to
handle searching for editions, authors and works. Will go live in a month
or so.

tfmorris · 2016-01-30T16:26:02Z

@Gio I know you made some Solr changes recently. Was diacritic folding and/or unicode normalization part of that work or is this still open?

LeadSongDog · 2016-07-11T19:14:58Z

@anandology there still seem to be two very similar records at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A and at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A
Neither of them is found yet by an author search for "Ghatti Anjaneyasarma"

tfmorris · 2016-07-11T21:41:07Z

@LeadSongDog Anand (anandology) isn't involved any more. As I understand it, Gio (@gdamdam) is the current dev. Unfortunately when I attempted to ping him for status back in January, I inadvertently used the wrong username.

@gdamdam Any update on Solr diacritic folding?

bfalling · 2016-09-22T17:54:42Z

We've moved the full-text search engine to an Internet-Archive-based Elastic Search cluster. A decision needs to be made about the OL metadata search engine. Keep SOLR? Also move to Elastic Search?

tfmorris · 2016-10-05T06:28:28Z

I don't think it makes sense to have two different search technologies, but then it didn't make sense to move to ES just because that's what IA wanted. We know the last transition broke things which depended on the search query language, so a little more due diligence, public notice, and discussion should be done this time to at least notify users that their apps are about to break, well in advance of any migration.

bfalling · 2016-10-05T06:53:26Z

A little context on the move to Elastic Search: The SOLR used for searching inside books was found to be continuously corrupting. Repaired data re-corrupted after a few weeks for no ascertainable reason. We weighed between upgrading SOLR and moving to ES, which has much more support within the Archive. We chose the latter.

tfmorris · 2016-10-06T05:05:59Z

All true, but the most relevant things for me were the lack of advance notice, public discussion, or any input from the community. It could have been entirely the correct decision, but arrived at in completely the wrong way. I'm suggesting not repeating the mistake.

tfmorris · 2017-04-05T01:41:47Z

@bfalling If a switch to Elasticsearch is a blocker for this task, has any progress been made on advertising the potential change (e.g. to ol-tech or ol-discuss), soliciting feedback, preparing downstream consumers for the change?

This bug represents a significant usability issue and was first reported in 2010. It'd be nice to make some progress on it.

mekarpeles · 2017-10-18T02:24:41Z

Regarding operating an ES instance and a solr instance, I agree that it is somewhat indefensible to have OL and IA on completely different search indices and databases. @tfmorris one thing we've started to do is write back openlibrary_work and openlibrary_edition IDs into their corresponding archive.org items. This allows us to do more querying against Internet Archive Elastic Search. OL still need solr (or its own ES) in the interim because there are many works and editions for which there are no corresponding archive.org items and IA is reluctant to store metadata in ES for works/editions which are not digitized.

One of the current challenges is solr takes a while to update and its becoming increasingly difficult to keep our tiny solr instance sync'd with IA's borrow availability data. We've been switching Open Library to use a special Archive.org availability API to get this info (instead of trying to write back to solr). One downside is we can't easily query Open Library for available works.

In the next year or so I'd like to see tighter integration between IA and OL in terms of moving metadata away from OL's postgres and solr instance into some official shared infrastructure which both services can agree upon. This direction is a very early stage idea, but it's worth bringing up in case there are strong opinions which may help us avoid "gotchas".

tfmorris · 2017-11-08T23:19:57Z

As pointed out in #599, the ICU Normalizer, mentioned in my Aug 2013 note, isn't powerful enough and we actually want ICU Folding.

tfmorris · 2020-04-30T21:34:20Z

The fix for this is in tfmorris@c7026ff and is straightforward, but it requires a Solr config change by OL staff which is unlikely to ever happen so unassigning myself.

ghost assigned anandology Mar 26, 2013

bencomp mentioned this issue Apr 21, 2013

Accents in search #185

Closed

bfalling mentioned this issue Sep 22, 2016

Search should neither be case-sensitive nor macron/diacritic-sensitive #11

Closed

bfalling added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 22, 2016

bfalling mentioned this issue Sep 22, 2016

Searching for Subject terms with diacritics fails #317

Closed

bfalling unassigned anandology Sep 28, 2016

hornc added the unicode label May 9, 2017

tfmorris mentioned this issue Jun 1, 2017

Normalize Unicode #149

Closed

cdrini mentioned this issue Oct 24, 2017

Make most SOLR fields ignore diacritics #599

Closed

hornc added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Nov 1, 2017

LeadSongDog mentioned this issue Dec 17, 2019

Search finds transliteration variations easily #2752

Open

xayhewalo removed the State: Backlogged label Mar 17, 2020

cdrini mentioned this issue Apr 6, 2020

Update to Solr 8 (Latest) #3317

Closed

31 tasks

tfmorris mentioned this issue Apr 7, 2020

Add LCC and Dewey decimal numbers to solr in April solr reindex #3290

Closed

cdrini added the Needs: Lead label Apr 20, 2020

tfmorris removed their assignment Apr 30, 2020

LeadSongDog mentioned this issue Nov 9, 2020

Fixing unicode urls in python3 #4049

Merged

hornc removed the CH: unicode label Nov 16, 2020

cclauss added the Theme: Unicode Issues and pull requests related to Unicode characters label Mar 9, 2021

cdrini added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Theme: Internationalization Making OpenLibrary work for both foreign-language users and books. [managed] and removed Needs: Lead labels Jun 2, 2021

cdrini modified the milestones: May 2013, Active Sprint Jun 2, 2021

mekarpeles assigned cdrini Jun 5, 2021

cdrini modified the milestones: Sprint 2021-06, Active Sprint, Next (proposed) Jul 6, 2021

cdrini modified the milestones: Next (proposed), Active Sprint Aug 31, 2021

This was referenced Aug 31, 2021

Author names should ignore diacritics in solr #5600

Merged

Search fails for authors with non-Latin1 characters #714

Closed

mekarpeles closed this as completed in #5600 Sep 15, 2021

cdrini mentioned this issue Nov 2, 2021

Do a full solr reindex with 2021-10 dump #5502

Closed

10 tasks

tfmorris mentioned this issue Jan 22, 2022

Search for exact title with different encoding fails #6059

Closed

tfmorris mentioned this issue Oct 6, 2022

Search should be aware of typical "diacritic replacement characters" #7040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index normalized author name in solr #178

Index normalized author name in solr #178

anandology commented Mar 26, 2013

bencomp commented Mar 26, 2013

anandology commented Mar 28, 2013

bencomp commented Mar 28, 2013

anandology commented Mar 28, 2013

anandology commented May 1, 2013

tfmorris commented Aug 30, 2013

tfmorris commented Sep 1, 2013

tfmorris commented Oct 10, 2013

george08 commented Oct 10, 2013

anandology commented Oct 11, 2013

tfmorris commented Jan 30, 2016

LeadSongDog commented Jul 11, 2016

tfmorris commented Jul 11, 2016

bfalling commented Sep 22, 2016

tfmorris commented Oct 5, 2016 via email

bfalling commented Oct 5, 2016

tfmorris commented Oct 6, 2016 via email

tfmorris commented Apr 5, 2017

mekarpeles commented Oct 18, 2017 •

edited

Loading

tfmorris commented Nov 8, 2017

tfmorris commented Apr 30, 2020

Index normalized author name in solr #178

Index normalized author name in solr #178

Comments

anandology commented Mar 26, 2013

bencomp commented Mar 26, 2013

anandology commented Mar 28, 2013

bencomp commented Mar 28, 2013

anandology commented Mar 28, 2013

anandology commented May 1, 2013

tfmorris commented Aug 30, 2013

tfmorris commented Sep 1, 2013

tfmorris commented Oct 10, 2013

george08 commented Oct 10, 2013

anandology commented Oct 11, 2013

tfmorris commented Jan 30, 2016

LeadSongDog commented Jul 11, 2016

tfmorris commented Jul 11, 2016

bfalling commented Sep 22, 2016

tfmorris commented Oct 5, 2016 via email

bfalling commented Oct 5, 2016

tfmorris commented Oct 6, 2016 via email

tfmorris commented Apr 5, 2017

mekarpeles commented Oct 18, 2017 • edited Loading

tfmorris commented Nov 8, 2017

tfmorris commented Apr 30, 2020

mekarpeles commented Oct 18, 2017 •

edited

Loading