-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index normalized author name in solr #178
Comments
👍 |
Looks like I've already reported this bug 3 years back, but not fixed yet. https://bugs.launchpad.net/openlibrary/+bug/540866 Edward had some suggestions about how it can be fixed. |
So it's a matter of configuration? Edward's solution was using http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory:
|
I tried that didn't seem to work. Requires more exploration. On Thursday, March 28, 2013, bencomp wrote:
Anand |
Working on moving to solr with single core and improved schema. Will fix that after that is done. Targeting this for May. |
I would recommend something more sophisticated like the NFKC_Casefold option of: so that we handle Unicode normalization as well. I know I've seen both composed and decomposed forms in OpenLibrary. This tokenizer probably deserves investigation as well:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory Here are some very basic names which aren't found: Antonin Dvořak, Antonin Dvořák, Antonín Dvořák, Antonín Dvorák, Antonín Dvořak Amongst other problems, not having them show up in search makes them very difficult to merge. |
This is a duplicate of issue #11. |
OpenLibrary is currently stuck on Solr v1.4.0 which is 4+ years old. Many of the useful diacritic folding capabilities were introduced with Solr 3.1 in 2011. Is there a reason not to move to a more modern version? |
Time, I bet ;) |
On Fri, Oct 11, 2013 at 5:04 AM, Tom Morris notifications@github.comwrote:
|
@Gio I know you made some Solr changes recently. Was diacritic folding and/or unicode normalization part of that work or is this still open? |
@anandology there still seem to be two very similar records at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A and at https://openlibrary.org/search?q=Gha%E1%B9%AD%E1%B9%ADi&author_key=OL6A |
@LeadSongDog Anand (anandology) isn't involved any more. As I understand it, Gio (@gdamdam) is the current dev. Unfortunately when I attempted to ping him for status back in January, I inadvertently used the wrong username. @gdamdam Any update on Solr diacritic folding? |
We've moved the full-text search engine to an Internet-Archive-based Elastic Search cluster. A decision needs to be made about the OL metadata search engine. Keep SOLR? Also move to Elastic Search? |
I don't think it makes sense to have two different search technologies, but
then it didn't make sense to move to ES just because that's what IA wanted.
We know the last transition broke things which depended on the search query
language, so a little more due diligence, public notice, and discussion
should be done this time to at least notify users that their apps are about
to break, well in advance of any migration.
|
A little context on the move to Elastic Search: The SOLR used for searching inside books was found to be continuously corrupting. Repaired data re-corrupted after a few weeks for no ascertainable reason. We weighed between upgrading SOLR and moving to ES, which has much more support within the Archive. We chose the latter. |
All true, but the most relevant things for me were the lack of advance
notice, public discussion, or any input from the community.
It could have been entirely the correct decision, but arrived at in
completely the wrong way.
I'm suggesting not repeating the mistake.
|
@bfalling If a switch to Elasticsearch is a blocker for this task, has any progress been made on advertising the potential change (e.g. to ol-tech or ol-discuss), soliciting feedback, preparing downstream consumers for the change? This bug represents a significant usability issue and was first reported in 2010. It'd be nice to make some progress on it. |
Regarding operating an ES instance and a solr instance, I agree that it is somewhat indefensible to have OL and IA on completely different search indices and databases. @tfmorris one thing we've started to do is write back openlibrary_work and openlibrary_edition IDs into their corresponding archive.org items. This allows us to do more querying against Internet Archive Elastic Search. OL still need solr (or its own ES) in the interim because there are many works and editions for which there are no corresponding archive.org items and IA is reluctant to store metadata in ES for works/editions which are not digitized. One of the current challenges is solr takes a while to update and its becoming increasingly difficult to keep our tiny solr instance sync'd with IA's borrow availability data. We've been switching Open Library to use a special Archive.org availability API to get this info (instead of trying to write back to solr). One downside is we can't easily query Open Library for available works. In the next year or so I'd like to see tighter integration between IA and OL in terms of moving metadata away from OL's postgres and solr instance into some official shared infrastructure which both services can agree upon. This direction is a very early stage idea, but it's worth bringing up in case there are strong opinions which may help us avoid "gotchas". |
As pointed out in #599, the ICU Normalizer, mentioned in my Aug 2013 note, isn't powerful enough and we actually want ICU Folding. |
The fix for this is in tfmorris@c7026ff and is straightforward, but it requires a Solr config change by OL staff which is unlikely to ever happen so unassigning myself. |
Imagine the case where the author name author name has special accent characters like "Ghaṭṭi Añjanēyaśarma". Most of the time, the user won't be able to enter the accent characters and autocomplete will fail.
The search engine should index the accent-stripped version of the author name along with the real name to avoid such issues.
The text was updated successfully, but these errors were encountered: