Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

BibIndex (fulltext): Demo fulltext searching on inspire-hep-dev doesn't find 'rattazzon' #22

Open
traviscb opened this Issue · 5 comments

4 participants

Travis Brooks Tibor Simko Samuele Kaplun Lars Holm Nielsen
Travis Brooks
Collaborator

Originally on 2010-04-29

Suprisingly:

astro-ph/0607086

does not find rattazzon even though it is in the fulltext (see snippet for "honor theorist" search in fulltext)

This may be due to its enclosure in '' in the text...

Tibor Simko
Owner

Originally on 2010-04-29

The problem seems to be due to non-ASCII UTF-8 quotes.
If one searches for ‘rattazzon’, one finds it:

[http://inspire-hep-dev.cern.ch/search?p=‘rattazzon’&f=fulltext]

The current word breaking sequencer handles only ASCII quotes.
That is, ` not ‘, and ' not ’.

We should add all tho common UTF-8 characters of that kind
to the config.

Samuele Kaplun kaplun added the s_INSPIRE label
Samuele Kaplun
Collaborator

It looks like this is no longer related to INSPIRE since we are using Solr there (and we have different issues with fulltext and UTF8). Is there still an interesting to fix this or shall we rather say: lets push towards moving to elasticsearch?

Samuele Kaplun kaplun added the in_triage label
Samuele Kaplun kaplun was assigned by tiborsimko
Tibor Simko
Owner

I recall this ticket was precisely for a bug with the Solr full-text search already on. The issue may still be alive, even after Patrick's improvements. One still does not find anything with this string; however astro-ph/0607086 does not seem to be fulltext-indexed at all, just try astro-ph/0607086 fulltext:of.
Can you try to fulltext-index it to confirm/infirm whether this still applies?

Samuele Kaplun
Collaborator

I believe the current implementation indeed does not allow to search any special character but only plain ASCII English alphabet...

Tibor Simko
Owner

I believe the current implementation indeed does not allow to search any special character but only plain ASCII English alphabet...

That's the goal, actually. The user search was for a word not containing any special character; however since this word's boundaries were containing UTF-8 quotes, not regular ASCII quotes, the tokeniser-of-that-time tought they were part of the word, hence not founding anything... Should we treat UTF-8 quotes as regular ASCII quotes, then the match would be found.

Lars Holm Nielsen lnielsen added this to the v1.x milestone
Samuele Kaplun kaplun was unassigned by lnielsen
Lars Holm Nielsen lnielsen removed the in_triage label
Tibor Simko tiborsimko added the master label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.