Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

BibIndex (fulltext): Demo fulltext searching on inspire-hep-dev doesn't find 'rattazzon' #22

Open
traviscb opened this Issue · 5 comments

4 participants

@traviscb
Collaborator

Originally on 2010-04-29

Suprisingly:

astro-ph/0607086

does not find rattazzon even though it is in the fulltext (see snippet for "honor theorist" search in fulltext)

This may be due to its enclosure in '' in the text...

@tiborsimko
Owner

Originally on 2010-04-29

The problem seems to be due to non-ASCII UTF-8 quotes.
If one searches for ‘rattazzon’, one finds it:

[http://inspire-hep-dev.cern.ch/search?p=‘rattazzon’&f=fulltext]

The current word breaking sequencer handles only ASCII quotes.
That is, ` not ‘, and ' not ’.

We should add all tho common UTF-8 characters of that kind
to the config.

@kaplun kaplun added the s_INSPIRE label
@kaplun
Collaborator

It looks like this is no longer related to INSPIRE since we are using Solr there (and we have different issues with fulltext and UTF8). Is there still an interesting to fix this or shall we rather say: lets push towards moving to elasticsearch?

@kaplun kaplun added the in_triage label
@kaplun kaplun was assigned by tiborsimko
@tiborsimko
Owner

I recall this ticket was precisely for a bug with the Solr full-text search already on. The issue may still be alive, even after Patrick's improvements. One still does not find anything with this string; however astro-ph/0607086 does not seem to be fulltext-indexed at all, just try astro-ph/0607086 fulltext:of.
Can you try to fulltext-index it to confirm/infirm whether this still applies?

@kaplun
Collaborator

I believe the current implementation indeed does not allow to search any special character but only plain ASCII English alphabet...

@tiborsimko
Owner

I believe the current implementation indeed does not allow to search any special character but only plain ASCII English alphabet...

That's the goal, actually. The user search was for a word not containing any special character; however since this word's boundaries were containing UTF-8 quotes, not regular ASCII quotes, the tokeniser-of-that-time tought they were part of the word, hence not founding anything... Should we treat UTF-8 quotes as regular ASCII quotes, then the match would be found.

@lnielsen lnielsen added this to the v1.x milestone
@kaplun kaplun was unassigned by lnielsen
@lnielsen lnielsen removed the in_triage label
@tiborsimko tiborsimko added the master label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.