Hyphens shouldn't always be assumed to be query operators #32

nolanlawson · 2013-10-13T00:00:04Z

Related to #28.

Words like "e-commerce" should be understood to be non-complex queries (whereas something like "e -commerce" is truly complex).

nolanlawson · 2013-10-17T02:17:49Z

Unfortunately it appears that hyphenated synonyms like "e-commerce" can only be used if the synonym file is explicitly purged of all hyphens (which should be replaced with spaces). Commit 4913b62 demonstrates this. The synonym file contains e commerce,electronic commerce, and queries like e-commerce work as expected.

My impression is that this is a weakness of the default configuration we choose for the synonym analyzer. The combination of KeywordTokenizers at one step and a StandardTokenizer at the other causes hyphenated synonyms to be overlooked.

Unfortunately I can't seem to find a combination that satisfies all the unit tests, so for now I'm just recommending that people manually purge their synonym files of hyphenated synonyms.

avlukanin · 2013-10-17T06:08:31Z

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.

janhoy · 2013-10-17T09:23:28Z

How about inserting a PatternReplaceCharFilterFactory Before the tokenizer to remove hyphens?

Den 17. okt. 2013 kl. 08:08 skrev Artem Lukanin notifications@github.com:

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.

—
Reply to this email directly or view it on GitHub.

avlukanin · 2013-10-17T11:05:23Z

As I know you cannot use CharFilters in edismax_synonyms in solrconfig.xml. It would be of course a workaround in some cases, but there are situations, when hyphens are completely good characters and you don't want to lose them. For example, if you work with phone numbers like 234-45-56 and with WhitespaceTokenizer -> WordDelimiterFilter with catenateNumbers="1" to convert it into 2344556, you will not get complete phone numbers, but only their parts.

…okenizer. Fixes #9 and provides a better fix for #32.

nolanlawson · 2013-10-22T23:30:03Z

While investigating #26 and #9, it occurred to me that all of these issues are related. I also think they're really just configuration issues, related to the fact that, in our examples and unit tests, we configure the synonym analyzer to use the StandardTokenizer, which tokenizes UTF-8 and hyphenated synonyms in an unintuitive way (for most folks):

血と骨 -> 血 と 骨 (3 tokens)
e-commerce -> e commerce (2 tokens)

My fix was just to replace the StandardTokenizer with the WhitespaceTokenizer. All the old unit tests still pass, and as a bonus we fully fix #32 and #9, so people don't have to manually replace hyphens with spaces anymore.

Hopefully the WhitespaceTokenizer will work better for most cases. It messes with what gets considered a "shingle" (e.g. 血と骨 becomes one big unigram) once you get to the ShingleFilterFactory, but it seems to fit better with people's expectations of how the synonym expansion should work.

BTW, I also put all this configuration into a single file, so it's easier to modify. The same file that's used for the unit tests is referenced in the README; we can change that later if it becomes awkward.

nolanlawson added a commit that referenced this issue Oct 13, 2013

begin work on fixing #32

9c47f1d

nolanlawson added a commit that referenced this issue Oct 17, 2013

begin work on fixing #32

7f48bda

nolanlawson closed this as completed in 4913b62 Oct 17, 2013

nolanlawson added a commit that referenced this issue Oct 22, 2013

Modify the config to use a WhitespaceTokenizer instead of a StandardT…

2728e87

…okenizer. Fixes #9 and provides a better fix for #32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyphens shouldn't always be assumed to be query operators #32

Hyphens shouldn't always be assumed to be query operators #32

nolanlawson commented Oct 13, 2013

nolanlawson commented Oct 17, 2013

avlukanin commented Oct 17, 2013

janhoy commented Oct 17, 2013

avlukanin commented Oct 17, 2013

nolanlawson commented Oct 22, 2013

Hyphens shouldn't always be assumed to be query operators #32

Hyphens shouldn't always be assumed to be query operators #32

Comments

nolanlawson commented Oct 13, 2013

nolanlawson commented Oct 17, 2013

avlukanin commented Oct 17, 2013

janhoy commented Oct 17, 2013

avlukanin commented Oct 17, 2013

nolanlawson commented Oct 22, 2013