Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphens shouldn't always be assumed to be query operators #32

Closed
nolanlawson opened this issue Oct 13, 2013 · 5 comments
Closed

Hyphens shouldn't always be assumed to be query operators #32

nolanlawson opened this issue Oct 13, 2013 · 5 comments
Milestone

Comments

@nolanlawson
Copy link
Member

Related to #28.

Words like "e-commerce" should be understood to be non-complex queries (whereas something like "e -commerce" is truly complex).

nolanlawson added a commit that referenced this issue Oct 13, 2013
nolanlawson added a commit that referenced this issue Oct 17, 2013
@nolanlawson
Copy link
Member Author

Unfortunately it appears that hyphenated synonyms like "e-commerce" can only be used if the synonym file is explicitly purged of all hyphens (which should be replaced with spaces). Commit 4913b62 demonstrates this. The synonym file contains e commerce,electronic commerce, and queries like e-commerce work as expected.

My impression is that this is a weakness of the default configuration we choose for the synonym analyzer. The combination of KeywordTokenizers at one step and a StandardTokenizer at the other causes hyphenated synonyms to be overlooked.

Unfortunately I can't seem to find a combination that satisfies all the unit tests, so for now I'm just recommending that people manually purge their synonym files of hyphenated synonyms.

@avlukanin
Copy link
Collaborator

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.

@janhoy
Copy link
Contributor

janhoy commented Oct 17, 2013

How about inserting a PatternReplaceCharFilterFactory Before the tokenizer to remove hyphens?

Den 17. okt. 2013 kl. 08:08 skrev Artem Lukanin notifications@github.com:

I use WhitespaceTokenizerFactory, not StandardTokenizerFactory in the configuration, that's why the hyphen in e-commerce is preserved. StandardTokenizerFactory is set further in my schema.xml.


Reply to this email directly or view it on GitHub.

@avlukanin
Copy link
Collaborator

As I know you cannot use CharFilters in edismax_synonyms in solrconfig.xml. It would be of course a workaround in some cases, but there are situations, when hyphens are completely good characters and you don't want to lose them. For example, if you work with phone numbers like 234-45-56 and with WhitespaceTokenizer -> WordDelimiterFilter with catenateNumbers="1" to convert it into 2344556, you will not get complete phone numbers, but only their parts.

@nolanlawson
Copy link
Member Author

While investigating #26 and #9, it occurred to me that all of these issues are related. I also think they're really just configuration issues, related to the fact that, in our examples and unit tests, we configure the synonym analyzer to use the StandardTokenizer, which tokenizes UTF-8 and hyphenated synonyms in an unintuitive way (for most folks):

血と骨 -> 血 と 骨 (3 tokens)
e-commerce -> e commerce (2 tokens)

My fix was just to replace the StandardTokenizer with the WhitespaceTokenizer. All the old unit tests still pass, and as a bonus we fully fix #32 and #9, so people don't have to manually replace hyphens with spaces anymore.

Hopefully the WhitespaceTokenizer will work better for most cases. It messes with what gets considered a "shingle" (e.g. 血と骨 becomes one big unigram) once you get to the ShingleFilterFactory, but it seems to fit better with people's expectations of how the synonym expansion should work.

BTW, I also put all this configuration into a single file, so it's easier to modify. The same file that's used for the unit tests is referenced in the README; we can change that later if it becomes awkward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants