The <stemmer> element in our config file allows the user to specify the location of a custom stemmer if they like, which is great, but we don't yet allow override of the tokenizer. It's of course possible to replace the entire tokenize.xsl file with your own for any specific project, but that's ugly, especially since you would need to do something similar for the JavaScript. In some cases, replacing just the tokenRegex would be sufficient; I have a current use-case where I'm experimenting with a crude tokenizer for Japanese, and this seems like a good opportunity to parameterize this value.
These are the changes that would need to be made:
- Make the
tokenRegex variable in tokenize.xsl into an <xsl:param>.
- Adding a new
<tokenRegex> element to the schema, with plain text content.
- Considering whether or not it would be our business to try to validate the content of the
<tokenRegex> element or not. We don't necessarily have to; errors would cause build failures with comprehensible errors.
- Setting up the build process to pass the value of
<tokenRegex> to the XSLT process.
- Handling the JavaScript (see below).
5 is the difficult one. The query string parsing in StaticSearch.js is quite complicated because it has to distinguish between quoted phrases and individual tokens; it doesn't simply tokenize the text. However, there is a method preProcessSearchString() which is provided specifically for an end-user to override, which can be used for tokenization; in other words, a language which does not tokenize on whitespace can be handled with a custom tokenizer that injects the whitespace, which the subsequent processing can use as token boundaries.
However, if phrasal search is being used, that presents a problem because the resulting phrasal strings will have whitespace in them which is not in the original text. One solution here would be to add another user-overridable postProcessSearchStringItem() function that simply returns its input by default, but which is called right at the beginning of addSearchItem(); the user could override this to reprocess any phrasal content to remove spaces. So in addition to configuring the <tokenRegex> in the config file, the end user would have to:
- Override
preProcessSearchString() to pre-tokenize the search string by adding spaces using the same tokenizer, and then
- Override the new
postProcessSearchString() function to remove any spaces.
This is by no means pretty, but it is doable, and with a requirement for this looming, I'd like to go ahead soon. @joeytakeda Do you see any major gotchas or bad ideas in here?
The
<stemmer>element in our config file allows the user to specify the location of a custom stemmer if they like, which is great, but we don't yet allow override of the tokenizer. It's of course possible to replace the entire tokenize.xsl file with your own for any specific project, but that's ugly, especially since you would need to do something similar for the JavaScript. In some cases, replacing just thetokenRegexwould be sufficient; I have a current use-case where I'm experimenting with a crude tokenizer for Japanese, and this seems like a good opportunity to parameterize this value.These are the changes that would need to be made:
tokenRegexvariable intokenize.xslinto an<xsl:param>.<tokenRegex>element to the schema, with plain text content.<tokenRegex>element or not. We don't necessarily have to; errors would cause build failures with comprehensible errors.<tokenRegex>to the XSLT process.5 is the difficult one. The query string parsing in StaticSearch.js is quite complicated because it has to distinguish between quoted phrases and individual tokens; it doesn't simply tokenize the text. However, there is a method
preProcessSearchString()which is provided specifically for an end-user to override, which can be used for tokenization; in other words, a language which does not tokenize on whitespace can be handled with a custom tokenizer that injects the whitespace, which the subsequent processing can use as token boundaries.However, if phrasal search is being used, that presents a problem because the resulting phrasal strings will have whitespace in them which is not in the original text. One solution here would be to add another user-overridable
postProcessSearchStringItem()function that simply returns its input by default, but which is called right at the beginning ofaddSearchItem(); the user could override this to reprocess any phrasal content to remove spaces. So in addition to configuring the<tokenRegex>in the config file, the end user would have to:preProcessSearchString()to pre-tokenize the search string by adding spaces using the same tokenizer, and thenpostProcessSearchString()function to remove any spaces.This is by no means pretty, but it is doable, and with a requirement for this looming, I'd like to go ahead soon. @joeytakeda Do you see any major gotchas or bad ideas in here?