Skip to content

tokenRegex should be a parameter, supplied through the config file #381

@martindholmes

Description

@martindholmes

The <stemmer> element in our config file allows the user to specify the location of a custom stemmer if they like, which is great, but we don't yet allow override of the tokenizer. It's of course possible to replace the entire tokenize.xsl file with your own for any specific project, but that's ugly, especially since you would need to do something similar for the JavaScript. In some cases, replacing just the tokenRegex would be sufficient; I have a current use-case where I'm experimenting with a crude tokenizer for Japanese, and this seems like a good opportunity to parameterize this value.

These are the changes that would need to be made:

  1. Make the tokenRegex variable in tokenize.xsl into an <xsl:param>.
  2. Adding a new <tokenRegex> element to the schema, with plain text content.
  3. Considering whether or not it would be our business to try to validate the content of the <tokenRegex> element or not. We don't necessarily have to; errors would cause build failures with comprehensible errors.
  4. Setting up the build process to pass the value of <tokenRegex> to the XSLT process.
  5. Handling the JavaScript (see below).

5 is the difficult one. The query string parsing in StaticSearch.js is quite complicated because it has to distinguish between quoted phrases and individual tokens; it doesn't simply tokenize the text. However, there is a method preProcessSearchString() which is provided specifically for an end-user to override, which can be used for tokenization; in other words, a language which does not tokenize on whitespace can be handled with a custom tokenizer that injects the whitespace, which the subsequent processing can use as token boundaries.

However, if phrasal search is being used, that presents a problem because the resulting phrasal strings will have whitespace in them which is not in the original text. One solution here would be to add another user-overridable postProcessSearchStringItem() function that simply returns its input by default, but which is called right at the beginning of addSearchItem(); the user could override this to reprocess any phrasal content to remove spaces. So in addition to configuring the <tokenRegex> in the config file, the end user would have to:

  1. Override preProcessSearchString() to pre-tokenize the search string by adding spaces using the same tokenizer, and then
  2. Override the new postProcessSearchString() function to remove any spaces.

This is by no means pretty, but it is doable, and with a requirement for this looming, I'd like to go ahead soon. @joeytakeda Do you see any major gotchas or bad ideas in here?

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions