tokenRegex should be a parameter, supplied through the config file

The `<stemmer>` element in our config file allows the user to specify the location of a custom stemmer if they like, which is great, but we don't yet allow override of the tokenizer. It's of course possible to replace the entire tokenize.xsl file with your own for any specific project, but that's ugly, especially since you would need to do something similar for the JavaScript. In some cases, replacing just the `tokenRegex` would be sufficient; I have a current use-case where I'm experimenting with a crude tokenizer for Japanese, and this seems like a good opportunity to parameterize this value.

These are the changes that would need to be made:

1. Make the `tokenRegex` variable in `tokenize.xsl` into an `<xsl:param>`.
2. Adding a new `<tokenRegex>` element to the schema, with plain text content.
3. Considering whether or not it would be our business to try to validate the content of the `<tokenRegex>` element or not. We don't necessarily have to; errors would cause build failures with comprehensible errors.
4. Setting up the build process to pass the value of `<tokenRegex>` to the XSLT process.
5. Handling the JavaScript (see below).

5 is the difficult one. The query string parsing in StaticSearch.js is quite complicated because it has to distinguish between quoted phrases and individual tokens; it doesn't simply tokenize the text. However, there is a method `preProcessSearchString()` which is provided specifically for an end-user to override, which can be used for tokenization; in other words, a language which does not tokenize on whitespace can be handled with a custom tokenizer that injects the whitespace, which the subsequent processing can use as token boundaries.

However, if phrasal search is being used, that presents a problem because the resulting phrasal strings will have whitespace in them which is not in the original text. One solution here would be to add another user-overridable `postProcessSearchStringItem()` function that simply returns its input by default, but which is called right at the beginning of `addSearchItem()`; the user  could override this to reprocess any phrasal content to remove spaces. So in addition to configuring the `<tokenRegex>` in the config file, the end user would have to:

1. Override `preProcessSearchString()` to pre-tokenize the search string by adding spaces using the same tokenizer, and then
2. Override the new `postProcessSearchString()` function to remove any spaces.

This is by no means pretty, but it is doable, and with a requirement for this looming, I'd like to go ahead soon. @joeytakeda Do you see any major gotchas or bad ideas in here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenRegex should be a parameter, supplied through the config file #381

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tokenRegex should be a parameter, supplied through the config file #381

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions