tokenizer separator regex #424

StarfallProjects · 2019-10-24T11:12:16Z

Hi! I am adding some custom regex to our tokenizer.separator:

lunr.tokenizer.separator = /[\s\-\.\(\)\[\]+\A-Z]/;

According to regex101.com, [\s-.()[]+\A-Z] will match all the capital letters in DeleteStreamAsync. I would like it to split on those, so that, for example, "DeleteStream" will return results.

Everything else is matching fine. For example, the stuff I added so that DeleteStreamAsync(someParams) would be returned when searching DeleteStreamAsync. So it's splitting on ( at least. It just doesn't seem to like the A-Z.

Any suggestions/info much appreciated.

The text was updated successfully, but these errors were encountered:

hoelzro · 2019-10-25T03:03:45Z

Hi @StarfallProjects - just to make sure I understand you correctly, you essentially want to split up DeleteStreamSync(fooBar) into a stream of tokens Delete Stream Sync foo Bar, right?

Tweaking lunr.tokenizer.separator this way will kind of work - you'll end up with a token stream consisting of elete tream ync foo ar, though. Which means searching for "delete bar" won't turn up any search results, and looking for "Delete Bar" will have the same results as "Delete Car".

If my understanding is correct and you want to use lunr for this, I'd recommend providing a custom, camel-case-aware tokenizer function to the lunr builder - I think you'd have better luck with that.

However, I've noticed a few of your issues seem to revolve around searching source code, and I'm not sure lunr is particularly well-suited to that. Is there a specific reason you chose lunr?

StarfallProjects · 2019-10-26T10:47:35Z

Thanks for the advice.
I only joined the project recently - but it uses DocFX, which has lunr.js built in.

hoelzro · 2019-10-30T04:43:33Z

Ah, I see - so it might be unreasonable to just yank lunr.js out and replace it with something entirely different!

With that in mind, I'd recommend trying to write the custom, camel-case-aware tokenizer I suggested above, and see how that works for you!

StarfallProjects · 2019-10-31T13:33:10Z

Thanks!

waylan mentioned this issue Jan 26, 2024

Sanitizing search entry titles mkdocs/mkdocs#3560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer separator regex #424

tokenizer separator regex #424

StarfallProjects commented Oct 24, 2019

hoelzro commented Oct 25, 2019

StarfallProjects commented Oct 26, 2019

hoelzro commented Oct 30, 2019

StarfallProjects commented Oct 31, 2019

tokenizer separator regex #424

tokenizer separator regex #424

Comments

StarfallProjects commented Oct 24, 2019

hoelzro commented Oct 25, 2019

StarfallProjects commented Oct 26, 2019

hoelzro commented Oct 30, 2019

StarfallProjects commented Oct 31, 2019