Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer separator regex #424

Open
StarfallProjects opened this issue Oct 24, 2019 · 4 comments
Open

tokenizer separator regex #424

StarfallProjects opened this issue Oct 24, 2019 · 4 comments

Comments

@StarfallProjects
Copy link

Hi! I am adding some custom regex to our tokenizer.separator:

lunr.tokenizer.separator = /[\s\-\.\(\)\[\]+\A-Z]/;

According to regex101.com, [\s-.()[]+\A-Z] will match all the capital letters in DeleteStreamAsync. I would like it to split on those, so that, for example, "DeleteStream" will return results.

Everything else is matching fine. For example, the stuff I added so that DeleteStreamAsync(someParams) would be returned when searching DeleteStreamAsync. So it's splitting on ( at least. It just doesn't seem to like the A-Z.

Any suggestions/info much appreciated.

@hoelzro
Copy link
Contributor

hoelzro commented Oct 25, 2019

Hi @StarfallProjects - just to make sure I understand you correctly, you essentially want to split up DeleteStreamSync(fooBar) into a stream of tokens Delete Stream Sync foo Bar, right?

Tweaking lunr.tokenizer.separator this way will kind of work - you'll end up with a token stream consisting of elete tream ync foo ar, though. Which means searching for "delete bar" won't turn up any search results, and looking for "Delete Bar" will have the same results as "Delete Car".

If my understanding is correct and you want to use lunr for this, I'd recommend providing a custom, camel-case-aware tokenizer function to the lunr builder - I think you'd have better luck with that.

However, I've noticed a few of your issues seem to revolve around searching source code, and I'm not sure lunr is particularly well-suited to that. Is there a specific reason you chose lunr?

@StarfallProjects
Copy link
Author

Thanks for the advice.
I only joined the project recently - but it uses DocFX, which has lunr.js built in.

@hoelzro
Copy link
Contributor

hoelzro commented Oct 30, 2019

Ah, I see - so it might be unreasonable to just yank lunr.js out and replace it with something entirely different!

With that in mind, I'd recommend trying to write the custom, camel-case-aware tokenizer I suggested above, and see how that works for you!

@StarfallProjects
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants