Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom word boundary #2896

Closed
wants to merge 8 commits into from
Closed

Conversation

Mikk3lRo
Copy link
Contributor

@Mikk3lRo Mikk3lRo commented Oct 2, 2017

Summary

The default / current word boundary regex ^|\\b|\\s is remarkably bad at actually finding word boundaries in languages that use non-unicode characters. A word boundary is basically detected after any non-ascii character (fx. ü, å, ø and æ to mention just a few - but there are MANY).

I've looked into possibilities, and unfortunately there doesn't seem to be any way to get decent word-boundary detection for anything except ascii in javascripts RegExp implementation... without either using a third-party library or including some 4k+ characters in the string.

Therefore, I don't see any way to reliably detect word boundaries with any pre-set, hardcoded regex.

Turning it into an option means that people can at least set something appropriate for their individual language and / or use case if they care about word boundaries being detected in "weird" places.

Please double-check that:

  • All changes were made in CoffeeScript files, not JavaScript files.
  • You used Grunt to build the JavaScript files and tested them locally.
  • You've updated both the jQuery and Prototype versions.
  • You haven't manually updated the version number in package.json.
  • If necessary, you've updated the documentation.

References

First partial PR from: #2894

@Mikk3lRo
Copy link
Contributor Author

Mikk3lRo commented Oct 2, 2017

I'm fairly certain this will solve at least this open issue / bug:
#2862

@tjschuck
Copy link
Member

tjschuck commented Oct 2, 2017

@Mikk3lRo Can you squash this down to one commit? Instructions here.

Add Test for search_word_boundary

Add search_word_boundary to options page
@@ -27,6 +27,7 @@ class AbstractChosen
@enable_split_word_search = if @options.enable_split_word_search? then @options.enable_split_word_search else true
@group_search = if @options.group_search? then @options.group_search else true
@search_contains = @options.search_contains || false
@search_word_boundary = if @options.search_word_boundary? then @options.search_word_boundary else '^|\\s|\\b'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t this be just;

@search_word_boundary = options.search_word_boundary || '^|\\s|\\b'

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the option should not necessarily include “search” in its name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not logical to prefix search related options with "search"?

And I'm pretty sure the question mark is needed, otherwise the empty string will evaluate to false and the default value will be used. The empty string is a perfectly valid value for this option... if set to an empty string Chosen works like search_contains=true, but allowing enable_split_word_search to decide whether to only match the beginning of the option - see linked issue for a relevant use case.

<td>^|\\b|\\s</td>
<td>By default, Chosen uses JS RegExp's built-in word boundary to detect word beginnings as well as whitespace or the beginning of the entire label. That works great for ascii-only languages, but <strong>will</strong> erroneously detect word boundaries after letters with umlauts among many, many others.<br>You can pass a string (that will be interpreted as part of a <code class="language-javascript">RegExp</code>) refined for your language and use case to correctly detect word boundaries. A (simplified) example could be <code class="language-javascript">'^|[^A-zæøåÆØÅ]'</code> for Danish.</td>
</tr>
<tr>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now in here twice

@@ -247,7 +247,7 @@ describe "Searching", ->

expect(div.find(".active-result").length).toBe(1)
expect(div.find(".active-result")[0].innerHTML).toBe("oh <em>h</em>ello")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; unrelated change

Copy link
Contributor Author

@Mikk3lRo Mikk3lRo Oct 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... not eactly a git wiz here, and the squahing turned out to not be as easy as the instuctions suggested due to a conflict on the options page. I'm just about to give up and start a new branch instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the whole thing is a mess now. I have no idea how to recover. Can I start a new branch and pull request?

# This is the 1st commit message:

Add Option search_word_boundary

Add Test for search_word_boundary

Add search_word_boundary to options page

Add new option: search_word_boundary

Add Test for search_word_boundary

Add search_word_boundary to options page

# This is the commit message harvesthq#2:

Cleaned unintentional, unrelated change.
# This is the commit message harvesthq#3:

Add Option search_word_boundary
# This is the commit message harvesthq#4:

Add Test for search_word_boundary
@Mikk3lRo
Copy link
Contributor Author

Mikk3lRo commented Oct 2, 2017

Sorry, but this is too messed up now. Some stupid conflict going around itself in circles.

@Mikk3lRo Mikk3lRo closed this Oct 2, 2017
@Mikk3lRo Mikk3lRo deleted the custom-word-boundary branch October 7, 2017 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants