Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
PLT-2077 Support CJK hashtags #4555
Support CJK hashtags
Handle CJK characters when hashtags are parsed.
Using regex patterns with character ranges below.
Added a new range
CJK Hashtag may requires to modify database.
DROP INDEX idx_posts_hashtags_txt ON Posts; CREATE FULLTEXT INDEX idx_posts_hashtags_txt ON Posts (Hashtags) WITH PARSER ngram;
please refer a related issue.
Thanks @cometkim for the pull request!
Please help complete the Mattermost contribution license agreement?
This is a standard procedure for many open source projects. Your form should be processed within 24 hours and reviewers for your pull request will be able to proceed.
Please let us know if you have any questions.
We are very happy to have you join our growing community! If you're not yet a member, please consider joining our Contributors community channel to meet other contributors and discuss new opportunities with the core team.
Hi @cometkim, thanks for the PR! Looks like you a client unit test failing:
Let me know if you need help fixing it
May I ask what these mean of this?
// Known issue, trailing underscore is captured by the client-side regex but not the server-side one assert.equal( TextFormatting.formatText('#test_').trim(), "<p><a class='mention-link' href='#' data-hashtag='#test_'>#test_</a></p>" )
Is the test must be passed? I think it depends on which is the right regex for hashtags.
I don't remember if we originally intended dots to be allowed in hashtags, but we have used it in the past for version numbers like
I also looked into how Twitter does their hashtags, and while they don't allow
I've tested some Japanese word in Spinmint test server.
A word including full-width space is detected as a hashtag.(see bellow)
I expect to detect "#鰻" as hashtag, but detected "#鰻 他".
I think a hashtag should be separated by space regardless of weather full-width or half-width.
I can test Japanese only, sorry... :(
But, I prefer using the range of Japanese-style punctuation except for full-width space (\u3000). (i.e.
@cometkim Have you tried using this regex that I added to the ticket
No worries. I thought the meeting was in the middle of the night for you, but I wanted to offer it in case you worked unusual hours.
We decided to keep it as is for now since we use hashtags including dots (like
Regarding the minimum length, I'm not too familiar with CJK, but would 2 character hashtags be common for them? The regex I posted would support 2 character hashtags if we change it to
Instead of adding the special case for CJK hashtags on Postgres, we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json that defaults to 3. That way, users could lower it to 2 if their database is set up to support it. If you're interested in adding something like that, you could do it as part of this PR. If not, you can just leave the minimum length as 3, and I'll file a separate ticket to add it.
I've fixed the regex to
I think it should be.
Removed back the