Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLT-2077 Support CJK hashtags #4555

Merged
merged 7 commits into from Nov 24, 2016
Merged

PLT-2077 Support CJK hashtags #4555

merged 7 commits into from Nov 24, 2016

Conversation

cometkim
Copy link
Collaborator

@cometkim cometkim commented Nov 15, 2016

Support CJK hashtags

Handle CJK characters when hashtags are parsed.

Using regex patterns with character ranges below.

Exist ranges
  • 3000 - 303f: Japanese-style punctuation
  • 3040 - 309f: Hiragana
  • 30a0 - 30ff: Katakana
  • ff00 - ff9f: Full-width Roman characters and half-width Katakana
  • 4e00 - 9faf: CJK unified ideographs - Common and uncommon Kanji
  • 3400 - 4dbf: CJK unified ideographs Extension A - Rare Kanji
Added a new range
  • ac00 - d7a3: Korean completed words (가-힣)

Ticket Link

PLT-2077 Support all unicode letters in hashtags

Checklist

  • Added or updated unit tests (needs help to add more test cases in Chinese and Japanese)

Note:

CJK Hashtag may requires to modify database.

DROP INDEX idx_posts_hashtags_txt ON Posts;
CREATE FULLTEXT INDEX idx_posts_hashtags_txt ON Posts (Hashtags) WITH PARSER ngram;

please refer a related issue.

@mattermod
Copy link
Contributor

Thanks @cometkim for the pull request!

Per the CONTRIBUTING.md file displayed when you created this pull request, we need to add you to the list of approved contributors for the Mattermost project.

Please help complete the Mattermost contribution license agreement?

This is a standard procedure for many open source projects. Your form should be processed within 24 hours and reviewers for your pull request will be able to proceed.

Please let us know if you have any questions.

We are very happy to have you join our growing community! If you're not yet a member, please consider joining our Contributors community channel to meet other contributors and discuss new opportunities with the core team.

@jwilander
Copy link
Member

Hi @cometkim, thanks for the PR! Looks like you a client unit test failing:

1 failing

  1) TextFormatting.Hashtags Hashtags:

      AssertionError: '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test\'>#test</a>123</p>' == '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test123\'>#test123</a></p>'
      + expected - actual

      -<p><a class='mention-link' href='#' data-hashtag='#test'>#test</a>123</p>
      +<p><a class='mention-link' href='#' data-hashtag='#test123'>#test123</a></p>

      at Context.<anonymous> (.tmp/mocha-webpack/c4951ea931302f59c7c872a7ccab381a/c4951ea931302f59c7c872a7ccab381a-output.js:6:765)

Let me know if you need help fixing it

@jwilander jwilander added the 1: PM Review Requires review by a product manager label Nov 15, 2016
@cometkim
Copy link
Collaborator Author

cometkim commented Nov 15, 2016

Hello, @jwilander.
I've fixed, but not pushed yet.

May I ask what these mean of this?

// Known issue, trailing underscore is captured by the client-side regex but not the server-side one
assert.equal(
    TextFormatting.formatText('#test_').trim(),
    "<p><a class='mention-link' href='#' data-hashtag='#test_'>#test_</a></p>"
)

Is the test must be passed? I think it depends on which is the right regex for hashtags.

  • client-side regex is allowed to underscore at last: (^|\W)(#[a-zA-ZäöüÄÖÜß][a-zA-Z0-9äöüÄÖÜß.\-_]*)\b
  • server-side regex is not allowed to underscore at last: ^(#[A-Za-zäöüÄÖÜß]+[A-Za-z0-9äöüÄÖÜß.\-_]*[A-Za-z0-9äöüÄÖÜß])$

@hmhealey
Copy link
Member

Hi @cometkim, you can remove that test if your regex fixes it. I just left that there as a reminder to myself when I started working on that ticket.

`test_` shouldn't be a hashtag
@cometkim
Copy link
Collaborator Author

cometkim commented Nov 15, 2016

Hi @hmhealey, I removed the test, and all tests were passed.

But I think we still need more considering about hashtags allowed.
for example, currently #test.test would be hashtag. but I don't think a hashtag should contains dots.

@hmhealey
Copy link
Member

I don't remember if we originally intended dots to be allowed in hashtags, but we have used it in the past for version numbers like #v2.4. I've wanted to revisit what makes up a hashtag for a while now so I'll schedule something for our UX design meeting tomorrow at 11:30 AM EST (15:30 UTC). I don't think the times work out well for you, but you're welcome to join us if you're available.

I also looked into how Twitter does their hashtags, and while they don't allow #test.test, they do actually allow the #test_ case above. They've also got a lot of support for different character sets so it could make sense to standardize on their regex. Here's the links I was looking at:
Stack Overflow question about their hashtags
Twitter hashtag regex from that post

@lfbrock lfbrock added the Setup Old Test Server Triggers the creation of a test server label Nov 16, 2016
@kaakaa
Copy link
Contributor

kaakaa commented Nov 17, 2016

I've tested some Japanese word in Spinmint test server.

A word including full-width space is detected as a hashtag.(see bellow)

town_square_-_pr4555_mattermost

I expect to detect "#鰻" as hashtag, but detected "#鰻 他".

I think a hashtag should be separated by space regardless of weather full-width or half-width.
How do you think about this?

@cometkim
Copy link
Collaborator Author

cometkim commented Nov 17, 2016

Because I don't know exactly what are ranges of Chinese and Japanese.

I think it should be cutted-off one or more characters from begin/end of that ranges.

Or I can use the regex @hmhealey said. It looks working well for Korean characters.

@kaakaa Could you test the regex with Chinese and Japanese?

@kaakaa
Copy link
Contributor

kaakaa commented Nov 17, 2016

Thanks @cometkim.

I can test Japanese only, sorry... :(
The regex with Japanese works fine for me.

But, I prefer using the range of Japanese-style punctuation except for full-width space (\u3000). (i.e. 3001 - 303f: Japanese-style punctuation)
Because when using the regex I cannot use almost all of the full-width symbol for hashtag.

@hmhealey
Copy link
Member

@cometkim Have you tried using this regex that I added to the ticket #\p{L}[\p{L}\d\-_.]+[\p{L}\d]? You'll need to use XRegExp instead of the built-in RegExp, but it will only all characters that the Unicode standard considers letters and shouldn't match any punctuation.

@cometkim
Copy link
Collaborator Author

@hmhealey I've tested that regex also, but it would restrict minimum characters of hashtags on regex side.

Did you make any policies about hashtags pattern and length at last meeting?
sorry for not participating. it was 3:00 AM to me.

@hmhealey
Copy link
Member

hmhealey commented Nov 17, 2016

No worries. I thought the meeting was in the middle of the night for you, but I wanted to offer it in case you worked unusual hours.

We decided to keep it as is for now since we use hashtags including dots (like #v1.4) and dashes (like #bug-ticket) ourselves. I'm going to make a post on our forums about possibly changing it in the future to gather feedback, but that won't be happening until we revisit our search system since many of the restrictions we have (such as the minimum length) are because of requirements for searching columns in MySQL and Postgres.

Regarding the minimum length, I'm not too familiar with CJK, but would 2 character hashtags be common for them? The regex I posted would support 2 character hashtags if we change it to #\p{L}[\p{L}\d\-_.]*[\p{L}\d], but I think we need to leave it at 3 characters minimum for now, even for CJK hashtags. The problem with 2 character hashtags is that they only work on Postgres and not MySQL unless the user changes some additional settings that we don't have access to.

Instead of adding the special case for CJK hashtags on Postgres, we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json that defaults to 3. That way, users could lower it to 2 if their database is set up to support it. If you're interested in adding something like that, you could do it as part of this PR. If not, you can just leave the minimum length as 3, and I'll file a separate ticket to add it.

@cometkim
Copy link
Collaborator Author

I've fixed the regex to #\p{L}[\p{L}\d\-_.]*[\p{L}\d] you suggested.

we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json

I think it should be.

Because :

  1. end-user may have used only one letter to make hashtag. (twitter and facebook support it.)
  2. There are so many words with two letters in CJK. In Chinese, 2~3 letters are enough to make a word. (Japanese and Korean use Chinese words in common.)

Removed back the MIN_CJK_HASHTAG_LINK_LENGTH from code. Please make a new ticket for it.

@@ -304,7 +304,7 @@ func Etag(parts ...interface{}) string {
return etag
}

var validHashtag = regexp.MustCompile(`^(#[A-Za-zäöüÄÖÜß]+[A-Za-z0-9äöüÄÖÜß_\-]*[A-Za-z0-9äöüÄÖÜß])$`)
var validHashtag = regexp.MustCompile(`^(#\\pL[\\pL\\d\\-_.]*[\\pL\\d])$`)
Copy link
Member

@hmhealey hmhealey Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your unit tests are failing. I think it's because of the extra backslashes causing the regex to be incorrect. Since they're surrounded by backquotes (like `), you don't need to escape them. This should just be

var validHashtag = regexp.MustCompile(`^(#\pL[\pL\d\-_.]*[\pL\d])$`)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed invalid escape characters. It is a mistake during copy paste the regex.

@lfbrock lfbrock added Setup Old Test Server Triggers the creation of a test server and removed Setup Old Test Server Triggers the creation of a test server labels Nov 23, 2016
@mattermod
Copy link
Contributor

Setup Test Server label detected. Spinmint test server created if build succeeds (checks pass and no conflicts with base branch).

@lfbrock lfbrock added the Setup Old Test Server Triggers the creation of a test server label Nov 23, 2016
@mattermod
Copy link
Contributor

Spinmint test server created at: http://i-3f7712ab.spinmint.com:8065/pr4555

Test Account 1: Email: test@test.com | Password: passwd

Test Account 2: Email: test2@test.com | Password: passwd

Instance ID: i-3f7712ab

@lfbrock
Copy link
Contributor

lfbrock commented Nov 23, 2016

Just following up on this, here's the ticket to add the minimum hashtag length setting @cometkim, if you're interested in working on that: https://mattermost.atlassian.net/browse/PLT-4793

I've asked @kaakaa to test if the full width hashtag issue is still happening.

@kaakaa
Copy link
Contributor

kaakaa commented Nov 24, 2016

Now I tested.
It works fine for Japanese hashtag.

Thanks @cometkim!

@lfbrock lfbrock removed the 1: PM Review Requires review by a product manager label Nov 24, 2016
@lfbrock
Copy link
Contributor

lfbrock commented Nov 24, 2016

+1

@lfbrock lfbrock added the 2: Dev Review Requires review by a developer label Nov 24, 2016
@jwilander jwilander added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a developer Setup Old Test Server Triggers the creation of a test server labels Nov 24, 2016
@mattermod
Copy link
Contributor

Spinmint test server destroyed

@jwilander jwilander merged commit 2abcc25 into mattermost:master Nov 24, 2016
@jwilander
Copy link
Member

Awesome work @cometkim !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants