New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLT-2077 Support CJK hashtags #4555

Merged
merged 7 commits into from Nov 24, 2016

Conversation

Projects
None yet
7 participants
@cometkim
Contributor

cometkim commented Nov 15, 2016

Support CJK hashtags

Handle CJK characters when hashtags are parsed.

Using regex patterns with character ranges below.

Exist ranges
  • 3000 - 303f: Japanese-style punctuation
  • 3040 - 309f: Hiragana
  • 30a0 - 30ff: Katakana
  • ff00 - ff9f: Full-width Roman characters and half-width Katakana
  • 4e00 - 9faf: CJK unified ideographs - Common and uncommon Kanji
  • 3400 - 4dbf: CJK unified ideographs Extension A - Rare Kanji
Added a new range
  • ac00 - d7a3: Korean completed words (가-힣)

Ticket Link

PLT-2077 Support all unicode letters in hashtags

Checklist

  • Added or updated unit tests (needs help to add more test cases in Chinese and Japanese)

Note:

CJK Hashtag may requires to modify database.

DROP INDEX idx_posts_hashtags_txt ON Posts;
CREATE FULLTEXT INDEX idx_posts_hashtags_txt ON Posts (Hashtags) WITH PARSER ngram;

please refer a related issue.

@mattermod

This comment has been minimized.

Show comment
Hide comment
@mattermod

mattermod Nov 15, 2016

Thanks @cometkim for the pull request!

Per the CONTRIBUTING.md file displayed when you created this pull request, we need to add you to the list of approved contributors for the Mattermost project.

Please help complete the Mattermost contribution license agreement?

This is a standard procedure for many open source projects. Your form should be processed within 24 hours and reviewers for your pull request will be able to proceed.

Please let us know if you have any questions.

We are very happy to have you join our growing community! If you're not yet a member, please consider joining our Contributors community channel to meet other contributors and discuss new opportunities with the core team.

mattermod commented Nov 15, 2016

Thanks @cometkim for the pull request!

Per the CONTRIBUTING.md file displayed when you created this pull request, we need to add you to the list of approved contributors for the Mattermost project.

Please help complete the Mattermost contribution license agreement?

This is a standard procedure for many open source projects. Your form should be processed within 24 hours and reviewers for your pull request will be able to proceed.

Please let us know if you have any questions.

We are very happy to have you join our growing community! If you're not yet a member, please consider joining our Contributors community channel to meet other contributors and discuss new opportunities with the core team.

@jwilander

This comment has been minimized.

Show comment
Hide comment
@jwilander

jwilander Nov 15, 2016

Member

Hi @cometkim, thanks for the PR! Looks like you a client unit test failing:

1 failing

  1) TextFormatting.Hashtags Hashtags:

      AssertionError: '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test\'>#test</a>123</p>' == '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test123\'>#test123</a></p>'
      + expected - actual

      -<p><a class='mention-link' href='#' data-hashtag='#test'>#test</a>123</p>
      +<p><a class='mention-link' href='#' data-hashtag='#test123'>#test123</a></p>

      at Context.<anonymous> (.tmp/mocha-webpack/c4951ea931302f59c7c872a7ccab381a/c4951ea931302f59c7c872a7ccab381a-output.js:6:765)

Let me know if you need help fixing it

Member

jwilander commented Nov 15, 2016

Hi @cometkim, thanks for the PR! Looks like you a client unit test failing:

1 failing

  1) TextFormatting.Hashtags Hashtags:

      AssertionError: '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test\'>#test</a>123</p>' == '<p><a class=\'mention-link\' href=\'#\' data-hashtag=\'#test123\'>#test123</a></p>'
      + expected - actual

      -<p><a class='mention-link' href='#' data-hashtag='#test'>#test</a>123</p>
      +<p><a class='mention-link' href='#' data-hashtag='#test123'>#test123</a></p>

      at Context.<anonymous> (.tmp/mocha-webpack/c4951ea931302f59c7c872a7ccab381a/c4951ea931302f59c7c872a7ccab381a-output.js:6:765)

Let me know if you need help fixing it

@cometkim

This comment has been minimized.

Show comment
Hide comment
@cometkim

cometkim Nov 15, 2016

Contributor

Hello, @jwilander.
I've fixed, but not pushed yet.

May I ask what these mean of this?

// Known issue, trailing underscore is captured by the client-side regex but not the server-side one
assert.equal(
    TextFormatting.formatText('#test_').trim(),
    "<p><a class='mention-link' href='#' data-hashtag='#test_'>#test_</a></p>"
)

Is the test must be passed? I think it depends on which is the right regex for hashtags.

  • client-side regex is allowed to underscore at last: (^|\W)(#[a-zA-ZäöüÄÖÜß][a-zA-Z0-9äöüÄÖÜß.\-_]*)\b
  • server-side regex is not allowed to underscore at last: ^(#[A-Za-zäöüÄÖÜß]+[A-Za-z0-9äöüÄÖÜß.\-_]*[A-Za-z0-9äöüÄÖÜß])$
Contributor

cometkim commented Nov 15, 2016

Hello, @jwilander.
I've fixed, but not pushed yet.

May I ask what these mean of this?

// Known issue, trailing underscore is captured by the client-side regex but not the server-side one
assert.equal(
    TextFormatting.formatText('#test_').trim(),
    "<p><a class='mention-link' href='#' data-hashtag='#test_'>#test_</a></p>"
)

Is the test must be passed? I think it depends on which is the right regex for hashtags.

  • client-side regex is allowed to underscore at last: (^|\W)(#[a-zA-ZäöüÄÖÜß][a-zA-Z0-9äöüÄÖÜß.\-_]*)\b
  • server-side regex is not allowed to underscore at last: ^(#[A-Za-zäöüÄÖÜß]+[A-Za-z0-9äöüÄÖÜß.\-_]*[A-Za-z0-9äöüÄÖÜß])$
@hmhealey

This comment has been minimized.

Show comment
Hide comment
@hmhealey

hmhealey Nov 15, 2016

Member

Hi @cometkim, you can remove that test if your regex fixes it. I just left that there as a reminder to myself when I started working on that ticket.

Member

hmhealey commented Nov 15, 2016

Hi @cometkim, you can remove that test if your regex fixes it. I just left that there as a reminder to myself when I started working on that ticket.

Remove a wrong test case
`test_` shouldn't be a hashtag
@cometkim

This comment has been minimized.

Show comment
Hide comment
@cometkim

cometkim Nov 15, 2016

Contributor

Hi @hmhealey, I removed the test, and all tests were passed.

But I think we still need more considering about hashtags allowed.
for example, currently #test.test would be hashtag. but I don't think a hashtag should contains dots.

Contributor

cometkim commented Nov 15, 2016

Hi @hmhealey, I removed the test, and all tests were passed.

But I think we still need more considering about hashtags allowed.
for example, currently #test.test would be hashtag. but I don't think a hashtag should contains dots.

@hmhealey

This comment has been minimized.

Show comment
Hide comment
@hmhealey

hmhealey Nov 15, 2016

Member

I don't remember if we originally intended dots to be allowed in hashtags, but we have used it in the past for version numbers like #v2.4. I've wanted to revisit what makes up a hashtag for a while now so I'll schedule something for our UX design meeting tomorrow at 11:30 AM EST (15:30 UTC). I don't think the times work out well for you, but you're welcome to join us if you're available.

I also looked into how Twitter does their hashtags, and while they don't allow #test.test, they do actually allow the #test_ case above. They've also got a lot of support for different character sets so it could make sense to standardize on their regex. Here's the links I was looking at:
Stack Overflow question about their hashtags
Twitter hashtag regex from that post

Member

hmhealey commented Nov 15, 2016

I don't remember if we originally intended dots to be allowed in hashtags, but we have used it in the past for version numbers like #v2.4. I've wanted to revisit what makes up a hashtag for a while now so I'll schedule something for our UX design meeting tomorrow at 11:30 AM EST (15:30 UTC). I don't think the times work out well for you, but you're welcome to join us if you're available.

I also looked into how Twitter does their hashtags, and while they don't allow #test.test, they do actually allow the #test_ case above. They've also got a lot of support for different character sets so it could make sense to standardize on their regex. Here's the links I was looking at:
Stack Overflow question about their hashtags
Twitter hashtag regex from that post

@kaakaa

This comment has been minimized.

Show comment
Hide comment
@kaakaa

kaakaa Nov 17, 2016

Contributor

I've tested some Japanese word in Spinmint test server.

A word including full-width space is detected as a hashtag.(see bellow)

town_square_-_pr4555_mattermost

I expect to detect "#鰻" as hashtag, but detected "#鰻 他".

I think a hashtag should be separated by space regardless of weather full-width or half-width.
How do you think about this?

Contributor

kaakaa commented Nov 17, 2016

I've tested some Japanese word in Spinmint test server.

A word including full-width space is detected as a hashtag.(see bellow)

town_square_-_pr4555_mattermost

I expect to detect "#鰻" as hashtag, but detected "#鰻 他".

I think a hashtag should be separated by space regardless of weather full-width or half-width.
How do you think about this?

@cometkim

This comment has been minimized.

Show comment
Hide comment
@cometkim

cometkim Nov 17, 2016

Contributor

Because I don't know exactly what are ranges of Chinese and Japanese.

I think it should be cutted-off one or more characters from begin/end of that ranges.

Or I can use the regex @hmhealey said. It looks working well for Korean characters.

@kaakaa Could you test the regex with Chinese and Japanese?

Contributor

cometkim commented Nov 17, 2016

Because I don't know exactly what are ranges of Chinese and Japanese.

I think it should be cutted-off one or more characters from begin/end of that ranges.

Or I can use the regex @hmhealey said. It looks working well for Korean characters.

@kaakaa Could you test the regex with Chinese and Japanese?

@kaakaa

This comment has been minimized.

Show comment
Hide comment
@kaakaa

kaakaa Nov 17, 2016

Contributor

Thanks @cometkim.

I can test Japanese only, sorry... :(
The regex with Japanese works fine for me.

But, I prefer using the range of Japanese-style punctuation except for full-width space (\u3000). (i.e. 3001 - 303f: Japanese-style punctuation)
Because when using the regex I cannot use almost all of the full-width symbol for hashtag.

Contributor

kaakaa commented Nov 17, 2016

Thanks @cometkim.

I can test Japanese only, sorry... :(
The regex with Japanese works fine for me.

But, I prefer using the range of Japanese-style punctuation except for full-width space (\u3000). (i.e. 3001 - 303f: Japanese-style punctuation)
Because when using the regex I cannot use almost all of the full-width symbol for hashtag.

@hmhealey

This comment has been minimized.

Show comment
Hide comment
@hmhealey

hmhealey Nov 17, 2016

Member

@cometkim Have you tried using this regex that I added to the ticket #\p{L}[\p{L}\d\-_.]+[\p{L}\d]? You'll need to use XRegExp instead of the built-in RegExp, but it will only all characters that the Unicode standard considers letters and shouldn't match any punctuation.

Member

hmhealey commented Nov 17, 2016

@cometkim Have you tried using this regex that I added to the ticket #\p{L}[\p{L}\d\-_.]+[\p{L}\d]? You'll need to use XRegExp instead of the built-in RegExp, but it will only all characters that the Unicode standard considers letters and shouldn't match any punctuation.

@cometkim

This comment has been minimized.

Show comment
Hide comment
@cometkim

cometkim Nov 17, 2016

Contributor

@hmhealey I've tested that regex also, but it would restrict minimum characters of hashtags on regex side.

Did you make any policies about hashtags pattern and length at last meeting?
sorry for not participating. it was 3:00 AM to me.

Contributor

cometkim commented Nov 17, 2016

@hmhealey I've tested that regex also, but it would restrict minimum characters of hashtags on regex side.

Did you make any policies about hashtags pattern and length at last meeting?
sorry for not participating. it was 3:00 AM to me.

@hmhealey

This comment has been minimized.

Show comment
Hide comment
@hmhealey

hmhealey Nov 17, 2016

Member

No worries. I thought the meeting was in the middle of the night for you, but I wanted to offer it in case you worked unusual hours.

We decided to keep it as is for now since we use hashtags including dots (like #v1.4) and dashes (like #bug-ticket) ourselves. I'm going to make a post on our forums about possibly changing it in the future to gather feedback, but that won't be happening until we revisit our search system since many of the restrictions we have (such as the minimum length) are because of requirements for searching columns in MySQL and Postgres.

Regarding the minimum length, I'm not too familiar with CJK, but would 2 character hashtags be common for them? The regex I posted would support 2 character hashtags if we change it to #\p{L}[\p{L}\d\-_.]*[\p{L}\d], but I think we need to leave it at 3 characters minimum for now, even for CJK hashtags. The problem with 2 character hashtags is that they only work on Postgres and not MySQL unless the user changes some additional settings that we don't have access to.

Instead of adding the special case for CJK hashtags on Postgres, we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json that defaults to 3. That way, users could lower it to 2 if their database is set up to support it. If you're interested in adding something like that, you could do it as part of this PR. If not, you can just leave the minimum length as 3, and I'll file a separate ticket to add it.

Member

hmhealey commented Nov 17, 2016

No worries. I thought the meeting was in the middle of the night for you, but I wanted to offer it in case you worked unusual hours.

We decided to keep it as is for now since we use hashtags including dots (like #v1.4) and dashes (like #bug-ticket) ourselves. I'm going to make a post on our forums about possibly changing it in the future to gather feedback, but that won't be happening until we revisit our search system since many of the restrictions we have (such as the minimum length) are because of requirements for searching columns in MySQL and Postgres.

Regarding the minimum length, I'm not too familiar with CJK, but would 2 character hashtags be common for them? The regex I posted would support 2 character hashtags if we change it to #\p{L}[\p{L}\d\-_.]*[\p{L}\d], but I think we need to leave it at 3 characters minimum for now, even for CJK hashtags. The problem with 2 character hashtags is that they only work on Postgres and not MySQL unless the user changes some additional settings that we don't have access to.

Instead of adding the special case for CJK hashtags on Postgres, we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json that defaults to 3. That way, users could lower it to 2 if their database is set up to support it. If you're interested in adding something like that, you could do it as part of this PR. If not, you can just leave the minimum length as 3, and I'll file a separate ticket to add it.

@cometkim

This comment has been minimized.

Show comment
Hide comment
@cometkim

cometkim Nov 20, 2016

Contributor

I've fixed the regex to #\p{L}[\p{L}\d\-_.]*[\p{L}\d] you suggested.

we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json

I think it should be.

Because :

  1. end-user may have used only one letter to make hashtag. (twitter and facebook support it.)
  2. There are so many words with two letters in CJK. In Chinese, 2~3 letters are enough to make a word. (Japanese and Korean use Chinese words in common.)

Removed back the MIN_CJK_HASHTAG_LINK_LENGTH from code. Please make a new ticket for it.

Contributor

cometkim commented Nov 20, 2016

I've fixed the regex to #\p{L}[\p{L}\d\-_.]*[\p{L}\d] you suggested.

we could consider adding a MinimumHashtagLength to the ServiceSettings section of config.json

I think it should be.

Because :

  1. end-user may have used only one letter to make hashtag. (twitter and facebook support it.)
  2. There are so many words with two letters in CJK. In Chinese, 2~3 letters are enough to make a word. (Japanese and Korean use Chinese words in common.)

Removed back the MIN_CJK_HASHTAG_LINK_LENGTH from code. Please make a new ticket for it.

Show outdated Hide outdated model/utils.go Outdated
@mattermod

This comment has been minimized.

Show comment
Hide comment
@mattermod

mattermod Nov 23, 2016

Setup Test Server label detected. Spinmint test server created if build succeeds (checks pass and no conflicts with base branch).

mattermod commented Nov 23, 2016

Setup Test Server label detected. Spinmint test server created if build succeeds (checks pass and no conflicts with base branch).

@mattermod

This comment has been minimized.

Show comment
Hide comment
@mattermod

mattermod Nov 23, 2016

Spinmint test server created at: http://i-3f7712ab.spinmint.com:8065/pr4555

Test Account 1: Email: test@test.com | Password: passwd

Test Account 2: Email: test2@test.com | Password: passwd

Instance ID: i-3f7712ab

mattermod commented Nov 23, 2016

Spinmint test server created at: http://i-3f7712ab.spinmint.com:8065/pr4555

Test Account 1: Email: test@test.com | Password: passwd

Test Account 2: Email: test2@test.com | Password: passwd

Instance ID: i-3f7712ab

@lfbrock

This comment has been minimized.

Show comment
Hide comment
@lfbrock

lfbrock Nov 23, 2016

Member

Just following up on this, here's the ticket to add the minimum hashtag length setting @cometkim, if you're interested in working on that: https://mattermost.atlassian.net/browse/PLT-4793

I've asked @kaakaa to test if the full width hashtag issue is still happening.

Member

lfbrock commented Nov 23, 2016

Just following up on this, here's the ticket to add the minimum hashtag length setting @cometkim, if you're interested in working on that: https://mattermost.atlassian.net/browse/PLT-4793

I've asked @kaakaa to test if the full width hashtag issue is still happening.

@kaakaa

This comment has been minimized.

Show comment
Hide comment
@kaakaa

kaakaa Nov 24, 2016

Contributor

Now I tested.
It works fine for Japanese hashtag.

Thanks @cometkim!

Contributor

kaakaa commented Nov 24, 2016

Now I tested.
It works fine for Japanese hashtag.

Thanks @cometkim!

@lfbrock lfbrock removed the 1: PM Review label Nov 24, 2016

@lfbrock

This comment has been minimized.

Show comment
Hide comment
@lfbrock

lfbrock Nov 24, 2016

Member

+1

Member

lfbrock commented Nov 24, 2016

+1

@mattermod

This comment has been minimized.

Show comment
Hide comment
@mattermod

mattermod Nov 24, 2016

Spinmint test server destroyed

mattermod commented Nov 24, 2016

Spinmint test server destroyed

@jwilander jwilander merged commit 2abcc25 into mattermost:master Nov 24, 2016

1 check passed

build.mattermost.com Successful Build
Details
@jwilander

This comment has been minimized.

Show comment
Hide comment
@jwilander

jwilander Nov 24, 2016

Member

Awesome work @cometkim !

Member

jwilander commented Nov 24, 2016

Awesome work @cometkim !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment