Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting hashtags #1

Closed
Patrisimo opened this issue Jun 14, 2016 · 0 comments
Closed

Splitting hashtags #1

Patrisimo opened this issue Jun 14, 2016 · 0 comments
Labels

Comments

@Patrisimo
Copy link

Some hashtags are tokenized incorrectly if they occur at the end of a tweet. The main example is "#FelizDiaDeLaM…", where the '…' is the unicode ellipsis, which gets tokenized '#','FelizDiaDeLaM','…'. My guess is that this error will occur on all hashtags that look like "#[[:alnum:]]+[^[:alnum:]\s]+"

@maxthomas maxthomas added the tift label Jun 15, 2016
maxthomas added a commit that referenced this issue Jun 15, 2016
non-capturing Hashtag regex.

Additionally, remove a lot of object creation and switch these to
arrays.

Closes #1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants