Skip to content
/ wordseg Public

Fast word segmentation with a focus on splitting #hashtags

License

Notifications You must be signed in to change notification settings

jchook/wordseg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wordseg

Wicked fast word segmenter with a focus on splitting #hashtags.

Example

from wordseg import segment

segment('mannequinchallenge')
    # => (['mannequin', 'challenge'], 5.996932418552515e-11)

More Info

Because the "training" data was harvested from social media websites, this word segmenter is especially good as a hashtag splitter. It's also about 10x faster than wordsegment.

The speed derives from an implementation of the Viterbi algorithm I found posted on SO. The built-in dictionary was pulled from about 6GB of social media posts (English only). Tools for building your own dictionary are included in the bin folder.

Roadmap

  • Improve data set by including posts from a broader range of time and with more unique unigrams.
  • Include common bigrams or even trigrams to help segmentation be context-aware.
  • Beef-up the very minimal Viterbi implementation

About

Fast word segmentation with a focus on splitting #hashtags

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages