Wicked fast word segmenter with a focus on splitting #hashtags.
from wordseg import segment
segment('mannequinchallenge')
# => (['mannequin', 'challenge'], 5.996932418552515e-11)
Because the "training" data was harvested from social media websites, this word segmenter is especially good as a hashtag splitter. It's also about 10x faster than wordsegment.
The speed derives from an implementation of the Viterbi algorithm I found
posted on SO. The built-in dictionary was pulled from about 6GB of social media
posts (English only). Tools for building your own dictionary are included in the
bin
folder.
- Improve data set by including posts from a broader range of time and with more unique unigrams.
- Include common bigrams or even trigrams to help segmentation be context-aware.
- Beef-up the very minimal Viterbi implementation