Have you tried to tokenize a sentence with combined words like 'tokenizerFail'
? Well, that is easy because the words use camel case. But how about tokenizerfail
? I'm sure you see the trouble you encounter tokenizing such words!
Unfortunately, with the advent of social media, these kind of 'compounded' words are much more common (especially with hashtags).
This package uses the concept of known consonant blends to attempt and discover word boundaries & hence tokenize/humanize such words. It is not perfect (I'm looking for other methods to enhance it) but gets you closer to perfect tokenization.
Don't speak English? Go to the ./lang
folder and create consonant blends for your language (check out ./lang/en.json
).
const wordize = require('wordize');
var str = 'there is this bigmanInYellowSUIT who thinks he is the freakingpope & our rainmaker';
//numanize
wordize.humanize(str, 'en'); //There is this big man in yellow suit who thinks he is the freaking pope & our rain maker
//get words from the sentence
//Note: The second parameter is the appropriate language code. Defaults to 'en'
wordize.words(str) //[ 'There', 'is', 'this', 'big', 'man', 'in', 'yellow', 'suit', 'who', 'thinks', 'he', 'is', 'the', 'freaking', 'pope', 'our', 'rain', 'maker' ]
Got ideas on how we can enhance this module? Please share!