-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling smartquotes #3
Comments
The preprocessing and tokenization in ner_stream is pretty basic. However, we are setting up python, R, and java APIs right now that will make it easy for you to use whatever kind of preprocessing and tokenization you want (e.g. like NLTK's tokenizer) since in general different applications require different sorts of preprocessing and tokenization. But yeah, it's annoying that the default tokenizer doesn't handle a smartquote. So I just updated it and if you pull and recompile it should work properly now. Cheers, |
Thanks for the update. Will definitely wait for the APIs. |
@geovedi the simple As @davisking mentioned, the solution may not be to put it in |
Davis,
Any plan to support smartquotes as modern text editors these days tend to use them instead regular quotes? This may have impact on the result. For example, this is the original text.
and this is after preprocessing (smartquotes replaced).
Silicon Valley
is now detected.Cheers,
Jim
The text was updated successfully, but these errors were encountered: