Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling smartquotes #3

Closed
geovedi opened this issue Apr 5, 2014 · 3 comments
Closed

Handling smartquotes #3

geovedi opened this issue Apr 5, 2014 · 3 comments

Comments

@geovedi
Copy link

geovedi commented Apr 5, 2014

Davis,

Any plan to support smartquotes as modern text editors these days tend to use them instead regular quotes? This may have impact on the result. For example, this is the original text.

Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift] .

and this is after preprocessing (smartquotes replaced). Silicon Valley is now detected.

Mozilla CEO Exit Exposes Silicon Valley's Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes [LOCATION Silicon Valley] 's [MISC Equality-Freedom Rift]

Cheers,
Jim

@davisking
Copy link
Contributor

The preprocessing and tokenization in ner_stream is pretty basic. However, we are setting up python, R, and java APIs right now that will make it easy for you to use whatever kind of preprocessing and tokenization you want (e.g. like NLTK's tokenizer) since in general different applications require different sorts of preprocessing and tokenization. But yeah, it's annoying that the default tokenizer doesn't handle a smartquote. So I just updated it and if you pull and recompile it should work properly now.

Cheers,
Davis

@geovedi
Copy link
Author

geovedi commented Apr 5, 2014

Thanks for the update. Will definitely wait for the APIs.

@swadey
Copy link
Contributor

swadey commented Apr 5, 2014

@geovedi the simple ner_stream tokenizer doesn't handle the more general problem of unicode normalization. We should probably do this. If you can file an issue for posterity and discussion, we can keep track of this.

As @davisking mentioned, the solution may not be to put it in ner_stream but into the bindings. The issue there is that it affects training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants