Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Import PunktWordTokenizer in nltk 3.3 #2122

Closed
ghost opened this issue Sep 13, 2018 · 5 comments
Closed

Cannot Import PunktWordTokenizer in nltk 3.3 #2122

ghost opened this issue Sep 13, 2018 · 5 comments
Labels

Comments

@ghost
Copy link

ghost commented Sep 13, 2018

How do I use PunktWordTokenizer in nltk 3.3? Has it been deprecated or renamed?
>>> nltk.__version__ '3.3'
>>> from nltk.tokenize import PunktWordTokenizer Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name 'PunktWordTokenizer'
Any help/suggestion is highly appreciated.

@alvations
Copy link
Contributor

alvations commented Sep 14, 2018

Punkt is a sentence tokenizer algorithm not word, for word tokenization, you can use functions in nltk.tokenize. Most commonly, people use the NLTK version of the Treebank word tokenizer with

>>> from nltk import word_tokenize
>>> word_tokenize("This is a sentence, where foo bar is present.")
['This', 'is', 'a', 'sentence', ',', 'where', 'foo', 'bar', 'is', 'present', '.']

Also, please do take a look at http://www.nltk.org/book/ch03.html

@ghost
Copy link
Author

ghost commented Sep 14, 2018

Yeah I usually use word_tokenize() . I might be wrong but wasn't PunktWordTokenizer present in the previous versions? Like in the nltk versions around 2014-15?

@alvations
Copy link
Contributor

Yes, the PunktWordTokenizer was exposed previously but it wasn't a real word tokenizer, it's more like a pre-processor before Punkt decides on where to split the sentence. So now it's no longer exposed to users to avoid confusion.

If you're interested in improving Punkt, do take a look at #2008

@iamRVel
Copy link

iamRVel commented Jan 25, 2019

Nope, It's available in 3.3 version but use the following code from nltk.tokenize.punkt import PunktSentenceTokenizer

@alvations
Copy link
Contributor

Yes, it does seems like the PunktSentenceTokenizer has been re-exposured again https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py#L1236

Closing this issue as resolved then =)

Please do reopen the issue if it's still relevant/unresolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants