-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load .pickle from AWS S3 URL and use it with word_tokenize() #2947
Comments
Hello! I see that you've uploaded the german Punkt model, which works for sentences, but not for splitting words, as you've noticed. import nltk
from nltk.tokenize import word_tokenize
text = "Ich bin ein Test. Tokenisierung ist toll"
tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')
sentences = tokenizer.tokenize(text)
sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences] Note the nltk/nltk/tokenize/__init__.py Lines 114 to 132 in d84a582
As you can see here, I hope that helps! |
Hi @tomaarsen, thanks for your swift assistance! I will implement it that way but also noticed one difference. it would be much easier if I could implement it directly like:
Although the path is appended to the locations where NLTK tries to find the files the code returns:
The url to the file is: |
I uploaded the german.pickle file to aws s3 and would like to use it with
word_tokenize()
First I loaded the .pickle file and use it with
tokenize()
As a result I get text tokenized in sentences.
But when I use the following code I receive an AttributeError:
How can I use
nltk.tokenize.word_tokenize()
with the downloaded file from S3? Since when I try the common wayword_tokenize(text, language = 'german')
the LookupError is returned since the data are not available in my local environment.Regards!
The text was updated successfully, but these errors were encountered: