Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

Open
davidbbaumann opened this issue Feb 15, 2022 · 2 comments
Open

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

davidbbaumann opened this issue Feb 15, 2022 · 2 comments

Comments

@davidbbaumann
Copy link

I uploaded the german.pickle file to aws s3 and would like to use it with word_tokenize()

First I loaded the .pickle file and use it with tokenize()

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.tokenize(text))
>> ['Ich bin ein Test.', 'Tokenisierung ist toll']

As a result I get text tokenized in sentences.

But when I use the following code I receive an AttributeError:

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.word_tokenize(text))

>> AttributeError: 'PunktSentenceTokenizer' object has no attribute 'word_tokenize'

How can I use nltk.tokenize.word_tokenize() with the downloaded file from S3? Since when I try the common way word_tokenize(text, language = 'german') the LookupError is returned since the data are not available in my local environment.

Regards!

@davidbbaumann davidbbaumann changed the title Load .pickle from AWS S3 URL and use it with word_tokenizer() Load .pickle from AWS S3 URL and use it with word_tokenize() Feb 15, 2022
@tomaarsen
Copy link
Member

Hello!

I see that you've uploaded the german Punkt model, which works for sentences, but not for splitting words, as you've noticed.
The cause is that Punkt (PunktSentenceTokenizer) is a sentence tokenizer, and a different model is used for word tokenization: NLTKWordTokenizer.
I'm not 100% certain, and I don't have the time to verify this now, but I think NLTKWordTokenizer does not require any additional data that would cause a LookupError for your local environment. This means that you might be able to use the following:

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')
sentences = tokenizer.tokenize(text)

sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]

Note the preserve_line=True here. The word_tokenize function is as follows:

def word_tokenize(text, language="english", preserve_line=False):
"""
Return a tokenized copy of *text*,
using NLTK's recommended word tokenizer
(currently an improved :class:`.TreebankWordTokenizer`
along with :class:`.PunktSentenceTokenizer`
for the specified language).
:param text: text to split into words
:type text: str
:param language: the model name in the Punkt corpus
:type language: str
:param preserve_line: A flag to decide whether to sentence tokenize the text or not.
:type preserve_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [
token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
]

As you can see here, sent_tokenize gets used whenever preserve_line is False (the default), and we want to avoid that, and use the sentence tokenizer that we already loaded (with AWS) instead.

I hope that helps!

@davidbbaumann
Copy link
Author

Hi @tomaarsen,

thanks for your swift assistance! I will implement it that way but also noticed one difference. word_tokenize() returns one list with all the tokens whereas sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences] returns a list of lists where each sub-list holds the tokens of one sentence.

it would be much easier if I could implement it directly like:

import nltk
from nltk.tokenize import word_tokenize

nltk.data.path.append('https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/')

text = "Ich bin ein Test. Tokenisierung ist toll"

print(word_tokenize(text, language='german'))

Although the path is appended to the locations where NLTK tries to find the files the code returns:

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/german.pickle

The url to the file is: https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants