Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

davidbbaumann · 2022-02-15T08:41:57Z

I uploaded the german.pickle file to aws s3 and would like to use it with word_tokenize()

First I loaded the .pickle file and use it with tokenize()

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.tokenize(text))
>> ['Ich bin ein Test.', 'Tokenisierung ist toll']

As a result I get text tokenized in sentences.

But when I use the following code I receive an AttributeError:

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

print(tokenizer.word_tokenize(text))

>> AttributeError: 'PunktSentenceTokenizer' object has no attribute 'word_tokenize'

How can I use nltk.tokenize.word_tokenize() with the downloaded file from S3? Since when I try the common way word_tokenize(text, language = 'german') the LookupError is returned since the data are not available in my local environment.

Regards!

The text was updated successfully, but these errors were encountered:

tomaarsen · 2022-02-15T12:22:26Z

Hello!

I see that you've uploaded the german Punkt model, which works for sentences, but not for splitting words, as you've noticed.
The cause is that Punkt (PunktSentenceTokenizer) is a sentence tokenizer, and a different model is used for word tokenization: NLTKWordTokenizer.
I'm not 100% certain, and I don't have the time to verify this now, but I think NLTKWordTokenizer does not require any additional data that would cause a LookupError for your local environment. This means that you might be able to use the following:

import nltk
from nltk.tokenize import word_tokenize

text = "Ich bin ein Test. Tokenisierung ist toll"

tokenizer = nltk.data.load(resource_url = 'https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')
sentences = tokenizer.tokenize(text)

sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences]

Note the preserve_line=True here. The word_tokenize function is as follows:

nltk/nltk/tokenize/__init__.py

Lines 114 to 132 in d84a582

    
           def word_tokenize(text, language="english", preserve_line=False): 
        
               """ 
        
               Return a tokenized copy of *text*, 
        
               using NLTK's recommended word tokenizer 
        
               (currently an improved :class:`.TreebankWordTokenizer` 
        
               along with :class:`.PunktSentenceTokenizer` 
        
               for the specified language). 
        
               :param text: text to split into words 
        
               :type text: str 
        
               :param language: the model name in the Punkt corpus 
        
               :type language: str 
        
               :param preserve_line: A flag to decide whether to sentence tokenize the text or not. 
        
               :type preserve_line: bool 
        
               """ 
        
               sentences = [text] if preserve_line else sent_tokenize(text, language) 
        
               return [ 
        
                   token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent) 
        
               ]

As you can see here, sent_tokenize gets used whenever preserve_line is False (the default), and we want to avoid that, and use the sentence tokenizer that we already loaded (with AWS) instead.

I hope that helps!

davidbbaumann · 2022-02-16T09:10:26Z

Hi @tomaarsen,

thanks for your swift assistance! I will implement it that way but also noticed one difference. word_tokenize() returns one list with all the tokens whereas sentences = [word_tokenize(sentence, preserve_line=True) for sentence in sentences] returns a list of lists where each sub-list holds the tokens of one sentence.

it would be much easier if I could implement it directly like:

import nltk
from nltk.tokenize import word_tokenize

nltk.data.path.append('https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/')

text = "Ich bin ein Test. Tokenisierung ist toll"

print(word_tokenize(text, language='german'))

Although the path is appended to the locations where NLTK tries to find the files the code returns:

LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/german.pickle

The url to the file is: https://XXXXX.s3.XXX.amazonaws.com/nltk/nltk-data/tokenizers/punkt/german.pickle')

davidbbaumann changed the title ~~Load .pickle from AWS S3 URL and use it with word_tokenizer()~~ Load .pickle from AWS S3 URL and use it with word_tokenize() Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

davidbbaumann commented Feb 15, 2022

tomaarsen commented Feb 15, 2022

davidbbaumann commented Feb 16, 2022

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

Load .pickle from AWS S3 URL and use it with word_tokenize() #2947

Comments

davidbbaumann commented Feb 15, 2022

tomaarsen commented Feb 15, 2022

davidbbaumann commented Feb 16, 2022