ValueError related to nlp.max_length: wordwise 0.0.4 #4

gsalfourn · 2021-08-24T21:32:19Z

Was trying out the library, and run the following error

ValueError: [E088] Text of length 4290144 exceeds maximum of 1000000. The parser and NER 
models require roughly 1GB of temporary memory per 100,000 characters in the input. This 
means long texts  may cause memory allocation errors. If you're not using the parser or NER, 
it's probably safe to increase  the `nlp.max_length` limit. The limit is in number of characters, 
so you can check whether your inputs are  too long by checking `len(text)`.

Just wondering where in the code to fix nlp.max_length

The text was updated successfully, but these errors were encountered:

jaketae · 2021-08-24T21:53:28Z

Hey @gsalfourn, thanks for reporting this issue.

You can access the spaCy model via extractor.nlp. So in this case, you could fix the warning via

extractor = Extractor()
extractor.nlp.max_length = 5000000 # or some big number, as long as you have RAM

Let me know if this fixes the error for you. I'll try to think of a way to remedy this warning in the next patch release.

gsalfourn · 2021-08-25T13:06:08Z

thanks so much, that worked.

Sorry to be a bother, but I have another question, so how would one go about truncating and specifying max_length tokens if the tokens in the text file being used are longer than the token the transformer tokenizer will accept? I am asking, because I ran into an indexing error related to the length of my tokens one the length accepted by the tokenizer.

Token indices sequence length is longer than the specified maximum sequence length for this model (41163 > 512). Running
this sequence through the model will result in indexing errors
...
...
IndexError: index out of range in self

In the core.py file line 60 shows

tokens = self.tokenizer(source, padding=True, return_tensors="pt"),

that part may raise an error relating to length of tokens (indexing error) , if a user's tokens are longer than that of the loaded tokenizer.

I don't know if it's possible to change line 60 to force truncation and max_length like so

tokens = self.tokenizer(source,  padding=True, truncation=True, max_length=512, return_tensors="pt")

or to something that can take variable length tokens, based on the tokenizer in use. So lines 56 to 64 will become something like:

    @torch.no_grad()
    def get_embedding(self, source):
        if isinstance(source, str):
            source = [source]
        max_length = len(self.tokenizer)
        tokens = self.tokenizer(source, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
        outputs = self.model(**tokens, return_dict=True)
        embedding = self.parse_outputs(outputs)
        embedding = embedding.detach().numpy()
        return embedding

with a max_length variable defined based on length of the tokenizer in use.
This is my observation, from a newbie. I hope it helps others who may be in my situation

jaketae · 2021-08-25T19:55:26Z

Hey @gsalfourn, thanks for the detailed discussion. Glad that the spaCy issue got sorted out. Here is my take on the discussion on the tokenizer:

Setting truncationto True will most likely solve the issue, but this is may not yield good keyword extraction results since the rest of the text will essentially be dropped. It's a very viable solution nonetheless, since the assumption that the head of the text will contain important information isn't totally fallacious, at least in many cases.
I believe the max_length issue is coming from the model, not the tokenizer. Every model has a limit on the number of tokens it can process due to positional embeddings. If you give it too many tokens, the positional embedding layer will raise an IndexError we just saw. Setting truncation=True will automatically take care of this.
A solution to this that seems reasonable to me is to split the text into multiple chunks, and apply the keyword extraction pipeline to each chunk. So we would obtain a list of noun keywords, then compute cosine similarity with each chunk to obtain the top K most salient keywords.

The last solution seems like it would take time to implement, so I'll issue a fix with truncation=True for the time being. If you have a better idea, please let me know in this thread! Thanks for your input.

jaketae · 2021-10-22T15:45:58Z

Closed via #7 for now.

xesaad mentioned this issue Oct 22, 2021

Truncate tokenization #7

Merged

jaketae closed this as completed Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError related to nlp.max_length: wordwise 0.0.4 #4

ValueError related to nlp.max_length: wordwise 0.0.4 #4

gsalfourn commented Aug 24, 2021 •

edited

Loading

jaketae commented Aug 24, 2021 •

edited

Loading

gsalfourn commented Aug 25, 2021 •

edited

Loading

jaketae commented Aug 25, 2021 •

edited

Loading

jaketae commented Oct 22, 2021

ValueError related to nlp.max_length: wordwise 0.0.4 #4

ValueError related to nlp.max_length: wordwise 0.0.4 #4

Comments

gsalfourn commented Aug 24, 2021 • edited Loading

jaketae commented Aug 24, 2021 • edited Loading

gsalfourn commented Aug 25, 2021 • edited Loading

jaketae commented Aug 25, 2021 • edited Loading

jaketae commented Oct 22, 2021

gsalfourn commented Aug 24, 2021 •

edited

Loading

jaketae commented Aug 24, 2021 •

edited

Loading

gsalfourn commented Aug 25, 2021 •

edited

Loading

jaketae commented Aug 25, 2021 •

edited

Loading