Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError related to nlp.max_length: wordwise 0.0.4 #4

Closed
gsalfourn opened this issue Aug 24, 2021 · 4 comments
Closed

ValueError related to nlp.max_length: wordwise 0.0.4 #4

gsalfourn opened this issue Aug 24, 2021 · 4 comments

Comments

@gsalfourn
Copy link

gsalfourn commented Aug 24, 2021

Was trying out the library, and run the following error

ValueError: [E088] Text of length 4290144 exceeds maximum of 1000000. The parser and NER 
models require roughly 1GB of temporary memory per 100,000 characters in the input. This 
means long texts  may cause memory allocation errors. If you're not using the parser or NER, 
it's probably safe to increase  the `nlp.max_length` limit. The limit is in number of characters, 
so you can check whether your inputs are  too long by checking `len(text)`.

Just wondering where in the code to fix nlp.max_length

@jaketae
Copy link
Owner

jaketae commented Aug 24, 2021

Hey @gsalfourn, thanks for reporting this issue.

You can access the spaCy model via extractor.nlp. So in this case, you could fix the warning via

extractor = Extractor()
extractor.nlp.max_length = 5000000 # or some big number, as long as you have RAM

Let me know if this fixes the error for you. I'll try to think of a way to remedy this warning in the next patch release.

@gsalfourn
Copy link
Author

gsalfourn commented Aug 25, 2021

thanks so much, that worked.

Sorry to be a bother, but I have another question, so how would one go about truncating and specifying max_length tokens if the tokens in the text file being used are longer than the token the transformer tokenizer will accept? I am asking, because I ran into an indexing error related to the length of my tokens one the length accepted by the tokenizer.

Token indices sequence length is longer than the specified maximum sequence length for this model (41163 > 512). Running
this sequence through the model will result in indexing errors
...
...
IndexError: index out of range in self

In the core.py file line 60 shows

tokens = self.tokenizer(source, padding=True, return_tensors="pt"),

that part may raise an error relating to length of tokens (indexing error) , if a user's tokens are longer than that of the loaded tokenizer.

I don't know if it's possible to change line 60 to force truncation and max_length like so

tokens = self.tokenizer(source,  padding=True, truncation=True, max_length=512, return_tensors="pt") 

or to something that can take variable length tokens, based on the tokenizer in use. So lines 56 to 64 will become something like:

    @torch.no_grad()
    def get_embedding(self, source):
        if isinstance(source, str):
            source = [source]
        max_length = len(self.tokenizer)
        tokens = self.tokenizer(source, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
        outputs = self.model(**tokens, return_dict=True)
        embedding = self.parse_outputs(outputs)
        embedding = embedding.detach().numpy()
        return embedding

with a max_length variable defined based on length of the tokenizer in use.
This is my observation, from a newbie. I hope it helps others who may be in my situation

@jaketae
Copy link
Owner

jaketae commented Aug 25, 2021

Hey @gsalfourn, thanks for the detailed discussion. Glad that the spaCy issue got sorted out. Here is my take on the discussion on the tokenizer:

  1. Setting truncationto True will most likely solve the issue, but this is may not yield good keyword extraction results since the rest of the text will essentially be dropped. It's a very viable solution nonetheless, since the assumption that the head of the text will contain important information isn't totally fallacious, at least in many cases.
  2. I believe the max_length issue is coming from the model, not the tokenizer. Every model has a limit on the number of tokens it can process due to positional embeddings. If you give it too many tokens, the positional embedding layer will raise an IndexError we just saw. Setting truncation=True will automatically take care of this.
  3. A solution to this that seems reasonable to me is to split the text into multiple chunks, and apply the keyword extraction pipeline to each chunk. So we would obtain a list of noun keywords, then compute cosine similarity with each chunk to obtain the top K most salient keywords.

The last solution seems like it would take time to implement, so I'll issue a fix with truncation=True for the time being. If you have a better idea, please let me know in this thread! Thanks for your input.

@jaketae
Copy link
Owner

jaketae commented Oct 22, 2021

Closed via #7 for now.

@jaketae jaketae closed this as completed Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants