Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langchain document loader giving "Resource punkt_tab not found" error #25609

Open
5 tasks done
quasarswastik opened this issue Aug 21, 2024 · 2 comments
Open
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@quasarswastik
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

loader = AzureBlobStorageFileLoader(
    conn_str=conn_str,
    container=container,
    blob_name=blob,
)
 document = loader.load()```

### Error Message and Stack Trace (if applicable)

_No response_

### Description

- I am trying to use Langchain to load the documents using `AzureBlobStorageFileLoader`. 
- When loading the document I get an error related to nltk that seems upstream to langchain 
- I could fix the problem temporarily by using a downgraded version of nltk. `nltk == 3.8.1`
![image](https://github.com/user-attachments/assets/f803ddc6-ecae-4b45-8c62-e35016cccc41)


### System Info

langchain==0.2.12
langchain-community==0.2.11
langchain-core==0.2.29
langchain-experimental==0.0.36
langchain-text-splitters==0.2.2

Platform: Ubuntu WSL2 on Windows 10
Containerisation: Docker version 27.0.2, build 912c1dd
Python: Python 3.10.12
@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 21, 2024
@quasarswastik
Copy link
Author

**********************************************************************
Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:

>>> import nltk
>>> nltk.download('punkt_tab')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt_tab/english/

Searched in:
- '/home/myLowPrivilegeUser/nltk_data'
- '/venv/nltk_data'
- '/venv/share/nltk_data'
- '/venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************

image

@dvdalilue
Copy link

@quasarswastik I was getting the same error, you need to update unstructured and probably python-pptx

pip install unstructured==0.15.7 python-pptx==1.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

2 participants