-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419
Comments
I encounter a similar issue using Google Colab. I am using the qachain and running code I have run many times before, but now receiving this today: [nltk_data] Error loading tokenizers: Package 'tokenizers' not found
|
Thanks for sharing. So have you come to a solution or are you facing the same problem. Appreciate |
Sadly, I am facing the same problem and still trying to find the solution. I was just adding some further context in the hope that it could help us. If I find a solution, I will definitely share it here! |
I found a workaround, for my situation anyway. The problem seems to be with the Directory loader. I went from:
to
This also required:
Hopefully that works for you too. |
Answer generated by a 🤖 AnswerI understand that you're encountering issues with NLTK while working with LangChain. Specifically, you're seeing errors related to the 'tokenizers' and 'taggers' packages not being found. I should note that as an AI, I can only process text and cannot view images or links. The issue you're experiencing seems to be related to NLTK, a Python library used by LangChain for text processing, rather than LangChain itself. The error messages suggest that NLTK can't locate the 'tokenizers' and 'taggers' packages. However, these are not standard packages in NLTK. The standard tokenizer in NLTK is 'punkt', which you've correctly downloaded with This problem might be due to a misconfiguration or misunderstanding. Here are a few suggestions:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(word_tokenize("This is a test.")) If the 'punkt' tokenizer is correctly installed, this should output
Please note that this is just one potential solution to your problem. If you believe this is a bug that could impact other users and you've found a solution, we encourage you to make a pull request. If this answer contradicts your experience, you might want to consider upgrading to the latest version of LangChain. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Upgrading to the latest unstructured fixed for me |
Oh sorry I didn't notice you're the same one that posted on the nltk issue. |
Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale. Based on my understanding, you encountered an error message related to the "tokenizers" and "taggers" packages not being found when working with Langchain. Another user, @reyfar, also encountered a similar issue and suggested a workaround by modifying the DirectoryLoader and importing TextLoader. In response, I provided potential solutions, including verifying the installation of NLTK and considering using a different text processing library like Spacy. Additionally, @akowalsk suggested upgrading the "unstructured" package as a solution. Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and contribution to the LangChain project! |
Issue you'd like to raise.
I'm trying to load some documents, powerpoints and text to train my custom LLm using Langchain.
When I run it I come to a weird error message where it tells I don't have "tokenizers" and "taggers" packages (folders).
I've read the docs, asked Langchain chatbot, pip install nltk, uninstall, pip install nltk without dependencies, added them with nltk.download(), nltk.download("punkt"), nltk.download("all"),... Did also manually put on the path: nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data'] and added all the folders. Added the tokenizers folder and taggers folder from the github repo: . Everything. Also asked on the Github repo. Nothing, no success.
Here the code of the file I try to run:
`
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import UnstructuredPowerPointLoader, TextLoader, UnstructuredWordDocumentLoader
from dotenv import load_dotenv, find_dotenv
import os
import openai
import sys
import nltk
nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data']
nltk.download(
'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')
sys.path.append('../..')
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
folder_path_docx = "DB\ DB VARIADO\DOCS"
folder_path_txt = "DB\BLOG-POSTS"
folder_path_pptx_1 = "DB\PPT DAY JUNIO"
folder_path_pptx_2 = "DB\DB VARIADO\PPTX"
loaded_content = []
for file in os.listdir(folder_path_docx):
if file.endswith(".docx"):
file_path = os.path.join(folder_path_docx, file)
loader = UnstructuredWordDocumentLoader(file_path)
docx = loader.load()
loaded_content.extend(docx)
for file in os.listdir(folder_path_txt):
if file.endswith(".txt"):
file_path = os.path.join(folder_path_txt, file)
loader = TextLoader(file_path, encoding='utf-8')
text = loader.load()
loaded_content.extend(text)
for file in os.listdir(folder_path_pptx_1):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_1, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_1 = loader.load()
loaded_content.extend(slides_1)
for file in os.listdir(folder_path_pptx_2):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_2, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_2 = loader.load()
loaded_content.extend(slides_2)
print(loaded_content[0].page_content)
print(nltk.data.path)
installed_packages = nltk.downloader.Downloader(
download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()
print(installed_packages)
sent_tokenize("Hello. How are you? I'm well.")
`
When running the file I get:
`
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
['C:\Users\zaesa\AppData\Roaming\nltk_data']
dict_values([, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ])
`
And here is how my folders structure from nltk_data looks like:
Suggestion:
I have fresh installed nltk with no dependencies. The version is the latest. The support team from NLTK doesn't know what is wrong. It seems everything is fine. So, it has to be a bug or something coming from Langchain that I'm not able to see. Really appreciate any help. Need to make this work! Thank you.
The text was updated successfully, but these errors were encountered: