Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

Closed
venturaEffect opened this issue Jul 28, 2023 · 8 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@venturaEffect
Copy link

venturaEffect commented Jul 28, 2023

Issue you'd like to raise.

I'm trying to load some documents, powerpoints and text to train my custom LLm using Langchain.

When I run it I come to a weird error message where it tells I don't have "tokenizers" and "taggers" packages (folders).

I've read the docs, asked Langchain chatbot, pip install nltk, uninstall, pip install nltk without dependencies, added them with nltk.download(), nltk.download("punkt"), nltk.download("all"),... Did also manually put on the path: nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data'] and added all the folders. Added the tokenizers folder and taggers folder from the github repo: . Everything. Also asked on the Github repo. Nothing, no success.

Here the code of the file I try to run:

`
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import UnstructuredPowerPointLoader, TextLoader, UnstructuredWordDocumentLoader
from dotenv import load_dotenv, find_dotenv
import os
import openai
import sys
import nltk
nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data']
nltk.download(
'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')

sys.path.append('../..')

_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']

folder_path_docx = "DB\ DB VARIADO\DOCS"
folder_path_txt = "DB\BLOG-POSTS"
folder_path_pptx_1 = "DB\PPT DAY JUNIO"
folder_path_pptx_2 = "DB\DB VARIADO\PPTX"

loaded_content = []

for file in os.listdir(folder_path_docx):
if file.endswith(".docx"):
file_path = os.path.join(folder_path_docx, file)
loader = UnstructuredWordDocumentLoader(file_path)
docx = loader.load()
loaded_content.extend(docx)

for file in os.listdir(folder_path_txt):
if file.endswith(".txt"):
file_path = os.path.join(folder_path_txt, file)
loader = TextLoader(file_path, encoding='utf-8')
text = loader.load()
loaded_content.extend(text)

for file in os.listdir(folder_path_pptx_1):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_1, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_1 = loader.load()
loaded_content.extend(slides_1)

for file in os.listdir(folder_path_pptx_2):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_2, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_2 = loader.load()
loaded_content.extend(slides_2)

print(loaded_content[0].page_content)
print(nltk.data.path)

installed_packages = nltk.downloader.Downloader(
download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()

print(installed_packages)

sent_tokenize("Hello. How are you? I'm well.")
`

When running the file I get:

`
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index

  • HERE SOME TEXT -

['C:\Users\zaesa\AppData\Roaming\nltk_data']
dict_values([, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ])
`

And here is how my folders structure from nltk_data looks like:

nltk-screenshot taggers-screenshot tokeenizers-screenshot punkt-screenshot

Suggestion:

I have fresh installed nltk with no dependencies. The version is the latest. The support team from NLTK doesn't know what is wrong. It seems everything is fine. So, it has to be a bug or something coming from Langchain that I'm not able to see. Really appreciate any help. Need to make this work! Thank you.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 28, 2023
@y-f-a
Copy link

y-f-a commented Jul 28, 2023

I encounter a similar issue using Google Colab. I am using the qachain and running code I have run many times before, but now receiving this today:

[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index

LookupError Traceback (most recent call last)
in <cell line: 10>()
8
9 # get the langchain qa chain for the given company
---> 10 qa = get_company_qa_chain(text_dir_path)

17 frames
/usr/local/lib/python3.10/dist-packages/nltk/data.py in find(resource_name, paths)
581 sep = "*" * 70
582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)
584
585

LookupError:


Resource averaged_perceptron_tagger not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('averaged_perceptron_tagger')

For more information see: https://www.nltk.org/data.html

Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle

Searched in:
- '/root/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'


@venturaEffect
Copy link
Author

Thanks for sharing.

So have you come to a solution or are you facing the same problem.

Appreciate

@y-f-a
Copy link

y-f-a commented Jul 28, 2023

Sadly, I am facing the same problem and still trying to find the solution. I was just adding some further context in the hope that it could help us. If I find a solution, I will definitely share it here!

@y-f-a
Copy link

y-f-a commented Jul 28, 2023

I found a workaround, for my situation anyway. The problem seems to be with the Directory loader. I went from:

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True)

to

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True, loader_cls=TextLoader)

This also required:

from langchain.document_loaders import DirectoryLoader, TextLoader

Hopefully that works for you too.

@dosubot
Copy link

dosubot bot commented Jul 28, 2023

Answer generated by a 🤖

Answer

I understand that you're encountering issues with NLTK while working with LangChain. Specifically, you're seeing errors related to the 'tokenizers' and 'taggers' packages not being found. I should note that as an AI, I can only process text and cannot view images or links.

The issue you're experiencing seems to be related to NLTK, a Python library used by LangChain for text processing, rather than LangChain itself. The error messages suggest that NLTK can't locate the 'tokenizers' and 'taggers' packages. However, these are not standard packages in NLTK. The standard tokenizer in NLTK is 'punkt', which you've correctly downloaded with nltk.download('punkt').

This problem might be due to a misconfiguration or misunderstanding. Here are a few suggestions:

  1. Verify that NLTK is correctly installed and up-to-date by running pip install --upgrade nltk.

  2. Confirm that the 'punkt' tokenizer is correctly installed. You can do this by running the following code:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

print(word_tokenize("This is a test."))

If the 'punkt' tokenizer is correctly installed, this should output ['This', 'is', 'a', 'test', '.'].

  1. If you're still encountering issues, you might find it helpful to completely uninstall NLTK, delete your nltk_data directory, and then reinstall NLTK. You can uninstall NLTK by running pip uninstall nltk, and you can reinstall it by running pip install nltk.

  2. If none of the above solutions work, you might want to consider using a different text processing library, such as Spacy. LangChain also supports Spacy, and you can use it by changing NLTKTextSplitter to SpacyTextSplitter in your code.

Please note that this is just one potential solution to your problem. If you believe this is a bug that could impact other users and you've found a solution, we encourage you to make a pull request. If this answer contradicts your experience, you might want to consider upgrading to the latest version of LangChain.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@akowalsk
Copy link

Upgrading to the latest unstructured fixed for me pip install -U unstructured. See Unstructured-IO/unstructured@e017e99

@akowalsk
Copy link

Oh sorry I didn't notice you're the same one that posted on the nltk issue.

@dosubot
Copy link

dosubot bot commented Oct 27, 2023

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you encountered an error message related to the "tokenizers" and "taggers" packages not being found when working with Langchain. Another user, @reyfar, also encountered a similar issue and suggested a workaround by modifying the DirectoryLoader and importing TextLoader. In response, I provided potential solutions, including verifying the installation of NLTK and considering using a different text processing library like Spacy. Additionally, @akowalsk suggested upgrading the "unstructured" package as a solution.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 27, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

3 participants