Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

venturaEffect · 2023-07-28T12:23:04Z

Issue you'd like to raise.

I'm trying to load some documents, powerpoints and text to train my custom LLm using Langchain.

When I run it I come to a weird error message where it tells I don't have "tokenizers" and "taggers" packages (folders).

I've read the docs, asked Langchain chatbot, pip install nltk, uninstall, pip install nltk without dependencies, added them with nltk.download(), nltk.download("punkt"), nltk.download("all"),... Did also manually put on the path: nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data'] and added all the folders. Added the tokenizers folder and taggers folder from the github repo: . Everything. Also asked on the Github repo. Nothing, no success.

Here the code of the file I try to run:

`
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import UnstructuredPowerPointLoader, TextLoader, UnstructuredWordDocumentLoader
from dotenv import load_dotenv, find_dotenv
import os
import openai
import sys
import nltk
nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data']
nltk.download(
'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')

sys.path.append('../..')

_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']

folder_path_docx = "DB\ DB VARIADO\DOCS"
folder_path_txt = "DB\BLOG-POSTS"
folder_path_pptx_1 = "DB\PPT DAY JUNIO"
folder_path_pptx_2 = "DB\DB VARIADO\PPTX"

loaded_content = []

for file in os.listdir(folder_path_docx):
if file.endswith(".docx"):
file_path = os.path.join(folder_path_docx, file)
loader = UnstructuredWordDocumentLoader(file_path)
docx = loader.load()
loaded_content.extend(docx)

for file in os.listdir(folder_path_txt):
if file.endswith(".txt"):
file_path = os.path.join(folder_path_txt, file)
loader = TextLoader(file_path, encoding='utf-8')
text = loader.load()
loaded_content.extend(text)

for file in os.listdir(folder_path_pptx_1):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_1, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_1 = loader.load()
loaded_content.extend(slides_1)

for file in os.listdir(folder_path_pptx_2):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_2, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_2 = loader.load()
loaded_content.extend(slides_2)

print(loaded_content[0].page_content)
print(nltk.data.path)

installed_packages = nltk.downloader.Downloader(
download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()

print(installed_packages)

sent_tokenize("Hello. How are you? I'm well.")
`

When running the file I get:

`
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index

HERE SOME TEXT -

['C:\Users\zaesa\AppData\Roaming\nltk_data']
dict_values([, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ])
`

And here is how my folders structure from nltk_data looks like:

Suggestion:

I have fresh installed nltk with no dependencies. The version is the latest. The support team from NLTK doesn't know what is wrong. It seems everything is fine. So, it has to be a bug or something coming from Langchain that I'm not able to see. Really appreciate any help. Need to make this work! Thank you.

y-f-a · 2023-07-28T15:02:33Z

I encounter a similar issue using Google Colab. I am using the qachain and running code I have run many times before, but now receiving this today:

[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index

LookupError Traceback (most recent call last)
in <cell line: 10>()
8
9 # get the langchain qa chain for the given company
---> 10 qa = get_company_qa_chain(text_dir_path)

17 frames
/usr/local/lib/python3.10/dist-packages/nltk/data.py in find(resource_name, paths)
581 sep = "*" * 70
582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)
584
585

LookupError:

Resource averaged_perceptron_tagger not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('averaged_perceptron_tagger')

For more information see: https://www.nltk.org/data.html

Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle

Searched in:
- '/root/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'

venturaEffect · 2023-07-28T15:13:27Z

Thanks for sharing.

So have you come to a solution or are you facing the same problem.

Appreciate

y-f-a · 2023-07-28T15:25:33Z

Sadly, I am facing the same problem and still trying to find the solution. I was just adding some further context in the hope that it could help us. If I find a solution, I will definitely share it here!

y-f-a · 2023-07-28T17:04:48Z

I found a workaround, for my situation anyway. The problem seems to be with the Directory loader. I went from:

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True)

to

loader = DirectoryLoader(text_dir, glob="*.txt", recursive=True, silent_errors=True, loader_cls=TextLoader)

This also required:

from langchain.document_loaders import DirectoryLoader, TextLoader

Hopefully that works for you too.

dosubot · 2023-07-28T18:38:44Z

Answer generated by a 🤖

Answer

I understand that you're encountering issues with NLTK while working with LangChain. Specifically, you're seeing errors related to the 'tokenizers' and 'taggers' packages not being found. I should note that as an AI, I can only process text and cannot view images or links.

The issue you're experiencing seems to be related to NLTK, a Python library used by LangChain for text processing, rather than LangChain itself. The error messages suggest that NLTK can't locate the 'tokenizers' and 'taggers' packages. However, these are not standard packages in NLTK. The standard tokenizer in NLTK is 'punkt', which you've correctly downloaded with nltk.download('punkt').

This problem might be due to a misconfiguration or misunderstanding. Here are a few suggestions:

Verify that NLTK is correctly installed and up-to-date by running pip install --upgrade nltk.
Confirm that the 'punkt' tokenizer is correctly installed. You can do this by running the following code:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

print(word_tokenize("This is a test."))

If the 'punkt' tokenizer is correctly installed, this should output ['This', 'is', 'a', 'test', '.'].

If you're still encountering issues, you might find it helpful to completely uninstall NLTK, delete your nltk_data directory, and then reinstall NLTK. You can uninstall NLTK by running pip uninstall nltk, and you can reinstall it by running pip install nltk.
If none of the above solutions work, you might want to consider using a different text processing library, such as Spacy. LangChain also supports Spacy, and you can use it by changing NLTKTextSplitter to SpacyTextSplitter in your code.

Please note that this is just one potential solution to your problem. If you believe this is a bug that could impact other users and you've found a solution, we encourage you to make a pull request. If this answer contradicts your experience, you might want to consider upgrading to the latest version of LangChain.

Sources

libs/langchain/langchain/text_splitter.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

akowalsk · 2023-07-28T21:11:10Z

Upgrading to the latest unstructured fixed for me pip install -U unstructured. See Unstructured-IO/unstructured@e017e99

akowalsk · 2023-07-28T21:12:59Z

Oh sorry I didn't notice you're the same one that posted on the nltk issue.

dosubot · 2023-10-27T16:02:35Z

Hi, @venturaEffect! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you encountered an error message related to the "tokenizers" and "taggers" packages not being found when working with Langchain. Another user, @reyfar, also encountered a similar issue and suggested a workaround by modifying the DirectoryLoader and importing TextLoader. In response, I provided potential solutions, including verifying the installation of NLTK and considering using a different text processing library like Spacy. Additionally, @akowalsk suggested upgrading the "unstructured" package as a solution.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 28, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 27, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

venturaEffect commented Jul 28, 2023 •

edited

Loading

y-f-a commented Jul 28, 2023 •

edited

Loading

venturaEffect commented Jul 28, 2023

y-f-a commented Jul 28, 2023

y-f-a commented Jul 28, 2023 •

edited

Loading

dosubot bot commented Jul 28, 2023

akowalsk commented Jul 28, 2023

akowalsk commented Jul 28, 2023

dosubot bot commented Oct 27, 2023

Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

Working with Langchain I get nlkt errors telling me: Package "tokenizers" not found in index and Packaage "taggers" not found in index #8419

Comments

venturaEffect commented Jul 28, 2023 • edited Loading

Issue you'd like to raise.

Suggestion:

y-f-a commented Jul 28, 2023 • edited Loading

venturaEffect commented Jul 28, 2023

y-f-a commented Jul 28, 2023

y-f-a commented Jul 28, 2023 • edited Loading

dosubot bot commented Jul 28, 2023

Answer

Sources

akowalsk commented Jul 28, 2023

akowalsk commented Jul 28, 2023

dosubot bot commented Oct 27, 2023

venturaEffect commented Jul 28, 2023 •

edited

Loading

y-f-a commented Jul 28, 2023 •

edited

Loading

y-f-a commented Jul 28, 2023 •

edited

Loading