Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain/document_loaders/web_base.py #8505

Closed
2 of 14 tasks
tabee opened this issue Jul 31, 2023 · 2 comments
Closed
2 of 14 tasks

langchain/document_loaders/web_base.py #8505

tabee opened this issue Jul 31, 2023 · 2 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@tabee
Copy link

tabee commented Jul 31, 2023

System Info

langchain/document_loaders/web_base.py > works for me only when i change:

return await response.text()

with:

body = await response.read()
return body.decode('utf-8', errors='ignore')

otherwise:

der Code produziert leider einen Fehler:

/home/codespace/.py
thon/current/bin/python3 /workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py
Fetching pages: 13%|###8 | 33/256 [00:03<00:19, 11.18it/s]Traceback (most recent call last):
File "/workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py", line 68, in
documents = loader.load()
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/sitemap.py", line 142, in load
results = self.scrape_all([el["loc"].strip() for el in els if "loc" in el])
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 168, in scrape_all
results = asyncio.run(self.fetch_all(urls))
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 35, in run
return loop.run_until_complete(task)
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 90, in run_until_complete
return f.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 148, in fetch_all
return await tqdm_asyncio.gather(
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 234, in __step
result = coro.throw(exc)
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 285, in await
yield self # This tells Task to wait for completion.
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
future.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 136, in _fetch_with_rate_limit
return await self._fetch(url)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 120, in _fetch
return await response.text()
File "/home/codespace/.python/current/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1086, in text
return self._body.decode( # type: ignore[no-any-return,union-attr]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
Fetching pages: 15%|####4 | 38/256 [00:04<00:23, 9.25it/s]

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

I tried to make embedding from a website in "french" language.

Expected behavior

we need a solution when : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 31, 2023
@dosubot
Copy link

dosubot bot commented Jul 31, 2023

Answer generated by a 🤖

Answer

I understand that you're encountering a UnicodeDecodeError when trying to load documents from a website in French using the LangChain framework. This error is occurring because the response.text() method is attempting to decode the response body using the 'utf-8' codec, and it's encountering characters that aren't valid 'utf-8' characters.

The solution you've proposed, which involves using the response.read() method to read the response body as bytes and then decoding those bytes using the 'utf-8' codec while ignoring any errors, is a valid approach. This method should effectively handle any non 'utf-8' characters and prevent the UnicodeDecodeError from being raised.

Here's how you can implement this solution in the _fetch method:

async def _fetch(self, url: str) -> str:
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if self.raise_for_status:
                response.raise_for_status()
            body = await response.read()
            return body.decode('utf-8', errors='ignore')

This modification should resolve the issue you're facing. However, please note that this is one potential solution and there might be other ways to address this problem.

If you find this solution to be effective and believe that it could benefit other users who might encounter the same issue, I encourage you to make a pull request with this change. This way, the LangChain community can review and potentially incorporate your solution into the framework.

If this solution doesn't work as expected, or if you continue to encounter issues, I recommend checking if you're using the latest version of LangChain. Upgrading to the latest version might resolve your issue if it's been addressed in a more recent update.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@dosubot
Copy link

dosubot bot commented Oct 30, 2023

Hi, @tabee! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to a UnicodeDecodeError in the code located in langchain/document_loaders/web_base.py. You found a workaround by changing return await response.text() to body = await response.read() return body.decode('utf-8', errors='ignore'). I provided a response suggesting that the workaround is a valid approach and even provided code on how to implement it. I also mentioned the possibility of making a pull request if the solution is effective.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 30, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

1 participant