Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnstructuredURLLoader does not gracefully handle failures given a list of URL's #1939

Closed
cragwolfe opened this issue Mar 23, 2023 · 0 comments

Comments

@cragwolfe
Copy link
Contributor

cragwolfe commented Mar 23, 2023

As reported by Kranos in Discord, there is no a way to robustly iterate through a list of URL's with UnstructuredURLLoader. The workaround for now is to create a UnstructuredURLLoader object per url and do the following:

Yep, exactly my problem - I had a load of URLs loaded into a pandas dataframe I was iterating through. I basically added the following at the end of the loop to keep things ticking over and ignoring any errors:
# Manage any errors
  except (NameError, ValueError, KeyError, OSError, TypeError):
    # Pass the error
    pass

UnstructuredURLLoader should likely do this by default, or provide a strict option to exit on any failures.

hwchase17 pushed a commit that referenced this issue Mar 27, 2023
By default, UnstructuredURLLoader now continues processing remaining
`urls` if encountering an error for a particular url.

If failure of the entire loader is desired as was previously the case,
use `continue_on_failure=False`.

E.g., this fails splendidly, courtesy of the 2nd url:

```
from langchain.document_loaders import UnstructuredURLLoader
urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://doesnotexistithinkprobablynotverynotlikely.io",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
loader = UnstructuredURLLoader(urls=urls, continue_on_failure=False)
data = loader.load()
```

Issue: #1939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant