Skip to content

Commit

Permalink
WebBaseLoader: optionally raise exception in the case of http error (#…
Browse files Browse the repository at this point in the history
…6823)

- **Description**: this PR adds the possibility to raise an exception in
the case the http request did not return a 2xx status code. This is
particularly useful in the situation when the url points to a
non-existent web page, the server returns a http status of 404 NOT
FOUND, but WebBaseLoader anyway parses and returns the http body of the
error message.
  - **Dependencies**: none,
  - **Tag maintainer**: @rlancemartin, @eyurtsev,
  - **Twitter handle**: jtolgyesi
  • Loading branch information
mrtj authored and hinthornw committed Jul 3, 2023
1 parent 9db9adc commit 562b430
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions langchain/document_loaders/web_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ class WebBaseLoader(BaseLoader):
requests_kwargs: Dict[str, Any] = {}
"""kwargs for requests"""

raise_for_status: bool = False
"""Raise an exception if http status code denotes an error."""

bs_get_text_kwargs: Dict[str, Any] = {}
"""kwargs for beatifulsoup4 get_text"""

Expand Down Expand Up @@ -189,6 +192,8 @@ def _scrape(self, url: str, parser: Union[str, None] = None) -> Any:
self._check_parser(parser)

html_doc = self.session.get(url, verify=self.verify, **self.requests_kwargs)
if self.raise_for_status:
html_doc.raise_for_status()
html_doc.encoding = html_doc.apparent_encoding
return BeautifulSoup(html_doc.text, parser)

Expand Down

0 comments on commit 562b430

Please sign in to comment.