# How to load HTML

The HyperText Markup Language or [HTML](https://en.wikipedia.org/wiki/HTML) is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load `HTML` documents into a LangChain [Document](https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing via [Unstructured](https://unstructured-io.github.io/unstructured/) and [BeautifulSoup4](https://beautiful-soup-4.readthedocs.io/en/latest/), which can be installed via pip. Head over to the integrations page to find integrations with additional services, such as [Azure AI Document Intelligence](/docs/integrations/document_loaders/azure_document_intelligence) or [FireCrawl](/docs/integrations/document_loaders/firecrawl).

## Loading HTML with Unstructured

In [1]:
%pip install unstructured

Collecting unstructured


  Downloading unstructured-0.15.10-py3-none-any.whl.metadata (29 kB)




Collecting filetype (from unstructured)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)


Collecting python-magic (from unstructured)
  Using cached python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)


Collecting nltk (from unstructured)


  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)


Collecting emoji (from unstructured)
  Using cached emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)


Collecting python-iso639 (from unstructured)
  Using cached python_iso639-2024.4.27-py3-none-any.whl.metadata (13 kB)


Collecting langdetect (from unstructured)
  Using cached langdetect-1.0.9-py3-none-any.whl


Collecting rapidfuzz (from unstructured)


  Downloading rapidfuzz-3.9.7-cp311-cp311-macosx_11_0_arm64.whl.metadata (12 kB)




Collecting unstructured-client (from unstructured)


  Downloading unstructured_client-0.25.8-py3-none-any.whl.metadata (15 kB)


Collecting python-oxmsg (from unstructured)


  Downloading python_oxmsg-0.0.1-py3-none-any.whl.metadata (5.0 kB)




Collecting joblib (from nltk->unstructured)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)


Collecting olefile (from python-oxmsg->unstructured)
  Using cached olefile-0.47-py2.py3-none-any.whl.metadata (9.7 kB)


Collecting deepdiff>=6.0 (from unstructured-client->unstructured)
  Downloading deepdiff-8.0.1-py3-none-any.whl.metadata (8.5 kB)


Collecting jsonpath-python>=1.0.6 (from unstructured-client->unstructured)
  Using cached jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)




Collecting pypdf>=4.0 (from unstructured-client->unstructured)
  Using cached pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting requests-toolbelt>=1.0.0 (from unstructured-client->unstructured)
  Using cached requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)


Collecting orderly-set==5.2.2 (from deepdiff>=6.0->unstructured-client->unstructured)
  Downloading orderly_set-5.2.2-py3-none-any.whl.metadata (6.3 kB)




Downloading unstructured-0.15.10-py3-none-any.whl (2.1 MB)
[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [38;2;249;38;114m━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/2.1 MB[0m [31m378.1 kB/s[0m eta [36m0:00:06[0m

[2K   [38;2;249;38;114m━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/2.1 MB[0m [31m669.7 kB/s[0m eta [36m0:00:03[0m

[2K   [38;2;249;38;114m━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/2.1 MB[0m [31m669.7 kB/s[0m eta [36m0:00:03[0m

[2K   [38;2;249;38;114m━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/2.1 MB[0m [31m758.4 kB/s[0m eta [36m0:00:03[0m[2K   [38;2;249;38;114m━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/2.1 MB[0m [31m1.0 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/2.1 MB[0m [31m944.9 kB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/2.1 MB[0m [31m918.0 kB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.9/2.1 MB[0m [31m969.2 kB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━[0m [32m1.1/2.1 MB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━[0m [32m1.4/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━[0m [32m1.5/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━[0m [32m1.5/2.1 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.7/2.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━[0m [32m1.8/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━[0m [32m1.9/2.1 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━[0m [32m2.0/2.1 MB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached emoji-2.12.1-py3-none-any.whl (431 kB)
Using cached filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)


[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [38;2;249;38;114m━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.5 MB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m648.4 kB/s[0m eta [36m0:00:03[0m

[2K   [38;2;249;38;114m━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m688.7 kB/s[0m eta [36m0:00:02[0m

[2K   [38;2;249;38;114m━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.5 MB[0m [31m943.8 kB/s[0m eta [36m0:00:02[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━[0m [32m0.9/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━[0m [32m1.2/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.2/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━[0m [32m1.2/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━[0m [32m1.3/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h

Using cached python_iso639-2024.4.27-py3-none-any.whl (274 kB)
Using cached python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Downloading python_oxmsg-0.0.1-py3-none-any.whl (31 kB)


Downloading rapidfuzz-3.9.7-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB)
[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [38;2;249;38;114m━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.5 MB[0m [31m711.3 kB/s[0m eta [36m0:00:03[0m

[2K   [38;2;249;38;114m━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.5 MB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━[0m [32m0.9/1.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━[0m [32m1.0/1.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━[0m [32m1.1/1.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━[0m [32m1.3/1.5 MB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━[0m [32m1.4/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━[0m [32m1.5/1.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m [32m1.5/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading unstructured_client-0.25.8-py3-none-any.whl (45 kB)


[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading deepdiff-8.0.1-py3-none-any.whl (82 kB)
[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/82.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.7/82.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading orderly_set-5.2.2-py3-none-any.whl (11 kB)


Using cached jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)


Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.8 kB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [38;2;249;38;114m━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/295.8 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/295.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/295.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━━━━━━━━━━━━[0m [32m194.6/295.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;5;237m╺[0m[38;5;237m━━[0m [32m276.5/295.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)


Downloading olefile-0.47-py2.py3-none-any.whl (114 kB)
[?25l   [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [38;2;249;38;114m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[38;2;249;38;114m╸[0m[38;5;237m━━━━━━━━━━━[0m [32m81.9/114.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m

[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.6/114.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Installing collected packages: filetype, rapidfuzz, python-magic, python-iso639, pypdf, orderly-set, olefile, langdetect, jsonpath-python, joblib, emoji, requests-toolbelt, python-oxmsg, nltk, deepdiff, unstructured-client, unstructured


Successfully installed deepdiff-8.0.1 emoji-2.12.1 filetype-1.2.0 joblib-1.4.2 jsonpath-python-1.0.6 langdetect-1.0.9 nltk-3.9.1 olefile-0.47 orderly-set-5.2.2 pypdf-4.3.1 python-iso639-2024.4.27 python-magic-0.4.27 python-oxmsg-0.0.1 rapidfuzz-3.9.7 requests-toolbelt-1.0.0 unstructured-0.15.10 unstructured-client-0.25.8


Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "../../docs/integrations/document_loaders/example_data/fake-content.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html'}, page_content='My First Heading\n\nMy first paragraph.')]


## Loading HTML with BeautifulSoup4

We can also use `BeautifulSoup4` to load HTML documents using the `BSHTMLLoader`.  This will extract the text from the HTML into `page_content`, and the page title as `title` into `metadata`.

In [3]:
%pip install bs4

Collecting bs4
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Using cached bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)


Installing collected packages: bs4


Successfully installed bs4-0.0.2


Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(metadata={'source': '../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'}, page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n')]
