-- Contents -- <br>
1. Simple and fast text extraction
2. Advanced parsing
3. Sitemap extraction
4. Indexing parsed webpage data

# 1. Simple and fast text extraction

In [1]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [None]:
page_url1 = "https://hongkongfp.com/2025/06/29/explainer-how-national-security-permeates-hong-kong-bureaucracy-5-years-after-law-enacted/"
page_url2 = "https://news.rthk.hk/rthk/en/component/k2/1811238-20250630.htm"

loader = WebBaseLoader(web_paths=[page_url1,
                                  page_url2])
docs = []

async for doc in loader.alazy_load():
    docs.append(doc)

## -- ASYNC -- ##
# The async for statement allows convenient iteration over asynchronous iterables.
# Asynchronous programming in Python is the process of writing concurrent code that runs asynchronously – 
# i.e. doesn't take place in real-time.
# It allows an app instance to execute multiple tasks at the same time, or in parallel.

assert len(docs) == 2
doc1 = docs[0]
doc2 = docs[1]

## -- ASSERT -- ##
# In Python, the assert statement is used for debugging and testing purposes. 
# It allows you to create assertions, which are essentially sanity checks that verify if a certain condition is true 
# during the execution of your program.


In [38]:
import pprint as p 
print(f"{doc1.metadata}\n")
#p.pprint(doc1.page_content[:500].strip())

{'source': 'https://hongkongfp.com/2025/06/29/explainer-how-national-security-permeates-hong-kong-bureaucracy-5-years-after-law-enacted/', 'title': 'How nat. sec permeates HK bureaucracy, 5 years after law enacted', 'description': "HK gov't departments and statutory bodies for different sectors, including education, labour, social welfare, arts and culture, and the environment, have adopted nat. sec clauses in their guidelines and conditions since Beijing imposed the law in June 2020.", 'language': 'en-GB'}



In [None]:
title_loader = WebBaseLoader(
    web_paths=[page_url1],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(class_="entry-title entry-title--with-subtitle"
        ),
    },
    bs_get_text_kwargs={"separator": " | ", "strip": True},
)

title = []
async for doc in title_loader.alazy_load():
    title.append(doc)

assert len(title) == 1
title = title[0]

In [36]:
loader = WebBaseLoader(
    web_paths=[page_url1],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(# class_="entry-title entry-title--with-subtitle")
                                       # class_= "newspack-post-subtitle"
                                       class_ = "main-content"

        ),
    },
    bs_get_text_kwargs={"separator": " | ", "strip": True},
)

docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

assert len(docs) == 1
doc = docs[0]

In [37]:
print(f"{doc.metadata}\n")
p.pprint(title.page_content)
p.pprint(doc.page_content[:5000])

{'source': 'https://hongkongfp.com/2025/06/29/explainer-how-national-security-permeates-hong-kong-bureaucracy-5-years-after-law-enacted/', 'title': 'x'}

('Explainer: How national security permeates Hong Kong bureaucracy, 5 years '
 'after law enacted')
('Five years since the Beijing-imposed legislation came into effect in Hong '
 'Kong, national security terms have become increasingly common in official '
 'guidelines, permit applications, and licences issued by government '
 'departments and semi-official bodies. | Most recently, the Food and '
 'Environmental Hygiene Department (FEHD) notified businesses of new national '
 'security clauses under the Public Health and Municipal Services Ordinance. | '
 'China’s national flags fill the streets in Hong Kong ahead of July 1, 2025, '
 'the 28th anniversary of Hong Kong’s handover to China. Photo: Kyle Lam/HKFP. '
 '| The new rule is the latest addition to similar provisions in official '
 'guidelines. Government departments and statutor

# 2. Advanced parsing

In [2]:
from langchain_unstructured import UnstructuredLoader

page_url = "https://python.langchain.com/docs/how_to/chatbots_memory/"
loader = UnstructuredLoader(web_url=page_url)

docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

INFO: NumExpr defaulting to 14 threads.


In [3]:
for doc in docs[:10]:
    print(doc.page_content)

Open In Colab
Open on GitHub
How to add memory to chatbots
A key feature of chatbots is their ability to use the content of previous conversational turns as context. This state management can take several forms, including:
Simply stuffing previous messages into a chat model prompt.
The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
More complex modifications like synthesizing summaries for long running conversations.
We'll go into more detail on a few techniques below!
note
This how-to guide previously built a chatbot using RunnableWithMessageHistory. You can access this version of the guide in the v0.2 docs.


In [4]:
for doc in docs[:10]:
    print(f'{doc.metadata["category"]}: {doc.page_content}')

Image: Open In Colab
Image: Open on GitHub
Title: How to add memory to chatbots
NarrativeText: A key feature of chatbots is their ability to use the content of previous conversational turns as context. This state management can take several forms, including:
ListItem: Simply stuffing previous messages into a chat model prompt.
ListItem: The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.
ListItem: More complex modifications like synthesizing summaries for long running conversations.
NarrativeText: We'll go into more detail on a few techniques below!
UncategorizedText: note
NarrativeText: This how-to guide previously built a chatbot using RunnableWithMessageHistory. You can access this version of the guide in the v0.2 docs.


In [5]:
from typing import List
from langchain_core.documents import Document

async def _get_setup_docs_from_url(url: str) -> List[Document]:
    loader = UnstructuredLoader(web_url=url)

    setup_docs = []
    parent_id = -1

    async for doc in loader.alazy_load():
        if doc.metadata["category"] == "Title" and doc.page_content.startswith("Setup"):
            parent_id = doc.metadata["element_id"]
        if doc.metadata.get("parent_id") == parent_id:
            setup_docs.append(doc)

    return setup_docs


page_urls = [
    "https://python.langchain.com/docs/how_to/chatbots_memory/",
    "https://python.langchain.com/docs/how_to/chatbots_tools/",
]

setup_docs = []
for url in page_urls:
    page_setup_docs = await _get_setup_docs_from_url(url)
    setup_docs.extend(page_setup_docs)

## -- AWAIT -- ##

# Usage within async functions:
# The await keyword can only be used inside functions defined with the async def syntax, which designates them as coroutines.
# Pausing execution:
# When await is encountered, the current coroutine's execution is temporarily suspended, and control is yielded back to the event loop.

In [6]:
setup_docs

[Document(metadata={'languages': ['eng'], 'filetype': 'text/html', 'parent_id': '045f743be3cd8ffd40e1856dcbe29eca', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'NarrativeText', 'element_id': 'a9382c193b18dd10455d6510ff3651dd'}, page_content="You'll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:"),
 Document(metadata={'languages': ['eng'], 'filetype': 'text/html', 'parent_id': '045f743be3cd8ffd40e1856dcbe29eca', 'url': 'https://python.langchain.com/docs/how_to/chatbots_memory/', 'category': 'NarrativeText', 'element_id': '1c452f9bedb9da8c85946b7886292f50'}, page_content='%pip install --upgrade --quiet langchain langchain-openai langgraph\n\nimport getpass\nimport os\n\nif not os.environ.get("OPENAI_API_KEY"):\n    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")'),
 Document(metadata={'languages': ['eng'], 'filetype': 'text/html', 'parent_id': '045f743be3cd8ffd40e1856d

In [7]:
from collections import defaultdict

setup_text = defaultdict(str)

for doc in setup_docs:
    url = doc.metadata["url"]
    setup_text[url] += f"{doc.page_content}\n"

dict(setup_text)

{'https://python.langchain.com/docs/how_to/chatbots_memory/': 'You\'ll need to install a few packages, and have your OpenAI API key set as an environment variable named OPENAI_API_KEY:\n%pip install --upgrade --quiet langchain langchain-openai langgraph\n\nimport getpass\nimport os\n\nif not os.environ.get("OPENAI_API_KEY"):\n    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")\nOpenAI API Key: ········\nLet\'s also set up a chat model that we\'ll use for the below examples.\nfrom langchain_openai import ChatOpenAI\n\nmodel = ChatOpenAI(model="gpt-4o-mini")\nAPI Reference:ChatOpenAI\n',
 'https://python.langchain.com/docs/how_to/chatbots_tools/': 'For this guide, we\'ll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you\'re using Tavily.\nYou\'ll need to sign up for an account on the Tavily website, and install the followi

# 3. Sitemap extraction

# 4. Indexing parsed webpage data

- Loading HuggingFace embedding (Rmk. This part doesn't work with API access neither with HF Hub connection)

In [None]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") #22M parameters: 3minutes

INFO: Use pytorch device_name: mps
INFO: Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


In [27]:
embeddings

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [35]:
# small test:
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.03833858668804169, 0.12346469610929489, -0.02864297293126583]

In [38]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore.from_documents(setup_docs, embeddings)

retrieved_docs = vector_store.similarity_search("chat model", k=5)
for doc in retrieved_docs:
    print(f'Page {doc.metadata["url"]}: {doc.page_content[:300]}\n')

Page https://python.langchain.com/docs/how_to/chatbots_memory/: Let's also set up a chat model that we'll use for the below examples.

Page https://python.langchain.com/docs/how_to/chatbots_memory/: API Reference:ChatOpenAI

Page https://python.langchain.com/docs/how_to/chatbots_memory/: from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini")

Page https://python.langchain.com/docs/how_to/chatbots_memory/: OpenAI API Key: ········

Page https://python.langchain.com/docs/how_to/chatbots_tools/: For this guide, we'll be using a tool calling agent with a single tool for searching the web. The default will be powered by Tavily, but you can switch it out for any similar tool. The rest of this section will assume you're using Tavily.

