# WebBaseLoader

This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`.

If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.

## Overview
### Integration details

- TODO: Fill in table features.
- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
- TODO: Make sure API reference links are correct.

| Class | Package | Local | Serializable | JS support|
| :--- | :--- | :---: | :---: |  :---: |
| [WebBaseLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ |
### Loader features
| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: |
| WebBaseLoader | ✅ | ✅ |

## Setup

### Credentials

`WebBaseLoader` does not require any credentials.

### Installation

To use the `WebBaseLoader` you first need to install the `langchain-community` python package.


In [1]:
%pip install -qU langchain_community beautifulsoup4

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m82.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.2/413.2 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Initialization

Now we can instantiate our model object and load documents:

In [8]:
import os
import warnings
from langchain_community.document_loaders import WebBaseLoader
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from langchain.docstore.document import Document

# Optionnel : désactiver les avertissements TLS (à ne pas faire en production)
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

def scrape_website(url):
    """Charge le contenu d'une page web via WebBaseLoader."""
    try:
        loader = WebBaseLoader(url, requests_kwargs={"verify": False})
        docs = loader.load()
        if docs:
            return docs[0].page_content
        else:
            print(f"Erreur : aucun document chargé pour {url}")
            return None
    except Exception as e:
        print(f"Erreur lors du scraping de {url}: {e}")
        return None

def get_all_links_from_html(html, base_url, visited):
    """Extrait tous les liens internes de la page à partir du HTML fourni."""
    soup = BeautifulSoup(html, "html.parser")
    links = set()
    for link in soup.find_all("a", href=True):
        full_url = urljoin(base_url, link["href"])
        parsed_url = urlparse(full_url)
        # On ne garde que les liens internes et non déjà visités
        if parsed_url.netloc == urlparse(base_url).netloc and full_url not in visited:
            links.add(full_url)
    return links

def scrape_entire_website(start_url):
    """
    Parcourt le site en chargeant chaque lien, en appendant son contenu
    directement au document complet et en extrayant les nouveaux liens depuis le HTML.
    """
    to_visit = {start_url}   # Ensemble des URL à visiter
    visited = set()          # Ensemble des URL déjà visitées
    complete_content = ""

    while to_visit:
        url = to_visit.pop()
        if url in visited:
            continue
        visited.add(url)
        print(f"Scraping: {url}")

        # Charger le contenu et l'ajouter immédiatement au document complet
        content = scrape_website(url)
        if content:
            complete_content += f"URL: {url}\n{content}\n\n"
            # Extraction des nouveaux liens depuis le HTML déjà chargé
            new_links = get_all_links_from_html(content, start_url, visited)
            if new_links:
                print(f"  Nouveaux liens trouvés sur {url} : {len(new_links)}")
            to_visit.update(new_links)

    return complete_content

website_url = "https://www.solutis.fr/"
complete_doc_content = scrape_entire_website(website_url)

    # Création d'un document complet avec Langchain
complete_doc = Document(page_content=complete_doc_content, metadata={"source": website_url})

    # Afficher le contenu complet
print(complete_doc.page_content)


Scraping: https://www.solutis.fr/




URL: https://www.solutis.fr/





Courtier rachat de credit, prêts, assurance emprunteur | Solutis ©



































































                            Rachat de crédits
                        






                            Prêt immobilier
                        






                            Crédit conso
                        






                            Assurance emprunteur
                        






                            Crédit pro
                        






















Simulez votre projet
Rachat de crédit
Prêt immobilier
Crédit à la consommation
Assurance emprunteur
Prêt professionnel

Je valide 



Besoin d'aide ? On vous rappelle.











CREDIT et ASSURANCE
Tous vos projetsau meilleur taux





Baromètre des taux



Nous négocions pour vous les meilleurs taux en vigueur selon votre profil.



Rachat de crédit immo
4.19%*





Rachat de crédit conso
5.08%*


* Taux actuels proposés chez les établi

In [6]:
from langchain_community.document_loaders import WebBaseLoader


loader = WebBaseLoader("https://www.solutis.fr/",requests_kwargs={"verify": False})


print(loader.load())



[Document(metadata={'source': 'https://www.solutis.fr/', 'title': 'Courtier rachat de credit, prêts, assurance emprunteur | Solutis ©', 'description': 'Obtenez la meilleure offre pour votre projet grâce à Solutis, votre courtier en rachat de crédit, prêts & assurance de prêt | Simulation Gratuite | Réponse en 24h.', 'language': 'fr'}, page_content="\n\n\n\n\nCourtier rachat de credit, prêts, assurance emprunteur | Solutis ©\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                            Rachat de crédits\n                        \n\n\n\n\n\n\n                            Prêt immobilier\n                        \n\n\n\n\n\n\n                            Crédit conso\n                        \n\n\n\n\n\n\n                            Assurance emprunteur\n                        \n\n\n\n\n\n\n                            Crédit pro\n                        \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\

To bypass SSL verification errors during fetching, you can set the "verify" option:

`loader.requests_kwargs = {'verify':False}`

### Initialization with multiple pages

You can also pass in a list of pages to load from.

In [None]:
loader_multiple_pages = WebBaseLoader(
    ["https://www.example.com/", "https://google.com"]
)

## Load

In [None]:
docs = loader.load()

docs[0]

Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')

In [None]:
print(docs[0].metadata)

{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}


### Load multiple urls concurrently

You can speed up the scraping process by scraping and parsing multiple urls concurrently.

There are reasonable limits to concurrent requests, defaulting to 2 per second.  If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the `requests_per_second` parameter to increase the max concurrent requests.  Note, while this will speed up the scraping process, but may cause the server to block you.  Be careful!

In [None]:
%pip install -qU  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

Note: you may need to restart the kernel to use updated packages.


In [None]:
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs

Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00,  8.28it/s]


[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
 Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms  ')]

### Loading a xml file, or using a different BeautifulSoup parser

You can also look at `SitemapLoader` for an example of how to load a sitemap file, which is an example of using this feature.

In [None]:
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs

[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\nÂ§ 431.86\nSection Â§ 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by condu

## Lazy Load

You can use lazy loading to only load one page at a time in order to minimize memory requirements.

In [None]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}


### Async

In [None]:
pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)

Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}





## Using proxies

Sometimes you might need to use proxies to get around IP blocks. You can pass in a dictionary of proxies to the loader (and `requests` underneath) to use them.

In [None]:
loader = WebBaseLoader(
    "https://www.walmart.com/search?q=parrots",
    proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
)
docs = loader.load()

## API reference

For detailed documentation of all `WebBaseLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html