# [WebBaseLoader](https://python.langchain.com/docs/integrations/document_loaders/web_base/)

In [7]:
from langchain_community.document_loaders import WebBaseLoader

## Loader

In [1]:
loader = WebBaseLoader("https://www.example.com/")

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
docs = loader.load()

len(docs)

1

In [6]:
docs[0].metadata

{'source': 'https://www.example.com/',
 'title': 'Example Domain',
 'language': 'No language found.'}

In [5]:
print(docs[0].page_content)




Example Domain







Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...






## Multiple Pages

In [29]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_paths=["https://www.example.com/", "https://google.com"])

docs = loader.load()
len(docs)

2

In [32]:
for i, doc in enumerate(docs):
    print(f"Document {i+1}:")
    print(f"   - 타입: {type(doc)}")
    print(f"   - page_content 타입: {type(doc.page_content)}")
    print(f"   - metadata 타입: {type(doc.metadata)}")
    print(f"   - metadata 내용: {doc.metadata}")
    print(f"   - 내용 길이: {len(doc.page_content)} 문자")
    print()

Document 1:
   - 타입: <class 'langchain_core.documents.base.Document'>
   - page_content 타입: <class 'str'>
   - metadata 타입: <class 'dict'>
   - metadata 내용: {'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}
   - 내용 길이: 220 문자

Document 2:
   - 타입: <class 'langchain_core.documents.base.Document'>
   - page_content 타입: <class 'str'>
   - metadata 타입: <class 'dict'>
   - metadata 내용: {'source': 'https://google.com', 'title': 'Google', 'language': 'ko'}
   - 내용 길이: 144 문자



## Loader with bs4

In [21]:
# BS4를 사용한 특정 요소 추출 예제
from bs4 import BeautifulSoup
import requests

# 1. 기본 BS4 예제 - 특정 태그만 추출
url = "https://www.example.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# 제목만 추출
title = soup.find('title').get_text()
print(f"페이지 제목: {title}")

# 모든 p 태그 텍스트 추출
paragraphs = soup.find_all('p')
for i, p in enumerate(paragraphs):
    print(f"단락 {i+1}: {p.get_text().strip()}")


페이지 제목: Example Domain
단락 1: This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
단락 2: More information...


In [24]:
from langchain_community.document_loaders import WebBaseLoader
from bs4 import BeautifulSoup, SoupStrainer

# 특정 CSS 선택자로 요소 추출 (예: 뉴스 사이트의 기사 내용만)
loader = WebBaseLoader(
    "https://python.langchain.com/docs/introduction/",  # 예시 URL
    bs_kwargs={
        "parse_only": SoupStrainer(
            ['div', 'a'], 
            attrs={
                'class': ['menu__link', 'menu__link--active']
            })
    },
    bs_get_text_kwargs={
        "separator": " ",
        "strip": True
    },
    requests_kwargs={
        "timeout": 10,
        "verify": False # 인증서 검증 비활성화
    }
)

docs = loader.load()

len(docs)



1

In [25]:
docs[0]

Document(metadata={'source': 'https://python.langchain.com/docs/introduction/'}, page_content='Tutorials Build a Question Answering application over a Graph Database Tutorials Build a simple LLM application with chat models and prompt templates Build a Chatbot Build a Retrieval Augmented Generation (RAG) App: Part 2 Build an Extraction Chain Build an Agent Tagging Build a Retrieval Augmented Generation (RAG) App: Part 1 Build a semantic search engine Build a Question/Answering system over SQL data Summarize Text How-to guides How-to guides How to use tools in a chain How to use a vectorstore as a retriever How to add memory to chatbots How to use example selectors How to add a semantic layer over graph database How to invoke runnables in parallel How to stream chat model responses How to add default invocation args to a Runnable How to add retrieval to chatbots How to use few shot examples in chat models How to do tool/function calling How to install LangChain packages How to add examp