# Easy Web Scraping with LangChain's UnstructuredURLLoader
### No BeautifulSoup, Just URLs and Magic! ✨

In this notebook, I’ll show you how to scrape websites effortlessly using LangChain's `UnstructuredURLLoader`. Forget complex HTML parsing—provide a list of URLs, and let the tool do the heavy lifting. Perfect for beginners and pros alike!

## Step 1: Install Required Libraries
First, we need to install `langchain-community` and `unstructured` to use the loader. Run the cell below:

In [1]:
!pip install langchain-community unstructured

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting unstructured
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting langchain-core<1.0.0,>=0.3.51 (from langchain-community)
  Downloading langchain_core-0.3.51-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.23 (from langchain-community)
  Downloading langchain-0.3.23-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python

## Step 2: Write the Scraping Code
Here’s the magic part! We’ll scrape content from a few URLs using `UnstructuredURLLoader`. I’ve picked New York Times news to demonstrate.

In [4]:
from langchain_community.document_loaders import UnstructuredURLLoader

# List of URLs to scrape
urls = [
    'https://www.nytimes.com/section/us?page=3'
]

# Load the content
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

In [10]:
# Display the scraped content
for i, doc in enumerate(data):
    print(f"Content from URL {i+1}: {urls[i]}")
    print(doc.page_content[:2000])  # Show first 2000 characters for brevity
    print("\n" + "="*50 + "\n")

Content from URL 1: https://www.nytimes.com/section/us?page=3
Advertisement

SKIP ADVERTISEMENT

Supported by

SKIP ADVERTISEMENT

U.S. News

Highlights

Buy or Wait? Americans Wrestle With How Tariffs Will Affect Their Shopping.

In the first weekend since President Trump unveiled broad tariffs, many shoppers sought to get ahead of expected price increases, while others showed patience.

By Orlando Mayorquín

A shopper in Marina del Rey, Calif. Many Americans this weekend were out in grocery stores, car dealerships, malls and big discount chains, racing to figure out how to get ahead of the new tariffs plan.

Canada Drops the Gloves in Tariff Spat, Makes Its Case on U.S. Billboards

The tariffs-are-a-tax messages are targeting residents in places like Pittsburgh that count on Canadian trade.

By Billy Witz

A digital billboard outside Pittsburgh paid for by the government of Canada.

After the L.A. Fires, These Schools Face Another Threat: Layoffs

A dozen teachers in Pasadena, Calif.

## Why This is Awesome
- **Simple**: No need to parse HTML manually like with BeautifulSoup.
- **Fast**: Just give URLs and get text back.
- **Scalable**: Works with multiple URLs at once.

This is a game-changer for quick data collection—think research, content analysis, or building datasets!

## Try It Yourself!
Add your own URLs in the `urls` list and run the code. Let me know how it goes in the comments on GitHub or LinkedIn!