## CrawlAI

Crawl4AI simplifies web crawling and data extraction, making it ready to use for LLMs and AI applications.

Here’s why it’s a game-changer:

🆓 Completely free and open-source
🚀 Blazing fast performance, outperforming many paid services
🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
🌍 Supports crawling multiple URLs simultaneously
🎨 Extracts all media tags (Images, Audio, Video)
🔗 Extracts all external and internal links

But that’s not all:

📚 Extracts metadata from pages
🔄 Custom hooks for auth, headers, and page modifications
🕵️ User-agent customization
🖼️ Takes screenshots of pages
📜 Executes custom JavaScript before crawling

In [8]:
import asyncio 
from crawl4ai import AsyncWebCrawler 

url = "https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide"
# url = "https://www.nbcnews.com/business"

async def main(): 
    async with AsyncWebCrawler(verbose=True) as crawler: 
        result = await crawler.arun(url=url) 
        print (result.markdown) 
        
await main()

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide successfully!
[LOG] 🚀 Crawling done for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, success: True, time taken: 2.25 seconds
[LOG] 🚀 Content extracted for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, success: True, time taken: 0.02 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, time taken: 0.02 seconds.
Documentation  My IBM  Log in

  1. All products
  2. OpenPages
  3. 9.0.0

Change version

Change version9.0.08.3.08.2.08.1.0

Was this topic helpful?

positive feedback

negative feedback

Focus sentine

In [9]:
import asyncio
from crawl4ai import AsyncWebCrawler

async def recursive_scrape(url, depth=2):
    if depth == 0:
        return

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url=url)
        print(f"Scraped content from {url}:\n", result.markdown)

        # Extract links from the page
        links = result.extracted_content.get('links', [])
        for link in links:
            # Recursively scrape each link
            await recursive_scrape(link, depth - 1)

await recursive_scrape(url)

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🚀 Content extracted for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, success: True, time taken: 0.02 seconds
[LOG] 🚀 Extraction done for https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide, time taken: 0.02 seconds.
Scraped content from https://www.ibm.com/docs/en/openpages/9.0.0?topic=user-guide:
 Documentation  My IBM  Log in

  1. All products
  2. OpenPages
  3. 9.0.0

Change version

Change version9.0.08.3.08.2.08.1.0

Was this topic helpful?

positive feedback

negative feedback

Focus sentinel

### Rate this content

Close

Great! Let us know what you found helpful.

Comment 0/750

Note: This feedback goes to your product\'s documentation team and does not
include a response. Issues that require a response should go through IBM
support.

CancelSubmit

Focus sentinel

Focus sentinel

### Rate this content

Close

What can we do to improve the content?

Comment 0/750


AttributeError: 'str' object has no attribute 'get'