<a href="https://colab.research.google.com/github/mankotia412vishal/BMS-GUI/blob/main/crawl4ai_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
<div align="center">

<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)

[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)

[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)

</div>

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

### **Quickstart with Crawl4AI**

#### 1. **Installation**
Install Crawl4AI and necessary dependencies:

In [None]:
%%capture
!pip install -U crawl4ai
!pip install nest_asyncio

In [None]:
# Check crawl4ai version
import crawl4ai
print(crawl4ai.__version__)

0.4.247


##### Setup Crawl4ai
The following command installs Playride and its dependencies and updates a few configurations for Crawl4ai. After that, you can run the doctor command to ensure everything works as it should.


In [None]:
%%capture
!crawl4ai-setup

In [None]:
!crawl4ai-doctor

[INIT].... → Running Crawl4AI health check...
[INIT].... → Crawl4AI 0.4.247
[TEST].... ℹ Testing crawling capabilities...
[EXPORT].. ℹ Exporting PDF and taking screenshot took 3.09s
[FETCH]... ↓ https://crawl4ai.com... | Status: True | Time: 5.43s
[SCRAPE].. ◆ Processed https://crawl4ai.com... | Time: 183ms
[COMPLETE] ● https://crawl4ai.com... | Status: True | Total: 5.62s
[COMPLETE] ● ✅ Crawling test passed!


In [None]:
# If you face with an error try it manually
# !playwright install --with-deps chrome # Recommended for Colab/Linux

I suggest you first try the code below to ensure that Playwright is installed and works properly.

In [None]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

In [None]:
import asyncio
from playwright.async_api import async_playwright

async def test_browser():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto('https://example.com')
        print(f'Title: {await page.title()}')
        await browser.close()

asyncio.run(test_browser())

Title: Example Domain


#### 2. **Basic Setup and Simple Crawl**

In [None]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

async def simple_crawl():
    crawler_run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=crawler_run_config
        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))  # Print the first 500 characters

asyncio.run(simple_crawl())

[INIT].... → Crawl4AI 0.4.245
[SCRAPE].. ◆ Some images failed to load within timeout
[FETCH]... ↓ https://www.kidocode.com/degrees/technology... | Status: True | Time: 2.78s
[SCRAPE].. ◆ Processed https://www.kidocode.com/degrees/technology... | Time: 166ms
[COMPLETE] ● https://www.kidocode.com/degrees/technology... | Status: True | Total: 2.96s
[![coding school for kids](https://cdn.prod.website-files.com/61d6943d6b5924685ac825ca/64a6a12136e8f756c9df3baa_k-combomark-white.svg)](/) --  -- [Trial Class](/trial-class) --  -- Degrees --  -- degrees --  -- [All Degrees](/degrees) --  -- [AI Degree](/degrees/artificial-intelligence) --  -- [Technology Degree](/degrees/technology) --  -- [Entrepreneurship Degree](/degrees/entrepreneurship) --  -- About Us --  -- About --  -- [Mission](/about) --  -- [Team](/team) --  -- [Contact](/contact) --  -- Community --  -- [Our community](/community) --  -- [Gallery](/gallery) --  -- Join us --  -- [Ca


#### 3. **Dynamic Content Handling**

In [None]:
async def crawl_dynamic_content():
    # You can use wait_for to wait for a condition to be met before returning the result
    # wait_for = """() => {
    #     return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
    # }"""

    # wait_for can be also just a css selector
    # wait_for = "article.tease-card:nth-child(10)"

    async with AsyncWebCrawler() as crawler:
        js_code = [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ]
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            js_code=js_code,
            # wait_for=wait_for,
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,

        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))  # Print first 500 characters

asyncio.run(crawl_dynamic_content())

[INIT].... → Crawl4AI 0.4.23
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 9.90s
[SCRAPE].. ◆ Processed https://www.nbcnews.com/business... | Time: 3501ms
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 13.45s
IE 11 is not supported. For an optimal experience visit our site on another browser. --  -- Skip to Content --  -- [NBC News Logo](https://www.nbcnews.com) --  -- Sponsored By --  --   * [Politics](https://www.nbcnews.com/politics) --   * [U.S. News](https://www.nbcnews.com/us-news) --   * Local --   * [New York](https://www.nbcnews.com/new-york) --   * [Los Angeles](https://www.nbcnews.com/los-angeles) --   * [Chicago](https://www.nbcnews.com/chicago) --   * [Dallas-Fort Worth](https://www.nbcnews.com/dallas-fort-worth) --   * [Philadelph


#### 4. **Content Cleaning and Fit Markdown**

In [None]:
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def clean_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            excluded_tags=['nav', 'footer', 'aside'],
            remove_overlay_elements=True,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
                options={
                    "ignore_links": True
                }
            ),
        )
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            config=config,
        )
        full_markdown_length = len(result.markdown_v2.raw_markdown)
        fit_markdown_length = len(result.markdown_v2.fit_markdown)
        print(f"Full Markdown Length: {full_markdown_length}")
        print(f"Fit Markdown Length: {fit_markdown_length}")


asyncio.run(clean_content())

[INIT].... → Crawl4AI 0.4.23
[FETCH]... ↓ https://en.wikipedia.org/wiki/Apple... | Status: True | Time: 3.10s
[SCRAPE].. ◆ Processed https://en.wikipedia.org/wiki/Apple... | Time: 3000ms
[COMPLETE] ● https://en.wikipedia.org/wiki/Apple... | Status: True | Total: 6.16s
Full Markdown Length: 76196
Fit Markdown Length: 69997


#### 5. **Link Analysis and Smart Filtering**

In [None]:

async def link_analysis():
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            exclude_external_links=True,
            exclude_social_media_links=True,
            # exclude_domains=["facebook.com", "twitter.com"]
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")

        for link in result.links['internal'][:5]:
            print(f"Href: {link['href']}\nText: {link['text']}\n")


asyncio.run(link_analysis())

[INIT].... → Crawl4AI 0.4.23
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 0.04s
[SCRAPE].. ◆ Processed https://www.nbcnews.com/business... | Time: 764ms
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 0.85s
Found 146 internal links
Found 58 external links
Href: https://www.nbcnews.com
Text: NBC News Logo

Href: https://www.nbcnews.com/politics
Text: Politics

Href: https://www.nbcnews.com/us-news
Text: U.S. News

Href: https://www.nbcnews.com/new-york
Text: New York

Href: https://www.nbcnews.com/los-angeles
Text: Los Angeles



#### 6. **Media Handling**

In [None]:
async def media_handling():
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            exclude_external_images=False,
            # screenshot=True # Set this to True if you want to take a screenshot
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,
        )
        for img in result.media['images'][:5]:
            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")

asyncio.run(media_handling())

[INIT].... → Crawl4AI 0.4.23
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 0.05s
[SCRAPE].. ◆ Processed https://www.nbcnews.com/business... | Time: 828ms
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 0.91s
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2024-12/241230-Jeju-Air-Boeing-se-1143a-3fe875.jpg, Alt: A Jeju Air Boeing 737-800 taking off from Osaka Kansai, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2024-12/241230-Leonard-Hamilton-fsu-coach-se-1124a-8e1eb9.jpg, Alt: Leonard Hamilton watches his team play against Virginia Tech, on March 13, 2024., Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-12/241230-Jimmy-Carter-aa-1048-c21744.jpg, Alt: Carter at NEA Meeting, LA, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-

#### 7. **Using Hooks for Custom Workflow**

Hooks in Crawl4AI allow you to run custom logic at specific stages of the crawling process. This can be invaluable for scenarios like setting custom headers, logging activities, or processing content before it is returned. Below is an example of a basic workflow using a hook, followed by a complete list of available hooks and explanations on their usage.

In [None]:
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
        """Hook called before navigating to each URL"""
        print(f"[HOOK] before_goto - About to visit: {url}")
        # Example: Add custom headers for the request
        await page.set_extra_http_headers({
            "Custom-Header": "my-value"
        })
        return page

async def custom_hook_workflow(verbose=True):
    async with AsyncWebCrawler(config=BrowserConfig( verbose=verbose)) as crawler:
        # Set a 'before_goto' hook to run custom code just before navigation
        crawler.crawler_strategy.set_hook("before_goto", before_goto)

        # Perform the crawl operation
        result = await crawler.arun(
            url="https://crawl4ai.com",
            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))

asyncio.run(custom_hook_workflow())

[INIT].... → Crawl4AI 0.4.23
[Hook] Preparing to navigate...
[FETCH]... ↓ https://crawl4ai.com... | Status: True | Time: 3.21s
[SCRAPE].. ◆ Processed https://crawl4ai.com... | Time: 97ms
[COMPLETE] ● https://crawl4ai.com... | Status: True | Total: 3.34s
[Crawl4AI Documentation](https://docs.crawl4ai.com/) --  --   * [ Home ](.) --   * [ Installation ](basic/installation/) --   * [ Docker Deployment ](basic/docker-deploymeny/) --   * [ Quick Start ](basic/quickstart/) --   * [ Search ](#) --  --  --  --   * Home --   * [Installation](basic/installation/) --   * [Docker Deployment](basic/docker-deploymeny/) --   * [Quick Start](basic/quickstart/) --   * Changelog & Blog --     * [Blog Home](blog/) --     * [Latest (0.4.2)](blog/releases/0.4.2/) --     * [Changelog](https://github.com/unclecode/cr


List of available hooks and examples for each stage of the crawling process:

- **on_browser_created**
    ```python
    async def on_browser_created_hook(browser):
        print("[Hook] Browser created")
    ```

- **before_goto**
    ```python
    async def before_goto_hook(page, context = None):
        await page.set_extra_http_headers({"X-Test-Header": "test"})
    ```

- **after_goto**
    ```python
    async def after_goto_hook(page, context = None):
        print(f"[Hook] Navigated to {page.url}")
    ```

- **on_execution_started**
    ```python
    async def on_execution_started_hook(page, context = None):
        print("[Hook] JavaScript execution started")
    ```

- **before_return_html**
    ```python
    async def before_return_html_hook(page, html, context = None):
        print(f"[Hook] HTML length: {len(html)}")
    ```

#### 8. **Session-Based Crawling**

When to Use Session-Based Crawling:
Session-based crawling is especially beneficial when navigating through multi-page content where each page load needs to maintain the same session context. For instance, in cases where a “Next Page” button must be clicked to load subsequent data, the new data often replaces the previous content. Here, session-based crawling keeps the browser state intact across each interaction, allowing for sequential actions within the same session. An easy way to think about a session is that it acts like a browser tab; when you pass the same session ID, it uses the same browser tab and does not create a new tab.

Example: Multi-Page Navigation Using JavaScript
In this example, we’ll navigate through multiple pages by clicking a "Next Page" button. After each page load, we extract the new content and repeat the process.

In [None]:
from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
)
import json

async def crawl_dynamic_content_pages_method_2():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")

    async with AsyncWebCrawler() as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []
        last_commit = ""

        js_next_page_and_wait = """
        (async () => {
            const getCurrentCommit = () => {
                const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
                return commits.length > 0 ? commits[0].textContent.trim() : null;
            };

            const initialCommit = getCurrentCommit();
            const button = document.querySelector('a[data-testid="pagination-next-button"]');
            if (button) button.click();

            // Poll for changes
            while (true) {
                await new Promise(resolve => setTimeout(resolve, 100)); // Wait 100ms
                const newCommit = getCurrentCommit();
                if (newCommit && newCommit !== initialCommit) {
                    break;
                }
            }
        })();
        """

        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.markdown-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        for page in range(2):  # Crawl 2 pages
            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page_and_wait if page > 0 else None,
                js_only=page > 0,
            )
            result = await crawler.arun(
                url=url,
                config=config
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            commits = json.loads(result.extracted_content)
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

asyncio.run(crawl_dynamic_content_pages_method_2())


--- Advanced Multi-Page Crawling with JavaScript Execution ---
[INIT].... → Crawl4AI 0.4.23
[FETCH]... ↓ https://github.com/microsoft/TypeScript/commits/ma... | Status: True | Time: 6.77s
[SCRAPE].. ◆ Processed https://github.com/microsoft/TypeScript/commits/ma... | Time: 627ms
[EXTRACT]. ■ Completed for https://github.com/microsoft/TypeScript/commits/ma... | Time: 0.3385833790000561s
[COMPLETE] ● https://github.com/microsoft/TypeScript/commits/ma... | Status: True | Total: 7.80s
Page 1: Found 35 commits
[FETCH]... ↓ https://github.com/microsoft/TypeScript/commits/ma... | Status: True | Time: 3.67s
[SCRAPE].. ◆ Processed https://github.com/microsoft/TypeScript/commits/ma... | Time: 1412ms
[EXTRACT]. ■ Completed for https://github.com/microsoft/TypeScript/commits/ma... | Time: 0.6771934809999038s
[COMPLETE] ● https://github.com/microsoft/TypeScript/commits/ma... | Status: True | Total: 5.94s
Page 2: Found 35 commits
Successfully crawled 70 commits across 3 pages


#### 9. **Using Extraction Strategies**

##### Executing JavaScript & Extract Structured Data without LLMs

In [None]:
from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
)
import json
async def extract():
    schema = {
    "name": "KidoCode Courses",
    "baseSelector": "section.charge-methodology .w-tab-content > div",
    "fields": [
        {
            "name": "section_title",
            "selector": "h3.heading-50",
            "type": "text",
        },
        {
            "name": "section_description",
            "selector": ".charge-content",
            "type": "text",
        },
        {
            "name": "course_name",
            "selector": ".text-block-93",
            "type": "text",
        },
        {
            "name": "course_description",
            "selector": ".course-content-text",
            "type": "text",
        },
        {
            "name": "course_icon",
            "selector": ".image-92",
            "type": "attribute",
            "attribute": "src"
        }
    ]
}

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler() as crawler:

        # Create the JavaScript that handles clicking multiple times
        js_click_tabs = """
        (async () => {
            const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");

            for(let tab of tabs) {
                // scroll to the tab
                tab.scrollIntoView();
                tab.click();
                // Wait for content to load and animations to complete
                await new Promise(r => setTimeout(r, 500));
            }
        })();
        """

        config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            extraction_strategy=extraction_strategy,
            js_code=[js_click_tabs],
        )
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=config
        )

        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))

await extract()

ModuleNotFoundError: No module named 'crawl4ai'

#####  LLM Extraction

This example demonstrates how to use language model-based extraction to retrieve structured data from a pricing page on OpenAI’s site.

In [None]:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import os, json
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    # Skip if API token is missing (for providers that require it)
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    extra_args = {"extra_headers": extra_headers} if extra_headers else {}

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""Extract all model names along with fees for input and output tokens."
                "{model_name: 'GPT-4', input_fee: 'US$10.00 / 1M tokens', output_fee: 'US$30.00 / 1M tokens'}.""",
                **extra_args
            ),
            cach_mode = CacheMode.ENABLED
        )
        print(json.loads(result.extracted_content)[:5])

# Usage:
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
await extract_structured_data_using_llm("openai/gpt-4o-mini", os.getenv("OPENAI_API_KEY"))





--- Extracting Structured Data with openai/gpt-4o-mini ---
[INIT].... → Crawl4AI 0.3.743
[FETCH]... ↓ https://openai.com/api/pricing/... | Status: True | Time: 1.65s
[SCRAPE].. ◆ Processed https://openai.com/api/pricing/... | Time: 4ms
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0
[LOG] Extracted 1 blocks from URL: https://openai.com/api/pricing/ block index: 0
[EXTRACT]. ■ Completed for https://openai.com/api/pricing/... | Time: 9.59899004899944s
[COMPLETE] ● https://openai.com/api/pricing/... | Status: True | Total: 11.41s
[{'model_name': 'GPT-4', 'input_fee': 'US$10.00 / 1M tokens', 'output_fee': 'US$30.00 / 1M tokens', 'error': False}]


**Cosine Similarity Strategy**

This strategy uses semantic clustering to extract relevant content based on contextual similarity, which is helpful when extracting related sections from a single topic.

IMPORTANT: This strategy uses embedding models from HuggingFace, to have a proper response time, make sure to switch to GPU.

In [None]:
from crawl4ai.extraction_strategy import CosineStrategy

async def cosine_similarity_extraction():
    async with AsyncWebCrawler() as crawler:
        strategy = CosineStrategy(
            word_count_threshold=10,
            max_dist=0.2, # Maximum distance between two words
            linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
            top_k=3, # Number of top keywords to extract
            sim_threshold=0.3, # Similarity threshold for clustering
            semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
            verbose=True
        )

        result = await crawler.arun(
            url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
            extraction_strategy=strategy,
            cach_mode = CacheMode.ENABLED
        )
        print(json.loads(result.extracted_content)[:5])

asyncio.run(cosine_similarity_extraction())


[LOG] Loading Extraction Model for cpu device.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

[LOG] Loading Multilabel Classifier for cpu device.


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

[LOG] Model loaded sentence-transformers/all-MiniLM-L6-v2, models/reuters, took 38.48604083061218 seconds
[LOG] 🚀 Assign tags using cpu
[LOG] 🚀 Categorization done in 1.96 seconds
[{'index': 1, 'tags': ['business_&_entrepreneurs', 'food_&_dining', 'news_&_social_concern'], 'content': 'McDonald’s has faced a customer revolt over pricey Big Macs, an unsolicited cameo in election-season crossfire, and now an E. coli outbreak — just as the company had been luring customers back with more affordable burgers. Despite a difficult quarter, McDonald’s looks resilient in the face of various pressures, analysts say — something the company shares with U.S. consumers overall. “McDonald’s has also done a good job of embedding the brand in popular culture to enhance its relevance and meaning around fun and family. But it also needed to modify the product line to meet the expectations of a consumer who is on a tight budget,” he said. For many consumers, the fast-food giant’s menus serve as an informal

#### 10. **Conclusion and Next Steps**

You’ve explored core features of Crawl4AI, including dynamic content handling, link analysis, and advanced extraction strategies. Visit our documentation for further details on using Crawl4AI’s extensive features.

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

Happy Crawling with Crawl4AI! 🕷️🤖
