<a href="https://colab.research.google.com/github/gverafei/scraping/blob/main/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Investigating how to scrape the web**
March 2025

## Configure virtual environment

Solo se ejecuta lo siguiente la primera vez. Pedirá seleccionar the kernel from the upper right corner. Choose this virtual environment we just created.

In [None]:
!python3 -m venv .venv                                             
!source .venv/bin/activate # Linux/Mac
# !.\venv\Scripts\activate # Windows

Inicializa el repositorio en GitHub. Todo esto se hace desde la terminal.

In [None]:
# !git init
# !git remote add origin https://github.com/gverafei/scraping.git
# !git pull origin main
# git add .
# git commit -m "Initial commit"
# git push --set-upstream origin main

In [None]:
!pip install --upgrade pip --quiet

Install required packages

```bash
pip install html2text
pip install markdownify
```

## Create the initial data

In [111]:
competitor_sites = [
    {
        "name": "Amazon",
        "url": "https://www.amazon.com"
    },
    {
        "name": "UV",
        "url": "https://www.uv.mx"
    },
    {
        "name": "W3C ACT Rules",
        "url": "https://www.w3.org/WAI/standards-guidelines/act/rules/"
    },
]

## Setup cost's calculations

The idea is to compare side-by-side. 

We can calculate how much it'll cost by using OpenAI's `tiktoken` library from: https://github.com/openai/tiktoken

And costs from: https://openai.com/api/pricing/

In [None]:
!pip install tiktoken --quiet

In [None]:
import tiktoken

def count_tokens(input_string: str) -> int:
    encoder = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoder.encode(input_string)
    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 2.5) -> float:
    num_tokens = count_tokens(input_string)
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

# Example usage:
input_string = "Porque la gallina cruzó el camino? Pues porque quería llegar al otro lado."
cost = calculate_cost(input_string)
print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

## Table para ver los resultados

Ahora, para ver los resultados de la comparaciones, instalamos un paquete para ver tablas en línea de comandos: https://pypi.org/project/prettytable/

Y también instalamos un paquete para ver una barra de progreso bonita en loops: https://pypi.org/project/tqdm/

In [None]:
!pip install prettytable --quiet

In [None]:
!pip install tqdm --quiet

In [103]:
from typing import List, Callable, Dict
from prettytable import PrettyTable
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:    
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_snippet = f"{len(content):,} characters retrieved:\n\n" + content_snippet
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = True

    cost_table.max_width = table_max_width
    cost_table.hrules = True

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data

## Setup all the scrapers

Let's setup all of our scrapers.

## Beautiful Soup

Se instala este paquete desde: https://pypi.org/project/beautifulsoup4/

Y también request para hacer peticiones desde: https://pypi.org/project/requests/

In [None]:
!pip install requests beautifulsoup4 --quiet

In [None]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)

## Reader API by Jina AI

Este es especialmente para LLMs también. Setup Jina AI's scrape method from: https://jina.ai/reader/

In [None]:
import requests

def scrape_jina_ai(url: str) -> str:
  headers = {
        'X-Return-Format': 'html',
        'X-Wait-For-Selector': '.slick-initialized',
        'X-With-Images-Summary': 'all'
  }
  data = {
      'url': url,
      'injectPageScript': [
          'window.scrollTo(0, document.body.scrollHeight)'
      ]
  }

  response = requests.post('https://r.jina.ai/', headers=headers, json=data)
  return response.text

## Playwright

La manera clásica de hacer scraping. No es especial para LLMs desde: https://playwright.dev/

In [None]:
!pip install playwright --quiet

In [None]:
!playwright install

In [None]:
!pip install nest_asyncio --quiet

In [None]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from playwright.async_api import async_playwright

async def scrape_playwright(url: str):
    async with async_playwright() as pw:
        browser =  await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        # Espera para que cargue la pagina
        await page.wait_for_load_state('domcontentloaded')
        # Ejecuta un script para bajar hasta el final de la pagina
        # await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
        # Tambien se puede hacer con el teclado
        await page.keyboard.press('End')
        # Espera a que baje el scroll
        await page.wait_for_timeout(2000)

        html = await page.content()
        await browser.close()
        
        return html
    
def scrape_playwright_sync(url: str):
    return asyncio.run(scrape_playwright(url))

## Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Este es especialmente para LLMs desde: https://docs.crawl4ai.com/

In [None]:
!pip install crawl4ai --quiet
# !pip install git+https://github.com/unclecode/crawl4ai.git@2025-MAR-ALPHA-1 --quiet

In [None]:
!crawl4ai-setup

In [None]:
!pip install nest_asyncio --quiet

In [None]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_conf = BrowserConfig(verbose=False,headless=True)

run_cfg = CrawlerRunConfig(
    wait_until="domcontentloaded",
    scan_full_page=True,
    # wait_for_timeout=2000,
    verbose=False
)

async def async_scrape_crawl4ai(url: str):
    crawler = AsyncWebCrawler(config=browser_conf)
    await crawler.start()
    result = await crawler.arun(
        url=url,
        config=run_cfg,
    )
    return result.html

# To run the async function in a synchronous context
# (like this script), you can use asyncio.run() to execute it.
# This is a workaround for running async functions in a sync context.
def scrape_crawl4ai(url: str):
    return asyncio.run(async_scrape_crawl4ai(url))

# print(scrape_crawl4ai("https://www.uv.mx"))

## Main functions to run the comparasion

Let's run all the scrapers and display them in our comparison table.

In [112]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      # {"name": "Jina AI", "function": scrape_jina_ai},
      {"name": "Playwright", "function": scrape_playwright_sync},
      {"name": "Crawl4ai", "function": scrape_crawl4ai},
      ]

all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 700, 40)

Processing site Amazon using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  5.86it/s]
Processing site Amazon using Playwright: 100%|██████████| 1/1 [00:04<00:00,  4.13s/it]
Processing site Amazon using Crawl4ai: 100%|██████████| 1/1 [00:04<00:00,  4.05s/it]
Processing site UV using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  4.13it/s]
Processing site UV using Playwright: 100%|██████████| 1/1 [00:04<00:00,  4.07s/it]
Processing site UV using Crawl4ai: 100%|██████████| 1/1 [00:02<00:00,  2.78s/it]
Processing site W3C ACT Rules using Beautiful Soup: 100%|██████████| 1/1 [00:00<00:00,  9.29it/s]
Processing site W3C ACT Rules using Playwright: 100%|██████████| 1/1 [00:02<00:00,  2.57s/it]
Processing site W3C ACT Rules using Crawl4ai: 100%|██████████| 1/1 [00:02<00:00,  2.95s/it]


Content Table:
+---------------+------------------------------------------+------------------------------------------+------------------------------------------+
|   Site Name   |          Beautiful Soup content          |            Playwright content            |             Crawl4ai content             |
+---------------+------------------------------------------+------------------------------------------+------------------------------------------+
|     Amazon    |       2,559 characters retrieved:        |      802,001 characters retrieved:       |      823,768 characters retrieved:       |
|               |                                          |                                          |                                          |
|               |                   <!--                   |    <!DOCTYPE html><html lang="en-us"     |    <!DOCTYPE html><html lang="en-us"     |
|               |          To discuss automated access to  |     class="a-ws a-js a-audio a-video     |