<a href="https://colab.research.google.com/github/gverafei/scraping/blob/main/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Investigating how to scrape the web**
March 2025

## Configure virtual environment

Solo se ejecuta lo siguiente la primera vez. Pedirá seleccionar the kernel from the upper right corner. Choose this virtual environment we just created.

In [None]:
# !python3 -m venv .venv
# !source .venv/bin/activate # Linux/Mac
# !.\venv\Scripts\activate # Windows

Inicializa el repositorio en GitHub. Todo esto se hace desde la terminal.

In [None]:
# !git init
# !git remote add origin https://github.com/gverafei/scraping.git
# !git pull origin main
# git add .
# git commit -m "Initial commit"
# git push --set-upstream origin main

In [None]:
!pip install --upgrade pip --quiet

## Create the initial data

In [1]:
competitor_sites = [
    # {
    #     "name": "Amazon",
    #     "url": "https://www.amazon.com"
    # },
    # {
    #     "name": "UV",
    #     "url": "https://www.uv.mx"
    # },
    # {
    #     "name": "W3C ACT Rules",
    #     "url": "https://www.w3.org/WAI/standards-guidelines/act/rules/"
    # },
    # {
    #     "name": "Chedrahui",
    #     "url": "https://www.chedraui.com.mx"
    # },
    # {
    #     "name": "FEI",
    #     "url": "https://www.uv.mx/fei/"
    # },
    {
        "name": "Sistemas FEI",
        "url": "https://sistemasfei.uv.mx/inicio/"
    }
]

## Setup cost's calculations

The idea is to compare side-by-side.

We can calculate how much it'll cost by using OpenAI's `tiktoken` library from: https://github.com/openai/tiktoken

And costs from: https://openai.com/api/pricing/

In [2]:
!pip install tiktoken --quiet

In [55]:
import tiktoken

def count_tokens(input_string: str) -> int:
    encoder = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoder.encode(input_string)
    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 2.5) -> float:
    num_tokens = count_tokens(input_string)
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

def calculate_cost(num_tokens: int, cost_per_million_tokens: float = 2.5) -> float:
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

# Example usage:
# input_string = "Porque la gallina cruzó el camino? Pues porque quería llegar al otro lado."
# cost = calculate_cost(input_string)
# print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

## Table para ver los resultados

Ahora, para ver los resultados de la comparaciones, instalamos un paquete para ver tablas en línea de comandos: https://pypi.org/project/prettytable/

Y también instalamos un paquete para ver una barra de progreso bonita en loops: https://pypi.org/project/tqdm/

In [4]:
!pip install prettytable --quiet

In [5]:
!pip install tqdm --quiet

In [59]:
from typing import List, Callable, Dict
from prettytable import PrettyTable
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50, to_markdown: bool=False) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:
                    content = scrape_function['function'](site['url'], to_markdown)
                    content_snippet = content[:characters_to_display]
                    content_snippet = f"{len(content):,} characters retrieved:\n\n" + content_snippet
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = True

    cost_table.max_width = table_max_width
    cost_table.hrules = True

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to use this content as input.")
    print(cost_table)

    return scraped_data

## Setup all the scrapers

Let's setup all of our scrapers.

## Beautiful Soup

Se instala este paquete desde: https://pypi.org/project/beautifulsoup4/

Y también request para hacer peticiones desde: https://pypi.org/project/requests/

In [7]:
!pip install requests beautifulsoup4 --quiet

In [8]:
!pip install markdownify --quiet

In [9]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md

def beautiful_soup_scrape_url(url: str, to_markdown: bool = False) -> str:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    if to_markdown:
        return md(str(soup)) # Convert to markdown
    return str(soup)

## Playwright

La manera clásica de hacer scraping. No es especial para LLMs desde: https://playwright.dev/

In [10]:
!pip install playwright --quiet

In [11]:
!playwright install

In [12]:
!pip install nest_asyncio --quiet

In [13]:
!pip install markdownify --quiet

In [14]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from playwright.async_api import async_playwright

async def scrape_playwright(url: str, to_markdown: bool = False) -> str:
    async with async_playwright() as pw:
        browser =  await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        # Espera para que cargue la pagina
        await page.wait_for_load_state('domcontentloaded')
        # Ejecuta un script para bajar hasta el final de la pagina
        # await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
        # Tambien se puede hacer con el teclado
        await page.keyboard.press('End')
        # Espera a que baje el scroll
        await page.wait_for_timeout(2000)

        html = await page.content()
        if to_markdown:
            html = md(html) # Convert to markdown
        
        await browser.close()
        return html

def scrape_playwright_sync(url: str, to_markdown: bool = False):
    return asyncio.run(scrape_playwright(url, to_markdown))

# print(scrape_playwright_sync("https://www.amazon.com", to_markdown=True))

## Reader API by Jina AI

Este es especialmente para LLMs también. Setup Jina AI's scrape method from: https://jina.ai/reader/

In [15]:
import requests

def scrape_jina_ai(url: str, to_markdown: bool = False) -> str:
    headers = {
        'X-Return-Format': 'markdown' if to_markdown else 'html',
        'X-Engine': 'browser',
        'X-Timeout': '30',
        "X-With-Images-Summary": "none" if to_markdown else "all",
    }
    data = {
        'url': url,
        'injectPageScript': [
            'document.addEventListener("mutationIdle", window.simulateScroll);'
        ]
    }
    response = requests.post('https://r.jina.ai/', headers=headers, json=data)
    return response.text

# print(scrape_jina_ai("https://www.uv.mx", to_markdown = True))

## Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Este es especialmente para obtener formato amigable para LLMs desde: https://docs.crawl4ai.com/

Primero instalamos prerequisitos que requiere Google Colab.

In [16]:
!pip install h5py --quiet

In [17]:
!pip install typing-extensions --quiet

In [18]:
!pip install wheel --quiet

Posteriormente, ya podemos realizar la instalación.

In [19]:
!pip install crawl4ai --quiet

In [20]:
!crawl4ai-setup

[36m[INIT].... → Running post-installation setup...[0m
[36m[INIT].... → Installing Playwright browsers...[0m
[32m[COMPLETE] ● Playwright installation completed successfully.[0m
[36m[INIT].... → Starting database initialization...[0m
[32m[COMPLETE] ● Database initialization completed successfully.[0m
[32m[COMPLETE] ● Post-installation setup completed![0m
[0m

In [21]:
!pip install nest_asyncio --quiet

In [22]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_conf = BrowserConfig(verbose=False,headless=True)

run_cfg = CrawlerRunConfig(
    wait_until="domcontentloaded",
    wait_for_images=True,
    scan_full_page=True,
    verbose=False,
)

async def async_scrape_crawl4ai(url: str, to_markdown: bool = False) -> str:
    crawler = AsyncWebCrawler(config=browser_conf)
    await crawler.start()
    result = await crawler.arun(
        url=url,
        config=run_cfg,
    )

    if not to_markdown:
        return result.html
    
    # Convert HTML to Markdown
    # Get all the images
    images = result.media.get("images", [])
    images_list = f"\n\nImages found:{len(images)}"
    for i, img in enumerate(images):
        images_list = images_list + f"\n - ![Image {i+1}: {img.get('alt','No description')}]({img.get('src','')})"
        # Example: - ![Image 1: Alt text](https://example.com/image1.jpg)
    
    return result.markdown + images_list

# To run the async function in a synchronous context
# (like this script), you can use asyncio.run() to execute it.
# This is a workaround for running async functions in a sync context.
def scrape_crawl4ai(url: str, to_markdown: bool = False):
    return asyncio.run(async_scrape_crawl4ai(url, to_markdown))

# print(scrape_crawl4ai("https://www.uv.mx", to_markdown=True))

## Firecrawl: Turn websites into LLM-ready data

Esta es otra opción que no se usará porque tiene costo. También esta enfocado en AI. Desde: https://www.firecrawl.dev/

## Main functions to run the comparasion with HTML

Let's run all the scrapers and display them in our comparison table.

In [23]:
list_of_scraper_functions = [
      # {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      # {"name": "Jina AI", "function": scrape_jina_ai},
      # {"name": "Playwright", "function": scrape_playwright_sync},
      {"name": "Crawl4ai", "function": scrape_crawl4ai},
      ]

all_content_html = view_scraped_content(list_of_scraper_functions, competitor_sites, 800, 35, to_markdown=False)

Processing site Sistemas FEI using Crawl4ai: 100%|██████████| 1/1 [00:02<00:00,  2.25s/it]


Content Table:
+--------------+-------------------------------------+
|  Site Name   |           Crawl4ai content          |
+--------------+-------------------------------------+
| Sistemas FEI |     10,378 characters retrieved:    |
|              |                                     |
|              |    <html lang="en" style="filter:   |
|              |          invert(0);"><head>         |
|              |            <meta charset="UTF-8">   |
|              |           <meta name="description"  |
|              |     content="Astro description">    |
|              |            <meta name="viewport"    |
|              |    content="width=device-width">    |
|              |               <link rel="icon"      |
|              |         type="image/x-icon"         |
|              |         href="favicon.ico">         |
|              |            <meta name="generator"   |
|              | content="Astro v4.4.1"><link href=" |
|              | https://cdn.jsdelivr.net/npm/boot

## Realizamos la comparación con markdown

Vamos a ejecutar todos los scrapers pero ahora que devuelvan un formato más amigable para la IA.

In [24]:
all_content_md = view_scraped_content(list_of_scraper_functions, competitor_sites, 1200, 35, to_markdown=True)

Processing site Sistemas FEI using Crawl4ai: 100%|██████████| 1/1 [00:01<00:00,  1.86s/it]


Content Table:
+--------------+-------------------------------------+
|  Site Name   |           Crawl4ai content          |
+--------------+-------------------------------------+
| Sistemas FEI |     2,064 characters retrieved:     |
|              |                                     |
|              | [ ![Logo FEI](https://sistemasfei.u |
|              | v.mx/inicio/_astro/logoFEI.CKAzzJnF |
|              |            _ZyfpCW.webp)            |
|              |       ](http://www.uv.mx/fei)       |
|              |       # Portal de sistemas FEI      |
|              | [ ![Logo Universidad Veracruzana](h |
|              | ttps://sistemasfei.uv.mx/inicio/_as |
|              |    tro/logoUV.gPBiyME4_N2yL.webp)   |
|              |         ](http://www.uv.mx)         |
|              | ![foto](https://sistemasfei.uv.mx/i |
|              | nicio/_astro/aulaFEI.Bfi36lLc_Z2aGa |
|              |               TO.webp)              |
|              |  ##### Sistema de Gestion de Clav

## Conectar con OpenAI para enviar la información

Vamos a enviar el contenido en HTML y en markdown y le vamos a pedir que nos devuelva una sección accesible WCAG 2.2.

In [25]:
!pip install openai --quiet

In [47]:
import getpass
from openai import OpenAI

OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key: ')

client = OpenAI(api_key=OPENAI_API_KEY)

def extract(user_input: str, user_prompt: str) -> str:
    entity_extraction_system_message = {
        "role": "system", 
        "content": "You are a helpful assistant expert on web accessibility WCAG 2.2 that evaluate and correct HTML code."
        "You will be given code and you will analize it."
        "Then, you will create a new webpage from that code but accessible according to WCAG 2.2."
        "If you include an style.css file, you will print it at the end."
        "Also, you will provide a list of the procedure you did respect to the original."
        "You will based the structure from the page: https://webaim.org/ wich is a good example of accessible web page."
    }
    # Add the system message to the messages list
    messages = [entity_extraction_system_message]
    # Add the content to the messages list
    messages.append({"role": "user", "content": user_prompt})
    messages.append({"role": "user", "content": user_input})
    # Call the OpenAI API to get the response
    response = client.chat.completions.create(
        model="gpt-4o",
        stream=False,
        messages=messages,
    )
    # Extract the output text from the response
    # response.usage.completion_tokens
    # response.usage.prompt_tokens

    # return response.choices[0].message.content
    return response.choices[0].message.content, response.usage.completion_tokens, response.usage.prompt_tokens

Ahora se crearemos una función para crear una tabla con los resultados de comparar el resultado devuelto en HTML y en Markdown. 

In [60]:
def display_extracted_content(results_html: List[Dict[str, any]], results_md: List[Dict[str, any]], function_name: str):
    table = PrettyTable()
    table.field_names = ["Site", "From HTML", "From markdown"]

    # Iterate through each site and its content
    for i,result in tqdm(enumerate(results_html), desc="Processing results"):
        sites_html = results_html[i]["sites"]
        sites_md = results_md[i]["sites"]
        provider = results_html[i]["provider"]

        for i in range(len(sites_html)):
            if sites_html[i]["name"] == function_name:
                # Extract the content for HTML and Markdown
                content_html = sites_html[i]["content"]
                content_md = sites_md[i]["content"]

                # Progress bar for each function
                for _ in tqdm(range(1), desc=f"Extracting content with {function_name} for HTML input"):
                    extracted_content_html,completion_tokens, prompt_tokens = extract(content_html, "Extract the section body of the following HTML code and create a new accessible web page version with header, content and footer.")
                    cost = calculate_cost(completion_tokens + prompt_tokens)
                    extracted_content_html = f"Completion tokens: {completion_tokens:,}\nPrompt tokens: {prompt_tokens:,}\nTotal cost:${cost:.6f}" + "\n\n\n" + extracted_content_html

                # Progress bar for each function
                for _ in tqdm(range(1), desc=f"Extracting content with {function_name} for Markdown input"):
                    extracted_content_md, completion_tokens, prompt_tokens = extract(content_md,"Use the following content to create a new accessible web page version with header, content and footer.")
                    cost = calculate_cost(completion_tokens + prompt_tokens)
                    extracted_content_md =  f"Completion tokens: {completion_tokens:,}\nPrompt tokens: {prompt_tokens:,}\nTotal cost:${cost:.6f}" + "\n\n\n" + extracted_content_md

                table.add_row([provider, extracted_content_html, extracted_content_md])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = True  # Add horizontal rules for better readability

    print("Extracted Content Table:")
    print(table)

In [57]:
display_extracted_content( all_content_html, all_content_md, "Crawl4ai")

Processing results: 0it [00:00, ?it/s]
Extracting content with Crawl4ai for HTML input:   0%|          | 0/1 [00:00<?, ?it/s]
Extracting content with Crawl4ai for HTML input: 100%|██████████| 1/1 [00:10<00:00, 10.30s/it]

Extracting content with Crawl4ai for Markdown input:   0%|          | 0/1 [00:00<?, ?it/s]
Extracting content with Crawl4ai for Markdown input: 100%|██████████| 1/1 [00:21<00:00, 21.62s/it]
Processing results: 1it [00:31, 31.93s/it]


Extracted Content Table:
+--------------+----------------------------------------------------+----------------------------------------------------+
|     Site     |                     From HTML                      |                   From markdown                    |
+--------------+----------------------------------------------------+----------------------------------------------------+
| Sistemas FEI |              Completion tokens: 1183               |              Completion tokens: 1778               |
|              |                Prompt tokens: 3049                 |                 Prompt tokens: 881                 |
|              |                  $cost:$0.010580                   |                  $cost:$0.006647                   |
|              |                                                    |                                                    |
|              |                                                    |                                             