<a href="https://colab.research.google.com/github/gverafei/scraping/blob/main/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Investigating how to scrape the web**
April 2025



### Initial method

<center><img src="images/metodo.jpg" style="margin:auto; width:90%"/></center>

### Método simple como control

<center><img src="images/metodo-slim1.jpg" style="margin:auto; width:50%"/></center>

### Método propuesto v1

<center><img src="images/metodo-slim2.jpg" style="margin:auto; width:50%"/></center>


### Scraping WCAG

Necesitamos guardar de alguna manera todo este conocimiento: https://www.w3.org/WAI/WCAG22/Understanding/

O este conocimiento: https://www.w3.org/WAI/standards-guidelines/act/rules/

<center><img src="images/metodo-craw.jpg" style="margin:auto; width:50%"/></center>

## Configure virtual environment

Solo se ejecuta lo siguiente la primera vez. Pedirá seleccionar the kernel from the upper right corner. Choose this virtual environment we just created.

In [None]:
# !python3 -m venv .venv
# !source .venv/bin/activate # Linux/Mac
# !.\venv\Scripts\activate # Windows

Inicializa el repositorio en GitHub. Todo esto se hace desde la terminal.

In [None]:
# !git init
# !git remote add origin https://github.com/gverafei/scraping.git
# !git pull origin main
# git add .
# git commit -m "Initial commit"
# git push --set-upstream origin main

In [None]:
!pip install --upgrade pip --quiet

## Create the initial data

In [None]:
test_sites = [
    {
        "name": "Amazon",
        "url": "https://www.amazon.com"
    },
    # {
    #     "name": "UV",
    #     "url": "https://www.uv.mx"
    # },
    {
        "name": "W3C ACT Rules",
        "url": "https://www.w3.org/WAI/standards-guidelines/act/rules/"
    },
    # {
    #     "name": "W3C WCAG 2.2",
    #     "url": "https://www.w3.org/WAI/WCAG22/Understanding/"
    # },
    # {
    #     "name": "Chedrahui",
    #     "url": "https://www.chedraui.com.mx"
    # },
    # {
    #     "name": "FEI",
    #     "url": "https://www.uv.mx/fei/"
    # },
    {
        "name": "Sistemas FEI",
        "url": "https://sistemasfei.uv.mx/inicio/"
    }
]

## Setup cost's calculations

The idea is to compare side-by-side.

We can calculate how much it'll cost by using OpenAI's `tiktoken` library from: https://github.com/openai/tiktoken

And costs from: https://openai.com/api/pricing/

In [None]:
!pip install tiktoken --quiet

In [None]:
import tiktoken

def count_tokens(input_string: str) -> int:
    encoder = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoder.encode(input_string)
    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 2.5) -> tuple:
    num_tokens = count_tokens(input_string)
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost, num_tokens

def calculate_cost_tokens(num_tokens: int, cost_per_million_tokens: float = 2.5) -> float:
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

# Example usage:
# input_string = "Porque la gallina cruzó el camino? Pues porque quería llegar al otro lado."
# cost = calculate_cost(input_string)
# print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

## Table para ver los resultados

Ahora, para ver los resultados de la comparaciones, instalamos un paquete para ver tablas en línea de comandos: https://pypi.org/project/prettytable/

Y también instalamos un paquete para ver una barra de progreso bonita en loops: https://pypi.org/project/tqdm/

In [None]:
!pip install prettytable --quiet

In [None]:
!pip install tqdm --quiet

In [None]:
from typing import List, Callable, Dict
from prettytable import PrettyTable
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50, to_markdown: bool=False) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                content = scrape_function['function'](site['url'], to_markdown)
                content_snippet = content[:characters_to_display]
                content_snippet = f"{len(content):,} characters retrieved:\n\n" + content_snippet
                content_row.append(content_snippet)

                cost, count_tokens = calculate_cost(content)
                cost_row.append(f"${cost:.6f} (tokens: {count_tokens:,})")

                site_data["sites"].append({"name": function_name, "content": content})

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = True

    cost_table.max_width = table_max_width
    cost_table.hrules = True

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to use this content as input.")
    print(cost_table)

    return scraped_data

## Setup all the scrapers

Let's setup all of our scrapers.

## Beautiful Soup

Se instala este paquete desde: https://pypi.org/project/beautifulsoup4/

Y también request para hacer peticiones desde: https://pypi.org/project/requests/

In [None]:
!pip install requests beautifulsoup4 --quiet

In [None]:
!pip install markdownify --quiet

In [None]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md

def beautiful_soup_scrape_url(url: str, to_markdown: bool = False) -> str:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    if to_markdown:
        return md(str(soup)) # Convert to markdown
    return str(soup)

## Playwright

La manera clásica de hacer scraping. No es especial para LLMs desde: https://playwright.dev/

In [None]:
!pip install playwright --quiet

In [None]:
!playwright install

In [None]:
!pip install nest_asyncio --quiet

In [None]:
!pip install markdownify --quiet

In [None]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from playwright.async_api import async_playwright

async def scrape_playwright(url: str, to_markdown: bool = False) -> str:
    async with async_playwright() as pw:
        browser =  await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        # Espera para que cargue la pagina
        await page.wait_for_load_state('domcontentloaded')
        # Ejecuta un script para bajar hasta el final de la pagina
        # await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
        # Tambien se puede hacer con el teclado
        await page.keyboard.press('End')
        # Espera a que baje el scroll
        await page.wait_for_timeout(2000)

        html = await page.content()
        if to_markdown:
            html = md(html) # Convert to markdown
        
        await browser.close()
        return html

def scrape_playwright_sync(url: str, to_markdown: bool = False):
    return asyncio.run(scrape_playwright(url, to_markdown))

# print(scrape_playwright_sync("https://www.amazon.com", to_markdown=True))

## Reader API by Jina AI

Este es especialmente para LLMs también. Setup Jina AI's scrape method from: https://jina.ai/reader/

In [None]:
import requests

def scrape_jina_ai(url: str, to_markdown: bool = False) -> str:
    headers = {
        'X-Return-Format': 'markdown' if to_markdown else 'html',
        'X-Engine': 'browser',
        'X-Timeout': '30',
        "X-With-Images-Summary": "none" if to_markdown else "all",
    }
    data = {
        'url': url,
        'injectPageScript': [
            'document.addEventListener("mutationIdle", window.simulateScroll);'
        ]
    }
    response = requests.post('https://r.jina.ai/', headers=headers, json=data)
    return response.text

# print(scrape_jina_ai("https://www.uv.mx", to_markdown = True))

## Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Este es especialmente para obtener formato amigable para LLMs desde: https://docs.crawl4ai.com/

Primero instalamos prerequisitos que requiere Google Colab.

In [None]:
!pip install h5py --quiet

In [None]:
!pip install typing-extensions --quiet

In [None]:
!pip install wheel --quiet

Posteriormente, ya podemos realizar la instalación.

In [None]:
!pip install crawl4ai --quiet

In [None]:
!crawl4ai-setup

In [None]:
!pip install nest_asyncio --quiet

In [None]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_conf = BrowserConfig(verbose=False,headless=True)

run_cfg = CrawlerRunConfig(
    wait_until="domcontentloaded",
    wait_for_images=True,
    scan_full_page=True,
    verbose=False,
)

async def async_scrape_crawl4ai(url: str, to_markdown: bool = False) -> str:
    crawler = AsyncWebCrawler(config=browser_conf)
    await crawler.start()
    result = await crawler.arun(
        url=url,
        config=run_cfg,
    )

    if not to_markdown:
        return result.html
    
    # Convert HTML to Markdown
    # Get all the images
    images = result.media.get("images", [])
    images_list = f"\n\nImages found:{len(images)}"
    for i, img in enumerate(images):
        images_list = images_list + f"\n - ![Image {i+1}: {img.get('alt','No description')}]({img.get('src','')})"
        # Example: - ![Image 1: Alt text](https://example.com/image1.jpg)
    
    return result.markdown + images_list

# To run the async function in a synchronous context
# (like this script), you can use asyncio.run() to execute it.
# This is a workaround for running async functions in a sync context.
def scrape_crawl4ai(url: str, to_markdown: bool = False):
    return asyncio.run(async_scrape_crawl4ai(url, to_markdown))

# print(scrape_crawl4ai("https://www.uv.mx", to_markdown=True))

## Firecrawl: Turn websites into LLM-ready data

Esta es otra opción que no se usará porque tiene costo. También esta enfocado en AI. Desde: https://www.firecrawl.dev/

In [None]:
!pip install firecrawl-py --quiet

In [None]:
import getpass
from firecrawl import FirecrawlApp

def scrape_firecrawl(url: str, to_markdown: bool = False) -> str:
    FIRECRAWL_API_KEY = getpass.getpass('Enter your FireCrawl API key: ')
    app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)

    # Crawl a website:
    scrape_result = app.scrape_url(
        url, 
        params={
            'formats': ['html' if not to_markdown else 'markdown'],
            'waitFor': 2000,
            'actions': [
                {"type": "executeJavascript", "script": "window.scrollTo(0, document.body.scrollHeight)"},
                {"type": "wait", "milliseconds": 2000},
            ]   
        }
    )

    # Get the content
    if not to_markdown:
        return scrape_result['html']
    else:
        return scrape_result['markdown']
    
# print(scrape_firecrawl("https://www.uv.mx", to_markdown=True))

## Main functions to run the comparasion with HTML

Let's run all the scrapers and display them in our comparison table.

In [None]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      {"name": "Jina AI", "function": scrape_jina_ai},
      {"name": "Playwright", "function": scrape_playwright_sync},
      {"name": "Crawl4ai", "function": scrape_crawl4ai},
      ]

all_content_html = view_scraped_content(list_of_scraper_functions, test_sites, 800, 35, to_markdown=False)

## Realizamos la comparación con markdown

Vamos a ejecutar todos los scrapers pero ahora que devuelvan un formato más amigable para la IA.

In [None]:
all_content_md = view_scraped_content(list_of_scraper_functions, test_sites, 1200, 35, to_markdown=True)

## Conectar con los LLMs para evaluar si puede crear una página web accesible

Vamos a enviar el contenido en HTML y en markdown y le vamos a pedir que nos devuelva una sección accesible WCAG 2.2.

In [None]:
!pip install openai --quiet

In [None]:
import getpass
from openai import OpenAI

def extract(model: str, user_input: str, user_prompt: str, template: str = None) -> str:
    if model == "gpt-4o":
        OPENAI_API_KEY = getpass.getpass('Enter your OpenAI API key for gpt-4o: ')
        client = OpenAI(api_key=OPENAI_API_KEY)
    elif model == "gemini-2.0-flash":
        GOOGLE_API_KEY = getpass.getpass('Enter your Google API key for gemini-2.0-flash: ')
        client = OpenAI(api_key=GOOGLE_API_KEY, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")


    entity_extraction_system_message = {
        "role": "system", 
        "content": """"
        1. You are a helpful assistant expert on web accessibility WCAG that evaluate and correct HTML code.
        2. You will be given code and you will analize it.
        3. Then, you will create a new webpage from that code but accessible according to WCAG https://www.w3.org/WAI/WCAG22/Understanding/
        4. Check all rules including color contrast, alt text, and semantic HTML.
        5. Use absolute URLs for the images that have a relative ones.
        6. If you include an style.css file, you will add the rules inline in the head section.
        7. Also, you will provide a list of the procedure you did respect to the original in markdown format.
        8. Dont scape the HTML code, just return it as a string. For example don't add "\n" or \" to the HTML code.
        9. Return the result as a JSON with values: {Procedure: str, HTML: str}
        """
    }
    # Add the system message to the messages list
    messages = [entity_extraction_system_message]
    # Add the content to the messages list
    messages.append({"role": "user", "content": user_prompt})
    messages.append({"role": "user", "content": user_input})
    if template:
        messages.append({"role": "user", "content": "The following is a template as a base for the HTML code you will generate with the content. Use bootstrap classes to make it responsive and accessible. " + template})
        messages.append({"role": "user", "content": template})
    # Call the OpenAI API to get the response
    response = client.chat.completions.create(
        model=model,
        temperature=0.1,
        stream=False,
        messages=messages,
        response_format={"type": "json_object"},
    )
    
    # return response.choices[0].message.content
    return response.choices[0].message.content, response.usage.completion_tokens, response.usage.prompt_tokens

## Función que permite comparar los resultados

Ahora se crearemos una función para crear una tabla con los resultados de comparar el resultado devuelto en HTML y en Markdown. 

In [None]:
import json

def display_extracted_content(model: str, results_html: List[Dict[str, any]], results_md: List[Dict[str, any]], function_name: str, site_name: str):
    table = PrettyTable()
    table.field_names = ["Site", "From HTML", "From markdown", "From markdown with template"]

    with open('templates/tem001.html', 'r') as file:  # r to open file in READ mode
        html_as_string = file.read()

    # Iterate through each site and its content
    for i,result in tqdm(enumerate(results_html), desc="Processing results"):
        sites_html = results_html[i]["sites"]
        sites_md = results_md[i]["sites"]
        provider = results_html[i]["provider"]

        # Check if the provider matches the site name
        if provider == site_name:
            for i in range(len(sites_html)):
                # Check if the function name matches
                if sites_html[i]["name"] == function_name:
                    # Extract the content for HTML and Markdown
                    content_html = sites_html[i]["content"]
                    content_md = sites_md[i]["content"]

                    # Progress bar for each function
                    for _ in tqdm(range(1), desc=f"Extracting content with {function_name} for HTML input"):
                        extracted_content_html, completion_tokens, prompt_tokens = extract(model, content_html, "Use the following HTML code and create a new accessible web page version mantaining all the contents. The absolute URL is https://sistemasfei.uv.mx/inicio/")
                        cost = calculate_cost_tokens(completion_tokens + prompt_tokens)
                        cost_label = f"Completion tokens: {completion_tokens:,}\nPrompt tokens: {prompt_tokens:,}\nTotal cost:${cost:.6f}" + "\n\n\n"
                        col_content_html = cost_label + extracted_content_html

                    # Progress bar for each function
                    for _ in tqdm(range(1), desc=f"Extracting content with {function_name} for Markdown input"):
                        extracted_content_md, completion_tokens, prompt_tokens = extract(model, content_md, "Use the following content to create a new accessible web page version. Use the code from https://webaim.org/ as a base to create a new accessible web page version. Observe the structure of the header, the images are in on row with the title; also the use of cols and rows to make it responsive and accessible.", template=None)
                        cost = calculate_cost_tokens(completion_tokens + prompt_tokens)
                        cost_label = f"Completion tokens: {completion_tokens:,}\nPrompt tokens: {prompt_tokens:,}\nTotal cost:${cost:.6f}" + "\n\n\n"
                        col_content_md = cost_label + extracted_content_md

                    # Progress bar for each function
                    for _ in tqdm(range(1), desc=f"Extracting content with {function_name} for Markdown input and template"):
                        extracted_content_md_template, completion_tokens, prompt_tokens = extract(model, content_md, "Use the following content to create a new accessible web page version.",html_as_string.replace("\n",""))
                        cost = calculate_cost_tokens(completion_tokens + prompt_tokens)
                        cost_label = f"Completion tokens: {completion_tokens:,}\nPrompt tokens: {prompt_tokens:,}\nTotal cost:${cost:.6f}" + "\n\n\n"
                        col_content_md_template = cost_label + extracted_content_md_template

                    table.add_row([provider, col_content_html, col_content_md, col_content_md_template])

    table.max_width = 50  # Set the maximum width for better display
    table.hrules = True  # Add horizontal rules for better readability

    print("Extracted Content Table:")
    print(table)

    return json.loads(extracted_content_html), json.loads(extracted_content_md), json.loads(extracted_content_md_template)

## OpenAI

Revisemos el resultado que produce este modelo.

In [None]:
extracted_json_html, extracted_json_md, extracted_json_md_template = display_extracted_content("gpt-4o", all_content_html, all_content_md, "Crawl4ai", "Sistemas FEI")

### Visualizar los resultados obtenidos

Instalamos las dependencias necesarias para generar la interface web.

In [None]:
pip install ipywidgets --quiet

In [None]:
!pip install aiofiles --quiet

In [None]:
!pip install gradio --quiet

Creamos una función de apoyo para crear la interface web y poder ver los resultados visualmente.

In [None]:
# import json
import gradio as gr

def create_interface():
    # Load the JSON data
    # json_data_html = json.loads(extracted_json_html)
    # json_data_md = json.loads(extracted_json_md)
    # json_data_md_template = json.loads(extracted_json_md_template)

    with gr.Blocks(theme=gr.themes.Default()) as demo:
        # From HTML
        with gr.Tab("Proc HTML"):
            gr.Markdown(extracted_json_html["Procedure"], label="Procedure")
        with gr.Tab("Code HTML"):
            gr.TextArea(extracted_json_html["HTML"], label="HTML code", show_copy_button=True, lines=20)
        with gr.Tab("Res HTML"):
            gr.HTML(extracted_json_html["HTML"])
        # From Markdown
        with gr.Tab("Proc MD"):
            gr.Markdown(extracted_json_md["Procedure"], label="Procedure")
        with gr.Tab("Code MD"):
            gr.TextArea(extracted_json_md["HTML"], label="HTML code", show_copy_button=True, lines=20)
        with gr.Tab("Res MD"):
            gr.HTML(extracted_json_md["HTML"])
        # From Markdown with template
        with gr.Tab("Proc MD with template"):
            gr.Markdown(extracted_json_md_template["Procedure"], label="Procedure")
        with gr.Tab("Code MD with template"):
            gr.TextArea(extracted_json_md_template["HTML"], label="HTML code", show_copy_button=True, lines=20)
        with gr.Tab("Res MD with template"):
            gr.HTML(extracted_json_md_template["HTML"])

    demo.launch()

In [None]:
create_interface()

## Google Gemini

Revisemos el resultado que produce este modelo.

In [None]:
extracted_json_html, extracted_json_md, extracted_json_md_template = display_extracted_content("gemini-2.0-flash", all_content_html, all_content_md, "Crawl4ai", "Sistemas FEI")

### Visualizar los resultados obtenidos

In [None]:
create_interface()

In [None]:
# import os

# if not os.path.exists("output"):
#     os.makedirs("output")

# with open(f'output/from_html.html', "w+", encoding="utf-8") as f:
#     f.write(json_data_html["HTML"])

# with open(f'output/from_markdown.html', "w+", encoding="utf-8") as f:
#     f.write(json_data_md["HTML"])

# with open(f'output/from_markdown_template.html', "w+", encoding="utf-8") as f:
#     f.write(json_data_md_template["HTML"])

In [None]:
!pip install flask --quiet

In [None]:
from flask import Flask
app = Flask(__name__)

@app.route('/html')
def index_html():
    return json_data_html["HTML"]

@app.route('/md')
def index_mc():
    return json_data_md["HTML"]

@app.route('/template')
def index_template():
    return json_data_md_template["HTML"]

app.run(port=5000)

From HTML: http://127.0.0.1:5000/html

From Markdown: http://127.0.0.1:5000/md

From Template: http://127.0.0.1:5000/template


### Retrieval-Augmented Generation (RAG)

<center><img src="images/rag.gif" style="margin:auto; width:50%"/></center>

### Agentic Retrieval-Augmented Generation (RAG)

<center><img src="images/arag.gif" style="margin:auto; width:50%"/></center>

### Chunking strategies

<center><img src="images/chunks.gif" style="margin:auto; width:50%"/></center>

## Evaluación de los resultados

Recuperamos los resultados de la generación automática para no estar llamando a los modelos en cada prueba.

In [None]:
# Read the output files
def read_output_files(prefix: str):
    generated_html = {}
    
    # Read the HTML content to files
    with open(f'output/{prefix}-from_html.html', 'r') as f:
        generated_html["HTML"] = f.read()

    with open(f'output/{prefix}-from_markdown.html', 'r') as f:
        generated_html["MD"] = f.read()

    with open(f'output/{prefix}-from_markdown_template.html', 'r') as f:
        generated_html["TEMPLATE"] = f.read()
    
    return generated_html

# Read the original HTML file
with open(f'output/Crawl4ai-original.html', 'r') as f:
    original_html = f.read()

# Read the output files
generated_html_gpt = read_output_files("gpt-4o")
generated_html_gemini = read_output_files("gemini-2.0-flash")
