<a href="https://colab.research.google.com/github/gverafei/scraping/blob/main/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Investigating how to scrape the web**
March 2025

## Configure virtual environment

Solo se ejecuta lo siguiente la primera vez. Pedirá seleccionar the kernel from the upper right corner. Choose this virtual environment we just created.

In [1]:
!python3 -m venv .venv                                             
!source .venv/bin/activate # Linux/Mac
# !.\venv\Scripts\activate # Windows

In [2]:
!pip install --upgrade pip



To hacer el commit de los cambios realizados.

In [None]:
!git init

Initialized empty Git repository in /Users/gvera/Library/CloudStorage/OneDrive-UniversidadVeracruzana/FEI/codigo/doctorado/scraping/.git/


In [5]:
!git add scraping.ipynb
!git commit -m "Initial commit"

[main (root-commit) bfe1d07] Initial commit
 1 file changed, 669 insertions(+)
 create mode 100644 scraping.ipynb


In [15]:
!git fetch origin

In [20]:
!git add .

In [21]:
!git commit -m "Initial commit"

[main f4c0fad] Initial commit
 2 files changed, 201 insertions(+), 2 deletions(-)
 create mode 100644 .gitignore


In [16]:
!git merge origin/main

fatal: refusing to merge unrelated histories


In [None]:
# !git remote add origin https://github.com/gverafei/scraping.git
# !git pull origin main
!git push -u origin main

In [12]:
!git config pull.rebase false

Install required packages

```bash
pip install html2text
pip install markdownify
```

## Create the initial data

In [None]:
competitor_sites = [
    # {
    #     "name": "Google",
    #     "url": "https://www.google.com"
    # },
    {
        "name": "UV",
        "url": "https://www.uv.mx"
    },
    # {
    #     "name": "Amazon",
    #     "url": "https://www.amazon.com"
    # },
]

## Setup cost's calculations

The idea is to compare side-by-side. 

We can calculate how much it'll cost by using OpenAI's `tiktoken` library from: https://github.com/openai/tiktoken

And costs from: https://openai.com/api/pricing/

In [7]:
!pip install tiktoken --quiet

In [8]:
import tiktoken

def count_tokens(input_string: str) -> int:
    encoder = tiktoken.encoding_for_model("gpt-4o")
    tokens = encoder.encode(input_string)
    return len(tokens)

def calculate_cost(input_string: str, cost_per_million_tokens: float = 2.5) -> float:
    num_tokens = count_tokens(input_string)
    total_cost = (num_tokens / 1_000_000) * cost_per_million_tokens
    return total_cost

# Example usage:
input_string = "Porque la gallina cruzó el camino? Pues porque quería llegar al otro lado."
cost = calculate_cost(input_string)
print(f"The total cost for using gpt-4o is: $US {cost:.6f}")

The total cost for using gpt-4o is: $US 0.000043


## Table para ver los resultados

Ahora, para ver los resultados de la comparaciones, instalamos un paquete para ver tablas en línea de comandos: https://pypi.org/project/prettytable/

Y también instalamos un paquete para ver una barra de progreso bonita en loops: https://pypi.org/project/tqdm/

In [9]:
!pip install prettytable --quiet

In [10]:
!pip install tqdm --quiet

In [11]:
from typing import List, Callable, Dict
from prettytable import PrettyTable
from tqdm import tqdm

def view_scraped_content(scrape_url_functions: List[Dict[str, Callable[[str], str]]], sites_list: List[Dict[str, str]], characters_to_display: int = 500, table_max_width: int = 50) -> List[Dict[str, str]]:
    content_table_headers = ["Site Name"] + [f"{func['name']} content" for func in scrape_url_functions]
    cost_table_headers = ["Site Name"] + [f"{func['name']} cost" for func in scrape_url_functions]

    content_table = PrettyTable()
    content_table.field_names = content_table_headers

    cost_table = PrettyTable()
    cost_table.field_names = cost_table_headers

    scraped_data = []

    for site in sites_list:
        content_row = [site['name']]
        cost_row = [site['name']]
        site_data = {"provider": site['name'], "sites": []}

        for scrape_function in scrape_url_functions:
            function_name = scrape_function['name']
            for _ in tqdm([site], desc=f"Processing site {site['name']} using {function_name}"):
                try:    
                    content = scrape_function['function'](site['url'])
                    content_snippet = content[:characters_to_display]
                    content_row.append(content_snippet)

                    cost = calculate_cost(content)
                    cost_row.append(f"${cost:.6f}")

                    site_data["sites"].append({"name": function_name, "content": content})
                except Exception as e:
                    error_message = f"Error: {str(e)}"
                    content_row.append(error_message)
                    cost_row.append("Error")

                    site_data["sites"].append({"name": function_name, "content": error_message})
                    continue

        content_table.add_row(content_row)
        cost_table.add_row(cost_row)
        scraped_data.append(site_data)

    content_table.max_width = table_max_width
    content_table.hrules = True

    cost_table.max_width = table_max_width
    cost_table.hrules = True

    print("Content Table:")
    print(content_table)

    print("\nCost Table:\nThis is how much it would cost to use gpt-4o to parse this content for extraction.")
    print(cost_table)

    return scraped_data

## Setup all the scrapers

Let's setup all of our scrapers.

## Beautiful Soup

Se instala este paquete desde: https://pypi.org/project/beautifulsoup4/

Y también request para hacer peticiones desde: https://pypi.org/project/requests/

In [12]:
!pip install requests beautifulsoup4 --quiet

In [13]:
# Beautiful Soup utility functions

import requests
from bs4 import BeautifulSoup

def beautiful_soup_scrape_url(url: str):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return str(soup)

## Reader API by Jina AI

Este es especialmente para LLMs también. Setup Jina AI's scrape method from: https://jina.ai/reader/

In [14]:
import requests

def scrape_jina_ai(url: str) -> str:
  headers = {
        'X-Return-Format': 'html',
        'X-Wait-For-Selector': '.slick-initialized',
        'X-With-Images-Summary': 'all'
  }
  data = {
      'url': url,
      'injectPageScript': [
          'window.scrollTo(0, document.body.scrollHeight)'
      ]
  }

  response = requests.post('https://r.jina.ai/', headers=headers, json=data)
  return response.text

## Playwright

La manera clásica de hacer scraping. No es especial para LLMs desde: https://playwright.dev/

In [15]:
!pip install playwright --quiet

In [16]:
!playwright install

In [17]:
!pip install nest_asyncio --quiet

In [18]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from playwright.async_api import async_playwright

async def scrape_playwright(url: str):
    async with async_playwright() as pw:
        browser =  await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        # Espera 5 segundos para que cargue la pagina
        await page.wait_for_timeout(2000)
        # await page.waitForLoadState('networkidle')
        # Ejecuta un script para bajar hasta el final de la pagina
        # await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
        # Tambien se puede hacer con el teclado
        #await page.keyboard.press('End')
        # Espera 5 segundos
        # await page.wait_for_timeout(2000)

        html = await page.content()
        await browser.close()
        
        return html
    
def scrape_playwright_sync(url: str):
    return asyncio.run(scrape_playwright(url))

## Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Este es especialmente para LLMs desde: https://docs.crawl4ai.com/

In [19]:
!pip install crawl4ai --quiet
# !pip install git+https://github.com/unclecode/crawl4ai.git@2025-MAR-ALPHA-1 --quiet

In [20]:
!crawl4ai-setup

[INIT].... → Running post-installation setup...
[INIT].... → Installing Playwright browsers...
[COMPLETE] ● Playwright installation completed successfully.
[INIT].... → Starting database initialization...
[COMPLETE] ● Database initialization completed successfully.
[COMPLETE] ● Post-installation setup completed!


In [21]:
!pip install nest_asyncio --quiet

In [26]:
import nest_asyncio
nest_asyncio.apply()

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig

async def async_scrape_crawl4ai(url: str):
    browser_config = BrowserConfig(verbose=False,headless=True)
    
    # async with AsyncWebCrawler(config=browser_config) as crawler:
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    result = await crawler.arun(
        url=url
    )
    return result.html

# To run the async function in a synchronous context
# (like this script), you can use asyncio.run() to execute it.
# This is a workaround for running async functions in a sync context.
def scrape_crawl4ai(url: str):
    return asyncio.run(async_scrape_crawl4ai(url))

# print(scrape_crawl4ai("https://www.uv.mx"))

## Main functions to run the comparasion

Let's run all the scrapers and display them in our comparison table.

In [1]:
list_of_scraper_functions = [
      {"name": "Beautiful Soup", "function": beautiful_soup_scrape_url},
      # {"name": "Jina AI", "function": scrape_jina_ai},
      {"name": "Playwright", "function": scrape_playwright_sync},
      {"name": "Crawl4ai", "function": scrape_crawl4ai},
      ]

all_content = view_scraped_content(list_of_scraper_functions, competitor_sites, 500, 40)

NameError: name 'beautiful_soup_scrape_url' is not defined