# üï∑Ô∏è Web Scraping de Datos con Python

En este notebook aprenderemos a extraer datos de sitios web usando BeautifulSoup y requests.


In [1]:
%pip install -q requests beautifulsoup4 lxml pandas html5lib

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Librer√≠as cargadas correctamente")
print(f"requests: {requests.__version__}")


Note: you may need to restart the kernel to use updated packages.
‚úÖ Librer√≠as cargadas correctamente
requests: 2.32.4


In [2]:
# Headers para simular un navegador
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
}

print("Headers configurados para simular navegador")


Headers configurados para simular navegador


## 2. Scraping de una p√°gina de ejemplo

Usaremos books.toscrape.com, una p√°gina dise√±ada para practicar scraping.


In [3]:
# P√°gina de ejemplo para scraping
url = "https://books.toscrape.com/"

response = requests.get(url, headers=headers)
print(f"Status: {response.status_code}")

# Parsear HTML
soup = BeautifulSoup(response.content, 'lxml')
print(f"T√≠tulo de la p√°gina: {soup.title.text}")


Status: 200
T√≠tulo de la p√°gina: 
    All products | Books to Scrape - Sandbox



In [4]:
# Extraer informaci√≥n de libros
books = soup.find_all('article', class_='product_pod')

print(f"Libros encontrados: {len(books)}")
print("=" * 50)

# Mostrar primeros 5 libros
for i, book in enumerate(books[:5], 1):
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    availability = book.find('p', class_='instock availability').text.strip()
    rating = book.p['class'][1]  # Rating est√° en la clase CSS
    
    print(f"\nüìö Libro {i}:")
    print(f"   T√≠tulo: {title}")
    print(f"   Precio: {price}")
    print(f"   Disponibilidad: {availability}")
    print(f"   Rating: {rating}")


Libros encontrados: 20

üìö Libro 1:
   T√≠tulo: A Light in the Attic
   Precio: ¬£51.77
   Disponibilidad: In stock
   Rating: Three

üìö Libro 2:
   T√≠tulo: Tipping the Velvet
   Precio: ¬£53.74
   Disponibilidad: In stock
   Rating: One

üìö Libro 3:
   T√≠tulo: Soumission
   Precio: ¬£50.10
   Disponibilidad: In stock
   Rating: One

üìö Libro 4:
   T√≠tulo: Sharp Objects
   Precio: ¬£47.82
   Disponibilidad: In stock
   Rating: Four

üìö Libro 5:
   T√≠tulo: Sapiens: A Brief History of Humankind
   Precio: ¬£54.23
   Disponibilidad: In stock
   Rating: Five


## 3. Funci√≥n de scraping completa


In [5]:
def scrape_books_page(url):
    """Extrae informaci√≥n de libros de una p√°gina."""
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        print(f"Error: {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.content, 'lxml')
    books = soup.find_all('article', class_='product_pod')
    
    # Mapeo de ratings
    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    
    data = []
    for book in books:
        rating_text = book.p['class'][1]
        data.append({
            'titulo': book.h3.a['title'],
            'precio': book.find('p', class_='price_color').text,
            'disponibilidad': book.find('p', class_='instock availability').text.strip(),
            'rating': rating_map.get(rating_text, 0),
            'url': 'https://books.toscrape.com/' + book.h3.a['href']
        })
    
    return data

# Probar la funci√≥n
books_data = scrape_books_page("https://books.toscrape.com/")
print(f"‚úÖ Extra√≠dos {len(books_data)} libros")


‚úÖ Extra√≠dos 20 libros


In [6]:
# Convertir a DataFrame
df_books = pd.DataFrame(books_data)

# Limpiar precio (quitar s√≠mbolo ¬£ y convertir a float)
df_books['precio_num'] = df_books['precio'].str.replace('¬£', '').astype(float)

print("üìä Dataset de libros:")
df_books.head(10)


üìä Dataset de libros:


Unnamed: 0,titulo,precio,disponibilidad,rating,url,precio_num
0,A Light in the Attic,¬£51.77,In stock,3,https://books.toscrape.com/catalogue/a-light-i...,51.77
1,Tipping the Velvet,¬£53.74,In stock,1,https://books.toscrape.com/catalogue/tipping-t...,53.74
2,Soumission,¬£50.10,In stock,1,https://books.toscrape.com/catalogue/soumissio...,50.1
3,Sharp Objects,¬£47.82,In stock,4,https://books.toscrape.com/catalogue/sharp-obj...,47.82
4,Sapiens: A Brief History of Humankind,¬£54.23,In stock,5,https://books.toscrape.com/catalogue/sapiens-a...,54.23
5,The Requiem Red,¬£22.65,In stock,1,https://books.toscrape.com/catalogue/the-requi...,22.65
6,The Dirty Little Secrets of Getting Your Dream...,¬£33.34,In stock,4,https://books.toscrape.com/catalogue/the-dirty...,33.34
7,The Coming Woman: A Novel Based on the Life of...,¬£17.93,In stock,3,https://books.toscrape.com/catalogue/the-comin...,17.93
8,The Boys in the Boat: Nine Americans and Their...,¬£22.60,In stock,4,https://books.toscrape.com/catalogue/the-boys-...,22.6
9,The Black Maria,¬£52.15,In stock,1,https://books.toscrape.com/catalogue/the-black...,52.15


In [7]:
# Estad√≠sticas del dataset
print("üìä ESTAD√çSTICAS DEL DATASET")
print("=" * 40)
print(f"Total libros: {len(df_books)}")
print(f"Precio promedio: ¬£{df_books['precio_num'].mean():.2f}")
print(f"Precio m√≠nimo: ¬£{df_books['precio_num'].min():.2f}")
print(f"Precio m√°ximo: ¬£{df_books['precio_num'].max():.2f}")
print(f"\nDistribuci√≥n de ratings:")
print(df_books['rating'].value_counts().sort_index())

# Guardar en CSV
df_books.to_csv('books_scraped.csv', index=False)
print("\n‚úÖ Datos guardados en 'books_scraped.csv'")


üìä ESTAD√çSTICAS DEL DATASET
Total libros: 20
Precio promedio: ¬£38.05
Precio m√≠nimo: ¬£13.99
Precio m√°ximo: ¬£57.25

Distribuci√≥n de ratings:
rating
1    6
2    3
3    3
4    4
5    4
Name: count, dtype: int64

‚úÖ Datos guardados en 'books_scraped.csv'


## Resumen

En este ejercicio aprendimos:

1. ‚úÖ Hacer requests HTTP con headers apropiados
2. ‚úÖ Parsear HTML con BeautifulSoup
3. ‚úÖ Extraer datos con `find()` y `find_all()`
4. ‚úÖ Convertir datos a DataFrame
5. ‚úÖ Exportar datos a CSV

### ‚ö†Ô∏è Recordatorio √©tico

- Siempre verifica los t√©rminos de servicio del sitio
- Respeta el archivo robots.txt
- No sobrecargues los servidores (usa delays)
- Prefiere APIs cuando est√©n disponibles
