# Ejercicio 12: Web Scraping

## Objetivo de la práctica

El objetivo de este ejercicio es construir un web scraper que recoja datos de un website.

### Parte 0: Planificar
1. Identificar los datos que quieres obtener.
2. Elegir el sitio web objetivo.
3. Planificar la estructura del corpus.

## Parte 1: Entender el sitio web objetivo

- Analizar la estructura de la página web a ser analizada.
- Identificar los elementos HTML que contienen los datos bsuscados.

**Morales Jessica - GR1CC**

In [3]:
from bs4 import BeautifulSoup

#file = '../data/12webcrawling/rotisserie-chicken.html'

!wget https://raw.githubusercontent.com/jessicaMorale/RI-busquedaB-MoralesJ/refs/heads/main/receta2.html -O recetas.html

# Load the HTML file
with open('recetas.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

--2025-07-27 23:22:17--  https://raw.githubusercontent.com/jessicaMorale/RI-busquedaB-MoralesJ/refs/heads/main/receta2.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 580640 (567K) [text/plain]
Saving to: ‘recetas.html’


2025-07-27 23:22:18 (11.9 MB/s) - ‘recetas.html’ saved [580640/580640]



In [4]:
# Buscar todos los enlaces de recetas en la página principal
recipe_links = soup.find_all('a', class_='fixed-recipe-card__title-link')

# Mostrar los enlaces de las recetas
for link in recipe_links:
    # Obtener el enlace completo de la receta
    href = link.get('href')
    if href:
        print(href)

In [5]:
# Intentar obtener el título desde el meta og:title
title = soup.find("meta", {"property": "og:title"})
if title:
    recipe_title = title.get('content')
else:
    # Si no se encuentra el meta og:title, intentar con el título en <h1>
    recipe_title = soup.find('h1', class_='headline').text.strip()

# Mostrar el título de la receta
print(f"Nombre de la receta: {recipe_title}")

Nombre de la receta: Rotisserie Chicken


In [7]:
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
for ingredient in ingredients_section:
    print(ingredient.text.strip())

1 (3 pound) whole chicken
1 pinch salt
¼ cup butter, melted
1 tablespoon salt
1 tablespoon ground paprika
¼ tablespoon ground black pepper


## Parte 2: Obtener los datos deseados

* Buscar dentro del contenido HTML y extraer la información.

In [8]:
# Extracting the description
description = soup.find("meta", {"name": "description"})["content"]

# Extracting the ingredients
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")
ingredients = [ingredient.get_text().strip() for ingredient in ingredients_section]

# Extracting the instructions
instructions_section = soup.find_all("p", class_="comp mntl-sc-block mntl-sc-block-html")
instructions = [instruction.get_text().strip() for instruction in instructions_section]

# Extracting the nutrition information
nutrition_section = soup.find_all("span", class_="mm-recipes-nutrition-facts-label__nutrient-name mm-recipes-nutrition-facts-label__nutrient-name--has-postfix")
nutrition_facts = [fact.parent.get_text().strip().replace('\n', ' ') for fact in nutrition_section]

# Print the extracted information
print("Recipe Title:", title)
print("Description:", description)
print("Ingredients:")
for ingredient in ingredients:
    print("-", ingredient)
print("Instructions:")
for i, instruction in enumerate(instructions, 1):
    print(f"{i}. {instruction}")
print("Nutrition Facts:")
for fact in nutrition_facts:
    print("-", fact)


Recipe Title: <meta content="Rotisserie Chicken" property="og:title"/>
Description: Rotisserie chicken that's easy to cook on a gas grill and turns out moist and juicy with crispy skin. This is a simple recipe that our family loves.
Ingredients:
- 1 (3 pound) whole chicken
- 1 pinch salt
- ¼ cup butter, melted
- 1 tablespoon salt
- 1 tablespoon ground paprika
- ¼ tablespoon ground black pepper
Instructions:
1. Intimidated by the idea of making a rotisserie chicken at home? We're here to help. Get your grill and rotisserie attachment ready — you'll want to try this recipe ASAP.
2. Here's what you'll need to make rotisserie chicken at home:
3. · Whole Chicken: This recipe is meant for a whole 3-pound chicken. If your chicken is larger or smaller, you'll have to adjust the cooking time.· Butter: Butter keeps the chicken moist and juicy, while giving the seasonings something to stick to.· Seasonings: The rotisserie chicken is simply seasoned with salt, pepper, and paprika.
4. You'll find t

## Parte 3: Obtener enlaces relacionados
* Encontrar links a otras recetas para completar el corpus

In [9]:
# Find all the links to other recipes
recipe_links = soup.find_all("a", href=True)

# Filter and print only the links that are likely to be recipes
recipe_urls = []
for link in recipe_links:
    href = link['href']
    if "recipe" in href:
        recipe_urls.append(href)

# Print the recipe URLs
print("Linked Recipes:")
for url in recipe_urls:
    print(url)

Linked Recipes:
https://www.allrecipes.com/authentication/login?regSource=3675&relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
/account/add-recipe
https://www.myrecipes.com/favorites
https://www.allrecipes.com/authentication/logout?relativeRedirectUrl=%2Frecipe%2F93168%2Frotisserie-chicken%2F
https://www.magazines.com/allrecipes-magazine.html?utm_source=allrecipes.com&utm_medium=owned&utm_campaign=i111arr1w2661
https://www.magazines.com/allrecipes-magazine.html
https://www.allrecipes.com/recipes/17562/dinner/
https://www.allrecipes.com/recipes/17057/everyday-cooking/more-meal-ideas/5-ingredients/main-dishes/
https://www.allrecipes.com/recipes/15436/everyday-cooking/one-pot-meals/
https://www.allrecipes.com/recipes/1947/everyday-cooking/quick-and-easy/
https://www.allrecipes.com/recipes/455/everyday-cooking/more-meal-ideas/30-minute-meals/
https://www.allrecipes.com/recipes/17889/everyday-cooking/family-friendly/family-dinners/
https://www.allrecipes.com/recipes/94/soups-s

# Identificar datos de la página web
**Obtener Urls para la creación del coprus**
*100 recetas*

In [10]:
with open("urls_recetas3.txt", "w", encoding="utf-8") as f:
    for url in recipe_urls:
        f.write(url + "\n")

print(f"Se guardaron {len(recipe_urls)} enlaces en 'urls_recetas.txt'")

Se guardaron 211 enlaces en 'urls_recetas.txt'


**Descargar URLs mediante el terminal de linux y cargar a un repositorio github**

In [18]:
from bs4 import BeautifulSoup
import os

descargadas = 0
recetas_validas = []  # Guardará los archivos con contenido

# Descargar archivos y verificar si están vacíos
for i in range(1, 101):
    url = f"https://raw.githubusercontent.com/jessicaMorale/recetas_crawler/refs/heads/main/recetas/receta_{i}.html"
    destino = f"receta_{i}.html"

    print(f" receta {i}...", end=" ")
    os.system(f"wget -q \"{url}\" -O \"{destino}\"")

    # Verificación simple
    if os.path.exists(destino) and os.path.getsize(destino) > 0:
        print(" OK")
        descargadas += 1
        recetas_validas.append(destino)
    else:
        print(" Vacía o error")

print(f"\n Total descargadas correctamente: {descargadas}/100")


 receta 1...  Vacía o error
 receta 2...  OK
 receta 3...  OK
 receta 4...  OK
 receta 5...  OK
 receta 6...  OK
 receta 7...  OK
 receta 8...  OK
 receta 9...  OK
 receta 10...  OK
 receta 11...  OK
 receta 12...  OK
 receta 13...  OK
 receta 14...  OK
 receta 15...  OK
 receta 16...  OK
 receta 17...  OK
 receta 18...  OK
 receta 19...  OK
 receta 20...  OK
 receta 21...  OK
 receta 22...  OK
 receta 23...  OK
 receta 24...  OK
 receta 25...  OK
 receta 26...  OK
 receta 27...  OK
 receta 28...  OK
 receta 29...  OK
 receta 30...  OK
 receta 31...  OK
 receta 32...  OK
 receta 33...  OK
 receta 34...  OK
 receta 35...  OK
 receta 36...  OK
 receta 37...  OK
 receta 38...  OK
 receta 39...  OK
 receta 40...  OK
 receta 41...  OK
 receta 42...  OK
 receta 43...  OK
 receta 44...  OK
 receta 45...  OK
 receta 46...  OK
 receta 47...  OK
 receta 48...  OK
 receta 49...  OK
 receta 50...  OK
 receta 51...  OK
 receta 52...  OK
 receta 53...  OK
 receta 54...  OK
 receta 55...  OK
 receta 

**Identificar el título de las 100 recetas obtenidas**

In [19]:
for archivo in recetas_validas:
    with open(archivo, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file.read(), "html.parser")
        titulo = soup.title.string.strip() if soup.title else "Sin título"
        print(f"{archivo}: {titulo}")

receta_2.html: Kraut Bierocks Recipe
receta_3.html: Pork Dumplings Recipe
receta_4.html: Filipino Avocado Milkshake Recipe
receta_5.html: Homemade Pickled Ginger (Gari) Recipe
receta_6.html: Frito Chicken Casserole Recipe
receta_7.html: ESER's Balsamic Salad Dressing Recipe
receta_8.html: One Pan Orecchiette Pasta Recipe
receta_9.html: Halloumi Cheese Fingers Recipe
receta_10.html: German Schwenkbraten Recipe
receta_11.html: Instant Pot Salsa Chicken Recipe
receta_12.html: Sausage, Potato and Kale Soup Recipe
receta_13.html: Falafel with Canned Chickpeas Recipe
receta_14.html: Chicken Afritada (Filipino Stew) Recipe
receta_15.html: World's Best Honey Garlic Pork Chops Recipe
receta_16.html: General Tao Chicken Recipe
receta_17.html: Traditional Filipino Lumpia Recipe
receta_18.html: Lasagna Flatbread Recipe
receta_19.html: Easy Chicken Curry Recipe
receta_20.html: Easy Cold Pasta Salad Recipe
receta_21.html: Simple Turkey Chili Recipe
receta_22.html: Japanese Pan Noodles Recipe
receta_

**Identificar URLs vacios y saltar para el proceso**

In [20]:
for i in range(1, 101):
    with open(f"receta_{i}.html", "r", encoding="utf-8") as file:
        content = file.read()
        if not content.strip():
            print(f"receta_{i}.html está vacía. Saltando...")
            continue  # Salta al siguiente archivo

    soup = BeautifulSoup(content, "html.parser")
    # procesar normalmente


receta_1.html está vacía. Saltando...


**Contenido de las recetas válidas**

In [21]:
# Procesar la receta válida
soup = BeautifulSoup(content, "html.parser")

# Título de la receta: primero intentamos obtenerlo del <title>, luego <h1>
titulo = soup.title.string.strip() if soup.title else "Sin título"
if titulo == "Sin título":
    titulo = soup.find('h1').string.strip() if soup.find('h1') else "Sin título encontrado"
print(f"Título: {titulo}")

# Extraer ingredientes (buscar diferentes clases y etiquetas)
ingredients_section = soup.find_all("li", class_="mm-recipes-structured-ingredients__list-item")

if not ingredients_section:
    print("Buscando ingredientes en otras etiquetas...")
    ingredients_section = soup.find_all("li")  # Si no los encontramos en la clase, probamos con <li>
    if not ingredients_section:
        ingredients_section = soup.find_all("span")  # O en <span>

# Mostrar ingredientes o mensaje si no existen
if ingredients_section:
    print("Ingredientes:")
    for ingredient in ingredients_section:
        print(f"- {ingredient.text.strip()}")
else:
    print("No se encontraron ingredientes.")

# Extraer todo el contenido de la receta (puede estar en <div>, <p>, etc.)
print("\nContenido de la receta:")
recipe_content = soup.find_all(["div", "p"])  # Busca dentro de divs o párrafos
for content in recipe_content:
    print(content.get_text(strip=True))  # Mostrar el contenido limpio sin etiquetas HTML
print("\n" + "-"*50)  # Separador visual entre recetas


Título: Baked Fish Fillets Recipe
Ingredientes:
- 1 tablespoon vegetable oil, or to taste
- 2 pounds mackerel fillets
- 1 teaspoon salt
- ⅛ teaspoon ground black pepper
- ¼ cup butter, melted
- 2 tablespoons lemon juice
- ⅛ teaspoon ground paprika

Contenido de la receta:
AllrecipesSaveRateSearchPlease fill out this field.Search the sitePlease fill out this field.Log InMy AccountAdd a RecipeSaved Recipes & CollectionsAccount SettingsHelpLog OutMagazineSubscribeManage Your SubscriptionGive a Gift SubscriptionGet HelpNewslettersSweepstakes



Allrecipes
Allrecipes
SaveRate
Save
SearchPlease fill out this field.Search the sitePlease fill out this field.Log InMy AccountAdd a RecipeSaved Recipes & CollectionsAccount SettingsHelpLog OutMagazineSubscribeManage Your SubscriptionGive a Gift SubscriptionGet HelpNewslettersSweepstakes
SearchPlease fill out this field.
Search
Please fill out this field.
Please fill out this field.
Search the sitePlease fill out this field.
Search the site
Please f

## Parte 4: Hacer RAG con las recetas obtenidas
* Una vez que se ha construido el corpus, implementar y desplegar RAG para realizar búsquedas en el corpus

In [22]:
!pip install faiss-cpu sentence-transformers python-dotenv openai


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12

In [23]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import os
from dotenv import load_dotenv
from bs4 import BeautifulSoup

# Cargar las recetas desde los archivos descargados (los archivos HTML)
recetas_validas = []  # Lista para almacenar las recetas con contenido
recipes = []

# Procesar archivos descargados
for i in range(1, 101):  # Asumiendo que tienes 100 recetas
    archivo = f"receta_{i}.html"
    if os.path.exists(archivo):
        with open(archivo, "r", encoding="utf-8") as file:
            content = file.read()
            if not content.strip():  # Saltar si está vacío
                continue

            # Usar BeautifulSoup para extraer información de la receta
            soup = BeautifulSoup(content, "html.parser")

            # Extraer título, ingredientes, instrucciones, etc.
            title = soup.title.string.strip() if soup.title else "Sin título"
            ingredients = [li.text.strip() for li in soup.find_all("li")]  # Suponiendo que los ingredientes están en <li>
            instructions = [p.text.strip() for p in soup.find_all("p")]  # Suponiendo que las instrucciones están en <p>
            description = soup.find('meta', {'name': 'description'})['content'] if soup.find('meta', {'name': 'description'}) else "No description"

            # Crear un diccionario para cada receta
            recipes.append({
                'title': title,
                'ingredients': ingredients,
                'instructions': instructions,
                'description': description,
                'full_text': f"{title}. {description}. Ingredientes: {', '.join(ingredients)}. Instrucciones: {' '.join(instructions)}"
            })

# Convertir las recetas a un DataFrame
recipes_df = pd.DataFrame(recipes)

# Crear el modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Generando embeddings...")

# Generar embeddings para el texto completo de cada receta
embeddings = model.encode(recipes_df['full_text'].tolist(), convert_to_numpy=True)

# Crear el índice FAISS (distancia L2)
index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance
index.add(embeddings)  # Agregar los embeddings al índice

# Agregar los embeddings al DataFrame para tener acceso en el análisis
recipes_df['embeddings'] = embeddings.tolist()  # Convertir el array numpy a lista para DataFrame


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generando embeddings...


In [24]:
!pip install google-cloud-language python-dotenv




In [25]:
!pip install requests python-dotenv




In [34]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import os
from dotenv import load_dotenv
from bs4 import BeautifulSoup

# Cargar las recetas desde los archivos descargados (los archivos HTML)
recetas_validas = []  # Lista para almacenar las recetas con contenido
recipes = []

# Procesar archivos descargados
for i in range(1, 101):  # Asumiendo que tienes 100 recetas
    archivo = f"receta_{i}.html"
    if os.path.exists(archivo):
        with open(archivo, "r", encoding="utf-8") as file:
            content = file.read()
            if not content.strip():  # Saltar si está vacío
                continue

            # Usar BeautifulSoup para extraer información de la receta
            soup = BeautifulSoup(content, "html.parser")

            # Extraer título, ingredientes, instrucciones, etc.
            title = soup.title.string.strip() if soup.title else "Sin título"
            ingredients = [li.text.strip() for li in soup.find_all("li")]  # Suponiendo que los ingredientes están en <li>
            instructions = [p.text.strip() for p in soup.find_all("p")]  # Suponiendo que las instrucciones están en <p>
            description = soup.find('meta', {'name': 'description'})['content'] if soup.find('meta', {'name': 'description'}) else "No description"

            # Crear un diccionario para cada receta
            recipes.append({
                'title': title,
                'ingredients': ingredients,
                'instructions': instructions,
                'description': description,
                'full_text': f"{title}. {description}. Ingredientes: {', '.join(ingredients)}. Instrucciones: {' '.join(instructions)}"
            })

# Convertir las recetas a un DataFrame
recipes_df = pd.DataFrame(recipes)

# Crear el modelo de embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Generando embeddings...")

# Generar embeddings para el texto completo de cada receta
embeddings = model.encode(recipes_df['full_text'].tolist(), convert_to_numpy=True)

# Crear el índice FAISS (distancia L2)
index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance
index.add(embeddings)  # Agregar los embeddings al índice

# Agregar los embeddings al DataFrame para tener acceso en el análisis
recipes_df['embeddings'] = embeddings.tolist()  # Convertir el array numpy a lista para DataFrame


Generando embeddings...


In [35]:
def seeker(query, k=5):
    """
    Función para buscar recetas basadas en una consulta.
    Retorna las top k recetas basadas en la similitud de coseno.
    """
    # Obtener el embedding para la consulta
    query_embedding = model.encode([query], convert_to_numpy=True)

    # Buscar en el índice FAISS
    distances, indices = index.search(query_embedding, k)

    context_recipes = []
    recipe_summaries = []

    # Recoger las recetas más relevantes
    for i in range(k):
        recipe = recipes_df.iloc[indices[0][i]]
        context_recipes.append({
            "title": recipe['title'],
            "description": recipe['description'],
            "distance": distances[0][i]
        })

        # Crear un resumen para la receta
        summary = f"{recipe['title']} - {recipe['description']}"
        recipe_summaries.append(summary)

    # Devolver las recetas relevantes
    return recipe_summaries

# Ejemplo de uso
query = "¿Cómo puedo hacer una cena rápida y saludable?"
response = seeker(query)
print("Respuestas relevantes:")
for res in response:
    print(res)


Respuestas relevantes:
Beef Stifado in the Slow Cooker Recipe - This Greek stifado features chunks of beef slow-cooked until tender with onion, shallots, herbs, and spices in a rich and aromatic tomato sauce.
Mongo Guisado (Mung Bean Soup) Recipe - This mongo guisado (mung bean soup) recipe is a hearty soup that uses mung beans simmered in chicken broth with prawns, diced pork, tomatoes, and spinach.
Lasagna Flatbread Recipe - Give lasagna a quick and easy pizza makeover by baking sausage, ricotta, marinara sauce, and mozzarella on flatbreads.
Taco Bell Seasoning Copycat Recipe - Copycat Taco Bell seasoning is fast and simple to make and will make your tacos taste like the popular Mexican chain for an easy taco Tuesday dinner.
Caldo de Pollo Recipe - Caldo de pollo is a simple but richly-flavored chicken soup that's packed with vegetables and seasoned with garlic and lots of fresh cilantro.
