<a href="https://colab.research.google.com/github/lucas-bruzzone/Recipes-Dataset-64k-Dishes/blob/main/Recipe_RAG_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recipe RAG System
**Retrieval-Augmented Generation para Receitas Culinárias**

Sistema completo que:
- Busca receitas similares por embeddings semânticos
- Sugere ingredientes complementares
- Gera receitas completas com LLM
- Respeita categorias e restrições alimentares

## Setup e Instalação

In [None]:
!pip install -q sentence-transformers transformers torch scikit-learn

In [None]:
import pandas as pd
import numpy as np
import json
import re
from collections import Counter, defaultdict
from itertools import combinations
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
import torch

print("✓ Imports completos")

✓ Imports completos


## Montagem do Drive e Carregamento

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
dataset_path = '/content/drive/MyDrive/Colab Notebooks/Recipes Dataset : 64k Dishes/dataset/1_Recipe_csv.csv'
df = pd.read_csv(dataset_path)

print(f"Dataset carregado: {len(df)} receitas")
display(df.head())

Dataset carregado: 62126 receitas


Unnamed: 0,recipe_title,category,subcategory,description,ingredients,directions,num_ingredients,num_steps
0,Air Fryer Potato Slices with Dipping Sauce,Air Fryer Recipes,Air Fryer Recipes,"These air fryer potato slices, served with a b...","[""3/4 cup ketchup"", ""1/2 cup beer"", ""1 tablesp...","[""Combine ketchup, beer, Worcestershire sauce,...",9,5
1,Gochujang Pork Belly Bites,Air Fryer Recipes,Air Fryer Recipes,These gochujang pork belly bites are sweet and...,"[""1 pound pork belly"", ""1/4 cup gochujang"", ""2...","[""Preheat an air fryer to 400 degrees F (200 d...",5,4
2,3-Ingredient Air Fryer Everything Bagel Chicke...,Air Fryer Recipes,Air Fryer Recipes,These 3-ingredient air fryer everything bagel ...,"[""1 \u00bc pounds chicken tenders"", ""1 tablesp...","[""Gather all ingredients. Preheat an air fryer...",3,4
3,Air Fryer Everything Bagel Chicken Cutlets,Air Fryer Recipes,Air Fryer Recipes,These air fryer everything bagel chicken cutle...,"[""4 chicken cutlets (about 1 pound total)"", ""s...","[""Preheat an air fryer to 400 degrees F (200 d...",9,9
4,Air Fryer Honey Sriracha Salmon Bites,Air Fryer Recipes,Air Fryer Recipes,These air fryer honey Sriracha salmon bites ar...,"[""1 tablespoon soy sauce"", ""1 tablespoon honey...","[""Preheat an air fryer to 400 degrees F (200 d...",5,5


## Pré-processamento

In [None]:
def parse_ingredients(ingredient_str):
    try:
        return json.loads(ingredient_str)
    except:
        return []

def normalize_ingredient(ingredient):
    ingredient = re.sub(r'\d+[\s]*[/-]?[\s]*\d*', '', ingredient)
    units = ['cup', 'cups', 'tablespoon', 'tablespoons', 'teaspoon', 'teaspoons',
             'pound', 'pounds', 'ounce', 'ounces', 'gram', 'grams', 'kg', 'ml',
             'liter', 'liters', 'tbsp', 'tsp', 'oz', 'lb', 'g']
    for unit in units:
        ingredient = re.sub(rf'\b{unit}s?\b', '', ingredient, flags=re.IGNORECASE)
    ingredient = re.sub(r'\([^)]*\)', '', ingredient)
    ingredient = re.sub(r'[,;]', '', ingredient)
    ingredient = ' '.join(ingredient.split())
    return ingredient.lower().strip()

print("Processando ingredientes...")

df['ingredients_list'] = df['ingredients'].apply(parse_ingredients)
df['ingredients_normalized'] = df['ingredients_list'].apply(
    lambda x: [normalize_ingredient(i) for i in x if normalize_ingredient(i)]
)

all_ingredients = []
for ingredients in df['ingredients_normalized']:
    all_ingredients.extend(ingredients)

ingredient_vocab = Counter(all_ingredients)
MIN_FREQUENCY = 5
filtered_vocab = {k: v for k, v in ingredient_vocab.items() if v >= MIN_FREQUENCY}

df['ingredients_filtered'] = df['ingredients_normalized'].apply(
    lambda x: [i for i in x if i in filtered_vocab]
)

df['recipe_text'] = df['ingredients_filtered'].apply(lambda x: ', '.join(x))

print(f"✓ Ingredientes únicos: {len(filtered_vocab)}")
print(f"✓ Receitas processadas: {len(df)}")

Processando ingredientes...
✓ Ingredientes únicos: 13040
✓ Receitas processadas: 62126


## Carregamento de Modelos

In [None]:
print("Carregando modelo de embeddings...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("✓ Embedding model: all-MiniLM-L6-v2 (384D)")

Carregando modelo de embeddings...
✓ Embedding model: all-MiniLM-L6-v2 (384D)


In [None]:
# OPÇÃO A: Flan-T5 Large (770M parâmetros - recomendado para GPU de 8GB+)
model_name = "google/flan-t5-large"

# OPÇÃO C: Mistral 7B (melhor qualidade, requer GPU de 24GB+)
# model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    device_map="auto",
    dtype=torch.float16  # Economiza memória
)

text_generator = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer
    )

print(f"✓ Modelo carregado: {model_name}")


Device set to use cuda:0


✓ Modelo carregado: google/flan-t5-large


## Criação de Embeddings

In [None]:
print("Gerando embeddings de receitas...")
recipe_embeddings = embedding_model.encode(
    df['recipe_text'].tolist(),
    batch_size=256,
    show_progress_bar=True,
    convert_to_tensor=True
)
recipe_embeddings = recipe_embeddings.cpu().numpy()
print(f"✓ Shape: {recipe_embeddings.shape}")

print("\nGerando embeddings de ingredientes...")
unique_ingredients = list(filtered_vocab.keys())
ingredient_embeddings = embedding_model.encode(
    unique_ingredients,
    batch_size=256,
    show_progress_bar=True,
    convert_to_tensor=True
)
ingredient_embeddings = ingredient_embeddings.cpu().numpy()
ingredient_to_idx = {ing: idx for idx, ing in enumerate(unique_ingredients)}

print(f"✓ Embeddings de ingredientes: {ingredient_embeddings.shape}")
print(f"✓ Índice pronto")

Gerando embeddings de receitas...


Batches:   0%|          | 0/243 [00:00<?, ?it/s]

✓ Shape: (62126, 384)

Gerando embeddings de ingredientes...


Batches:   0%|          | 0/51 [00:00<?, ?it/s]

✓ Embeddings de ingredientes: (13040, 384)
✓ Índice pronto


## Sistema de Retrieval

In [None]:
def retrieve_similar_recipes(query_ingredients, top_k=5):
    """Busca semântica de receitas similares"""
    normalized = [normalize_ingredient(ing) for ing in query_ingredients]
    query_text = ', '.join(normalized)

    query_embedding = embedding_model.encode([query_text], convert_to_tensor=True)
    query_embedding = query_embedding.cpu().numpy()

    similarities = cosine_similarity(query_embedding, recipe_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k*3]  # Busca mais para compensar duplicatas

    results = []
    seen_titles = set()  # <-- ADICIONAR ISSO

    for idx in top_indices:
        recipe = df.iloc[idx]
        title = recipe['recipe_title']

        # Pular receitas duplicadas
        if title in seen_titles:
            continue

        seen_titles.add(title)
        results.append({
            'idx': idx,
            'title': title,
            'category': recipe['category'],
            'ingredients': recipe['ingredients_filtered'],
            'directions': json.loads(recipe['directions']) if isinstance(recipe['directions'], str) else recipe['directions'],
            'description': recipe['description'],
            'similarity': similarities[idx]
        })

        if len(results) >= top_k:  # Para quando tiver receitas únicas suficientes
            break

    return results

### Teste de Retrieval

In [None]:
test_query = ["chicken", "garlic", "soy sauce"]
print(f"Query: {test_query}\n")

retrieved = retrieve_similar_recipes(test_query, top_k=5)
for i, recipe in enumerate(retrieved, 1):
    print(f"{i}. {recipe['title']} (sim: {recipe['similarity']:.3f})")
    print(f"   Categoria: {recipe['category']}")
    print(f"   Ingredientes: {', '.join(recipe['ingredients'][:5])}...\n")

Query: ['chicken', 'garlic', 'soy sauce']

1. Key West Chicken (sim: 0.852)
   Categoria: Labor Day
   Ingredientes: soy sauce, honey, vegetable oil, lime juice, chopped garlic...

2. Sweet and Spicy Green Beans (sim: 0.851)
   Categoria: Labor Day
   Ingredientes: soy sauce, clove garlic minced, honey, canola oil...

3. Japchae (sim: 0.839)
   Categoria: Beef Recipes
   Ingredientes: soy sauce, ½ white sugar, sesame oil, minced garlic...

4. Spicy Korean Chicken (sim: 0.836)
   Categoria: Korean
   Ingredientes: soy sauce, vegetable oil, sesame oil, cloves garlic minced...

5. Tri-Tip Bulgogi (sim: 0.829)
   Categoria: Beef Recipes
   Ingredientes: onion, soy sauce, cloves garlic...



In [None]:
def retrieve_complementary_ingredients(base_ingredients, top_k=10):
    """Busca ingredientes complementares com diversidade"""
    suggestions = defaultdict(float)
    base_normalized = [normalize_ingredient(ing) for ing in base_ingredients]

    # Palavras a filtrar (muito genéricas)
    generic_words = {'sauce', 'oil', 'water', 'salt', 'pepper', 'sugar', 'flour'}

    for base_ing in base_normalized:
        if base_ing not in ingredient_to_idx:
            continue

        base_idx = ingredient_to_idx[base_ing]
        base_emb = ingredient_embeddings[base_idx].reshape(1, -1)

        similarities = cosine_similarity(base_emb, ingredient_embeddings)[0]

        for idx, sim in enumerate(similarities):
            ing = unique_ingredients[idx]

            # Pular ingredientes já presentes
            if ing in base_normalized:
                continue

            # Filtrar genéricos sozinhos (mas aceitar compostos como "soy sauce")
            if ing in generic_words and ' ' not in ing:
                continue

            suggestions[ing] += sim

    # Ordenar e remover similares demais
    sorted_sugg = sorted(suggestions.items(), key=lambda x: x[1], reverse=True)

    # Filtrar redundância (ex: se tem "soy sauce", não precisa "sauce")
    filtered = []
    seen_words = set()

    for ing, score in sorted_sugg:
        words = set(ing.split())

        # Aceitar se não tem overlap substancial com anteriores
        if len(words & seen_words) < len(words):
            filtered.append((ing, score))
            seen_words.update(words)

        if len(filtered) >= top_k:
            break

    return filtered

## Sistema RAG Completo

In [None]:
def generate_recipe_with_rag(
    base_ingredients,
    category=None,
    restrictions=None,
    top_k=5,
    verbose=True
):
    """Sistema RAG completo com geração melhorada"""

    if verbose:
        print(f"\n{'='*60}")
        print("RAG GENERATION")
        print(f"{'='*60}")
        print(f"\nIngredientes base: {base_ingredients}")
        if category:
            print(f"Categoria: {category}")
        if restrictions:
            print(f"Restrições: {restrictions}")

    # RETRIEVAL (com deduplicação)
    if verbose:
        print(f"\n[1/3] Buscando receitas similares...")

    retrieved_recipes = retrieve_similar_recipes(base_ingredients, top_k=top_k)

    if category:
        retrieved_recipes = [r for r in retrieved_recipes
                           if category.lower() in r['category'].lower()][:top_k]

    if verbose:
        for i, r in enumerate(retrieved_recipes[:3], 1):
            print(f"    {i}. {r['title']} (sim: {r['similarity']:.2f})")

    # COMPLEMENTARY INGREDIENTS
    if verbose:
        print(f"\n[2/3] Buscando ingredientes complementares...")

    complementary = retrieve_complementary_ingredients(base_ingredients, top_k=8)

    if restrictions:
        restrictions_norm = [normalize_ingredient(r) for r in restrictions]
        complementary = [(ing, score) for ing, score in complementary
                        if ing not in restrictions_norm][:8]

    suggested_ingredients = [ing for ing, _ in complementary[:5]]
    all_ingredients = base_ingredients + suggested_ingredients

    if verbose:
        print(f"    Sugeridos: {', '.join(suggested_ingredients)}")

    # AUGMENTATION - Prompt Melhorado
    if verbose:
        print(f"\n[3/3] Gerando receita com LLM...")

    # Contexto das receitas recuperadas
    context_recipes = []
    for r in retrieved_recipes[:3]:
        ing_list = ', '.join(r['ingredients'][:8])
        context_recipes.append(f"- {r['title']}: {ing_list}")

    context = "\n".join(context_recipes)

    # PROMPT MELHORADO
    prompt = f"""You are a professional chef. Create a complete, detailed recipe.

Reference recipes for inspiration:
{context}

Main ingredients to use: {', '.join(base_ingredients)}
Additional suggested ingredients: {', '.join(suggested_ingredients)}
Category: {category if category else 'Any'}
{f"Dietary restrictions (DO NOT use): {', '.join(restrictions)}" if restrictions else ""}

Generate a complete recipe with:

**Recipe Name:** [Creative, appetizing name]

**Ingredients:**
- [ingredient 1 with quantity]
- [ingredient 2 with quantity]
- [etc. - list ALL ingredients with measurements]

**Instructions:**
1. [Detailed first step]
2. [Detailed second step]
3. [Continue with clear, step-by-step instructions]
...

**Cooking Time:** [prep + cook time]

**Servings:** [number of servings]

Create the recipe:"""

    # GENERATION com parâmetros otimizados
    try:
        generated = text_generator(
            prompt,
            max_length=800,      # Mais espaço para receita completa
            min_length=300,      # Garantir mínimo de detalhes
            num_return_sequences=1,
            temperature=0.85,    # Balanceado: criativo mas coerente
            do_sample=True,
            top_p=0.92,
            top_k=50,
            repetition_penalty=1.2  # Evita repetição
        )[0]['generated_text']

        # Limpar o prompt da saída
        if "Create the recipe:" in generated:
            generated = generated.split("Create the recipe:")[-1].strip()

    except Exception as e:
        generated = f"[Generation error: {str(e)}]"

    return {
        'base_ingredients': base_ingredients,
        'suggested_ingredients': suggested_ingredients,
        'all_ingredients': all_ingredients,
        'retrieved_recipes': retrieved_recipes,
        'generated_recipe': generated,
        'category': category,
        'restrictions': restrictions
    }

print("✓ Sistema RAG atualizado e pronto")

✓ Sistema RAG atualizado e pronto


## Testes do Sistema RAG

### Teste 1: Receita Asiática

In [None]:
result1 = generate_recipe_with_rag(
    base_ingredients=["chicken", "soy sauce", "ginger"],
    category="Asian",
    restrictions=None,
    top_k=5,
    verbose=True
)

print("\n" + "="*60)
print("RECEITA GERADA")
print("="*60)
print(result1['generated_recipe'])


RAG GENERATION

Ingredientes base: ['chicken', 'soy sauce', 'ginger']
Categoria: Asian

[1/3] Buscando receitas similares...

[2/3] Buscando ingredientes complementares...
    Sugeridos: chili sauce, pasta sauce, tomato sauce, fish sauce, fresh ginger

[3/3] Gerando receita com LLM...


Both `max_new_tokens` (=256) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



RECEITA GERADA
**Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:**


### Teste 2: Sobremesa sem Laticínios

In [None]:
result2 = generate_recipe_with_rag(
    base_ingredients=["chocolate", "flour", "sugar"],
    category="Desserts",
    restrictions=["milk", "butter", "cream"],
    top_k=5,
    verbose=True
)

print("\n" + "="*60)
print("RECEITA GERADA")
print("="*60)
print(result2['generated_recipe'])


RAG GENERATION

Ingredientes base: ['chocolate', 'flour', 'sugar']
Categoria: Desserts
Restrições: ['milk', 'butter', 'cream']

[1/3] Buscando receitas similares...

[2/3] Buscando ingredientes complementares...


Both `max_new_tokens` (=256) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


    Sugeridos: chocolate syrup, vanilla sugar, chocolate milk, white chocolate, / sugar

[3/3] Gerando receita com LLM...

RECEITA GERADA
**Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:**


### Teste 3: Prato Italiano

In [None]:
result3 = generate_recipe_with_rag(
    base_ingredients=["pasta", "tomato", "basil"],
    category="Italian",
    restrictions=None,
    top_k=5,
    verbose=True
)

print("\n" + "="*60)
print("RECEITA GERADA")
print("="*60)
print(result3['generated_recipe'])


RAG GENERATION

Ingredientes base: ['pasta', 'tomato', 'basil']
Categoria: Italian

[1/3] Buscando receitas similares...

[2/3] Buscando ingredientes complementares...


Both `max_new_tokens` (=256) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


    Sugeridos: pasta sauce, spaghetti, tomatoes, tomato pasta sauce, roma tomato

[3/3] Gerando receita com LLM...

RECEITA GERADA
**Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:** **Ingredients:**


## Interface Simplificada

In [None]:
def quick_recipe_generation(base_ingredients, category=None):
    """Interface simplificada para geração rápida"""
    result = generate_recipe_with_rag(
        base_ingredients=base_ingredients,
        category=category,
        restrictions=None,
        top_k=5,
        verbose=False
    )

    print(f"\n{'='*60}")
    print(f"RECEITA GERADA")
    print(f"{'='*60}")
    print(f"\nIngredientes base: {', '.join(result['base_ingredients'])}")
    print(f"Ingredientes sugeridos: {', '.join(result['suggested_ingredients'])}")
    if result['category']:
        print(f"Categoria: {result['category']}")
    print(f"\nReceitas de referência:")
    for i, r in enumerate(result['retrieved_recipes'][:3], 1):
        print(f"  {i}. {r['title']}")
    print(f"\n{'-'*60}")
    print(result['generated_recipe'])
    print(f"{'-'*60}")

    return result

print("✓ Interface simplificada pronta")

✓ Interface simplificada pronta


### Exemplo de Uso

In [None]:
# Teste com seus próprios ingredientes
result = quick_recipe_generation(
    base_ingredients=['beef', 'onion', 'garlic'],
    category='Mexican'
)

Both `max_new_tokens` (=256) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


## Estatísticas do Sistema

In [None]:
print(f"{'='*60}")
print("ESTATÍSTICAS DO SISTEMA RAG")
print(f"{'='*60}")

print(f"\n📊 Dataset:")
print(f"  • Total de receitas: {len(df):,}")
print(f"  • Receitas com embeddings: {len(recipe_embeddings):,}")
print(f"  • Ingredientes únicos: {len(ingredient_to_idx):,}")
print(f"  • Categorias: {df['category'].nunique()}")

print(f"\n🔧 Modelos:")
print(f"  • Embedding: all-MiniLM-L6-v2 (384D)")
print(f"  • LLM: google/flan-t5-base")
print(f"  • Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

print(f"\n✓ Sistema completo e funcional")