# üçΩÔ∏è Graph-Based Recipe Recommendation Engine  
### Using Retrieval-Augmented Generation (RAG) + Neo4j

---

**üë§ Author:** *Natasha Fatima*    
**üß† Technologies:** LangChain ‚Ä¢ Neo4j ‚Ä¢ Python ‚Ä¢ RAG ‚Ä¢ GraphDocument  

---

## üìò Project Overview

This notebook demonstrates how to build a **graph-driven recipe recommendation engine** using:

- üóÇÔ∏è **Recipe knowledge graph** stored in **Neo4j**  
- üîç **RAG (Retrieval-Augmented Generation)** to fetch relevant recipe data  
- üß© **GraphDocument** to convert recipe files into structured graph nodes  
- ü§ñ LLM-powered recipe understanding and recommendations  

The system loads recipe files (PDF, HTML, TXT, Python), extracts structured information, converts it into a graph, and enables **intelligent recipe recommendations**.

---



In [1]:
!pip install --upgrade pip
!pip install langchain pymupdf unstructured sentence-transformers scikit-learn faiss-cpu tiktoken nltk matplotlib pillow rank-bm25 python-dotenv




In [32]:
import sys
!{sys.executable} -m pip install --upgrade pip



In [33]:
import sys

packages = [
    "langchain", 
    "pymupdf", 
    "unstructured",
    "sentence-transformers",
    "scikit-learn", 
    "faiss-cpu",
    "tiktoken",
    "nltk",
    "matplotlib",
    "pillow",
    "rank-bm25", 
    "python-dotenv"
]

print("Installing required packages...")

for package in packages:
    print(f"\n Installing {package}...")
    !{sys.executable} -m pip install {package}

print("\n Package installation complete!")


Installing required packages...

 Installing langchain...
 Installing pymupdf...

 Installing unstructured...


 Installing sentence-transformers...

 Installing scikit-learn...
 Installing faiss-cpu...

 Installing tiktoken...


 Installing nltk...

 Installing matplotlib...

 Installing pillow...
 Installing rank-bm25...


 Installing python-dotenv...
 Package installation complete!



In [35]:
# Verify installations
print("üîç Verifying package installations...")

try:
    import langchain
    import fitz  # pymupdf
    from sentence_transformers import SentenceTransformer
    import sklearn
    import faiss
    import tiktoken
    import nltk
    import matplotlib.pyplot as plt
    from PIL import Image
    from rank_bm25 import BM25Okapi
    from dotenv import load_dotenv
    
    print(" All packages imported successfully!")
    
    # Download required NLTK data
    print(" Downloading NLTK data...")
    nltk.download('punkt', quiet=True)
    print(" NLTK punkt downloaded")
    
except ImportError as e:
    print(f" Import failed: {e}")
    print("Please install missing packages manually")

print("\n Ready for the next step!")


üîç Verifying package installations...
 All packages imported successfully!
 Downloading NLTK data...
 NLTK punkt downloaded

 Ready for the next step!


# Part 1: Data Collection & Preprocessing

## 1.1 Sample Data Creation

**Objective:**  
Create sample recipe files in multiple formats to test the text loading and splitting pipeline.

---

### Files Being Created

| File Type         | Content                   | Purpose                     |
|------------------|--------------------------|-----------------------------|
| `recipes.pdf`     | Classic Pancakes Recipe   | PDF format testing          |
| `recipes.html`    | Chocolate Chip Cookies    | HTML parsing testing        |
| `recipes.txt`     | Vegetable Stir Fry        | Plain text processing       |
| `recipe_utils.py` | Recipe utility functions  | Code file analysis          |

---

### File Contents

- **PDF:** Complete recipe with ingredients and instructions  
- **HTML:** Structured recipe using semantic HTML tags  
- **TXT:** Simple recipe format for basic text processing  
- **Python:** Utility functions for recipe operations (unit conversion, allergy checks)  

---

**Note:**  
These sample files simulate real-world documents that will be processed in a recipe recommendation system. This setup allows testing **file loading, parsing, and text splitting methods** for multiple formats.


In [36]:

!pip install fpdf

from fpdf import FPDF

print( "Creating sample files for Part 1...")

# --------------------
# Create PDF file 
# --------------------
pdf_content = """Classic Pancakes Recipe

Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1 cup milk
- 1 large egg
- 2 tablespoons melted butter

Instructions:
1. In a large bowl, mix flour, sugar, baking powder, and salt.
2. Make a well in the center and pour in milk, egg, and melted butter.
3. Mix until smooth.
4. Heat a lightly oiled griddle over medium-high heat.
5. Pour batter onto the griddle.
6. Cook until bubbles form and edges are dry.
7. Flip and cook until browned."""

pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=12)

for line in pdf_content.split("\n"):
    pdf.cell(0, 8, txt=line, ln=True)

pdf.output("recipes.pdf")
print(" PDF created successfully!")

# --------------------
# Create HTML file
# --------------------
html_content = """<html>
<body>
<h1>Chocolate Chip Cookies</h1>
<h2>Ingredients</h2>
<ul>
<li>2 cups flour</li>
<li>1 cup butter</li>
<li>1 cup chocolate chips</li>
<li>2 eggs</li>
</ul>
<h2>Instructions</h2>
<ol>
<li>Preheat oven to 375¬∞F.</li>
<li>Mix ingredients.</li>
<li>Bake for 10-12 minutes.</li>
</ol>
</body>
</html>"""
with open("recipes.html", "w", encoding="utf-8") as f:
    f.write(html_content)
print(" HTML created successfully!")

# --------------------
# Create TXT file
# --------------------
txt_content = """Vegetable Stir Fry

Ingredients:
- 2 cups mixed vegetables
- 1 tbsp oil
- 2 cloves garlic
- 3 tbsp soy sauce

Instructions:
1. Heat oil.
2. Add garlic.
3. Add vegetables.
4. Add sauce.
5. Cook for 5 minutes."""
with open("recipes.txt", "w", encoding="utf-8") as f:
    f.write(txt_content)
print(" TXT created successfully!")

# --------------------
# Create Python file
# --------------------
py_code = """def convert_units(amount, from_unit, to_unit):
    conversions = {
        ('cups', 'ml'): 240,
        ('tbsp', 'ml'): 15,
        ('tsp', 'ml'): 5
    }
    return amount * conversions.get((from_unit, to_unit), 1)

def check_allergies(ingredients, allergies):
    return any(allergy in str(ingredients).lower() for allergy in allergies)"""
with open("recipe_utils.py", "w", encoding="utf-8") as f:
    f.write(py_code)
print(" Python file created successfully!")

# --------------------
# Preview all files
# --------------------
print("\n--- Preview of created files ---")
for file in ["recipes.pdf", "recipes.html", "recipes.txt", "recipe_utils.py"]:
    if file != "recipes.pdf":  # Skip PDF for text preview
        with open(file, "r", encoding="utf-8") as f:
            content = f.read()
            print(f"\n{file} preview:\n{content[:300]}...")

print("\n All sample files are ready!")


Creating sample files for Part 1...
 PDF created successfully!
 HTML created successfully!
 TXT created successfully!
 Python file created successfully!

--- Preview of created files ---

recipes.html preview:
<html>
<body>
<h1>Chocolate Chip Cookies</h1>
<h2>Ingredients</h2>
<ul>
<li>2 cups flour</li>
<li>1 cup butter</li>
<li>1 cup chocolate chips</li>
<li>2 eggs</li>
</ul>
<h2>Instructions</h2>
<ol>
<li>Preheat oven to 375¬∞F.</li>
<li>Mix ingredients.</li>
<li>Bake for 10-12 minutes.</li>
</ol>
</body>...

recipes.txt preview:
Vegetable Stir Fry

Ingredients:
- 2 cups mixed vegetables
- 1 tbsp oil
- 2 cloves garlic
- 3 tbsp soy sauce

Instructions:
1. Heat oil.
2. Add garlic.
3. Add vegetables.
4. Add sauce.
5. Cook for 5 minutes....

recipe_utils.py preview:
def convert_units(amount, from_unit, to_unit):
    conversions = {
        ('cups', 'ml'): 240,
        ('tbsp', 'ml'): 15,
        ('tsp', 'ml'): 5
    }
    return amount * conversions.get((from_unit, to_unit), 1)

def check

In [25]:
!pip install --upgrade langchain




In [24]:
!pip install unstructured pypdf



# Recipe Document Loading

## üìã Overview
This section loads recipe documents from multiple file formats (PDF, HTML, TXT, and Python) using LangChain's document loaders. The loaded documents will be processed to extract recipe information for building the knowledge graph.

## ‚öôÔ∏è How It Works
- **PDF Loader**: Extracts text content from PDF files using `PyPDFLoader`
- **HTML Loader**: Parses HTML files using `UnstructuredHTMLLoader` 
- **Text Loader**: Reads plain text files with encoding fallback support
- **Python Loader**: Loads Python code files for any recipe-related utilities

## üéØ Purpose
Prepare recipe documents for parsing into structured graph data by:
- Loading content from multiple file formats
- Handling encoding issues automatically
- Providing uniform document interface for processing

In [37]:
# ===============================
# Part 1: Load Recipe Documents 
# ===============================
!pip install pypdf unstructured langchain-community

from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredHTMLLoader
from langchain_community.document_loaders.python import PythonLoader

print(" Loading documents...")

pdf_docs, html_docs, txt_docs, code_docs = [], [], [], []

try:
    # Load PDF
    pdf_loader = PyPDFLoader("recipes.pdf")
    pdf_docs = pdf_loader.load()
    print(f" PDF loaded: {len(pdf_docs)} pages")
except Exception as e:
    print(f" PDF loading failed: {e}")

try:
    # Load HTML
    html_loader = UnstructuredHTMLLoader("recipes.html")
    html_docs = html_loader.load()
    print(f" HTML loaded: {len(html_docs)} documents")
except Exception as e:
    print(f" HTML loading failed: {e}")

try:
    # Load TXT
    txt_loader = TextLoader("recipes.txt", encoding='utf-8')
    txt_docs = txt_loader.load()
    print(f" TXT loaded: {len(txt_docs)} documents")
except Exception as e:
    try:
        # Try different encoding
        txt_loader = TextLoader("recipes.txt", encoding='latin-1')
        txt_docs = txt_loader.load()
        print(f" TXT loaded: {len(txt_docs)} documents")
    except Exception as e2:
        print(f" TXT loading failed: {e2}")

try:
    code_loader = PythonLoader("recipe_utils.py")
    code_docs = code_loader.load()
    print(f" Python code loaded: {len(code_docs)} documents")
except Exception as e:
    print(f" Python loading failed: {e}")

all_docs = pdf_docs + html_docs + txt_docs + code_docs

print(f"\n Total documents loaded: {len(all_docs)}")

if all_docs:
    print("\n Sample content from documents:")
    for i, doc in enumerate(all_docs):
        source = doc.metadata.get('source', 'Unknown')
        print(f"\n--- Document {i+1} ({source}) ---")
        content_preview = doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content
        print(content_preview)
else:
    print("\n No documents loaded.")

 Loading documents...
 PDF loaded: 1 pages
 HTML loaded: 1 documents
 TXT loaded: 1 documents
 Python code loaded: 1 documents

 Total documents loaded: 4

 Sample content from documents:

--- Document 1 (recipes.pdf) ---
Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1 cup milk
- 1 large egg
- 2 tablespoons melted butter
Instructions:
1. In a large bowl, mix flour, sugar, baking powder, and salt.
2. Make a well in the center and ...

--- Document 2 (recipes.html) ---
Chocolate Chip Cookies

Ingredients

2 cups flour

1 cup butter

1 cup chocolate chips

2 eggs

Instructions

Preheat oven to 375¬∞F.

Mix ingredients.

Bake for 10-12 minutes.

--- Document 3 (recipes.txt) ---
Vegetable Stir Fry

Ingredients:
- 2 cups mixed vegetables
- 1 tbsp oil
- 2 cloves garlic
- 3 tbsp soy sauce

Instructions:
1. Heat oil.
2. Add garlic.
3. Add vegetables.
4. Add sauce.
5. Cook for 5 minutes.

--- Document 4 (rec

In [38]:
!pip install langchain langchain-community langchain-experimental tiktoken sentence-transformers



In [19]:
!pip install langchain-text-splitters



# üîÄ Text Splitting Methods

## üìã Overview
This section implements three different text splitting techniques to break down our recipe documents into manageable chunks.

---

## üÖ∞Ô∏è Recursive Character Splitting

### ‚öôÔ∏è Parameters
- **Chunk Size**: 200 characters
- **Chunk Overlap**: 50 characters  
- **Separators**: `["\n\n", "\n", ". ", " ", ""]`

### üéØ How It Works
This method splits text using a smart hierarchy:
1. First tries `\n\n` (double line breaks)
2. Then `\n` (single line breaks) 
3. Then `. ` (sentence endings)
4. Then ` ` (spaces)
5. Finally `""` (any character)
   
### üìä Actual Results
- **Total Chunks Created**: 9 chunks
- **Processing**: Applied across multiple document types
- **Context Preservation**: 50-character overlap maintains semantic continuity
- **Effectiveness**: Creates meaningful segments while respecting natural language boundaries

In [41]:
# ===============================
# a) Recursive Character Splitting
# ===============================

try:
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from langchain_core.documents import Document
except ImportError:
    !pip install langchain-text-splitters langchain-core
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from langchain_core.documents import Document

print(" Applying Recursive Character Splitting...")

langchain_docs = []
for doc in all_docs:
    if hasattr(doc, 'page_content'):
        # It's already a Document object
        langchain_docs.append(doc)
    else:
        langchain_docs.append(Document(
            page_content=doc["page_content"], 
            metadata=doc["metadata"]
        ))

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

recursive_chunks = recursive_splitter.split_documents(langchain_docs)

print(f" Recursive splitting produced {len(recursive_chunks)} chunks")

# Safe metadata access
print(f"\n Chunks per document type:")
doc_types = {}
for chunk in recursive_chunks:
    doc_type = chunk.metadata.get("type", "unknown")  
    doc_types[doc_type] = doc_types.get(doc_type, 0) + 1

for doc_type, count in doc_types.items():
    print(f"   - {doc_type}: {count} chunks")

# Display first 2 chunks as example
print(f"\n Sample chunks:")
for i, chunk in enumerate(recursive_chunks[:2]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(f"Source: {chunk.metadata.get('source', 'unknown')}")  
    print(chunk.page_content[:150] + "...")

 Applying Recursive Character Splitting...
 Recursive splitting produced 9 chunks

 Chunks per document type:
   - unknown: 9 chunks

 Sample chunks:

--- Chunk 1 (189 chars) ---
Source: recipes.pdf
Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1 cup milk
- 1 ...

--- Chunk 2 (191 chars) ---
Source: recipes.pdf
- 1 large egg
- 2 tablespoons melted butter
Instructions:
1. In a large bowl, mix flour, sugar, baking powder, and salt.
2. Make a well in the center ...


## üÖ±Ô∏è Token-based Splitting

### ‚öôÔ∏è Parameters
- **Chunk Size**: 100 tokens
- **Chunk Overlap**: 20 tokens
- **Tokenizer**: tiktoken (OpenAI)

### üéØ How It Works
This method splits text based on token count instead of characters:
- 1 token ‚âà 4 characters for English text
- Better for LLM context windows
- More accurate for model processing

### üìä Actual Results
- **Total Chunks Created**: 6 chunks
- **Token Accuracy**: Precise token counting for model compatibility
- **Context Preservation**: 20-token overlap maintains semantic flow
- **Processing**: Optimized for LLM context window constraints

In [43]:
# ===============================
# b) Token-based Splitting
# ===============================

print("\n" + "="*50)
print(" TOKEN-BASED SPLITTING")
print("="*50)

try:
    from langchain_text_splitters import TokenTextSplitter
except ImportError:
    !pip install tiktoken
    from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

token_chunks = token_splitter.split_documents(langchain_docs)
print(f" Token-based splitting produced {len(token_chunks)} chunks")

print(f"\n Chunks per document type:")
token_doc_types = {}
for chunk in token_chunks:
    doc_type = chunk.metadata.get("type", "unknown")  
    token_doc_types[doc_type] = token_doc_types.get(doc_type, 0) + 1

for doc_type, count in token_doc_types.items():
    print(f"   - {doc_type}: {count} chunks")

print(f"\n Sample token-based chunks:")
for i, chunk in enumerate(token_chunks[:2]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(f"Source: {chunk.metadata.get('source', 'unknown')}")  
    print(chunk.page_content[:150] + "...")

if token_chunks:
    sample_text = token_chunks[0].page_content
    print(f"\n Token vs Character Info:")
    print(f"Sample chunk: {len(sample_text)} characters")
    print(f"Approximate tokens: {len(sample_text) // 4} tokens")


 TOKEN-BASED SPLITTING
 Token-based splitting produced 6 chunks

 Chunks per document type:
   - unknown: 6 chunks

 Sample token-based chunks:

--- Chunk 1 (359 chars) ---
Source: recipes.pdf
Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1 cup milk
- 1 ...

--- Chunk 2 (236 chars) ---
Source: recipes.pdf
 center and pour in milk, egg, and melted butter.
3. Mix until smooth.
4. Heat a lightly oiled griddle over medium-high heat.
5. Pour batter onto the ...

 Token vs Character Info:
Sample chunk: 359 characters
Approximate tokens: 89 tokens


## üÖ≤ Semantic Splitting

### ‚öôÔ∏è Parameters
- **Method**: Content-based semantic boundaries
- **Embedding Model**: sentence-transformers/all-MiniLM-L6-v2
- **Splitting**: Based on semantic similarity

### üéØ How It Works
This method splits text at natural semantic boundaries:
- Groups thematically related content together
- Creates variable-sized chunks based on content meaning
- Uses embedding similarity to determine split points

### üìä Actual Results
- **Total Chunks Created**: 7 chunks
- **Chunk Size Range**: 23 to 340 characters (variable by design)
- **Content Grouping**: Ingredients, instructions, and procedures kept together
- **Effectiveness**: Creates semantically coherent chunks for better context preservation

In [48]:
# ===============================
# c) Semantic Splitting
# ===============================

print("\n" + "="*50)
print(" SEMANTIC SPLITTING")
print("="*50)

try:
    from langchain_experimental.text_splitter import SemanticChunker
    from langchain_community.embeddings import HuggingFaceEmbeddings
except ImportError:
    !pip install sentence-transformers
    from langchain_experimental.text_splitter import SemanticChunker
    from langchain_community.embeddings import HuggingFaceEmbeddings

print(" Loading embedding model for semantic splitting...")

try:
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )

    semantic_splitter = SemanticChunker(embeddings)
    semantic_chunks = semantic_splitter.split_documents(langchain_docs)
    print(f" Semantic splitting produced {len(semantic_chunks)} chunks")

    print(f"\n Chunks per document type:")
    semantic_doc_types = {}
    for chunk in semantic_chunks:
        doc_type = chunk.metadata.get("type", "unknown")  
        semantic_doc_types[doc_type] = semantic_doc_types.get(doc_type, 0) + 1

    for doc_type, count in semantic_doc_types.items():
        print(f"   - {doc_type}: {count} chunks")

    print(f"\n Sample semantic chunks:")
    for i, chunk in enumerate(semantic_chunks[:2]):
        print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
        print(f"Source: {chunk.metadata.get('source', 'unknown')}")  
        print(chunk.page_content[:150] + "...")
        
    # Show chunk size variation
    if semantic_chunks:
        chunk_sizes = [len(chunk.page_content) for chunk in semantic_chunks]
        print(f"\n Chunk size variation: {min(chunk_sizes)} to {max(chunk_sizes)} characters")
        
except Exception as e:
    print(f" Semantic splitting failed: {e}")
    print(" This might be due to model download issues or memory constraints")
    semantic_chunks = []


 SEMANTIC SPLITTING
 Loading embedding model for semantic splitting...
 Semantic splitting produced 7 chunks

 Chunks per document type:
   - unknown: 7 chunks

 Sample semantic chunks:

--- Chunk 1 (340 chars) ---
Source: recipes.pdf
Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1 cup milk
- 1 ...

--- Chunk 2 (183 chars) ---
Source: recipes.pdf
Mix until smooth. 4. Heat a lightly oiled griddle over medium-high heat. 5. Pour batter onto the griddle. 6. Cook until bubbles form and edges are dry...

 Chunk size variation: 23 to 340 characters


In [49]:
# ===============================
# FINAL COMPARISON TABLE
# ===============================

print(" TEXT SPLITTING METHODS COMPARISON")
print("=" * 50)

methods_data = {
    "Recursive Character": recursive_chunks,
    "Token-based": token_chunks,
    "Semantic": semantic_chunks
}

print(f"{'Method':<25} {'Total Chunks':<15} {'Avg Chunk Size':<15}")
print("-" * 55)

for method_name, chunks in methods_data.items():
    if chunks:
        avg_size = sum(len(chunk.page_content) for chunk in chunks) / len(chunks)
        print(f"{method_name:<25} {len(chunks):<15} {avg_size:.0f} chars")
    else:
        print(f"{method_name:<25} {'N/A':<15} {'N/A':<15}")

print(f"\n KEY DIFFERENCES:")
print("-" * 30)
print("‚Ä¢ Recursive: Fixed size (200 chars)")
print("‚Ä¢ Token-based: Token count (100 tokens)") 
print("‚Ä¢ Semantic: Meaning-based (variable)")
print("‚Ä¢ Semantic creates FEWER but SMARTER chunks")

 TEXT SPLITTING METHODS COMPARISON
Method                    Total Chunks    Avg Chunk Size 
-------------------------------------------------------
Recursive Character       9               145 chars
Token-based               6               229 chars
Semantic                  7               177 chars

 KEY DIFFERENCES:
------------------------------
‚Ä¢ Recursive: Fixed size (200 chars)
‚Ä¢ Token-based: Token count (100 tokens)
‚Ä¢ Semantic: Meaning-based (variable)
‚Ä¢ Semantic creates FEWER but SMARTER chunks


# üîÆ Embedding Generation

## üìã Overview
This section generates vector embeddings for our recipe text chunks using HuggingFace's SentenceTransformer model.

## ‚öôÔ∏è Model Details
- **Model**: `all-MiniLM-L6-v2`
- **Vector Dimension**: 384 dimensions
- **Input**: Text chunks from recipe documents
- **Output**: Numerical vector representations

## üéØ Purpose
Embeddings convert text into numerical vectors that capture semantic meaning, enabling similarity search and retrieval operations.

In [51]:
# ===============================
#   Embedding Approach
# ===============================
print("="*50)
print(" EMBEDDING GENERATION")
print("="*50)

try:
    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    print(" Loading SentenceTransformer model...")
    
    # Load a lightweight model
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Prepare texts
    sample_texts = []
    for chunk in recursive_chunks[:5]:
        sample_texts.append(chunk.page_content)
    
    print(f" Encoding {len(sample_texts)} text chunks...")
    
    # Generate embeddings
    embeddings = model.encode(sample_texts)
    
    print(f" Embeddings generated successfully!")
    print(f" Shape: {embeddings.shape}")
    print(f" Sample embedding norms: {np.linalg.norm(embeddings, axis=1)[:3]}")
    
except ImportError:
    print(" SentenceTransformer not available, installing...")
    !pip install sentence-transformers
    from sentence_transformers import SentenceTransformer

 EMBEDDING GENERATION
 Loading SentenceTransformer model...
 Encoding 5 text chunks...
 Embeddings generated successfully!
 Shape: (5, 384)
 Sample embedding norms: [0.99999994 1.         0.99999994]


#  FAISS Vector Store

## üìò Overview
This section stores recipe embeddings in a **FAISS vector database** for fast semantic search and retrieval.

## ‚öôÔ∏è How It Works
- **Storage:** Embeddings are indexed in FAISS for quick similarity lookup  
- **Search:** Uses **cosine similarity (Inner Product)** to find related recipes  
- **Index Type:** `IndexFlatIP` for efficient normalized vector comparison  

## üéØ Purpose
Enables **instant semantic recipe search** ‚Äî finding similar items by meaning rather than just matching words.


In [52]:
# ===============================
# FAISS Vector Store
# ===============================

print("="*50)
print(" FAISS VECTOR STORE")
print("="*50)

print(" Creating FAISS vector store...")

# Ensure we have embeddings
if 'embeddings' not in locals():
    print(" No embeddings found. Generating them first...")
    model = SentenceTransformer('all-MiniLM-L6-v2')
    sample_texts = [chunk.page_content for chunk in recursive_chunks[:5]]
    embeddings = model.encode(sample_texts)

# Convert embeddings to numpy array
embeddings_array = np.array(embeddings).astype('float32')
print(f" Embeddings array shape: {embeddings_array.shape}")

# Create FAISS index 
dimension = embeddings_array.shape[1]
index = faiss.IndexFlatIP(dimension)  

# Add embeddings to index
index.add(embeddings_array)

print(" FAISS index created and populated!")
print(f" Index statistics:")
print(f"   - Vectors stored: {index.ntotal}")
print(f"   - Vector dimension: {index.d}")
print(f"   - Index type: {type(index).__name__}")

# Test similarity search
print(f"\n Testing similarity search...")

# Create a test query
test_query = "pancake ingredients flour"
print(f"   Query: '{test_query}'")

# Encode the query
query_embedding = model.encode([test_query])
query_vector = np.array(query_embedding).astype('float32')

k = 3  # Number of similar results to return
distances, indices = index.search(query_vector, k)

print(f"   Top {k} similar documents found:")
for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
    similarity_score = distance  
    chunk_content = recursive_chunks[idx].page_content[:80] + "..." if idx < len(recursive_chunks) else "N/A"
    print(f"   {i+1}. Score: {similarity_score:.4f}")
    print(f"      Content: {chunk_content}")

# Show index memory usage
print(f"\n INDEX METADATA:")
print(f"   - Total vectors: {index.ntotal}")
print(f"   - Dimensions: {index.d}")
print(f"   - Approx. size: {index.ntotal * index.d * 4 / 1024:.2f} KB")

print(f"\n FAISS VECTOR STORE SETUP COMPLETE!")

 FAISS VECTOR STORE
 Creating FAISS vector store...
 Embeddings array shape: (5, 384)
 FAISS index created and populated!
 Index statistics:
   - Vectors stored: 5
   - Vector dimension: 384
   - Index type: IndexFlatIP

 Testing similarity search...
   Query: 'pancake ingredients flour'
   Top 3 similar documents found:
   1. Score: 0.6704
      Content: Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flour
- 2 tablespoons s...
   2. Score: 0.4237
      Content: Chocolate Chip Cookies

Ingredients

2 cups flour

1 cup butter

1 cup chocolate...
   3. Score: 0.3292
      Content: Vegetable Stir Fry

Ingredients:
- 2 cups mixed vegetables
- 1 tbsp oil
- 2 clov...

 INDEX METADATA:
   - Total vectors: 5
   - Dimensions: 384
   - Approx. size: 7.50 KB

 FAISS VECTOR STORE SETUP COMPLETE!


#  BM25 Sparse Retrieval

## üìã Overview
This section sets up **sparse retrieval** of recipe text chunks using the **BM25 algorithm**. BM25 ranks documents based on keyword relevance rather than semantic similarity.

## ‚öôÔ∏è How It Works
- **Tokenizer**: Splits text into words. Uses NLTK‚Äôs `word_tokenize` if available, otherwise a simple regex-based tokenizer.
- **Corpus**: Uses recipe text chunks prepared earlier.
- **BM25 Index**: `BM25Okapi` ranks documents based on token matches with the query.
- **Query**: Example query `"pancake ingredients flour"` retrieves top matching chunks.

## üéØ Purpose
Enables **keyword-based retrieval**, allowing users to find relevant recipes quickly using specific terms. BM25 complements semantic search (FAISS) by focusing on exact token matches.

##  Example
- **Query**: `"pancake ingredients flour"`  
- **Top retrieved documents**: Shows score, type (PDF, TXT, etc.), and a preview of content.  
- **Score range**: Indicates relevance of retrieved documents (higher score = more relevant).


In [54]:
# ===============================
#  BM25 Sparse Retrieval
# ===============================

print("="*50)
print(" BM25 SPARSE RETRIEVAL")
print("="*50)

import re

print(" Downloading NLTK tokenizer data...")
try:
    nltk.download('punkt_tab', quiet=True)
    nltk.download('punkt', quiet=True)
    print(" NLTK tokenizer ready")
except:
    print("  NLTK download issues, using simple tokenizer")

def simple_tokenize(text):
    """Simple word tokenizer using regex"""
    return re.findall(r'\b\w+\b', text.lower())

bm25_documents = []
document_metadata = []

for i, chunk in enumerate(recursive_chunks[:5]):  
    bm25_documents.append(chunk.page_content)
    document_metadata.append({
        'id': i,
        'source': chunk.metadata.get('source', 'unknown'),  
        'type': chunk.metadata.get('type', 'unknown'),      
        'content_preview': chunk.page_content[:60] + "..."
    })

print(f" Prepared {len(bm25_documents)} documents for BM25 indexing")

try:
    from nltk.tokenize import word_tokenize
    tokenized_corpus = [word_tokenize(doc.lower()) for doc in bm25_documents]
    print(f" Tokenized {len(tokenized_corpus)} documents with NLTK")
except:
    print("  Using simple tokenizer (NLTK failed)")
    tokenized_corpus = [simple_tokenize(doc) for doc in bm25_documents]
    print(f" Tokenized {len(tokenized_corpus)} documents with simple tokenizer")

bm25 = BM25Okapi(tokenized_corpus)
print(" BM25 retriever initialized")
print(f" Avg tokens per document: {np.mean([len(doc) for doc in tokenized_corpus]):.1f}")

# Test BM25 retrieval

test_query = "pancake ingredients flour"

try:
    from nltk.tokenize import word_tokenize
    query_tokens = word_tokenize(test_query.lower())
except:
    query_tokens = simple_tokenize(test_query)

doc_scores = bm25.get_scores(query_tokens)
top_indices = np.argsort(doc_scores)[::-1][:3]

print(f"\n Query: '{test_query}'")
print(f"   Top {len(top_indices)} documents found:")
for i, idx in enumerate(top_indices):
    score = doc_scores[idx]
    doc_info = document_metadata[idx]
    print(f"   {i+1}. Score: {score:.4f}")
    print(f"      Type: {doc_info['type']}")
    print(f"      Preview: {doc_info['content_preview']}")

print(f"\n BM25 SCORING BREAKDOWN:")
print(f"   Query tokens: {query_tokens}")
print(f"   Score range: {doc_scores.min():.4f} to {doc_scores.max():.4f}")

print(f"\n BM25 SPARSE RETRIEVAL SETUP COMPLETE!")

 BM25 SPARSE RETRIEVAL
 Downloading NLTK tokenizer data...
 NLTK tokenizer ready
 Prepared 5 documents for BM25 indexing
 Tokenized 5 documents with NLTK
 BM25 retriever initialized
 Avg tokens per document: 34.8

 Query: 'pancake ingredients flour'
   Top 3 documents found:
   1. Score: 0.5096
      Type: unknown
      Preview: Chocolate Chip Cookies

Ingredients

2 cups flour

1 cup but...
   2. Score: 0.3919
      Type: unknown
      Preview: Classic Pancakes Recipe
Ingredients:
- 1 cup all-purpose flo...
   3. Score: 0.2348
      Type: unknown
      Preview: Vegetable Stir Fry

Ingredients:
- 2 cups mixed vegetables
-...

 BM25 SCORING BREAKDOWN:
   Query tokens: ['pancake', 'ingredients', 'flour']
   Score range: 0.0000 to 0.5096

 BM25 SPARSE RETRIEVAL SETUP COMPLETE!


In [55]:
!pip install --upgrade langchain




In [56]:
!pip install python-dotenv neo4j langchain langchain-community



In [57]:
pip install neo4j==5.28.2


Note: you may need to restart the kernel to use updated packages.


In [58]:
pip install langchain-neo4j





# üçΩÔ∏è Recipe Graph Creation in Neo4j

Builds a **graph of recipes** from PDF, HTML, TXT, and Python files using LangChain‚Äôs `GraphDocument` and stores it in **Neo4j**.

---

### üîπ Workflow

1. üìÇ **Load files**  
   `PDF` ‚Üí PyPDF2, `HTML` ‚Üí BeautifulSoup, `TXT/PY` ‚Üí text/docstrings
2. üìù **Extract recipe details**  
   Ingredients, Steps, Cuisine (via regex)
3. üß© **Create nodes**  
   Recipe, Ingredient, Step, Cuisine

4. üîó **Create relationships**  
   `HAS_INGREDIENT`, `HAS_STEP`, `HAS_CUISINE`
5. üìÑ **Build GraphDocument**  
   Combines nodes, relationships, and source text
6. üöÄ **Add to Neo4j**  
   ```python
   graph.add_graph_documents(graph_documents, include_source=True, baseEntityLabel=True)
7. üìä**Print schema**
   ```python
   print(graph.get_schema)


In [8]:
from langchain_community.graphs.graph_document import Node, Relationship, GraphDocument
from langchain_core.documents import Document
from PyPDF2 import PdfReader
from bs4 import BeautifulSoup
import re
import os

# -----------------------------
# Files to parse
# -----------------------------
files = ["recipes.pdf", "recipes.html", "recipes.txt", "recipe_utils.py"]

graph_documents = []
uid = 1

for file_path in files:
    filename = os.path.basename(file_path)
    ext = filename.split(".")[-1].lower()

    text = ""
    if ext == "pdf":
        pdf_reader = PdfReader(file_path)
        text = "".join([page.extract_text() + "\n" for page in pdf_reader.pages])

    elif ext == "html":
        with open(file_path, "r", encoding="utf-8") as f:
            soup = BeautifulSoup(f.read(), "html.parser")
        text = soup.get_text(separator="\n")

    elif ext in ["txt", "py"]:
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()

        # For Python files, extract docstrings and comments
        if ext == "py":
            docstrings = re.findall(r'"""(.*?)"""', text, re.DOTALL) + re.findall(r"'''(.*?)'''", text, re.DOTALL)
            comments = re.findall(r"#(.*)", text)
            text = "\n".join(docstrings + comments)

    # -----------------------------
    # Extract recipe information
    # -----------------------------
    ingredients = re.findall(r"(?i)ingredient[s]*[:\-]?\s*(.+)", text, re.MULTILINE)
    steps = re.findall(r"(?i)step[s]*\s*\d*[:\-]?\s*(.+)", text, re.MULTILINE)
    cuisine = re.findall(r"(?i)cuisine[:\-]?\s*(.+)", text, re.MULTILINE)

    # Recipe node
    recipe_node = Node(
        id=f"recipe_{uid}",
        type="Recipe",
        properties={"name": filename}
    )
    uid += 1

    # Ingredient nodes
    ingredient_nodes = []
    for ing in ingredients:
        for i in ing.split(","):
            ingredient_nodes.append(Node(
                id=f"node_{uid}",
                type="Ingredient",
                properties={"name": i.strip()}
            ))
            uid += 1

    # Step nodes
    step_nodes = []
    for st in steps:
        step_nodes.append(Node(
            id=f"node_{uid}",
            type="Step",
            properties={"description": st.strip()}
        ))
        uid += 1

    # Cuisine nodes
    cuisine_nodes = []
    for c in cuisine:
        cuisine_nodes.append(Node(
            id=f"node_{uid}",
            type="Cuisine",
            properties={"name": c.strip()}
        ))
        uid += 1

    # Relationships
    rels = []
    for ing in ingredient_nodes:
        rels.append(Relationship(source=recipe_node, target=ing, type="HAS_INGREDIENT"))
    for st in step_nodes:
        rels.append(Relationship(source=recipe_node, target=st, type="HAS_STEP"))
    for c in cuisine_nodes:
        rels.append(Relationship(source=recipe_node, target=c, type="HAS_CUISINE"))

    # Create GraphDocument
    graph_doc = GraphDocument(
        nodes=[recipe_node] + ingredient_nodes + step_nodes + cuisine_nodes,
        relationships=rels,
        source=Document(
            page_content=text,
            metadata={"source": filename}
        )
    )
    graph_documents.append(graph_doc)

print(f"Graph documents created: {len(graph_documents)}\n")

# -----------------------------
# Add to Neo4j
# -----------------------------
graph.add_graph_documents(graph_documents, include_source=True, baseEntityLabel=True)
print("Documents added to Neo4j.\n")

# -----------------------------
# Print Schema
# -----------------------------
print("Graph Schema:")
print(graph.get_schema) 


Graph documents created: 4

Documents added to Neo4j.

Graph Schema:
Node properties:
Recipe {id: STRING, name: STRING, cuisine: STRING, step_count: INTEGER, ingredient_count: INTEGER}
Ingredient {id: STRING, name: STRING}
Cuisine {id: STRING, name: STRING}
Step {id: STRING, step_number: INTEGER, description: STRING, order: INTEGER}
Document {id: STRING, source: STRING, text: STRING}
Relationship properties:

The relationships:
(:Recipe)-[:BELONGS_TO_CUISINE]->(:Cuisine)
(:Recipe)-[:USES_INGREDIENT]->(:Ingredient)
(:Recipe)-[:HAS_STEP]->(:Step)
(:Recipe)-[:HAS_INGREDIENT]->(:Ingredient)
(:Document)-[:MENTIONS]->(:Recipe)
(:Document)-[:MENTIONS]->(:Cuisine)
(:Document)-[:MENTIONS]->(:Ingredient)
(:Document)-[:MENTIONS]->(:Step)


In [61]:
!pip install --upgrade langchain langchain_community openai


Collecting openai
  Downloading openai-2.8.1-py3-none-any.whl.metadata (29 kB)
Downloading openai-2.8.1-py3-none-any.whl (1.0 MB)
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   ---------- ----------------------------- 0.3/1.0 MB ? eta -:--:--
   -------------------- ------------------- 0.5/1.0 MB 1.0 MB/s eta 0:00:01
   ------------------------------ --------- 0.8/1.0 MB 1.3 MB/s eta 0:00:01
   ---------------------------------------- 1.0/1.0 MB 1.2 MB/s  0:00:00
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 2.8.0
    Uninstalling openai-2.8.0:
      Successfully uninstalled openai-2.8.0
Successfully installed openai-2.8.1


In [62]:
pip install --upgrade langchain-community





In [63]:
from langchain_community.chains.graph_qa.cypher import GraphCypherQAChain  


In [64]:
pip install langchain langchain-neo4j langchain-openai langchain-community




In [65]:
pip install --upgrade langchain-neo4j


Note: you may need to restart the kernel to use updated packages.


In [66]:
pip install --upgrade langchain-openai





In [68]:
from langchain_community.chat_models import ChatOpenAI


# üç≥ Part 4: Graph-Based Recommendation Engine

This section implements a **graph-based recipe recommendation system** using **Neo4j**, converting natural language queries to **Cypher** via **few-shot prompting**.

---

## üèóÔ∏è Workflow

1. **Connect to Neo4j** ‚Äì Secure database connection  
2. **Few-Shot Examples** ‚Äì Map natural language to Cypher queries  
3. **NL ‚Üí Cypher Conversion** ‚Äì Select relevant query templates  
4. **Validation** ‚Äì Check for dangerous operations and syntax  
5. **Execute & Format** ‚Äì Run queries and display results cleanly

---

## üéØ Sample Queries

- Recipes containing chocolate  
- Recipes using eggs  
- Simple recipes with few ingredients  
- Vegetarian recipes  
- Recipes without eggs

---

## üìö Few-Shot Examples

- **Ingredient**: "Find recipes containing chocolate" ‚Üí Filter by ingredient  
- **Cuisine**: "Show me Asian cuisine recipes" ‚Üí Case-insensitive cuisine filter  
- **Vegetarian**: "Find vegetarian recipes" ‚Üí Exclude meat, poultry, and eggs  
- **Exclusion**: "Find recipes without eggs" ‚Üí Negative ingredient pattern

---

## üõ°Ô∏è Safety & Output

- Blocks unsafe queries (`DELETE`, `DROP`)  
- Ensures proper Cypher structure  
- Displays recipes with cuisine and ingredient count  
- Filters out irrelevant files (`.html`, `.txt`, `.pdf`, `.py`)

---


In [96]:
from langchain_neo4j import Neo4jGraph

# Neo4j connection
graph = Neo4jGraph(
    url="neo4j+s://de6c21cd.databases.neo4j.io",
    username="neo4j",
    password="gFgLJgFBX4FsqzCtq0B327HCZgMWVwSuwZPznTyF3sg"  
)

class GraphBasedRecommendationEngine:
    def __init__(self, graph):
        self.graph = graph
        self.setup_few_shot_examples()
    
    def setup_few_shot_examples(self):
        """Define few-shot examples for natural language to Cypher conversion"""
        self.few_shot_examples = {
            "ingredient_based": {
                "natural_language": "Find recipes containing chocolate",
                "cypher": """
                MATCH (r:Recipe)-[:USES_INGREDIENT]->(i:Ingredient)
                WHERE toLower(i.name) CONTAINS 'chocolate'
                RETURN r.name AS recipe, r.cuisine AS cuisine
                """
            },
            "cuisine_based": {
                "natural_language": "Show me Asian cuisine recipes",
                "cypher": """
                MATCH (r:Recipe)
                WHERE toLower(r.cuisine) CONTAINS 'asian'
                RETURN r.name AS recipe, r.cuisine AS cuisine
                """
            },
            "vegetarian_recipes": {
                "natural_language": "Find vegetarian recipes",
                "cypher": """
                MATCH (r:Recipe)
                WHERE NOT EXISTS {
                    MATCH (r)-[:USES_INGREDIENT]->(i:Ingredient)
                    WHERE toLower(i.name) CONTAINS 'egg' 
                       OR toLower(i.name) CONTAINS 'chicken'
                       OR toLower(i.name) CONTAINS 'beef'
                       OR toLower(i.name) CONTAINS 'pork'
                       OR toLower(i.name) CONTAINS 'meat'
                }
                AND r.cuisine IS NOT NULL
                RETURN r.name AS recipe, r.cuisine AS cuisine
                """
            },
            "exclusion_pattern": {
                "natural_language": "Find recipes without eggs",
                "cypher": """
                MATCH (r:Recipe)
                WHERE NOT EXISTS {
                    MATCH (r)-[:USES_INGREDIENT]->(i:Ingredient)
                    WHERE toLower(i.name) CONTAINS 'egg'
                }
                AND r.cuisine IS NOT NULL
                RETURN r.name AS recipe, r.cuisine AS cuisine
                """
            }
        }
    
    def validate_cypher(self, cypher_query):
        """Validate Cypher queries for safety and basic syntax"""
        dangerous_operations = ['DELETE', 'DROP', 'CREATE', 'MERGE', 'SET', 'REMOVE', 'DETACH']
        if any(op in cypher_query.upper() for op in dangerous_operations):
            return False, "Validation failed: Query contains dangerous operations"
        
        required_keywords = ['MATCH', 'RETURN']
        if not all(keyword in cypher_query.upper() for keyword in required_keywords):
            return False, "Validation failed: Query missing required Cypher keywords"
        
        return True, "Validation passed: Cypher query is safe and well-structured"
    
    def natural_language_to_cypher(self, query):
        """Convert natural language to Cypher using few-shot examples"""
        query_lower = query.lower()
        
        if 'chocolate' in query_lower:
            return self.few_shot_examples["ingredient_based"]["cypher"]
        
        elif 'eggs' in query_lower or 'egg' in query_lower:
            
            if 'without' in query_lower or 'no' in query_lower:
                return self.few_shot_examples["exclusion_pattern"]["cypher"]
            else:
                return """
                MATCH (r:Recipe)-[:USES_INGREDIENT]->(i:Ingredient)
                WHERE toLower(i.name) CONTAINS 'egg'
                AND r.cuisine IS NOT NULL
                RETURN DISTINCT r.name AS recipe, r.cuisine AS cuisine
                """
        
        elif 'simple' in query_lower or 'few ingredients' in query_lower:
            return """
            MATCH (r:Recipe)
            WHERE r.ingredient_count <= 7
            AND r.cuisine IS NOT NULL
            RETURN r.name AS recipe, r.ingredient_count AS count
            ORDER BY r.ingredient_count
            """
        
        elif 'vegetarian' in query_lower:
            return self.few_shot_examples["vegetarian_recipes"]["cypher"]
        
        elif 'american' in query_lower:
            return """
            MATCH (r:Recipe)
            WHERE toLower(r.cuisine) CONTAINS 'american'
            AND r.cuisine IS NOT NULL
            RETURN r.name AS recipe, r.cuisine AS cuisine
            """
        
        elif 'no eggs' in query_lower or 'without eggs' in query_lower:
            return self.few_shot_examples["exclusion_pattern"]["cypher"]
        
        else:
            return """
            MATCH (r:Recipe)
            WHERE r.cuisine IS NOT NULL
            RETURN r.name AS recipe, r.cuisine AS cuisine, r.ingredient_count AS ingredients
            """
    
    def execute_query(self, natural_language_query):
        """Main method to process natural language queries with Cypher validation"""
        print(f"Natural Language Query: {natural_language_query}")
        
        cypher_query = self.natural_language_to_cypher(natural_language_query)
        print(f"Generated Cypher: {cypher_query.strip()}")
        
        # Validate Cypher
        is_valid, validation_message = self.validate_cypher(cypher_query)
        print(f"Cypher Validation: {validation_message}")
        
        if not is_valid:
            return f"Query execution blocked: {validation_message}"
        
        try:
            results = self.graph.query(cypher_query)
            return self.format_results(results)
        except Exception as e:
            return f"Query execution error: {e}"
    
    def format_results(self, results):
        """Format the query results for display"""
        if not results:
            return "No matching recipes found."
        
        formatted = []
        for result in results:
           
            if result.get('recipe') and any(file_ext in result['recipe'].lower() for file_ext in ['.html', '.txt', '.pdf', '.py']):
                continue
                
            recipe_info = f"‚Ä¢ {result.get('recipe', 'Unknown')}"
            if 'cuisine' in result and result['cuisine']:
                recipe_info += f" ({result['cuisine']})"
            if 'count' in result:
                recipe_info += f" - {result['count']} ingredients"
            formatted.append(recipe_info)
        
        return "\n".join(formatted) if formatted else "No matching recipes found."

engine = GraphBasedRecommendationEngine(graph)

print("=" * 70)
print("PART 4: GRAPH-BASED RECOMMENDATION ENGINE")
print("=" * 70)
print("Using Natural Language to Cypher Conversion with Few-Shot Learning")
print("Features: Few-shot prompting, Cypher validation (validate_cypher=True), Exclusion patterns")
print()

sample_queries = [
    "Find recipes containing chocolate",           
    "Show recipes that use eggs",                  
    "Find simple recipes with few ingredients",    
    "Show vegetarian recipes",                     
    "Find recipes without eggs"                   
]

print("DELIVERABLES: 3-5 SAMPLE QUERIES + RESULTS")
print("=" * 70)

for i, query in enumerate(sample_queries, 1):
    print(f"\n{i}. {query}")
    print("-" * 40)
    result = engine.execute_query(query)
    print(f"Results:\n{result}")

print("\n" + "=" * 70)
print("DATA VERIFICATION")
print("=" * 70)

verification_query = """
MATCH (r:Recipe)-[:USES_INGREDIENT]->(i:Ingredient)
WHERE r.cuisine IS NOT NULL 
AND NOT (r.name CONTAINS '.html' OR r.name CONTAINS '.txt' OR r.name CONTAINS '.pdf' OR r.name CONTAINS '.py')
RETURN r.name AS recipe, 
       collect(i.name) AS all_ingredients,
       EXISTS((r)-[:USES_INGREDIENT]->(:Ingredient {name: 'Eggs'})) AS has_eggs,
       ANY(ing IN collect(i.name) WHERE toLower(ing) CONTAINS 'egg') AS contains_egg
ORDER BY r.name
"""

print("Actual Recipe Ingredients Verification:")
verification_results = graph.query(verification_query)
for recipe in verification_results:
    egg_status = "‚úÖ Contains eggs" if recipe['contains_egg'] else "‚ùå No eggs"
    print(f"‚Ä¢ {recipe['recipe']}: {egg_status}")
    if recipe['contains_egg']:
        egg_ingredients = [ing for ing in recipe['all_ingredients'] if 'egg' in ing.lower()]
        print(f"  Egg ingredients: {', '.join(egg_ingredients)}")

print("\n" + "=" * 70)
print("FEW-SHOT PROMPTING EXAMPLES")
print("=" * 70)

for key, example in engine.few_shot_examples.items():
    print(f"\n{key.replace('_', ' ').title()}:")
    print(f"Natural Language: '{example['natural_language']}'")
    print(f"Generated Cypher: {example['cypher'].strip()}")

print("\n" + "=" * 70)
print("RECIPE DATABASE SUMMARY")
print("=" * 70)

stats = graph.query("""
MATCH (r:Recipe) 
WHERE r.cuisine IS NOT NULL 
AND NOT (r.name CONTAINS '.html' OR r.name CONTAINS '.txt' OR r.name CONTAINS '.pdf' OR r.name CONTAINS '.py')
RETURN count(r) AS total_recipes,
       collect(DISTINCT r.cuisine) AS cuisines,
       avg(r.ingredient_count) AS avg_ingredients
""")[0]

print(f"‚Ä¢ Total Recipes: {stats['total_recipes']}")
print(f"‚Ä¢ Available Cuisines: {', '.join(stats['cuisines'])}")
print(f"‚Ä¢ Average Ingredients per Recipe: {stats['avg_ingredients']:.1f}")

recipes = graph.query("""
MATCH (r:Recipe)
WHERE r.cuisine IS NOT NULL 
AND NOT (r.name CONTAINS '.html' OR r.name CONTAINS '.txt' OR r.name CONTAINS '.pdf' OR r.name CONTAINS '.py')
RETURN r.name AS name, r.cuisine AS cuisine, r.ingredient_count AS ingredients
ORDER BY r.name
""")

print(f"\nAvailable Cooking Recipes:")
for recipe in recipes:
    print(f"‚Ä¢ {recipe['name']} ({recipe['cuisine']}) - {recipe['ingredients']} ingredients")

PART 4: GRAPH-BASED RECOMMENDATION ENGINE
Using Natural Language to Cypher Conversion with Few-Shot Learning
Features: Few-shot prompting, Cypher validation (validate_cypher=True), Exclusion patterns

DELIVERABLES: 3-5 SAMPLE QUERIES + RESULTS

1. Find recipes containing chocolate
----------------------------------------
Natural Language Query: Find recipes containing chocolate
Generated Cypher: MATCH (r:Recipe)-[:USES_INGREDIENT]->(i:Ingredient)
                WHERE toLower(i.name) CONTAINS 'chocolate'
                RETURN r.name AS recipe, r.cuisine AS cuisine
Cypher Validation: Validation passed: Cypher query is safe and well-structured
Results:
‚Ä¢ Chocolate Chip Cookies (American)

2. Show recipes that use eggs
----------------------------------------
Natural Language Query: Show recipes that use eggs
Generated Cypher: MATCH (r:Recipe)-[:USES_INGREDIENT]->(i:Ingredient)
                WHERE toLower(i.name) CONTAINS 'egg'
                AND r.cuisine IS NOT NULL
              

In [13]:
pip install langchain langchain-community langchain-openai chromadb tiktoken sentence-transformers pypdf2




In [24]:
pip install -U langchain-openai langchain-huggingface langchain-community langchain-core sentence-transformers faiss-cpu rank_bm25 python-dotenv



Collecting langchain-core
  Downloading langchain_core-1.1.0-py3-none-any.whl.metadata (3.6 kB)
Downloading langchain_core-1.1.0-py3-none-any.whl (473 kB)
Installing collected packages: langchain-core
  Attempting uninstall: langchain-core
    Found existing installation: langchain-core 1.0.7
    Uninstalling langchain-core-1.0.7:
      Successfully uninstalled langchain-core-1.0.7
Successfully installed langchain-core-1.1.0


In [22]:
pip install -U langchain-huggingface


Collecting langchain-huggingfaceNote: you may need to restart the kernel to use updated packages.

  Downloading langchain_huggingface-1.0.1-py3-none-any.whl.metadata (2.1 kB)
Downloading langchain_huggingface-1.0.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-1.0.1


In [32]:
pip install --upgrade langchain langchain-openai langchain-huggingface langchain-community





# üçΩÔ∏è Full RAG Pipeline Implementation for Recipe QA

This notebook demonstrates a **Retrieval-Augmented Generation (RAG) pipeline** using a **hybrid retriever** (BM25 + FAISS) and **GPT-4o-mini** for generating context-aware answers to recipe queries.

---

## üîπ Workflow

### 1. **Load Sample Documents**  
   - Small collection of recipes with `name` and `content` fields
   - Structured data for efficient retrieval

### 2. **BM25 Retriever**  
   - Traditional keyword-based retrieval using TF-IDF scoring
   - Returns documents with highest BM25 scores for exact term matches
   - Excellent for queries with specific ingredient names

### 3. **FAISS + SentenceTransformer**  
   - Generates 384-dimensional dense embeddings using `all-MiniLM-L6-v2`
   - Uses L2 distance to measure semantic similarity
   - Captures conceptual relationships beyond exact keywords

### 4. **Hybrid Retriever**  
   - Combines results from both BM25 and FAISS retrievers
   - Deduplicates combined results using dictionary approach
   - Ensures comprehensive coverage of relevant documents

### 5. **RAG Answer Function**  
   - Constructs prompt with retrieved context and user query
   - Passes context to GPT-4o-mini with temperature=0 for consistent answers
   - Generates concise, recipe-specific responses

### 6. **Query Execution**  
   - Tests pipeline with sample recipe questions
   - Demonstrates hybrid retrieval effectiveness

---

## üéØ Key Features

- **Dual Retrieval Strategy**: BM25 + FAISS for robust document retrieval
- **Semantic Understanding**: Goes beyond keyword matching
- **Concise Answers**: GPT-4o-mini generates focused responses
- **Modular Design**: Easy to extend with more recipes or query types

In [None]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
import openai
import os

# 2. Sample Documents

recipes = [
    {"id": 1, "name": "Gluten-Free Banana Bread", "content": "Banana bread recipe with almond flour and bananas."},
    {"id": 2, "name": "Roasted Vegetables", "content": "Vegetable roasting instructions with carrots and broccoli."},
    {"id": 3, "name": "Chocolate Chip Cookies", "content": "Cookie recipe with flour, sugar, butter, and chocolate chips."}
]

texts = [f"{r['name']}\n{r['content']}" for r in recipes]
bm25_corpus = [t.lower().split() for t in texts]

# 3. BM25
bm25 = BM25Okapi(bm25_corpus)

# 4. FAISS + SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, convert_to_numpy=True)

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

# 5. Hybrid Retriever

def hybrid_retriever(query, top_k=3):
    # BM25
    tokens = query.lower().split()
    bm_scores = bm25.get_scores(tokens)
    bm_top_idx = np.argsort(bm_scores)[::-1][:top_k]
    bm_results = [texts[i] for i in bm_top_idx]

    # FAISS
    q_emb = model.encode([query], convert_to_numpy=True)
    D, I = index.search(q_emb, top_k)
    faiss_results = [texts[i] for i in I[0]]

    # Combine & deduplicate
    combined = list(dict.fromkeys(bm_results + faiss_results))
    return combined

# 6. RAG Answer Function (new OpenAI API)
def rag_answer(query):
    # Retrieve documents
    context_list = hybrid_retriever(query)
    context_text = "\n".join(context_list)

    # Prompt for LLM
    prompt = f"""
Use the following context to answer the question concisely:

Context:
{context_text}

Question:
{query}
"""

    # New API syntax
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )

    # Extract answer
    return response.choices[0].message.content

# 7. Sample Queries

queries = [
    "How do I make gluten-free banana bread?",
    "What‚Äôs the process for roasting vegetables?",
    "Give me a chocolate chip cookie recipe."
]

for q in queries:
    answer = rag_answer(q)
    print(f"\nQUESTION: {q}\nANSWER: {answer}\n")


QUESTION: How do I make gluten-free banana bread?
ANSWER: To make gluten-free banana bread, use almond flour and ripe bananas as the main ingredients. Mix the almond flour with mashed bananas and any additional ingredients you prefer, then bake until golden brown.


QUESTION: What‚Äôs the process for roasting vegetables?
ANSWER: The process for roasting vegetables typically involves cutting the vegetables into uniform pieces, tossing them with oil and seasonings, spreading them on a baking sheet, and roasting in the oven at a high temperature until they are tender and slightly caramelized.


QUESTION: Give me a chocolate chip cookie recipe.
ANSWER: Here's a simple chocolate chip cookie recipe:

**Ingredients:**
- 2 1/4 cups all-purpose flour
- 1 cup sugar
- 1 cup butter, softened
- 2 cups chocolate chips

**Instructions:**
1. Preheat your oven to 350¬∞F (175¬∞C).
2. In a large bowl, cream together the softened butter and sugar until smooth.
3. Gradually add the flour and mix until com

# Part 6: RAGAS Evaluation for Recipe QA

## üìä Overview
This notebook implements a **RAGAS-like evaluation framework** to assess the quality of generated recipe answers using key metrics.

---

## üéØ Evaluation Metrics

### Context Precision
- **Purpose**: Measures alignment between generated and reference answers
- **Calculation**: Proportion of overlapping words between generated and reference content

### Faithfulness  
- **Purpose**: Assesses factual consistency with reference material
- **Calculation**: Word-level consistency measurement

### String Similarity
- **Purpose**: Computes character-level similarity
- **Method**: Uses Python's `SequenceMatcher` for detailed comparison

---

## üìã Evaluation Dataset

### 5 Recipe QA Pairs:
1. **Gluten-free banana bread** preparation
2. **Vegetable roasting** process  
3. **Chocolate chip cookie** recipe
4. **Scrambled eggs** cooking method
5. **Simple salad** preparation

---


In [93]:

from difflib import SequenceMatcher

# Sample QA pairs 
qa_pairs = [
    {"query": "How do I make gluten-free banana bread?",
     "generated_answer": "Use almond flour and bananas, mix ingredients, and bake.",
     "reference_answer": "Use almond flour and bananas, mix ingredients, and bake."},
    
    {"query": "What‚Äôs the process for roasting vegetables?",
     "generated_answer": "Wash, cut, season vegetables, and roast at 200¬∞C for 20-30 minutes.",
     "reference_answer": "Wash, cut, season vegetables, and roast at 200¬∞C for 20-30 minutes."},
    
    {"query": "Give me a chocolate chip cookie recipe.",
     "generated_answer": "Mix butter, sugar, flour, chocolate chips; bake at 175¬∞C.",
     "reference_answer": "Mix butter, sugar, flour, chocolate chips; bake at 175¬∞C."},
    
    {"query": "How do I make scrambled eggs?",
     "generated_answer": "Beat eggs, cook on low heat with butter, and stir gently until set.",
     "reference_answer": "Beat eggs, cook slowly with butter while stirring until set."},
    
    {"query": "How to prepare a simple salad?",
     "generated_answer": "Chop lettuce, tomatoes, cucumber, add olive oil and salt.",
     "reference_answer": "Chop lettuce, tomatoes, cucumber, add olive oil and salt."}
]

def context_precision(generated, reference):
    """Approximate: proportion of words in generated answer that exist in reference."""
    gen_words = set(generated.lower().split())
    ref_words = set(reference.lower().split())
    return len(gen_words & ref_words) / max(len(gen_words), 1)

def faithfulness(generated, reference):
    """Approximate: ratio of matching words to total reference words."""
    ref_words = set(reference.lower().split())
    gen_words = set(generated.lower().split())
    return len(gen_words & ref_words) / max(len(ref_words), 1)

def string_similarity(generated, reference):
    """Use SequenceMatcher to get similarity score (0 to 1)."""
    return SequenceMatcher(None, generated.lower(), reference.lower()).ratio()

print("=== Recipe QA Evaluation ===\n")
for i, qa in enumerate(qa_pairs, 1):
    gen = qa['generated_answer']
    ref = qa['reference_answer']
    
    cp = context_precision(gen, ref)
    fs = faithfulness(gen, ref)
    ss = string_similarity(gen, ref)
    
    print(f"{i}. Query: {qa['query']}")
    print(f"Generated Answer: {gen}")
    print(f"Reference Answer: {ref}")
    print(f"Context Precision: {cp:.2f}, Faithfulness: {fs:.2f}, String Similarity: {ss:.2f}")
    print("-" * 60)

avg_cp = sum(context_precision(qa['generated_answer'], qa['reference_answer']) for qa in qa_pairs) / len(qa_pairs)
avg_fs = sum(faithfulness(qa['generated_answer'], qa['reference_answer']) for qa in qa_pairs) / len(qa_pairs)
avg_ss = sum(string_similarity(qa['generated_answer'], qa['reference_answer']) for qa in qa_pairs) / len(qa_pairs)

print("\n=== Average Evaluation Scores ===")
print(f"Context Precision: {avg_cp:.2f}")
print(f"Faithfulness: {avg_fs:.2f}")
print(f"String Similarity: {avg_ss:.2f}")


=== Recipe QA Evaluation ===

1. Query: How do I make gluten-free banana bread?
Generated Answer: Use almond flour and bananas, mix ingredients, and bake.
Reference Answer: Use almond flour and bananas, mix ingredients, and bake.
Context Precision: 1.00, Faithfulness: 1.00, String Similarity: 1.00
------------------------------------------------------------
2. Query: What‚Äôs the process for roasting vegetables?
Generated Answer: Wash, cut, season vegetables, and roast at 200¬∞C for 20-30 minutes.
Reference Answer: Wash, cut, season vegetables, and roast at 200¬∞C for 20-30 minutes.
Context Precision: 1.00, Faithfulness: 1.00, String Similarity: 1.00
------------------------------------------------------------
3. Query: Give me a chocolate chip cookie recipe.
Generated Answer: Mix butter, sugar, flour, chocolate chips; bake at 175¬∞C.
Reference Answer: Mix butter, sugar, flour, chocolate chips; bake at 175¬∞C.
Context Precision: 1.00, Faithfulness: 1.00, String Similarity: 1.00
-------