# Introduction to RAG (Retrieval Augmented Generation) with LLMs

This notebook provides a beginner-friendly introduction to building a simple RAG (Retrieval Augmented Generation) system connected to a Language Model (LLM). We'll use simpler libraries to avoid dependency issues.

## What is RAG?

**RAG (Retrieval Augmented Generation)** is a technique that enhances Large Language Models (LLMs) by providing them with relevant information retrieved from a knowledge base. This helps the model generate more accurate and contextually relevant responses.

In simple terms:
1. You have a question or query
2. The system finds relevant information from a database or documents
3. This information is sent along with your question to the LLM
4. The LLM uses both your question and the retrieved information to generate a better answer

## What is an LLM?

**LLM (Large Language Model)** is an AI system trained on vast amounts of text data that can understand and generate human-like text. Examples include LLaMA from Meta and various models available on Hugging Face.

## What We'll Build

In this notebook, we'll build a simple RAG system that:
1. Takes a document about NovaTech Electronics' supply chain
2. Creates a searchable knowledge base from this document
3. Allows users to ask questions about the document
4. Retrieves relevant passages from the document
5. Uses a free LLM to generate answers based on these passages

# Setup and Installation

First, let's install the necessary libraries. We'll use:
- `sentence-transformers` for creating embeddings
- `scikit-learn` for similarity calculations
- `requests` for making API calls to the LLM

In [1]:
# Install required packages
# Uncomment and run this cell if you need to install these packages
!pip install sentence-transformers scikit-learn numpy requests

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [2]:
# Import necessary libraries
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import requests
import json
import time  # For adding slight delays between API calls
import os  # For environment variables

# Preparing Our Document

Let's start by defining our document. In a real-world scenario, you would load many files from a database, but for simplicity, we'll define it directly in the notebook.

In [3]:
# The document we'll use for our RAG system
document_text = """# NovaTech Electronics: Supply Chain Overview 2025

## 1. Company Profile

NovaTech Electronics is a leading manufacturer of consumer electronics based in Austin, Texas. Founded in 2010, the company specializes in producing smartphones, tablets, smart home devices, and wearable technology. With annual revenue of $3.2 billion and 4,500 employees worldwide, NovaTech has established itself as an innovative mid-market player competing with larger corporations through agility and cutting-edge product design.

## 2. Manufacturing Facilities

### 2.1 Primary Production Plants
- Austin, Texas, USA: Headquarters and R&D center with limited production of high-end prototypes and specialty products
- Shenzhen, China: Main production facility handling 65% of total manufacturing volume, specializing in smartphones and tablets
- Penang, Malaysia: Secondary production facility handling 25% of manufacturing volume, focusing on wearables and smart home devices
- Guadalajara, Mexico: Newest facility (opened 2023) handling 10% of production, primarily for North American market products

### 2.2 Production Capacity
Total annual production capacity: 12 million devices
- Smartphones: 5 million units
- Tablets: 2 million units
- Smart home devices: 3 million units
- Wearables: 2 million units

## 3. Key Suppliers

### 3.1 Component Suppliers
- Quantum Microchips (San Jose, USA): Primary supplier of processor chips and memory modules
- Suntech Displays (Seoul, South Korea): Exclusive supplier of OLED and LCD screens
- GlobalBattery Inc. (Osaka, Japan): Supplies 80% of battery components
- PrecisionCircuits Ltd. (Taipei, Taiwan): Provides printed circuit boards and electronic components
- FastConn Technologies (Shenzhen, China): Supplies connectors, cables, and charging components

### 3.2 Raw Materials Suppliers
- MetalTech Industries (Pittsburgh, USA): Aluminum and specialized metal alloys for device casings
- PolymerSolutions (Singapore): High-grade plastics and composite materials
- GlassTech Innovations (Corning, USA): Specialized glass for screens and device protection

### 3.3 Packaging Suppliers
- EcoPack Enterprises (Portland, USA): Sustainable packaging materials for North American market
- AsiaBox Corporation (Bangkok, Thailand): Packaging materials for Asian markets
- EuroWrap Solutions (Munich, Germany): Packaging for European market distribution

## 4. Distribution Network

### 4.1 Distribution Centers
- North America: Dallas (TX), Toronto (Canada), Mexico City (Mexico)
- Europe: Rotterdam (Netherlands), Warsaw (Poland), Manchester (UK)
- Asia: Singapore, Tokyo (Japan), Shanghai (China)
- Oceania: Sydney (Australia)
- South America: São Paulo (Brazil)

### 4.2 Transportation Partners
- FastFreight Logistics: Primary partner for ocean freight (65% of international shipping)
- AeroDelivery Inc.: Air freight partner for high-priority shipments
- RailLink Logistics: Rail transport within North America and Europe
- RegionalTrucking Partners: Last-mile delivery in various regions

## 5. Key Customers

### 5.1 Retail Partners
- TechMart: Largest retail partner with 1,200 stores across North America
- ElectroWorld: Major electronics retailer in Europe with 800 locations
- Digital Life Stores: Premium retail partner in Asia with 650 stores
- Online Marketplace Partners: TechZone.com, GlobalGadgets, and PrimeShop

### 5.2 Telecommunications Partners
- ConnectMobile: Distributes NovaTech smartphones through 2,000+ locations
- EuroConnect: European telecom partner with exclusive deals on certain models
- AsiaLink Wireless: Major distribution partner across Southeast Asia

### 5.3 Corporate Clients
- InnovateCorp: Enterprise client purchasing 50,000 devices annually for employee use
- GlobalFinance Group: Corporate client with standing order for custom secure devices
- HealthTech Systems: Purchases specialized wearables for healthcare applications

## 5. Supply Chain Bottlenecks and Challenges

### Current Bottlenecks
NovaTech currently faces several critical bottlenecks in its supply chain operations:

- **Semiconductor Shortage**: Severe constraints in obtaining specialized microprocessors from Quantum Microchips, delaying production of premium product lines by 3-4 weeks
- **Shanghai Port Congestion**: Ongoing shipping delays at Shanghai port affecting 30% of Asian exports
- **Malaysian Factory Capacity Limitations**: The Penang facility is operating at 94% capacity, creating production constraints
- **Quality Control Bottlenecks**: Inspection processes at the Mexico City facility have created a backlog of 10,000 units awaiting final approval

### Additional Supply Chain Challenges

- **Transportation Delays**: International shipping disruptions affecting timely delivery
- **Inventory Management**: Balancing stock levels across multiple distribution centers
- **Quality Control**: Maintaining consistent quality standards across diverse supplier base
- **Risk Factors**: Geopolitical tensions affecting Asian manufacturing regions

## 6. Sustainability Initiatives

NovaTech has implemented several sustainability measures across its supply chain:

- Reduced packaging waste by 35% since 2022
- Solar power installations at 3 manufacturing facilities
- Supplier sustainability rating system
- Carbon-neutral shipping options for 40% of deliveries
- E-waste recycling program for end-of-life products
- Reduced water usage at manufacturing facilities by 25% since 2022
- Carbon footprint reduction program targeting 30% decrease by 2027
- Transition to 100% recyclable packaging by end of 2025
- Supplier sustainability rating system implemented in 2024"""

print(f"Document length: {len(document_text)} characters")

Document length: 5619 characters


# Step 1: Text Splitting

The first step in building our RAG system is to split our document into smaller chunks. This is necessary because:

1. LLMs have a maximum context length (the amount of text they can process at once)
2. Smaller chunks make it easier to find the most relevant parts of the document
3. It allows us to create more precise vector embeddings for search

We'll create a simple text splitter function:

In [4]:
# Function to split text into chunks with improved handling of headers and sections
def split_text(text, chunk_size=300, overlap=50):
    # Split the text into paragraphs
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = ""

    # Process paragraphs with special handling for headers
    for paragraph in paragraphs:
        # If this paragraph is a header (starts with #), handle it specially
        if paragraph.strip().startswith('#'):
            # If we have content in the current chunk, save it
            if current_chunk:
                chunks.append(current_chunk.strip())

            # Start a new chunk with this header
            current_chunk = paragraph
            continue

        # Calculate the length if we add this paragraph
        new_length = len(current_chunk) + len(paragraph) + (2 if current_chunk else 0)  # +2 for '\n\n'

        # If adding this paragraph would make the chunk too long, save current chunk and start a new one
        if new_length > chunk_size and current_chunk:
            chunks.append(current_chunk.strip())

            # For overlap, try to include the last sentence or two from previous chunk
            if overlap > 0:
                # Find the last period that would keep us under the overlap limit
                last_period_pos = current_chunk.rfind('.', len(current_chunk) - overlap)
                if last_period_pos > 0:
                    current_chunk = current_chunk[last_period_pos + 1:].strip()
                else:
                    current_chunk = current_chunk[-overlap:].strip()
            else:
                current_chunk = ""

        # Add the paragraph to the current chunk
        if current_chunk:
            current_chunk += "\n\n" + paragraph
        else:
            current_chunk = paragraph

    # Add the last chunk if it has content
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Split the document into chunks
text_chunks = split_text(document_text)

# Display information about our chunks
print(f"Split the document into {len(text_chunks)} chunks")
print(f"Example chunk: \n{text_chunks[0][:200]}...")

Split the document into 24 chunks
Example chunk: 
# NovaTech Electronics: Supply Chain Overview 2025...


# Step 2: Creating Embeddings

Now that we have our text chunks, we need to convert them into numerical representations (vectors) that capture their meaning. This process is called **embedding**.

**What are embeddings?**

Embeddings are numerical representations of text that capture semantic meaning. Similar texts will have similar embeddings, which allows us to find related content through vector similarity search.

We'll use a free and efficient embedding model from Hugging Face called "all-MiniLM-L6-v2", which is small but powerful enough for our purposes.

In [5]:
# Create our embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # This is a good balance of speed and quality

# Create embeddings for our chunks
print("Creating embeddings for document chunks...")
chunk_embeddings = embedding_model.encode(text_chunks)
print(f"Created {len(chunk_embeddings)} embeddings of dimension {chunk_embeddings.shape[1]}")

Loading embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Creating embeddings for document chunks...


  return forward_call(*args, **kwargs)


Created 24 embeddings of dimension 384


# Step 3: Building a Retrieval System

Now we'll create a retrieval system that can find the most relevant chunks for a given query. This is the "Retrieval" part of RAG.

When a user asks a question, we'll:
1. Convert the question into an embedding vector
2. Find the most similar chunks using cosine similarity
3. Return these chunks as context for the LLM

In [6]:
# Function to retrieve the most relevant chunks for a query
def get_relevant_chunks(query, top_k=4):
    # Create an embedding for the query
    query_embedding = embedding_model.encode([query])[0]

    # Extract keywords from the query to improve retrieval
    query_lower = query.lower()
    keywords = []

    # Extract all important words from the query
    import re
    # Remove stop words and punctuation
    query_words = re.sub(r'[^\w\s]', '', query_lower).split()
    stop_words = ['a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
                  'have', 'has', 'had', 'do', 'does', 'did', 'but', 'and', 'or',
                  'what', 'which', 'who', 'whom', 'whose', 'where', 'when', 'why', 'how']

    # Add all non-stop words as keywords
    for word in query_words:
        if word not in stop_words and len(word) > 2:
            keywords.append(word)

    # Query Type: Suppliers & Components (Original)
    # Example: "who are the main component suppliers for novatech?"
    if 'who' in query_lower and ('supplier' in query_lower or 'vendor' in query_lower or 'component' in query_lower):
        keywords.extend(['supplier', 'suppliers', 'vendor', 'vendors', 'component', 'components', 'manufacturer', 'manufacturers', 'Quantum Microchips', 'Suntech Displays', 'GlobalBattery Inc.', 'PrecisionCircuits Ltd.', 'FastConn Technologies'])

    # Query Type: Manufacturing & Production
    # Example: "where are the manufacturing facilities located?" or "what is their production capacity?"
    elif 'manufactur' in query_lower or 'produc' in query_lower or 'plant' in query_lower or 'facility' in query_lower or 'facilities' in query_lower:
        keywords.extend(['manufacturing', 'production', 'plant', 'plants', 'facility', 'facilities', 'capacity', 'shenzhen', 'penang', 'guadalajara', 'austin'])

    # Query Type: Customers & Partners
    # Example: "who are novatech's key customers?" or "which retailers sell their products?"
    elif 'customer' in query_lower or 'client' in query_lower or 'partner' in query_lower or 'retailer' in query_lower:
        keywords.extend(['customer', 'customers', 'client', 'clients', 'partner', 'partners', 'retail', 'telecommunications', 'corporate', 'TechMart', 'ElectroWorld', 'Digital Life Stores', 'ConnectMobile'])

    # Query Type: Distribution & Logistics
    # Example: "how does novatech handle distribution?" or "who are their shipping partners?"
    elif 'distribut' in query_lower or 'logistic' in query_lower or 'transport' in query_lower or 'shipping' in query_lower:
        keywords.extend(['distribution', 'logistics', 'transportation', 'shipping', 'freight', 'delivery', 'centers', 'partners', 'FastFreight Logistics', 'AeroDelivery Inc.', 'RailLink Logistics'])

    # Query Type: Supply Chain Problems & Risks
    # Example: "what are the current supply chain bottlenecks?" or "what challenges do they face?"
    elif 'problem' in query_lower or 'bottleneck' in query_lower or 'challenge' in query_lower or 'risk' in query_lower or 'issue' in query_lower or 'delay' in query_lower:
        keywords.extend(['bottleneck', 'bottlenecks', 'challenge', 'challenges', 'risk', 'risks', 'shortage', 'congestion', 'delay', 'quality control', 'constraints'])

    # Query Type: Sustainability & Environmental Initiatives
    # Example: "what are novatech's sustainability initiatives?" or "is their packaging eco-friendly?"
    elif 'sustainab' in query_lower or 'environment' in query_lower or 'green' in query_lower or 'eco-friendly' in query_lower or 'carbon' in query_lower or 'recycl' in query_lower:
        keywords.extend(['sustainability', 'sustainable', 'initiatives', 'recycling', 'carbon', 'neutral', 'packaging', 'waste', 'e-waste', 'solar', 'footprint'])

    # Query Type: Company Location or Headquarters
    # Example: "where is novatech electronics based?"
    elif 'where' in query_lower and ('based' in query_lower or 'located' in query_lower or 'headquarters' in query_lower or 'hq' in query_lower):
        keywords.extend(['location', 'headquarters', 'hq', 'based', 'address', 'austin', 'texas'])


    # Calculate similarity between the query and all chunks
    similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]

    # Boost scores for chunks containing keywords
    boosted_similarities = similarities.copy()
    for i, chunk in enumerate(text_chunks):
        # Count keyword matches
        keyword_matches = sum(1 for keyword in keywords if keyword.lower() in chunk.lower())

        # Boost score based on keyword matches
        if keyword_matches > 0:
            boosted_similarities[i] += keyword_matches * 0.05  # Boost by 0.05 per keyword match

            # Extra boost for supplier sections if relevant
            if any(k in query.lower() for k in ['supplier', 'vendor', 'manufacturer']) and \
               any(k in chunk.lower() for k in ['supplier', 'vendor', 'manufacturer']):
                boosted_similarities[i] += 0.15  # Extra boost for supplier sections

    # Get the indices of the top_k most similar chunks using boosted scores
    top_indices = boosted_similarities.argsort()[-top_k:][::-1]

    # Calculate combined scores for display (semantic + keyword boosts)
    combined_scores = []
    for i in top_indices:
        keyword_matches = sum(1 for keyword in keywords if keyword.lower() in text_chunks[i].lower())
        # Create a combined score that better reflects total relevance
        combined_score = similarities[i] + (keyword_matches * 0.1)
        combined_scores.append(combined_score)

    # Return the top chunks with combined scores
    results = [(text_chunks[i], float(combined_scores[j])) for j, i in enumerate(top_indices)]

    # Print some debug info about the retrieved chunks
    print("\nRetrieved chunks:")
    for i, (chunk, combined_score) in enumerate(results):
        # Count keyword matches in this chunk
        keyword_matches = sum(1 for keyword in keywords if keyword.lower() in chunk.lower())

        # Get original semantic similarity score for this chunk
        original_idx = text_chunks.index(chunk)
        semantic_score = similarities[original_idx]

        # Check if chunk contains a section header with keywords
        has_header_match = False
        for keyword in keywords:
            lines = chunk.split('\n')
            for line in lines:
                if line.strip().startswith('#') and keyword.lower() in line.lower():
                    has_header_match = True
                    break
            if has_header_match:
                break

        # Print detailed scoring information
        print(f"Chunk {i+1}:")
        print(f"  - Combined Score: {combined_score:.4f} (Semantic: {semantic_score:.4f}, Keywords: {keyword_matches})")
        print(f"  - Header Match: {'Yes' if has_header_match else 'No'}")
        print(f"  - Preview: {chunk[:100]}...")
        print()

    return results

# Test our retriever with a sample question
sample_query = "Who are NovaTech's main component suppliers?"
relevant_chunks = get_relevant_chunks(sample_query)

# Display the most relevant chunk
print("\nMost relevant chunk:")
print(f"Similarity score: {relevant_chunks[0][1]:.4f}")
print(relevant_chunks[0][0])


Retrieved chunks:
Chunk 1:
  - Combined Score: 1.7011 (Semantic: 0.6011, Keywords: 11)
  - Header Match: Yes
  - Preview: ### 3.1 Component Suppliers
- Quantum Microchips (San Jose, USA): Primary supplier of processor chip...

Chunk 2:
  - Combined Score: 0.7877 (Semantic: 0.6877, Keywords: 1)
  - Header Match: No
  - Preview: Company Profile

NovaTech Electronics is a leading manufacturer of consumer electronics based in Aus...

Chunk 3:
  - Combined Score: 0.8404 (Semantic: 0.5404, Keywords: 3)
  - Header Match: Yes
  - Preview: ### 3.2 Raw Materials Suppliers
- MetalTech Industries (Pittsburgh, USA): Aluminum and specialized m...

Chunk 4:
  - Combined Score: 0.7598 (Semantic: 0.4598, Keywords: 3)
  - Header Match: Yes
  - Preview: ## 3. Key Suppliers...


Most relevant chunk:
Similarity score: 1.7011
### 3.1 Component Suppliers
- Quantum Microchips (San Jose, USA): Primary supplier of processor chips and memory modules
- Suntech Displays (Seoul, South Korea): Exclusive supplier of

# Step 4: Setting Up the LLM with Hugging Face (Free Option)

Now we need to set up our Language Model. We'll use Hugging Face's free inference API to access open-source models.

For this tutorial, we'll use the "facebook/bart-large-cnn" model which is available for free through Hugging Face's inference endpoints. This model is particularly good at summarization tasks, making it suitable for our RAG application.

**Note:** For production use cases, you might want to use more powerful models, but this is perfect for learning purposes.

In [7]:
# Function to generate a response using an external API (removed OpenAI-specific implementation)
def generate_response(query, context, api_key=None):
    # This function has been deprecated in favor of model-specific implementations
    # Please use generate_response_deepseek or simple_extractive_answer instead
    return "This function has been deprecated. Please use generate_response_deepseek or simple_extractive_answer instead."

# Using Alternative LLM APIs

This notebook supports using the Deepseek API for generating responses.
You can also use the simple extractive approach that doesn't require any API keys.

# Deepseek API Implementation

Alternatively, you can use the Deepseek API for generating responses. Deepseek offers powerful language models that can be used for RAG applications.

In [8]:
# Function to generate a response using the Deepseek API
def generate_response_deepseek(query, context, api_key=None):
    # API endpoint for Deepseek
    API_URL = "https://api.deepseek.com/v1/chat/completions"  # Deepseek API endpoint

    # Combine the context and query into a prompt
    prompt = f"""Context information:
{context}

Question: {query}

Answer the question based only on the context information provided above. If the context doesn't contain the answer, say that you don't have enough information."""

    # Set up the headers and payload
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    # Create the payload for Deepseek's chat completion API
    payload = {
        "model": "deepseek-chat",  # Using deepseek-chat model
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.3,
        "max_tokens": 500
    }

    # Make the API request
    try:
        response = requests.post(API_URL, headers=headers, json=payload)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Parse the response
        response_json = response.json()

        # Extract the generated text from the response
        if "choices" in response_json and len(response_json["choices"]) > 0:
            return response_json["choices"][0]["message"]["content"]
        else:
            return "Error: Unexpected response format from Deepseek API"
    except requests.exceptions.RequestException as e:
        return f"Error: {str(e)}"
    except ValueError as e:
        return f"Error parsing JSON response: {str(e)}"

In [9]:
# Load environment variables from .env file if present
try:
    from dotenv import load_dotenv
    load_dotenv()  # Load environment variables from .env file
    print("Loaded environment variables from .env file")
except ImportError:
    print("dotenv package not installed. Using environment variables directly.")

# Set your Deepseek API key here if not using environment variables
DEEPSEEK_API_KEY = None

# Function to get the Deepseek API key from environment or variable
def get_deepseek_api_key():
    # First check if it's set as an environment variable
    api_key = os.environ.get("DEEPSEEK_API_KEY")
    # If not, use the one defined in this notebook
    if not api_key:
        api_key = DEEPSEEK_API_KEY
    return api_key

# Test our Deepseek LLM with a sample question and context if API key is available
deepseek_api_key = get_deepseek_api_key()
if deepseek_api_key:
    sample_context = relevant_chunks[0][0]
    response = generate_response_deepseek(sample_query, sample_context, deepseek_api_key)

    print("\nDeepseek LLM Response:")
    print(response)
else:
    print("\nNo Deepseek API key provided. Please set your API key to test the Deepseek LLM.")

dotenv package not installed. Using environment variables directly.

No Deepseek API key provided. Please set your API key to test the Deepseek LLM.


# Alternative Free LLM Options

## No-Auth Option: Using a Simple Extractive Approach

If you don't want to create a Hugging Face account, you can use a simpler approach that doesn't require authentication. This isn't a true LLM but can still be useful for basic RAG:

In [10]:
# An improved function that extracts the most relevant sentences from the retrieved chunks
def simple_extractive_answer(question, chunks):
    # Check if we have any chunks
    if not chunks:
        return "I don't have enough information to answer this question."

    # Access the global embedding model
    global embedding_model

    # Extract keywords from the question to improve matching
    question_lower = question.lower()
    keywords = []

    # Extract all important words from the question (not just question type)
    import re
    # Remove stop words and punctuation
    question_words = re.sub(r'[^\w\s]', '', question_lower).split()
    stop_words = ['a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
                  'have', 'has', 'had', 'do', 'does', 'did', 'but', 'and', 'or',
                  'what', 'which', 'who', 'whom', 'whose', 'where', 'when', 'why', 'how']

    # Add all non-stop words as keywords
    for word in question_words:
        if word not in stop_words and len(word) > 2:
            keywords.append(word)

    # Add specific keywords based on question types
    if 'who' in question_lower or 'supplier' in question_lower or 'vendors' in question_lower:
        keywords.extend(['supplier', 'suppliers', 'vendor', 'vendors', 'partner', 'partners', 'manufacturer', 'manufacturers', 'company', 'companies'])
    if 'where' in question_lower or 'facility' in question_lower or 'facilities' in question_lower:
        keywords.extend(['located', 'location', 'facility', 'facilities', 'center', 'centers', 'factory', 'factories'])
    if 'bottleneck' in question_lower or 'issue' in question_lower or 'problem' in question_lower:
        keywords.extend(['bottleneck', 'constraint', 'shortage', 'delay', 'limited', 'issue', 'problem', 'challenge'])

    # Combine all chunks into one text, with weights based on similarity scores
    weighted_chunks = []
    for chunk, score in chunks:
        # Boost score if chunk contains keywords
        keyword_matches = sum(1 for keyword in keywords if keyword.lower() in chunk.lower())
        boosted_score = score + (keyword_matches * 0.05)  # Boost by 0.05 per keyword match
        weighted_chunks.append((chunk, boosted_score))

    # Sort chunks by boosted similarity score
    weighted_chunks.sort(key=lambda x: x[1], reverse=True)

    # Prepare all text from chunks
    all_text = "\n\n".join([chunk for chunk, _ in weighted_chunks])

    # Split into sentences
    sentences = [s.strip() for s in all_text.replace('\n', ' ').split('.')
                if len(s.strip()) > 10]  # Filter out very short sentences

    # Boost sentences that contain keywords
    keyword_boosts = [sum(1 for keyword in keywords if keyword.lower() in s.lower()) for s in sentences]

    # Create embeddings for sentences and question
    sentence_embeddings = embedding_model.encode(sentences)
    question_embedding = embedding_model.encode([question])[0]

    # Calculate similarity
    similarities = cosine_similarity([question_embedding], sentence_embeddings)[0]

    # Combine semantic similarity with keyword matching
    for i in range(len(similarities)):
        similarities[i] += keyword_boosts[i] * 0.1  # Adjust the weight of keyword matching

    # Get top 4 most relevant sentences
    top_indices = similarities.argsort()[-4:][::-1]
    top_sentences = [sentences[i] + '.' for i in top_indices]

    # Combine into an answer
    answer = ' '.join(top_sentences)

    # If the answer is too short or doesn't seem relevant, try to find a section header that matches
    if len(answer) < 50 or max(similarities) < 0.3:
        for chunk, _ in chunks:
            if any(keyword.lower() in chunk.lower() for keyword in keywords):
                lines = chunk.split('\n')
                for line in lines:
                    if line.strip().startswith('#') and any(keyword.lower() in line.lower() for keyword in keywords):
                        # Found a relevant section header, include the content below it
                        section_content = chunk[chunk.find(line):]
                        if section_content and len(section_content) > 20:
                            return section_content[:500]  # Limit to 500 chars

    return answer

# Step 5: Putting It All Together - Building Our RAG System

Now let's combine all the components we've built to create our complete RAG (Retrieval Augmented Generation) system:

1. Take a user question
2. Retrieve relevant chunks from our document
3. Send the question and relevant context to the LLM
4. Return the generated answer

This is the core of how RAG works!

In [11]:
# Create our complete RAG function
def filter_chunks_with_llm(question, chunks, api_key, llm_type="deepseek"):
    """Use an LLM to filter and rank the most relevant chunks for a question

    Args:
        question (str): The user's question
        chunks (list): List of (chunk, score) tuples
        api_key (str): API key for the LLM
        llm_type (str): Which LLM to use (only "deepseek" is currently supported)

    Returns:
        str: Filtered and reordered context
    """
    # Combine all chunks into a single string with separators and identifiers
    all_chunks = ""
    for i, (chunk, _) in enumerate(chunks):
        all_chunks += f"Chunk {i+1}:\n{chunk}\n\n"

    # Create a prompt for the LLM to filter and rank chunks
    filter_prompt = f"""I have a question and several document chunks that might contain relevant information.

    Question: {question}

    Document chunks:
    {all_chunks}

    Please analyze these chunks and identify which ones contain information relevant to answering the question.
    1. First, identify the chunks that are most relevant to the question.
    2. Then, reorder these chunks from most to least relevant.
    3. Finally, return ONLY the relevant chunks in your preferred order, separated by '---'.
    4. If a chunk is not relevant to the question, do not include it.
    5. Do not add any explanations or commentary - just return the filtered and reordered chunks.
    """

    # Call Deepseek to filter chunks
    if llm_type.lower() == "deepseek":
        filtered_content = generate_response_deepseek(question, filter_prompt, api_key)
    else:
        # If not using Deepseek, return original context
        return "\n\n".join([chunk for chunk, _ in chunks])

    # Return the filtered content without verbose output
    return filtered_content

# Add a lock and debounce mechanism to prevent multiple calls
import threading
import time

# Create a lock to ensure only one thread can execute the function at a time
rag_lock = threading.Lock()

# Store the last execution time and result
last_execution = {
    "time": 0,
    "question": "",
    "model": "",
    "result": ""
}

def rag_answer(question, model="simple", top_k=4, use_llm_filter=True):
    """Complete RAG pipeline: retrieval + generation with debounce protection

    Args:
        question (str): The user's question
        model (str): Which model to use - "deepseek" or "simple" for extractive approach
        top_k (int): Number of chunks to retrieve
        use_llm_filter (bool): Whether to use LLM to filter and reorder chunks (only applies to deepseek)

    Returns:
        str: The generated answer
    """
    global last_execution

    # Check if this is a duplicate call (same question and model within 2 seconds)
    current_time = time.time()
    if (current_time - last_execution["time"] < 2 and
            question == last_execution["question"] and
            model == last_execution["model"]):
        print("Debounced duplicate call detected. Returning cached result.")
        return last_execution["result"]

    # Try to acquire the lock, but don't block if it's already locked
    if not rag_lock.acquire(blocking=False):
        print("Another query is already being processed. Please wait.")
        return "Another query is already being processed. Please try again in a moment."

    try:
        # Step 1: Retrieve relevant chunks - with reduced verbosity
        # Temporarily redirect print output during get_relevant_chunks call
        import sys
        import io
        original_stdout = sys.stdout
        sys.stdout = io.StringIO()  # Redirect stdout to capture and discard verbose output

        # Get relevant chunks silently
        relevant_chunks = get_relevant_chunks(question, top_k=top_k)

        # Restore stdout
        sys.stdout = original_stdout

        # Step 2: Generate answer based on selected model
        if model.lower() == "deepseek":
            api_key = get_deepseek_api_key()
            if api_key:
                # Use LLM to filter chunks if requested
                if use_llm_filter:
                    context = filter_chunks_with_llm(question, relevant_chunks, api_key, "deepseek")
                else:
                    context = "\n\n".join([chunk for chunk, _ in relevant_chunks])

                answer = generate_response_deepseek(question, context, api_key)
            else:
                context = "\n\n".join([chunk for chunk, _ in relevant_chunks])
                answer = simple_extractive_answer(question, relevant_chunks)

        else:  # Use simple extractive approach
            # Temporarily redirect print output during simple_extractive_answer call
            sys.stdout = io.StringIO()  # Redirect stdout again

            context = "\n\n".join([chunk for chunk, _ in relevant_chunks])
            answer = simple_extractive_answer(question, relevant_chunks)

            # Restore stdout
            sys.stdout = original_stdout

        # Update the last execution cache
        last_execution = {
            "time": current_time,
            "question": question,
            "model": model,
            "result": answer
        }

        return answer
    finally:
        # Always release the lock
        rag_lock.release()

# Testing Our RAG System

Let's test our RAG system with a few questions about the document. This will demonstrate how the system retrieves relevant information and uses it to generate accurate answers.

Since we're using the simple extractive approach (no authentication required), we'll set `model="simple"`.
If you've added your Deepseek API token, you can set `model="deepseek"` to use the LLM instead.

# Testing Our RAG System with LLM Filtering

Now we'll test our enhanced RAG system that uses the LLM itself to filter and reorder chunks based on relevance.
This "self-filtering" approach lets the LLM determine which information is most important for answering the question.

In [12]:
# Test question 1 with Deepseek and LLM filtering enabled
question1 = "What manufacturing facilities does NovaTech have?"
# Uncomment the line below to test with Deepseek model and LLM filtering
rag_answer(question1, model="deepseek", use_llm_filter=True)  # Using Deepseek API with provided token

  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


'Manufacturing Facilities  Company Profile  NovaTech Electronics is a leading manufacturer of consumer electronics based in Austin, Texas. 1 Primary Production Plants - Austin, Texas, USA: Headquarters and R&D center with limited production of high-end prototypes and specialty products - Shenzhen, China: Main production facility handling 65% of total manufacturing volume, specializing in smartphones and tablets - Penang, Malaysia: Secondary production facility handling 25% of manufacturing volume, focusing on wearables and smart home devices - Guadalajara, Mexico: Newest facility (opened 2023) handling 10% of production, primarily for North American market products  ## 2. 2 billion and 4,500 employees worldwide, NovaTech has established itself as an innovative mid-market player competing with larger corporations through agility and cutting-edge product design. # NovaTech Electronics: Supply Chain Overview 2025.'

In [13]:
# Test question 1 with no LLM
# Uncomment the line below to compare results
rag_answer(question1, model="simple", use_llm_filter=False)  # Using only vector-based retrieval

  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


'Manufacturing Facilities  Company Profile  NovaTech Electronics is a leading manufacturer of consumer electronics based in Austin, Texas. 1 Primary Production Plants - Austin, Texas, USA: Headquarters and R&D center with limited production of high-end prototypes and specialty products - Shenzhen, China: Main production facility handling 65% of total manufacturing volume, specializing in smartphones and tablets - Penang, Malaysia: Secondary production facility handling 25% of manufacturing volume, focusing on wearables and smart home devices - Guadalajara, Mexico: Newest facility (opened 2023) handling 10% of production, primarily for North American market products  ## 2. 2 billion and 4,500 employees worldwide, NovaTech has established itself as an innovative mid-market player competing with larger corporations through agility and cutting-edge product design. # NovaTech Electronics: Supply Chain Overview 2025.'

# Interactive Question Answering Interface

Let's create an interactive interface where users can ask questions about NovaTech Electronics' supply chain and get answers from our RAG system.

In [14]:
# Install ipywidgets if not already installed
# Uncomment the line below if needed
!pip install ipywidgets

Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [15]:
def create_rag_interface():
    """Create and display an interactive RAG interface"""
    from IPython.display import display, HTML, clear_output
    import ipywidgets as widgets

    # Create a unique ID for this instance of the interface
    import uuid
    interface_id = str(uuid.uuid4())

    # Create the text input widget
    question_input = widgets.Text(
        value='',
        placeholder='Ask a question about NovaTech Electronics supply chain...',
        description='Question:',
        disabled=False,
        layout=widgets.Layout(width='80%')
    )

    # Create the submit button
    submit_button = widgets.Button(
        description='Get Answer',
        disabled=False,
        button_style='primary',
        tooltip='Click to get answer',
        icon='search'
    )

    # Create output area for displaying the answer
    output_area = widgets.Output()

    # Create model selection dropdown
    model_dropdown = widgets.Dropdown(
        options=['deepseek', 'simple'],
        value='deepseek',  # Set default to deepseek since we have the API key
        description='Model:',
        disabled=False,
    )

    # Create LLM filtering checkbox
    llm_filter_checkbox = widgets.Checkbox(
        value=True,  # Enable by default since we're using Deepseek
        description='Use LLM for filtering',
        disabled=False,
        indent=False
    )

    # Track if a query is in progress
    query_in_progress = False

    # Function to handle button click
    def on_button_click(b):
        nonlocal query_in_progress

        # Prevent multiple simultaneous executions
        if query_in_progress:
            return

        # Get the question
        question = question_input.value
        if question.strip() == '':
            with output_area:
                clear_output()
                display(HTML("<div style='color:red'>Please enter a question.</div>"))
            return

        try:
            # Set flag to prevent multiple executions
            query_in_progress = True

            # Disable the button during processing
            submit_button.disabled = True
            submit_button.description = "Processing..."

            # Clear previous output and show loading message
            with output_area:
                clear_output()
                display(HTML(f"<div><h3>Question:</h3>{question}</div><div>Processing...</div>"))

            # Get selected model
            selected_model = model_dropdown.value
            use_filter = llm_filter_checkbox.value

            # Only enable LLM filtering if using deepseek model
            if selected_model == 'simple':
                use_filter = False

            # Get answer using our RAG system - with all print statements suppressed
            import sys, io
            original_stdout = sys.stdout
            sys.stdout = io.StringIO()  # Redirect all stdout

            try:
                answer = rag_answer(question, model=selected_model, use_llm_filter=use_filter)
            finally:
                sys.stdout = original_stdout  # Restore stdout

            # Display the final answer with clean formatting
            with output_area:
                clear_output()
                display(HTML(f"""
                <div style='margin-bottom:10px;'>
                    <h3>Question:</h3>
                    <p>{question}</p>
                </div>
                <div style='background-color:#f0f0f0; padding:10px; border-radius:5px;'>
                    <h3>Answer:</h3>
                    <p>{answer}</p>
                </div>
                """))
        finally:
            # Always reset the flag and re-enable the button
            query_in_progress = False
            submit_button.disabled = False
            submit_button.description = "Get Answer"

    # Connect the button click event to the handler function
    submit_button.on_click(on_button_click)

    # Create the interface
    interface = widgets.VBox([
        widgets.HBox([question_input, submit_button]),
        widgets.HBox([model_dropdown, llm_filter_checkbox]),
        output_area
    ])


    # Display the interface
    display(interface)

    # Display example questions
    display(HTML("<h4>Example questions to try:</h4>"))
    display(HTML("<ul>"
                 "<li>Who are NovaTech's main component suppliers?</li>"
                 "<li>What are the current supply chain bottlenecks?</li>"
                 "<li>Who is the exclusive supplier of screens?</li>"
                 "</ul>"))


# Create and display the interface
# This function will only be called once when this cell is executed
create_rag_interface()

VBox(children=(HBox(children=(Text(value='', description='Question:', layout=Layout(width='80%'), placeholder=…

  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


# Advanced: Improving Our RAG System

There are several ways we could improve our simple RAG system:

1. **Better Embeddings**: Use more powerful embedding models like OpenAI's text-embedding-ada-002 or newer Hugging Face models

2. **Improved Retrieval**: Implement techniques like re-ranking or hybrid search to get more relevant chunks

3. **More Powerful LLM**: Use more capable models (though these may require API keys)

4. **Metadata Filtering**: Add metadata to chunks (like source, date, author) and filter based on this information

5. **Evaluation**: Add metrics to evaluate the quality of answers and improve the system

If you want to explore other free LLM options that require authentication, here are some alternatives:

## 1. Local Models with Ollama

[Ollama](https://ollama.ai/) lets you run open-source LLMs locally on your computer:

```python
# Install ollama-python
!pip install ollama

import ollama

def generate_local_response(query, context):
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = ollama.generate(
        model="llama2",
        prompt=prompt,
        options={"temperature": 0.3}
    )
    return response['response']
```

## 2. Other Free Hugging Face Models

There are many other free models on Hugging Face you can try:

```python
# For more capable models:
API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1"
```

## 3. Using Transformers Library Directly

For complete control, you can use the transformers library:

```python
!pip install transformers torch

from transformers import pipeline

# Create a text generation pipeline
generator = pipeline('text-generation', model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')

def generate_response_with_transformers(query, context):
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    response = generator(prompt, max_length=200, do_sample=True, temperature=0.3)
    return response[0]['generated_text']
```

# Conclusion

Congratulations! You've built a simple but functional RAG system that can:

1. Process a document into searchable chunks
2. Find relevant information based on questions
3. Generate answers using a free LLM
4. Provide more accurate responses by using retrieved context

This approach can be extended to work with much larger document collections, databases, or even websites. The key advantage of RAG is that it allows LLMs to access specific information they weren't trained on, making them more accurate for specialized tasks.