### In this photo, I tried to show the difference between LLM and RALM.

![image](./1722417000990.jpeg)

### Sample usage RAG

```bash
uv add scikit-learn transformers numpy
```

In [1]:
import torch

In [2]:
print(torch.__version__)

2.6.0+cu124


In [3]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import torch

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [6]:
# Step 1: Set up a small document collection
documents = [
    "Retrieval-Augmented Generation (RAG) combines retrieval and generation for improved AI responses.",
    "RAG models retrieve relevant information before generating answers to questions.",
    "The retrieval component helps RAG models access knowledge beyond their training data.",
    "Unlike standard language models, RAG can cite sources for the information it provides.",
    "RAG architecture helps reduce hallucinations by grounding responses in retrieved documents."
]

In [7]:

# Step 2: Create a simple retrieval system using TF-IDF
vectorizer = TfidfVectorizer()
document_vectors = vectorizer.fit_transform(documents)

def retrieve_docs(query, top_k=2):
    # Convert query to vector
    query_vector = vectorizer.transform([query])
    
    # Calculate similarity scores
    similarities = cosine_similarity(query_vector, document_vectors)[0]
    
    # Get indices of top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Return selected documents and their scores
    return [(documents[i], similarities[i]) for i in top_indices]

In [8]:
# Step 3: Set up a simple generative model (using a pre-trained model)
# Note: In a real implementation, you would use a proper LLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")



In [9]:
# First, set the pad token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token

def generate_response(context, query):
    # Format the prompt
    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
    
    # Tokenize with proper attention mask
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512  # Set an appropriate max_length for your input
    )
    
    # Generate with proper pad_token_id
    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,  # Add the attention mask
        pad_token_id=tokenizer.eos_token_id,   # Explicitly set pad_token_id
        max_length=150,
        num_return_sequences=1,
        # temperature=0.7
    )
    
    # Decode and return response
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [10]:
# Step 4: RAG pipeline that combines retrieval and generation
def rag_pipeline(query):
    # Retrieve relevant documents
    retrieved_docs = retrieve_docs(query)
    
    # Format retrieved documents as context
    context = " ".join([doc for doc, score in retrieved_docs])
    
    # Generate response based on retrieved context
    response = generate_response(context, query)
    
    # Format final output with citations
    final_response = f"{response}\n\nSources:\n"
    for i, (doc, score) in enumerate(retrieved_docs):
        final_response += f"[{i+1}] {doc} (Score: {score:.2f})\n"
    
    return final_response

In [11]:
# Example usage
query = "How does RAG reduce hallucinations?"

In [12]:
result = rag_pipeline(query)



In [13]:
print(result)

Context: RAG architecture helps reduce hallucinations by grounding responses in retrieved documents. RAG models retrieve relevant information before generating answers to questions.

Question: How does RAG reduce hallucinations?

Answer: RAG reduces hallucinations by grounding responses in retrieved documents.

Question: How does RAG reduce hallucinations?

Answer: RAG reduces hallucinations by grounding responses in retrieved documents.

Question: How does RAG reduce hallucinations?

Answer: RAG reduces hallucinations by grounding responses in retrieved documents.

Question: How does RAG reduce hallucinations?

Answer: RAG reduces hallucinations by grounding responses in retrieved documents.

Question: How does RAG reduce hallucinations?

Answer: RAG reduces hallucinations

Sources:
[1] RAG architecture helps reduce hallucinations by grounding responses in retrieved documents. (Score: 0.48)
[2] RAG models retrieve relevant information before generating answers to questions. (Score: 0.