<a href="https://colab.research.google.com/github/natalykur/rag_tutorial/blob/main/Basic_rag_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cell installs the necessary Python libraries used in the notebook:
- `openai`: To interact with OpenAI models.
- `pymupdf` (imported as `fitz`): For **loading **and processing PDF files.
- `faiss-cpu`: A library for fast similarity search and clustering of dense vectors (used to store **embeddings**).
- `scikit-learn`: Required for certain **vector store **operations and utilities.
python
Copy code


**RAG - general problem**

Language models often give outdated or incorrect answers because they can’t access external or real-time information. RAG solves this by retrieving relevant context from trusted sources, making responses more accurate and reliable.



**RAG - let's understand the problem**

Adding the document directly to the prompt context isn't enough because:

1. **Token Limits** – Language models have a maximum input size. Long documents often exceed this limit, forcing truncation and loss of important information.

2. **Noisy or Irrelevant Content** – Dumping an entire document may include irrelevant text, distracting the model and lowering answer quality.

3. **Lack of Targeted Retrieval** – The model doesn’t "know" what parts are most relevant to the query, so it can't focus its reasoning effectively.

4. **Inefficient Scaling** – As the document base grows, this approach becomes slower, more expensive, and harder to manage.

**RAG** solves this by retrieving only the most relevant snippets per query, optimizing both accuracy and efficiency.


**What we will learn:**
- PDF parsing and text extraction.
- Generating vector representations (embeddings) of text.
- Storing and searching these embeddings using a vector database.
- Integrating these components to build a simple Q&A system.


**What we will use and what are the parts:**
- **Libraries:** `openai` (for embeddings and potentially Q&A models), `pymupdf` (for PDF loading and processing), `faiss-cpu` (for the vector store), and `scikit-learn` (for supporting utilities).
- **Parts:**
    - **PDF Loader:** Using `pymupdf` to read the content of a PDF file.
    - **Text Processor:** Breaking down the PDF content into smaller chunks suitable for embedding.
    - **Embeddings Generator:** Using the `openai` library to create vector embeddings for the text chunks.
    - **Vector Store:** Using `faiss-cpu` to store the embeddings and perform fast similarity searches to find relevant text chunks based on a query.
    - **Question Answering Logic:** (This part would be built on top of the retrieved text chunks, likely using an OpenAI model to generate an answer based on the context).
"""

In [4]:
! pip install fitz

Collecting fitz
  Downloading fitz-0.0.1.dev2-py2.py3-none-any.whl.metadata (816 bytes)
Collecting configobj (from fitz)
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting configparser (from fitz)
  Downloading configparser-7.2.0-py3-none-any.whl.metadata (5.5 kB)
Collecting nipype (from fitz)
  Downloading nipype-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pyxnat (from fitz)
  Downloading pyxnat-1.6.3-py3-none-any.whl.metadata (5.4 kB)
Collecting prov>=1.5.2 (from nipype->fitz)
  Downloading prov-2.1.1-py3-none-any.whl.metadata (3.7 kB)
Collecting rdflib>=5.0.0 (from nipype->fitz)
  Downloading rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Collecting traits>=6.2 (from nipype->fitz)
  Downloading traits-7.0.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)
Collecting acres (from nipype->fitz)
  Downloading acres-0.5.0-py3-none-any.whl.metadata (6.2 kB)
Collecting etelemetry>=0.3.1

In [5]:
from google.colab import files
import fitz  # PyMuPDF

ModuleNotFoundError: No module named 'tools'

This cell allows you to upload your local PDF files into the Colab environment.
After uploading, they can be processed and embedded for search and retrieval.


Extracting text from PDFs


In [None]:
2.# Upload PDF files to Google Colab
uploaded = files.upload()

This cell defines a helper function `extract_text_from_pdf` that loads a PDF file and extracts all the text into a single string.

- `fitz.open(pdf_path)`: Opens the PDF file.
- `for page_num in range(len(doc))`: Iterates over all pages.
- `doc.load_page(page_num)`: Loads each page by index.
- `page.get_text()`: Extracts the text content from the page.
- The result is returned as a single concatenated string.


In [None]:
# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text


In [None]:
uploaded.keys()

This cell defines a helper function `extract_text_from_pdf` that loads a PDF file and extracts all the text into a single string.

- `fitz.open(pdf_path)`: Opens the PDF file.
- `for page_num in range(len(doc))`: Iterates over all pages.
- `doc.load_page(page_num)`: Loads each page by index.
- `page.get_text()`: Extracts the text content from the page.
- The result is returned as a single concatenated string.


In [None]:
# Extract text from all uploaded PDF files
pdf_texts = {}
for pdf_file in uploaded.keys():
    if pdf_file.endswith(".pdf"):
        pdf_texts[pdf_file] = extract_text_from_pdf(pdf_file)


This cell processes each uploaded PDF file and stores its text content in a dictionary.  
Then, it imports tools for vectorization and similarity search:

- `uploaded.keys()` contains the names of the uploaded files.
- For each `.pdf` file, it calls `extract_text_from_pdf()` and stores the result in `pdf_texts`.


In [None]:
# Display extracted text from each PDF file
for pdf_file, text in pdf_texts.items():
    print(f"--- {pdf_file} ---")
    print(text[:500])  # Display the first 500 characters of each document
    print("\n")



`TfidfVectorizer` from `sklearn` will be used to convert text into numerical vectors (embeddings).
- `faiss` is a fast similarity search library used for indexing and querying embeddings.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import faiss


`TfidfVectorizer` is a tool from `sklearn.feature_extraction.text` that converts a collection of raw documents (like strings of text) into a matrix of TF-IDF features.

TF-IDF stands for Term Frequency–Inverse Document Frequency:
- **Term Frequency (TF)** measures how frequently a word appears in a document.
- **Inverse Document Frequency (IDF)** reduces the weight of common words and increases the importance of rare words across all documents.

This combination highlights the most relevant terms in each document, allowing for effective comparison and retrieval.

By default, `TfidfVectorizer`:
- Converts all text to lowercase
- Tokenizes using word boundaries
- Removes English stop words (if configured)
- Outputs a sparse matrix with one row per document and one column per term

In this notebook, it is used to transform the PDF text into numerical vectors that capture the importance of each term for later similarity search.


In [None]:
# Convert text documents to TF-IDF vectors
documents = list(pdf_texts.values())
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(documents).toarray()


This step converts each document into a numerical vector that represents the importance of each term in the document relative to the corpus

- `dimension = doc_vectors.shape[1]`: Gets the dimensionality of the document vectors (number of features).
- `faiss.IndexFlatL2(dimension)`: Initializes a flat index using L2 (Euclidean) distance — a basic and efficient similarity metric.
- `index.add(doc_vectors)`: Adds all the document vectors to the FAISS index so they can be searched.

After running this cell, the index is ready to perform similarity search with query vectors.

In [None]:
# Create a FAISS index
dimension = doc_vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_vectors)

In [None]:
# prompt: print what are the dimentions of my database

print(f"The number of documents in the database is: {index.ntotal}")
print(f"The dimensionality of the vectors in the database is: {index.d}")


Based on the provided code, the "database" is not a traditional relational database or a NoSQL database. Instead, it is a **FAISS index** that stores the **TF-IDF vector representations** of the text content extracted from the uploaded PDF files.

Here's a breakdown of what the "database" (the FAISS index) contains:

1.  **Document Vectors:** It stores the numerical vectors generated by the `TfidfVectorizer` for each of the uploaded PDF documents. Each vector is a high-dimensional representation of the document, where the values represent the TF-IDF scores of the terms in that document.
2.  **Index Structure:** The FAISS index (`faiss.IndexFlatL2` in this case) provides an efficient structure for performing similarity searches on these vectors. It's optimized for finding the "nearest neighbors" (most similar documents) to a given query vector based on the L2 (Euclidean) distance metric.
3.  **No Original Text or Metadata:** The FAISS index itself **does not store the original text content** of the PDFs, nor does it store any metadata about the documents (like filenames). It only stores the numerical vectors. When you find similar vectors in the index, you would need to map back to the original documents (using the `pdf_texts` dictionary and the order in `documents` and `doc_vectors`) to retrieve the actual text or filenames.

In essence, the "database" is a **vector store** specifically designed for fast similarity searching, not for storing and querying structured or unstructured data in the conventional sense.

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. We create an index using the IndexFlatL2 method, which builds a flat (non-hierarchical) index based on L2 (Euclidean) distance. We then add our document vectors to this index using the add method.

# **Searching the index**

This function performs a similarity search over the vectorized documents using FAISS.

- `query`: A user input string (e.g. a question or topic).
- `vectorizer.transform([query])`: Converts the query string into a TF-IDF vector, using the same vectorizer as the documents.
- `index.search(query_vector, top_k)`: Searches the FAISS index to find the `top_k` most similar document vectors, returning both:
  - `distances`: Similarity scores (lower distance = higher similarity).
  - `indices`: Indices of the matching documents in the original list.
- `results`: A list of tuples `(document_text, distance_score)` for each result.

The function returns the top matching documents with their similarity scores, enabling context retrieval based on user input.
python
Copy code


In [None]:
def search_documents(query, top_k=1):
    query_vector = vectorizer.transform([query]).toarray()
    # Ensure top_k does not exceed the number of documents
    actual_top_k = min(top_k, len(documents))
    distances, indices = index.search(query_vector, actual_top_k)
    results = [(documents[i], distances[0][j]) for j, i in enumerate(indices[0])]
    return results

This cell runs a test query through the search system and prints out the top matching document texts.

- `query = "..."`: The question or topic you're searching for.
- `search_documents(query)`: Calls the previously defined function to find the most relevant documents.
- `for result in search_results`: Loops through the returned list and prints each matching document along with its distance score.

This demonstrates the full retrieval flow: from user input → to TF-IDF vector → to FAISS similarity search → to matching text output.
python
Copy code


In [None]:

# Example query
query = "can I cancel my flight and get refund in elal"
search_results = search_documents(query)
for result in search_results:
    print(result,end="")

# till now we only look at the document. we dont do any llm related procedure Let's start call to openai


שלפת טקסט מ־PDF
✅ הפכת אותו ל־TF-IDF
✅ הכנסת אותו ל־FAISS
✅ בנית פונקציית חיפוש




Cell: Configure OpenAI client for using LLMs

/

In [None]:
#This cell is here to test my openai is connection and key works fine


#response = client.responses.create(
#    model="gpt-3.5-turbo",
#    instructions="You are a coding assistant",
#    input="What is RAG?",
#)

#print(response.output_text)


This function sends a question and relevant context to OpenAI's GPT model and returns a generated answer.

- `query`: The user's question.
- `context`: Text retrieved from documents (e.g., via FAISS).
- `llm_model`: The model to use, such as `"gpt-3.5-turbo"`.
- `your_role`: A system prompt that defines the assistant's behavior (e.g., "You are a helpful customer support bot").

The prompt includes both the context and the question.  
It is sent to the model using `client.chat.completions.create(...)` in a chat format:
- `system` message defines the model's persona or instructions.
- `user` message provides the actual prompt with the question and context.

The response is returned as a string from `response.choices[0].message.content`.

In [None]:

# Use OpenAI's GPT model to generate a response
def generate_response_with_openai(query, context, llm_model, your_role):
    prompt = f"""
    Answer the following question based on the provided context.
    Context: {context}
    Question: {query}
    """
    response = client.chat.completions.create(
        model=llm_model,  # or another suitable model
        messages=[
            {"role": "system", "content": your_role},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content


This example shows how to generate a response from OpenAI's GPT model using only a plain question, without any retrieved context or system prompt.

- `context = ""`: No supporting document context is provided.
- `your_role = ""`: No specific system instruction or role is defined.
- `query`: A user-defined question (e.g., about airline refund policies).
- `generate_response_with_openai(...)`: The function sends the query to the model and prints the result.

This setup allows you to compare how the model performs with vs. without retrieved document context.
python



In [None]:

# Example 1:  just call openapi
context = ""
your_role = ""
query = "can I cancel my flight and get refund in elal"
answer = generate_response_with_openai(query, context,llm_model,your_role )
print("\n--- Generated Answer ---")
print(answer)
print(len(answer))

In [None]:
# Example 2:  change the prompt to get better answer
your_role = "you are a travel agent of elal"
answer = generate_response_with_openai(query, context,llm_model,your_role )
print("\n--- Generated Answer ---")
print(answer)
print(len(answer))

**You are a prompt expert!!**

In [None]:
#lets add context

This example demonstrates prompt engineering by manually injecting document context into the GPT prompt.

- `context = documents[0]`: The full text of the first uploaded PDF is used as the context.
- `generate_response_with_openai(...)`: Sends the question along with this static context to the LLM.
- This approach doesn't include retrieval or vector search — it's a direct prompt composition.

Although this mimics the "context injection" idea behind RAG, it is not true RAG because there’s no dynamic retrieval step based on similarity to the query.

You also print the length of both the input context and the output answer to inspect token usage and verbosity.


In [None]:
#Example 3 - lets add context -this is not a rag- we are prompt-engeneerings
context = documents[0]
answer = generate_response_with_openai(query, context,llm_model,your_role )
print("\n--- Generated Answer ---")
print(answer)
print(len(answer))

In [2]:
print(context)

NameError: name 'context' is not defined

# Example 4: Full RAG-lite

This example performs a full RAG-style workflow:
1. `search_documents(query)`: Retrieves top-k relevant documents using FAISS vector similarity.
2. `context = "\n".join([...])`: Concatenates the top results into a single text block as context.
3. `generate_response_with_openai(...)`: Sends the query and the retrieved context to OpenAI’s GPT model.
4. The answer is printed, along with lengths of context and output for inspection.

This is a complete, minimal RAG implementation — enabling the model to ground its response in actual documents rather than hallucinate or guess.


In [None]:
#get augmented context
search_results = search_documents(query)
context = "\n".join([result[0] for result in search_results])
print(len(context))
answer = generate_response_with_openai(query, context,llm_model,your_role )
print("\n--- Generated Answer ---")
print(answer)
print(len(answer))


# “You just built your first RAG pipeline.From scratch. Like a boss.”

## 🧪 RAG Evaluation Procedure

This notebook implements an evaluation framework for a basic Retrieval-Augmented Generation (RAG) pipeline.  
For each user query, we evaluate both system performance and answer quality using the following metrics:

### 🔍 Quality Metrics
- **Answer Relevance**: Does the generated answer directly address the question?
- **Faithfulness**: Is the answer grounded in the retrieved context (i.e., no hallucination)?
- **Context Recall**: Does the ground-truth answer appear in the retrieved context?
- **Context Precision**: What proportion of the retrieved context actually contains relevant information?
- **R**


In [None]:
import pandas as pd
import time

# Initialize results DataFrame
results_df = pd.DataFrame()

# Cost assumptions
COST_PER_1K_TOKENS = 0.001  # for embeddings or 0.002 for gpt-3.5-turbo output

def evaluate_and_log(query, gold_answer, results_df, top_k=5, model_cost_per_1k_tokens=0.002):
    start_time = time.time()

    # Step 1: retrieve context
    search_results = search_documents(query)
    context_chunks = [result[0] for result in search_results[:top_k]]
    context = "\n".join(context_chunks)

    # Step 2: generate answer
    response = generate_response_with_openai(query, context, llm_model, your_role)
    end_time = time.time()

    # Step 3: simple heuristic metrics (can be replaced with GPT-based eval)
    def fuzzy_match(a, b):
        return int(a.lower().strip() in b.lower())

    answer_relevance = fuzzy_match(gold_answer, response)
    context_recall = fuzzy_match(gold_answer, context)
    faithfulness = fuzzy_match(response, context)

    # Optional metrics
    retrieval_accuracy = any(gold_answer.lower().strip() in chunk.lower() for chunk in context_chunks)
    context_precision = sum(gold_answer in chunk for chunk in context_chunks) / len(context_chunks)

    # Token cost estimates (very rough)
    total_chars = len(context) + len(response)
    token_estimate = total_chars / 4  # approx. 4 chars/token
    estimated_cost = (token_estimate / 1000) * model_cost_per_1k_tokens

    # Logging
    row = {
        "query": query,
        "gold_answer": gold_answer,
        "generated_answer": response,
        "context": context,
        "latency": round(end_time - start_time, 2),
        "context_len": len(context),
        "answer_len": len(response),
        "answer_relevance": answer_relevance,
        "faithfulness": faithfulness,
        "context_recall": context_recall,
        "context_precision": round(context_precision, 2),
        "retrieval_accuracy": retrieval_accuracy,
        "token_cost_estimate": round(estimated_cost, 4)
    }

    results_df = pd.concat([results_df, pd.DataFrame([row])], ignore_index=True)
    return results_df


lets run our experiments again

In [None]:
# Define example query and gold answer
query = "can I cancel my flight and get refund in elal"
gold_answer = "El Al allows flight cancellations under certain conditions."


In [None]:
len(documents)

In [None]:

# Example  evaluation (prompt engineering, no RAG)
# Corrected: Added missing closing bracket
context_example3 = documents[0] # Use the full text of the first document as context
answer_example3 = generate_response_with_openai(query, context_example3, llm_model, your_role)

# Example 4 evaluation (Full RAG-lite)
search_results_example4 = search_documents(query)
context_example4 = "\n".join([result[0] for result in search_results_example4])
answer_example4 = generate_response_with_openai(query, context_example4, llm_model, your_role)

# Print comparison
print("--- Comparison of Example 3 vs Example 4 ---")
print("\nQuery:", query)

print("\n--- Example 3 (Prompt Engineering - Full Document Context) ---")
print("Context Length:", len(context_example3))
print("Generated Answer:", answer_example3)
print("Answer Length:", len(answer_example3))

print("\n--- Example 4 (Full RAG-lite - Retrieved Context) ---")
print("Context Length:", len(context_example4))
print("Generated Answer:", answer_example4)
print("Answer Length:", len(answer_example4))

# Summarize the comparison
print("\n--- Summary of Comparison ---")
print(f"Example 3 (Prompt Engineering): Used a concatenation of several documents ({len(context_example3)} characters) as context.")
print(f"Example 4 (RAG-lite): Used retrieved context ({len(context_example4)} characters), which is likely a subset of the total document(s) based on the query.")
print(f"Answer Length (Example 3): {len(answer_example3)}")
print(f"Answer Length (Example 4): {len(answer_example4)}")
print("The RAG-lite approach (Example 4) dynamically retrieves context based on the query, which is generally more efficient and effective than providing concatenated documents (Example 3), especially for large documents or multiple documents. This selective context provision helps the LLM focus on relevant information, potentially leading to more accurate and concise answers while using fewer tokens.")
print("To perform a more robust evaluation, you would use the `evaluate_and_log` function with ground truth answers and compare metrics like answer relevance, faithfulness, and context precision.")

In [None]:
# --------- Example 3: static context from documents[0] ---------
print("Running Example 3 (static context, no retrieval)")

# Prepare context manually (not using search_documents)
context_3 = documents[0]+documents[1]+documents[3]+documents[5] # Use the full text of the first document as context
your_role = ""  # Or customize if needed

# Measure time
start = time.time()
response_3 = generate_response_with_openai(query, context_3, llm_model, your_role)
end = time.time()

# Use same evaluation function, but override context & response manually
row_3 = {
    "query": query,
    "gold_answer": gold_answer,
    "generated_answer": response_3,
    "context": context_3,
    "latency": round(end - start, 2),
    "context_len": len(context_3),
    "answer_len": len(response_3),
    "answer_relevance": int(gold_answer.lower() in response_3.lower()),
    "faithfulness": int(response_3.lower() in context_3.lower()),
    "context_recall": int(gold_answer.lower() in context_3.lower()),
    "context_precision": float(gold_answer in context_3),
    "retrieval_accuracy": gold_answer.lower() in context_3.lower(),
    "token_cost_estimate": round((len(context_3) + len(response_3)) / 4 / 1000 * 0.002, 4)
}

results_df = pd.concat([results_df, pd.DataFrame([row_3])], ignore_index=True)

# --------- Example 4: standard RAG pipeline ---------
print("Running Example 4 (retrieval + generation)")
results_df = evaluate_and_log(
    query=query,
    gold_answer=gold_answer,
    results_df=results_df,
    top_k=5,
    model_cost_per_1k_tokens=0.002
)

# Show results
print("✅ Both examples complete.")
display(results_df.tail(2))


lets **analize**

In [None]:
# Basic summary stats
#results_df.describe(include='all')

# Export to CSV
#results_df.to_csv("rag_evaluation_results.csv", index=False)


In [None]:
def evaluate_and_log_llm(query, gold_answer, results_df, top_k=5, model_cost_per_1k_tokens=0.002):
    start_time = time.time()

    # Retrieve context
    search_results = search_documents(query)
    context_chunks = [result[0] for result in search_results[:top_k]]
    context = "\n".join(context_chunks)

    # Generate answer
    response = generate_response_with_openai(query, context, "gpt-3.5-turbo", "")
    end_time = time.time()

    # LLM-based scoring (instead of fuzzy_match)
    answer_relevance = llm_judge_score(query, context, response, gold_answer, "answer_relevance")
    faithfulness = llm_judge_score(query, context, response, gold_answer, "faithfulness")
    context_recall = llm_judge_score(query, context, response, gold_answer, "context_recall")
    context_precision = llm_judge_score(query, context, response, gold_answer, "context_precision")
    retrieval_accuracy = llm_judge_score(query, context, response, gold_answer, "retrieval_accuracy")

    # Token estimation
    total_chars = len(context) + len(response)
    token_estimate = total_chars / 4
    estimated_cost = (token_estimate / 1000) * model_cost_per_1k_tokens

    # Save to table
    row = {
        "query": query,
        "gold_answer": gold_answer,
        "generated_answer": response,
        "context": context,
        "latency": round(end_time - start_time, 2),
        "context_len": len(context),
        "answer_len": len(response),
        "answer_relevance": round(answer_relevance, 2),
        "faithfulness": round(faithfulness, 2),
        "context_recall": round(context_recall, 2),
        "context_precision": round(context_precision, 2),
        "retrieval_accuracy": round(retrieval_accuracy, 2),
        "token_cost_estimate": round(estimated_cost, 4)
    }

    results_df = pd.concat([results_df, pd.DataFrame([row])], ignore_index=True)
    return results_df


In [None]:

def llm_judge_score(query, context, response, gold_answer, metric_type):
    """
    Uses an LLM (like GPT-3.5-turbo) to score a response based on various metrics.

    Args:
        query (str): The original user query.
        context (str): The retrieved context used to generate the response.
        response (str): The generated answer from the LLM.
        gold_answer (str): The ground truth correct answer.
        metric_type (str): The metric to evaluate ('answer_relevance', 'faithfulness',
                           'context_recall', 'context_precision', 'retrieval_accuracy').

    Returns:
        float: The score (between 0 and 1) for the specified metric as judged by the LLM.
               Returns 0 if the metric type is invalid or an error occurs.
    """
    prompt = ""
    if metric_type == "answer_relevance":
        prompt = f"""
        Evaluate if the following generated answer is relevant to the user query.
        Score 1 if it is highly relevant, 0.5 if partially relevant, 0 if not relevant.
        User Query: {query}
        Generated Answer: {response}
        Score:
        """
    elif metric_type == "faithfulness":
        prompt = f"""
        Evaluate if the following generated answer is supported by the provided context.
        Score 1 if the answer is fully supported by the context, 0.5 if partially supported, 0 if not supported (hallucination).
        Context: {context}
        Generated Answer: {response}
        Score:
        """
    elif metric_type == "context_recall":
        prompt = f"""
        Evaluate if the gold answer is present or implied in the provided context.
        Score 1 if the gold answer is fully present in the context, 0.5 if partially present, 0 if not present.
        Gold Answer: {gold_answer}
        Context: {context}
        Score:
        """
    elif metric_type == "context_precision":
         prompt = f"""
        Evaluate how much of the provided context is relevant to answer the user query.
        Score 1 if most of the context is relevant, 0.5 if some is relevant, 0 if little or none is relevant.
        User Query: {query}
        Context: {context}
        Score:
        """
    elif metric_type == "retrieval_accuracy":
         prompt = f"""
        Evaluate if the provided context contains information necessary to answer the user query.
        Score 1 if the context is sufficient to answer the query, 0 if it is not.
        User Query: {query}
        Context: {context}
        Score:
        """
    else:
        print(f"Warning: Invalid metric_type '{metric_type}' for LLM judging.")
        return 0.0

    try:
        llm_response = client.chat.completions.create(
            model="gpt-3.5-turbo", # Using a suitable LLM for judging
            messages=[
                {"role": "system", "content": "You are an impartial judge evaluating AI responses. Provide a score between 0 and 1."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=5 # Expecting a single number or simple text
        )
        score_text = llm_response.choices[0].message.content.strip()
        # Try to parse the score as a float
        try:
            score = float(score_text)
            # Clamp the score between 0 and 1 just in case
            score = max(0.0, min(1.0, score))
            return score
        except ValueError:
            # If parsing fails, try to interpret simple text like "1" or "0.5"
            if score_text == "1": return 1.0
            elif score_text == "0.5": return 0.5
            elif score_text == "0": return 0.0
            else:
                print(f"Could not parse LLM score '{score_text}' for metric '{metric_type}'. Returning 0.")
                return 0.0

    except Exception as e:
        print(f"Error during LLM judging for metric '{metric_type}': {e}")
        return 0.0

In [None]:
# --------- Example 3: static context from documents[0] ---------
print("Running Example 3 (static context, no retrieval)")

# Prepare context manually (not using search_documents)
context_3 = documents[0] # Use the full text of the first document as context
your_role = ""  # Or customize if needed

# Measure time
start = time.time()
response_3 = generate_response_with_openai(query, context_3, llm_model, your_role)
end = time.time()

# Use same evaluation function, but override context & response manually
row_3 = {
    "query": query,
    "gold_answer": gold_answer,
    "generated_answer": response_3,
    "context": context_3,
    "latency": round(end - start, 2),
    "context_len": len(context_3),
    "answer_len": len(response_3),
    "answer_relevance": int(gold_answer.lower() in response_3.lower()),
    "faithfulness": int(response_3.lower() in context_3.lower()),
    "context_recall": int(gold_answer.lower() in context_3.lower()),
    "context_precision": float(gold_answer in context_3),
    "retrieval_accuracy": gold_answer.lower() in context_3.lower(),
    "token_cost_estimate": round((len(context_3) + len(response_3)) / 4 / 1000 * 0.002, 4)
}

results_df = pd.concat([results_df, pd.DataFrame([row_3])], ignore_index=True)

# --------- Example 4: standard RAG pipeline ---------
print("Running Example 4 (retrieval + generation)")
results_df = evaluate_and_log_llm(
    query=query,
    gold_answer=gold_answer,
    results_df=results_df,
    top_k=5,
    model_cost_per_1k_tokens=0.002
)

# Show results
print("✅ Both examples complete.")
display(results_df.tail(2))


# We have a basic flow , but bad metrices. lets try to improve

In [None]:
!pip install -q sentence-transformers faiss-cpu


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load a multilingual semantic model
embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Step 1: Embed the documents
doc_vectors = embed_model.encode(documents, convert_to_numpy=True)

# Step 2: Create FAISS index
dimension = doc_vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_vectors)

# Step 3: Define new search function
def search_documents(query, top_k=5):
    query_vector = embed_model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_vector, top_k)
    return [(documents[i], distances[0][j]) for j, i in enumerate(indices[0])]


In [None]:
# --------- Example 5: SBERT embedding + GPT-based evaluation ---------
print("🔎 Running Example 5 - SBERT retrieval + LLM judgment")

# Query and gold answer
query = "can I cancel my flight and get refund in elal"
gold_answer = "El Al allows flight cancellations under certain conditions."

# Use SBERT retrieval instead of TF-IDF
search_results = search_documents(query, top_k=8)
context_chunks = [result[0] for result in search_results]
context_5 = "\n".join(context_chunks)

# Generate answer with OpenAI
start = time.time()
response_5 = generate_response_with_openai(query, context_5, "gpt-3.5-turbo", "")
end = time.time()

# Evaluate using GPT as judge
answer_relevance = llm_judge_score(query, context_5, response_5, gold_answer, "answer_relevance")
faithfulness = llm_judge_score(query, context_5, response_5, gold_answer, "faithfulness")
context_recall = llm_judge_score(query, context_5, response_5, gold_answer, "context_recall")
context_precision = llm_judge_score(query, context_5, response_5, gold_answer, "context_precision")
retrieval_accuracy = llm_judge_score(query, context_5, response_5, gold_answer, "retrieval_accuracy")

# Estimate cost
total_chars = len(context_5) + len(response_5)
token_estimate = total_chars / 4
estimated_cost = (token_estimate / 1000) * 0.002

# Log results
row_5 = {
    "query": query,
    "gold_answer": gold_answer,
    "generated_answer": response_5,
    "context": context_5,
    "latency": round(end - start, 2),
    "context_len": len(context_5),
    "answer_len": len(response_5),
    "answer_relevance": round(answer_relevance, 2),
    "faithfulness": round(faithfulness, 2),
    "context_recall": round(context_recall, 2),
    "context_precision": round(context_precision, 2),
    "retrieval_accuracy": round(retrieval_accuracy, 2),
    "token_cost_estimate": round(estimated_cost, 4)
}

results_df = pd.concat([results_df, pd.DataFrame([row_5])], ignore_index=True)

# Display last 3 results
print("✅ Example 5 complete.")
display(results_df.tail(3))
