# Email Wizard Assistant - RAG Implementation Notebook
#
# This notebook details the development and testing of the core Retrieval-Augmented Generation (RAG) pipeline.

### 1. Setup and Imports
### Make sure you have activated your virtual environment and installed dependencies from `requirements.txt`.
### ```bash
### pip install -r requirements.txt
### ```
### For local execution involving the Gemini API, ensure the `GOOGLE_API_KEY` environment variable is set *before* starting Jupyter Lab/Notebook:
### ```bash
### export GOOGLE_API_KEY="YOUR_API_KEY" # Linux/macOS
### set GOOGLE_API_KEY="YOUR_API_KEY"   # Windows CMD
### $env:GOOGLE_API_KEY="YOUR_API_KEY" # Windows PowerShell
### ```

In [1]:
# Core Libraries
import pandas as pd
import numpy as np
import json
import os
import time
from sklearn.metrics.pairwise import cosine_similarity

# Embedding Model Library
from sentence_transformers import SentenceTransformer

# LLM Library (Google Gemini)
from google import genai

  from .autonotebook import tqdm as notebook_tqdm


### 2. Configuration and Gemini Client Initialization

In [2]:
# --- Configuration ---
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'
# Use a model compatible with your google-genai setup, e.g., 'gemini-pro'
GEMINI_MODEL_NAME = 'gemini-2.5-flash-preview-04-17'
EMAIL_DATA_PATH = 'data/emails.json'
EMBEDDING_SAVE_PATH = 'data/email_embeddings.npy'

# --- Initialize Gemini Client ---
GEMINI_CLIENT = None
API_KEY = os.environ.get("GOOGLE_API_KEY")

if not API_KEY:
    print("WARNING: GOOGLE_API_KEY environment variable not set.")
    print("Gemini API calls will fail. Please set the environment variable and restart the kernel.")
else:
    try:
        # Using google-genai SDK client initialization
        GEMINI_CLIENT = genai.Client(api_key=API_KEY)
        print(f"Gemini client initialized successfully for model access (using configured key). Target model: {GEMINI_MODEL_NAME}")
    except Exception as e:
        print(f"ERROR: Failed to initialize Gemini client: {e}")
        print("Please ensure your API key is valid and the environment variable is set correctly.")
        GEMINI_CLIENT = None # Ensure client is None if setup failed

Gemini client initialized successfully for model access (using configured key). Target model: gemini-2.5-flash-preview-04-17


### 3. Load and Prepare Email Dataset

In [10]:
emails_df = None
try:
    emails_df = pd.read_json(EMAIL_DATA_PATH)
    # Combine subject and body for embedding
    emails_df['full_text'] = "Sender: " + emails_df['sender'] + "\n\n" + emails_df['subject'] + "\n\n" + emails_df['body']
    print(f"Successfully loaded {len(emails_df)} emails from {EMAIL_DATA_PATH}.")
    print(emails_df.head())
except FileNotFoundError:
    print(f"ERROR: Email data file not found at {EMAIL_DATA_PATH}")
except Exception as e:
    print(f"ERROR: Failed to load or process email data: {e}")

Successfully loaded 50 emails from data/emails.json.
   id                                             sender       date  \
0   1                   Chloe <chloe.g@emailfriends.com> 2024-05-10   
1   2               Samantha Miller, Project Coordinator 2023-11-05   
2   3                       tech_enthusiast_88@email.com 2024-01-20   
3   4               Old Friend Mike <mikey.p@mymail.net> 2024-04-18   
4   5  Project Phoenix Lead <phoenix.lead@corporate.com> 2024-02-15   

                                     subject  \
0                       Italy Trip - Ideas?!   
1         Meeting Request: Q4 Project Review   
2     Inquiry about 'Aura Phone X1' Features   
3                          Long time no see!   
4  Project Phoenix - Phase 2 Progress Report   

                                                body  \
0  Hey Liam,\n\nOMG, so excited we're actually do...   
1  Dear Team,\n\nPlease let me know your availabi...   
2  Hello OmniGadget Support,\n\nI'm interested in...   
3  Hey 

### 4. Load Embedding Model and Embed Emails

In [11]:
# --- Load Embedding Model ---
sbert_model = None
try:
    sbert_model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    print(f"Sentence transformer model '{EMBEDDING_MODEL_NAME}' loaded successfully.")
except Exception as e:
    print(f"ERROR: Failed to load sentence transformer model: {e}")

Sentence transformer model 'all-MiniLM-L6-v2' loaded successfully.


In [12]:
# --- Embed Emails (Load if exists, otherwise generate and save) ---
email_embeddings = np.array([]) # Initialize as empty

if sbert_model is not None and emails_df is not None and not emails_df.empty:
    if os.path.exists(EMBEDDING_SAVE_PATH):
        try:
            print(f"Loading pre-computed embeddings from {EMBEDDING_SAVE_PATH}...")
            email_embeddings = np.load(EMBEDDING_SAVE_PATH)
            print(f"Loaded embeddings. Shape: {email_embeddings.shape}")
            if email_embeddings.shape[0] != len(emails_df):
                print("WARNING: Number of embeddings does not match number of emails. Re-generating...")
                email_embeddings = np.array([]) # Force regeneration
        except Exception as e:
            print(f"ERROR loading embeddings: {e}. Will attempt to regenerate.")
            email_embeddings = np.array([]) # Force regeneration

    if email_embeddings.size == 0: # If loading failed or file didn't exist
        try:
            print(f"Generating embeddings for {len(emails_df)} email texts...")
            email_contents_to_embed = emails_df['full_text'].tolist()
            email_embeddings = sbert_model.encode(email_contents_to_embed)
            print(f"Embeddings generated. Shape: {email_embeddings.shape}")
            # Save the generated embeddings
            os.makedirs(os.path.dirname(EMBEDDING_SAVE_PATH), exist_ok=True) # Ensure data dir exists
            np.save(EMBEDDING_SAVE_PATH, email_embeddings)
            print(f"Embeddings saved to {EMBEDDING_SAVE_PATH}")
        except Exception as e:
            print(f"ERROR: Failed to generate or save embeddings: {e}")
            email_embeddings = np.array([]) # Ensure it's empty on failure
else:
    if sbert_model is None:
        print("Skipping embedding generation: SBERT model not loaded.")
    if emails_df is None or emails_df.empty:
        print("Skipping embedding generation: Email data not loaded.")

Loading pre-computed embeddings from data/email_embeddings.npy...
Loaded embeddings. Shape: (50, 384)


### 5. Implement Similarity Search Function

In [13]:
def find_similar_emails(query, email_embeddings_db, all_emails_df, embedding_pipeline_model, top_n=3):
    """
    Finds the top_n most similar emails to a given query using cosine similarity.
    Assumes email_embeddings_db corresponds row-wise to all_emails_df.
    """
    if embedding_pipeline_model is None:
        print("ERROR: Embedding model not loaded for similarity search.")
        return [], []
    if email_embeddings_db is None or email_embeddings_db.size == 0:
        print("ERROR: Email embeddings not available for search.")
        return [], []
    if all_emails_df is None or all_emails_df.empty:
        print("ERROR: Email DataFrame not available for similarity search.")
        return [], []
    if email_embeddings_db.shape[0] != len(all_emails_df):
         print("ERROR: Mismatch between number of embeddings and emails.")
         return [], []


    try:
        query_embedding = embedding_pipeline_model.encode([query]) 
        similarities = cosine_similarity(query_embedding, email_embeddings_db)[0] 

        num_emails = email_embeddings_db.shape[0]
        actual_top_n = min(top_n, num_emails)
        if actual_top_n <= 0: return [], []

        top_indices_sorted = np.argsort(similarities) 
        top_indices = top_indices_sorted[-actual_top_n:][::-1] 

        similar_emails_content = []
        similar_email_details = []

        for index in top_indices:
            if 0 <= index < len(all_emails_df):
                email_row = all_emails_df.iloc[index]
                email_info = f"Email ID: {email_row.get('id', 'N/A')}\nSender: {email_row.get('sender', 'N/A')}\nDate: {email_row.get('date', 'N/A')}\nSubject: {email_row.get('subject', 'N/A')}\nBody:\n{email_row.get('body', 'N/A')}"
                similar_emails_content.append(email_info)
                similar_email_details.append({
                    "id": email_row.get('id', 'N/A'),
                    "similarity_score": float(similarities[index]),
                    "subject": email_row.get('subject', 'N/A')
                })
            else:
                print(f"Warning: Index {index} out of bounds for emails_df.")
        return similar_emails_content, similar_email_details
    except Exception as e:
        print(f"ERROR during similarity search: {e}")
        return [], []

### 6. Implement Gemini Generation Function (using Context)

In [14]:
def generate_email_response_gemini(user_query, retrieved_emails_content, stream_response=False):
    """
    Generates a response using the initialized Gemini client based on the user query and retrieved emails.
    """
    global GEMINI_CLIENT, GEMINI_MODEL_NAME # Access the globally configured client/model name

    if GEMINI_CLIENT is None:
        return "ERROR: Gemini client not initialized. Cannot generate response."

    if not retrieved_emails_content:
        return "I couldn't find any relevant past emails to answer your query. Could you please try rephrasing or provide more details?"

    email_context = "\n\n---\n\n".join(retrieved_emails_content)

    prompt = f"""You are an Email Wizard Assistant. Your task is to answer the user's query based *only* on the provided email context below. Be concise and factual based on the emails. If the answer is not found in the emails, state that clearly. Do not make up information.

Retrieved Email(s) Context:
---
{email_context}
---

User Query: "{user_query}"

Assistant's Answer:
"""

    try:
        if stream_response:
             print("Note: Streaming response not fully implemented for simplified API/notebook testing here.")
             response = GEMINI_CLIENT.models.generate_content(
                model=GEMINI_MODEL_NAME,
                contents=prompt
             )
             return response.text if hasattr(response, 'text') else "Streaming response (implementation needed)."
        else:
            response = GEMINI_CLIENT.models.generate_content(
                model=GEMINI_MODEL_NAME,
                contents=prompt
            )

            prompt_feedback = getattr(response, 'prompt_feedback', None) # Safely get prompt_feedback or None
            if prompt_feedback is not None:
                block_reason = getattr(prompt_feedback, 'block_reason', None) # Safely get block_reason or None
                if block_reason is not None: # Check if block_reason has a value (isn't None)
                    reason_msg = getattr(prompt_feedback, 'block_reason_message', str(block_reason))
                    print(f"Warning: Gemini response blocked. Reason: {reason_msg}")
                    return f"Sorry, the response generation was blocked due to safety filters ({reason_msg})."

            if hasattr(response, 'text') and response.text:
                return response.text.strip()
            elif hasattr(response, 'parts') and response.parts:
                # Concatenate text from parts if available
                return "".join(part.text for part in response.parts if hasattr(part, 'text')).strip()
            else:
                 print(f"Warning: Received empty or non-text response object, and prompt was not blocked: {response}")
                 return "Sorry, I couldn't generate a valid response from the AI model based on the provided context."


    except Exception as e:
        print(f"ERROR during Gemini API call: {type(e).__name__} - {e}")
        if "429" in str(e) or "RESOURCE_EXHAUSTED" in str(e):
            return "ERROR: The AI service is currently busy (Rate Limit Exceeded). Please wait and try again."
        if "API_KEY_INVALID" in str(e) or "PermissionDenied" in str(type(e).__name__):
             return "ERROR: Authentication Error. Please check your API Key."
        return f"ERROR: An unexpected error occurred while contacting the AI service ({type(e).__name__})."

### 7. Implement Full RAG Pipeline Function

In [15]:
def rag_email_assistant(user_query, email_embeddings_db, all_emails_df, embedding_pipeline_model, llm_generation_function, top_n_retrieval=3):
    """
    Orchestrates the RAG pipeline: retrieve similar emails and generate response.
    """
    print(f"\nProcessing query: '{user_query}'")
    start_time = time.time()

    # 1. Retrieve relevant emails
    print(f"Step 1: Finding top {top_n_retrieval} similar emails...")
    retrieved_email_snippets, retrieved_details = find_similar_emails(
        user_query,
        email_embeddings_db,
        all_emails_df,
        embedding_pipeline_model,
        top_n=top_n_retrieval
    )
    retrieval_time = time.time() - start_time

    if not retrieved_email_snippets:
        print(f"Step 1 Result: No relevant emails found. (Took {retrieval_time:.2f}s)")
        return "I couldn't find any relevant past emails to answer your query. Could you please try rephrasing or provide more details?"

    print(f"Step 1 Result: Retrieved {len(retrieved_email_snippets)} email snippets. (Took {retrieval_time:.2f}s)")
    print("Retrieved Email Subjects & Scores:")
    for detail in retrieved_details:
        print(f"  - ID: {detail['id']}, Score: {detail['similarity_score']:.4f}, Subject: {detail['subject']}")


    # 2. Generate response using LLM with retrieved context
    print("\nStep 2: Generating response with Gemini using retrieved context...")
    generation_start_time = time.time()
    generated_response = llm_generation_function(
        user_query,
        retrieved_email_snippets,
        stream_response=False # Keep simple for notebook test
    )
    generation_time = time.time() - generation_start_time
    print(f"Step 2 Result: Response generated. (Took {generation_time:.2f}s)")

    total_time = time.time() - start_time
    print(f"\nTotal processing time: {total_time:.2f}s")
    return generated_response

### 8. Test the RAG Pipeline

In [19]:
all_components_ready = (
    'GEMINI_CLIENT' in globals() and GEMINI_CLIENT is not None and
    'email_embeddings' in globals() and isinstance(email_embeddings, np.ndarray) and email_embeddings.size > 0 and
    'emails_df' in globals() and isinstance(emails_df, pd.DataFrame) and not emails_df.empty and
    'sbert_model' in globals() and sbert_model is not None
)

if all_components_ready:
    print("\n--- Testing RAG Email Assistant ---")

    # Test Query 1 (Based on provided dataset)
    query1 = "What's the status of Project Phoenix?"
    assistant_reply1 = rag_email_assistant(
        user_query=query1,
        email_embeddings_db=email_embeddings,
        all_emails_df=emails_df,
        embedding_pipeline_model=sbert_model,
        llm_generation_function=generate_email_response_gemini
    )
    print("\nEmail Wizard's Assistant Reply 1:")
    print("="*30)
    print(assistant_reply1)
    print("="*30)

    # Test Query 2 (Based on provided dataset)
    query2 = "Inquiries about Aura Phone X1"
    assistant_reply2 = rag_email_assistant(
        user_query=query2,
        email_embeddings_db=email_embeddings,
        all_emails_df=emails_df,
        embedding_pipeline_model=sbert_model,
        llm_generation_function=generate_email_response_gemini,
        top_n_retrieval=2 # Retrieve fewer docs for this one maybe
    )
    print("\nEmail Wizard's Assistant Reply 2:")
    print("="*30)
    print(assistant_reply2)
    print("="*30)

     # Test Query 3 (Based on provided dataset)
    query3 = "Any emails regarding a dog which got lost?"
    assistant_reply3 = rag_email_assistant(
        user_query=query3,
        email_embeddings_db=email_embeddings,
        all_emails_df=emails_df,
        embedding_pipeline_model=sbert_model,
        llm_generation_function=generate_email_response_gemini
    )
    print("\nEmail Wizard's Assistant Reply 3:")
    print("="*30)
    print(assistant_reply3)
    print("="*30)

else:
    print("\n--- RAG Test Skipped ---")
    print("Reason: Not all required components (Gemini client, embeddings, data, models) were initialized successfully.")
    print(f"- Gemini Client Ready: {'Yes' if 'GEMINI_CLIENT' in globals() and GEMINI_CLIENT is not None else 'No'}")
    print(f"- Embeddings Ready: {'Yes' if 'email_embeddings' in globals() and isinstance(email_embeddings, np.ndarray) and email_embeddings.size > 0 else 'No'}")
    print(f"- DataFrame Ready: {'Yes' if 'emails_df' in globals() and isinstance(emails_df, pd.DataFrame) and not emails_df.empty else 'No'}")
    print(f"- SBERT Model Ready: {'Yes' if 'sbert_model' in globals() and sbert_model is not None else 'No'}")


--- Testing RAG Email Assistant ---

Processing query: 'What's the status of Project Phoenix?'
Step 1: Finding top 3 similar emails...
Step 1 Result: Retrieved 3 email snippets. (Took 0.02s)
Retrieved Email Subjects & Scores:
  - ID: 26, Score: 0.4931, Subject: Meeting Minutes: Project Skyward Strategy Session - 2023-09-19
  - ID: 5, Score: 0.4878, Subject: Project Phoenix - Phase 2 Progress Report
  - ID: 2, Score: 0.4553, Subject: Meeting Request: Q4 Project Review

Step 2: Generating response with Gemini using retrieved context...
Step 2 Result: Response generated. (Took 3.00s)

Total processing time: 3.02s

Email Wizard's Assistant Reply 1:
Based on the emails provided:

The initial data migration for Project Phoenix Phase 2 has been successfully completed. The development team is currently implementing new user interface modules. Testing for Module A is scheduled to begin next Monday and is anticipated to take approximately one week. The design team has finalized mockups for Modu

### 9. Evaluation

This section evaluates the performance of the Email Wizard Assistant based on the RAG pipeline implemented above. We will look at search speed, accuracy of similarity, and coherence of responses using the test queries executed.

#### 9.1. Search Speed

The similarity search speed measures the time taken to embed the user's query, compare it against the pre-computed email embeddings, and retrieve the top N most similar emails. This is primarily the execution time of the `find_similar_emails` function.

Based on the test runs:
*   **Query 1 ("What's the status of Project Phoenix?"):** Retrieval (Step 1) took approximately **0.02 seconds**.
*   **Query 2 ("Inquiries about Aura Phone X1"):** Retrieval (Step 1) took approximately **0.01 seconds**.
*   **Query 3 ("Any emails regarding a dog which got lost?"):** Retrieval (Step 1) took approximately **0.01 seconds**.

**Conclusion on Search Speed:**
For the current dataset of 50 emails, using `all-MiniLM-L6-v2` embeddings and scikit-learn's `cosine_similarity` for exact search, the retrieval times are extremely low (in the order of tens of milliseconds). This indicates that the current search mechanism is highly efficient and more than adequate for real-time interaction with this dataset size.

#### 9.2. Accuracy of Similarity (Qualitative Assessment)

This metric assesses how relevant the emails retrieved by the `find_similar_emails` function are to the user's query. This is a qualitative assessment based on the `Retrieved Email Subjects & Scores` printed during the RAG test calls.

*   **Query 1: "What's the status of Project Phoenix?"**
    *   **Retrieved:**
        *   ID 26, Score: 0.4931, Subject: Meeting Minutes: Project Skyward Strategy Session - 2023-09-19
        *   ID 5, Score: 0.4878, Subject: Project Phoenix - Phase 2 Progress Report
        *   ID 2, Score: 0.4553, Subject: Meeting Request: Q4 Project Review
    *   **Assessment:** Good retrieval. The system successfully identified the most critical email (ID 5: "Project Phoenix - Phase 2 Progress Report") as a top result. The other emails (ID 26 and ID 2) have thematic overlaps concerning projects and meetings, which is understandable for semantic search. The retrieval of ID 5 provides the necessary context.

*   **Query 2: "Inquiries about Aura Phone X1"**
    *   **Retrieved:**
        *   ID 3, Score: 0.6808, Subject: Inquiry about 'Aura Phone X1' Features
        *   ID 5, Score: 0.2639, Subject: Project Phoenix - Phase 2 Progress Report
    *   **Assessment:** Excellent retrieval for the primary document. Email ID 3, which is directly about the 'Aura Phone X1', was retrieved with a high similarity score (0.6808). The second retrieved email (ID 5) is less relevant but was likely picked up due to general "project" or "feature" related terms if the query implies technical aspects. The key success is retrieving ID 3.

*   **Query 3: "Any emails regarding a dog which got lost?"**
    *   **Retrieved:**
        *   ID 6, Score: 0.5776, Subject: LOST DOG: 'Buddy' - Golden Retriever - Maple Street Area
        *   ID 13, Score: 0.3542, Subject: Some fluffy news!
        *   ID 15, Score: 0.2942, Subject: Question about Summer Street Fair
    *   **Assessment:** Very good retrieval. The most relevant email (ID 6: "LOST DOG: 'Buddy'") was identified with the highest score. Email ID 13 ("Some fluffy news!") about adopting a kitten was also retrieved, likely due to the semantic similarity of "fluffy" and general pet-related topics to "dog." Email ID 15 is less relevant but might have been pulled due to general "question" or "community event" themes if the "lost dog" email also had community aspects. The primary success is the retrieval of ID 6.

**Conclusion on Accuracy of Similarity:**
The similarity search demonstrates good performance in retrieving semantically relevant documents for various query types. It successfully identifies emails directly related to the query's main subject with high similarity scores. It also occasionally retrieves emails with broader thematic overlaps, which is a natural characteristic of semantic search. Overall, it provides a strong foundation for the RAG pipeline by supplying relevant context.

#### 9.3. Coherence of Responses (Qualitative Assessment)

This metric evaluates the quality of the final answer generated by the Gemini model, considering its clarity, relevance to the query, and factual grounding in the retrieved email context.

*   **Reply 1 (Project Phoenix):**
    *   *Generated Text:* "Based on the emails provided:\n\nThe initial data migration for Project Phoenix Phase 2 has been successfully completed. The development team is currently implementing new user interface modules. Testing for Module A is scheduled to begin next Monday and is anticipated to take approximately one week. The design team has finalized mockups for Module B and shared them with stakeholders for feedback. The project is currently on track with the revised timeline."
    *   **Assessment:** **Excellent.** The response is coherent, directly answers the query, and accurately synthesizes information from the retrieved context (primarily Email ID 5).

*   **Reply 2 (Aura Phone X1):**
    *   *Generated Text:* "Based on the emails, the inquiries about the Aura Phone X1 are regarding its battery life under typical usage conditions and whether the camera supports RAW image capture."
    *   **Assessment:** **Excellent.** The response is coherent and accurately extracts the specific inquiries mentioned in the relevant retrieved email (ID 3).

*   **Reply 3 (Lost Dog):**
    *   *Generated Text:* "Yes, there is one email regarding a lost dog.\n\nEmail ID 6 from Worried Neighbor reports that their Golden Retriever, named Buddy, went missing on December 1st near Maple Street and Oak Avenue. Buddy is friendly, has a red collar with a tag, answers to his name, is about 70 lbs with light golden fur, and was last seen heading towards the park. The contact number provided is 555-123-4567."
    *   **Assessment:** **Excellent.** The response directly answers the question, correctly identifies the relevant email (ID 6), and accurately summarizes all the key details about the lost dog from that email.

**Conclusion on Coherence of Responses:**
The Gemini model consistently generates coherent, accurate, and contextually appropriate responses. It effectively leverages the retrieved email snippets to answer user queries factually and concisely. The responses are well-grounded in the provided context and demonstrate good summarization and information extraction capabilities.

#### 9.4. Overall Evaluation Summary

The RAG-based Email Wizard Assistant effectively integrates retrieval and generation to answer user queries based on a sample email dataset.
*   **Search Speed:** The similarity search is very fast and suitable for this dataset size.
*   **Retrieval Accuracy:** The system generally retrieves highly relevant documents for user queries, demonstrating good semantic understanding.
*   **Response Coherence:** The generated responses are consistently coherent, factually accurate based on the retrieved context, and directly address the user's questions.

The system performs well on the given tasks. For larger and more complex datasets, further optimization of retrieval (e.g., ANN) and continued prompt engineering would be beneficial.