# Comparison of Embeddings: Raw vs. Summarized Conversation Content

**Purpose:**
1. Load a specific conversation from the 'parsed_conversations' data.
2. Reconstruct the raw text of the conversation.
3. Generate a summary of the conversation using GPT-4o (mimicking the pipeline).
4. Generate embeddings for BOTH the raw text and the summary text using the
   Google's Generative AI embeddings (mimicking the extension API).
5. Compare the two embeddings using cosine similarity.

**Troubleshooting:**
- If you encounter errors with the Google Generative AI embeddings:
  - Make sure you have the Google API key set in your environment as `GOOGLE_API_KEY`
  - Or ensure you're authenticated with `gcloud auth application-default login` in your terminal
- If you encounter errors with Azure OpenAI, ensure your API keys are properly set in environment variables
- Update the user and conversation IDs to use ones available in your data directory

**Dependencies:**
```
pip install google-generative-ai polars numpy python-dotenv openai scikit-learn json-repair
```

## 1. Setup and Configuration

Import libraries and configure API clients.
**Important:** Ensure your API keys and endpoints are correctly set up in your environment
(e.g., using a `.env` file).

In [None]:
import os
import polars as pl
import numpy as np
from pathlib import Path
import asyncio
import textwrap
import json
from dotenv import load_dotenv
from openai import AsyncAzureOpenAI
from google import genai
from google.genai.types import EmbedContentConfig
from sklearn.metrics.pairwise import cosine_similarity
from json_repair import repair_json # For parsing potentially malformed JSON from LLM

# Load environment variables (ensure you have a .env file or set these system-wide)
load_dotenv()

# --- Configuration ---

# Data Paths
# Adjust this path to point to your data directory structure
DATA_DIR = Path("../data") # Assumes notebook is in 'notebooks' folder sibling to 'data'
PARSED_CONV_DIR = DATA_DIR / "dagster/parsed_conversations"

# Conversation Selection (Choose a user and conversation to analyze)
# Replace with actual IDs from your data
USER_ID_TO_ANALYZE = "cm8d5ubo1000219t5pby4mssx" # Example user ID
CONVERSATION_ID_TO_ANALYZE = "67b4b8b0-c57c-8010-853e-25e324c584b7" # Example conversation ID

# GPT-4o (Summarization) Configuration - Using Azure OpenAI
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = "https://enclaveidai2163546968.openai.azure.com"
AZURE_OPENAI_API_VERSION = "2024-08-01-preview" # Use the version your deployment supports
GPT4O_MINI_DEPLOYMENT_NAME = "gpt-4o-mini" # Your Azure deployment name for GPT-4o Mini

# Gemini (Embedding) Configuration - Using Google Generative AI (genai) library
# Ensure you're authenticated with gcloud auth application-default login
GEMINI_EMBEDDING_MODEL_ID = "text-embedding-large-exp-03-07" # Using the latest embedding model
# Embedding dimension
EMBEDDING_DIMENSION = 3072  # Dimension for text-embedding-005 model

# --- Client Initialization ---

# Initialize Azure OpenAI Client (if keys are present)
summarization_client = None
if AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT:
    summarization_client = AsyncAzureOpenAI(
        api_key=AZURE_OPENAI_API_KEY,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    print("Azure OpenAI client initialized for summarization.")
else:
    print("Warning: Azure OpenAI credentials not found. Summarization will be skipped.")

embedding_client = genai.Client(vertexai=True, project="enclaveid", location="us-central1")
print("Google Generative AI client initialized for embeddings.")

In [None]:
# Helper function copied from the pipeline for consistency
def get_conversation_summary_prompt_sequence(conversation: str) -> list[dict[str, str]]:
    """Generates the prompt for conversation summarization."""
    prompt_text = textwrap.dedent(
        f"""
          You will be given a conversation between a user and an AI assistant.
          Your job is to provide a summary as follows:
          1. Provide a summary that describes the progression of the conversation and what the user obtains at the end.
          2. Determine if the conversation is highly sensitive, containing topics such as physical and mental health problems, relationship advice, erotic content, private legal matters, etc.
          3. If the conversation is not in English, your summary should be in English.
          4. Keep it under 150 words.
          5. In the summary, every occurrence of "the user" should be replaced with "<USER>"
          Use this output schema:
          {{
              "summary": str,
              "is_sensitive": bool,
          }}
          Here is the conversation:
          {conversation}
        """
    ).strip()
    # Using the standard message format expected by OpenAI/Azure API
    return [{"role": "user", "content": prompt_text}]

# Helper function copied from the pipeline for consistency
def parse_conversation_summaries(completion: str) -> dict | None:
    """Parses the LLM summary response."""
    try:
        # Use json_repair for robustness against minor JSON formatting issues
        res = repair_json(completion, return_objects=True)
        if (
            isinstance(res, dict)
            and "is_sensitive" in res
            and "summary" in res
            and isinstance(res["is_sensitive"], bool)
            and isinstance(res["summary"], str)
        ):
            return res
        else:
            print(f"Warning: Could not parse summary response correctly. Raw: {completion}")
            return None
    except Exception as e:
        print(f"Error parsing summary JSON: {e}. Raw: {completion}")
        return None

# Helper function to generate embeddings using Google Generative AI
async def generate_vertex_embedding(text: str) -> list[float] | None:
    """Generates embedding for a given text using Google Generative AI."""
    try:
        # Use the genai library to get embeddings
        response = embedding_client.models.embed_content(
            model=GEMINI_EMBEDDING_MODEL_ID,
            contents=[text],
            config=EmbedContentConfig(
                task_type="RETRIEVAL_DOCUMENT",  # Setting the task type
                output_dimensionality=EMBEDDING_DIMENSION,  # Setting the output dimension
            ),
        )
        
        # Extract the embedding values from the response
        if response and response.embeddings and len(response.embeddings) > 0:
            embedding_values = response.embeddings[0].values
            
            if len(embedding_values) != EMBEDDING_DIMENSION:
                print(f"Warning: Embedding dimension mismatch. Expected {EMBEDDING_DIMENSION}, got {len(embedding_values)}")
                return None
                
            return embedding_values
        else:
            print("Warning: No embedding values returned")
            return None

    except Exception as e:
        print(f"Error generating Google Generative AI embedding: {e}")
        import traceback
        traceback.print_exc()
        return None

In [None]:
# Helper function copied from the pipeline for consistency
def get_conversation_summary_prompt_sequence(conversation: str) -> list[dict[str, str]]:
    """Generates the prompt for conversation summarization."""
    prompt_text = textwrap.dedent(
        f"""
          You will be given a conversation between a user and an AI assistant.
          Your job is to provide a summary as follows:
          1. Provide a summary that describes the progression of the conversation and what the user obtains at the end.
          2. Determine if the conversation is highly sensitive, containing topics such as physical and mental health problems, relationship advice, erotic content, private legal matters, etc.
          3. If the conversation is not in English, your summary should be in English.
          4. Keep it under 150 words.
          5. In the summary, every occurrence of "the user" should be replaced with "<USER>"
          Use this output schema:
          {{
              "summary": str,
              "is_sensitive": bool,
          }}
          Here is the conversation:
          {conversation}
        """
    ).strip()
    # Using the standard message format expected by OpenAI/Azure API
    return [{"role": "user", "content": prompt_text}]

# Helper function copied from the pipeline for consistency
def parse_conversation_summaries(completion: str) -> dict | None:
    """Parses the LLM summary response."""
    try:
        # Use json_repair for robustness against minor JSON formatting issues
        res = repair_json(completion, return_objects=True)
        if (
            isinstance(res, dict)
            and "is_sensitive" in res
            and "summary" in res
            and isinstance(res["is_sensitive"], bool)
            and isinstance(res["summary"], str)
        ):
            return res
        else:
            print(f"Warning: Could not parse summary response correctly. Raw: {completion}")
            return None
    except Exception as e:
        print(f"Error parsing summary JSON: {e}. Raw: {completion}")
        return None



## 2. Load and Prepare Conversation Data

In [None]:
# Construct the path to the user's Parquet file
user_file_path = PARSED_CONV_DIR / f"{USER_ID_TO_ANALYZE}.snappy"

if not user_file_path.exists():
    raise FileNotFoundError(f"Could not find parsed conversations for user {USER_ID_TO_ANALYZE} at {user_file_path}")

# Load the data
df_all_convos = pl.read_parquet(user_file_path)
print(f"Loaded data for user {USER_ID_TO_ANALYZE}. Shape: {df_all_convos.shape}")

# Filter for the specific conversation
df_single_convo = df_all_convos.filter(pl.col("conversation_id") == CONVERSATION_ID_TO_ANALYZE).sort("date", "time")

if df_single_convo.height == 0:
    raise ValueError(f"Conversation ID {CONVERSATION_ID_TO_ANALYZE} not found for user {USER_ID_TO_ANALYZE}")

print(f"Found {df_single_convo.height} messages for conversation {CONVERSATION_ID_TO_ANALYZE}")

# Reconstruct the raw conversation text
# Format: "QUESTION: <question text>\nANSWER: <answer text>\n\nQUESTION: ..."
conversation_parts = []
for row in df_single_convo.iter_rows(named=True):
    conversation_parts.append(f"QUESTION: {row['question']}")
    conversation_parts.append(f"ANSWER: {row['answer']}")

raw_conversation_text = "\n\n".join(conversation_parts)

print("\n--- Raw Conversation Text (First 500 chars) ---")
print(textwrap.shorten(raw_conversation_text, width=500, placeholder="..."))

## 3. Generate Conversation Summary

Uses GPT-4o via Azure OpenAI to create a summary based on the raw text.

In [None]:
raw_text_embedding = None
summary_text_embedding = None

async def generate_embeddings():
    global raw_text_embedding, summary_text_embedding
    
    print(f"\nGenerating embeddings using Google Generative AI model: {GEMINI_EMBEDDING_MODEL_ID}...")

    # Generate embedding for raw text
    print("Embedding raw text...")
    raw_text_embedding = await generate_vertex_embedding(raw_conversation_text)
    if raw_text_embedding:
        print(f"Raw text embedding generated. Dimension: {len(raw_text_embedding)}")
    else:
        print("Failed to generate raw text embedding.")

    # Generate embedding for summary text (if summary exists)
    if summary_text:
        print("Embedding summary text...")
        summary_text_embedding = await generate_vertex_embedding(summary_text)
        if summary_text_embedding:
            print(f"Summary text embedding generated. Dimension: {len(summary_text_embedding)}")
        else:
            print("Failed to generate summary text embedding.")
    else:
        print("Skipping summary embedding as summary was not generated.")

# Run the async function
await generate_embeddings()

## 4. Generate Embeddings

Uses the specified Gemini model via Vertex AI to generate embeddings for both the raw text and the summary text.

In [None]:
raw_text_embedding = None
summary_text_embedding = None

async def generate_embeddings():
    global raw_text_embedding, summary_text_embedding
    if not embedding_client:
        print("Skipping embedding generation as client is not initialized.")
        return

    print(f"\nGenerating embeddings using Vertex AI model: {GEMINI_EMBEDDING_MODEL_ID}...")

    # Generate embedding for raw text
    print("Embedding raw text...")
    raw_text_embedding = await generate_vertex_embedding(raw_conversation_text)
    if raw_text_embedding:
        print(f"Raw text embedding generated. Dimension: {len(raw_text_embedding)}")
    else:
        print("Failed to generate raw text embedding.")

    # Generate embedding for summary text (if summary exists)
    if summary_text:
        print("Embedding summary text...")
        summary_text_embedding = await generate_vertex_embedding(summary_text)
        if summary_text_embedding:
            print(f"Summary text embedding generated. Dimension: {len(summary_text_embedding)}")
        else:
            print("Failed to generate summary text embedding.")
    else:
        print("Skipping summary embedding as summary was not generated.")

# Run the async function
await generate_embeddings()

## 5. Compare Embeddings

Calculate the cosine similarity between the raw text embedding and the summary text embedding. A value closer to 1 indicates higher similarity.

In [None]:
similarity_score = None

if raw_text_embedding and summary_text_embedding:
    print("\nCalculating cosine similarity...")
    # Reshape embeddings into 2D arrays for scikit-learn
    embedding1 = np.array(raw_text_embedding).reshape(1, -1)
    embedding2 = np.array(summary_text_embedding).reshape(1, -1)

    # Calculate cosine similarity
    similarity_matrix = cosine_similarity(embedding1, embedding2)
    similarity_score = similarity_matrix[0][0] # Extract the scalar value

    print(f"\n--- Comparison Results ---")
    print(f"Conversation ID: {CONVERSATION_ID_TO_ANALYZE}")
    print(f"Embedding Dimension: {embedding1.shape[1]}")
    print(f"Cosine Similarity (Raw vs. Summary): {similarity_score:.4f}")

    # Interpretation guide
    if similarity_score > 0.85:
        print("Interpretation: Very High Similarity - The summary captures the core semantic meaning of the raw text very well.")
    elif similarity_score > 0.7:
        print("Interpretation: High Similarity - The summary largely reflects the meaning of the raw text.")
    elif similarity_score > 0.5:
        print("Interpretation: Moderate Similarity - The summary captures some aspects, but there are divergences.")
    else:
        print("Interpretation: Low Similarity - The summary and raw text have significantly different semantic representations.")

elif not raw_text_embedding:
    print("\nCannot compare embeddings: Raw text embedding was not generated.")
elif not summary_text_embedding:
    print("\nCannot compare embeddings: Summary text embedding was not generated.")
else:
    print("\nEmbeddings were not generated. Cannot perform comparison.")

## Conclusion

This notebook loaded a conversation, summarized it using GPT-4o, generated embeddings for both the raw and summarized versions using Gemini on Vertex AI, and calculated the cosine similarity between them.

The similarity score indicates how well the summary's embedding captures the semantic essence of the raw conversation's embedding according to the chosen embedding model.