# Usage Logger

This module provides an embedding-enhanced usage logging system for the RL-based thesis assistant.  
It captures prompt-level interactions, infers user intent, and stores relevant metadata (e.g. thesis stage).  
In addition, it leverages the OpenAI embedding API to represent prompts as high-dimensional vectors, enabling similarity search, memory-based reflection, and semantic RL conditioning.

---

## Purpose

The `UsageLogger` is designed to:

- Log assistant usage events during live interaction or LangGraph rollout
- Generate vector embeddings for each prompt using `text-embedding-3-small`
- Persist embeddings to disk for reuse and inspection
- Support similarity-based search over past usage sessions

This logger is intended to support both:
- **Semantic retrieval** (e.g. reflection on similar past queries)
- **RL preprocessing** (e.g. embedding as state input)

---

## Logged Format

Each log entry includes:

| Field          | Description                               |
|----------------|-------------------------------------------|
| `timestamp`    | Timestamp when interaction occurred       |
| `prompt`       | The raw prompt text (user input)          |
| `intent`       | Inferred intent label or action ID        |
| `thesis_stage` | Stage of the thesis (e.g. planning, writing) |

Each embedding is stored with:
- `embedding`: 1536-dimensional vector (as list or np.array)
- `original_index`: index into the usage log array

---


##  Main Methods

### `log_usage(prompt, intent, thesis_stage)`

Logs a single prompt event and generates its embedding.

### `generate_embedding_for_log(log_entry)`

Private method. Uses OpenAI API to convert a prompt into an embedding and stores it. Called automatically after each log_usage.

### `generate_all_usage_embeddings(batch_size=100)`

Batch-generates embeddings for any past logs that are missing them.


### `find_similar_usage(query_prompt, n=3)`

Finds the n most similar usage logs based on prompt embeddings.

### `similar = logger.find_similar_usage("summarize discussion section")`
Return a list of

```
[
  {
    "distance": 0.14,
    "original_log": {
      "prompt": "...",
      "intent": "...",
      "thesis_stage": "...",
      ...
    }
  },
]

```

## File Persistence
All embeddings are automatically saved to disk as usage_embeddings.json.
The logger will load this file (if present) on startup.

You may change this path by passing a custom filename:

```
logger = UsageLogger(openai_client, embeddings_file="my_embeddings.json")
```

## Example Usage

```
from usage_logger import UsageLogger
from openai import OpenAI

client = OpenAI()
logger = UsageLogger(client)

# Log one interaction
logger.log_usage("revise abstract for clarity", "write_2", "writing")

# Find similar usage
matches = logger.find_similar_usage("abstract revision")
for m in matches:
    print(m["distance"], m["original_log"]["prompt"])
```
## Integration Plan
The UsageLogger is designed to work in parallel with:

- LangGraph rollouts for logging actions, policy trace, and user prompts

- DataPreprocessor  to inject embedding slices into the RL state vector

- Reflection/memory modules for finding similar cases from past logs

## Notes

Embedding model: text-embedding-3-small

Requires OpenAI client authentication (openai.api_key)

Similarity search is based on cosine distance

Embeddings are stored as 1536-dim vectors (np.array or list)

This logger helps bridge semantic understanding and RL state input.
Use it during interaction, simulation, or model training.


In [59]:
import pandas as pd
import time
from openai import OpenAI
from scipy.spatial import distance
import math
import unittest
from unittest.mock import MagicMock, patch, call
import numpy as np
import json # Import json for file operations
import os # Import os for file path operations

# print("## Improved UsageLogger Code and Tests")
# print("---") # Separator for clarity

# print("### UsageLogger Class Code")

class UsageLogger:
    """
    Logs usage data for a thesis assistant, including user prompts, inferred
    intents, and the relevant thesis stage.

    It also integrates with the OpenAI API to generate embeddings for user
    prompts, allowing for similarity searches to find related interactions.
    Includes functionality for batch and continuous embedding generation.
    Supports saving and loading embeddings to/from a file for persistence.

    Attributes:
        usage_logs (list): A list of dictionaries, where each dictionary
                           represents a single usage log entry.
        usage_embeddings (list): A list of dictionaries, where each dictionary
                                 contains an embedding vector ('embedding')
                                 and a reference to the original log entry's
                                 index ('original_index').
        client (OpenAI): An initialized OpenAI client instance used for API calls.
        embeddings_file (str): The file path for storing usage embeddings.
    """
    def __init__(self, openai_client, embeddings_file="usage_embeddings.json"):
        """
        Opens UsageLogger with OpenAI client, empty lists for usage logs and
        embeddings, and loads embeddings from a file if it exists.

        Args:
            openai_client: An initialized OpenAI client object for API interactions.
            embeddings_file (str): File path for storing usage embeddings.
        """
        self.usage_logs = []
        self.usage_embeddings = []
        self.client = openai_client
        self.embeddings_file = embeddings_file
        self._load_embeddings()


    def _load_embeddings(self):
        """Loads usage embeddings from JSON file."""
        if os.path.exists(self.embeddings_file):
            try:
                with open(self.embeddings_file, 'r') as f:
                    loaded_embeddings = json.load(f)
                    for item in loaded_embeddings:
                         if 'embedding' in item and isinstance(item['embedding'], list):
                              item['embedding'] = np.array(item['embedding'])
                    self.usage_embeddings = loaded_embeddings

                # print(f"Loaded {len(self.usage_embeddings)} embeddings from {self.embeddings_file}")
            except Exception as e:
                print(f"Error loading embeddings from {self.embeddings_file}: {e}")
                self.usage_embeddings = []
        else:
            # print(f"Embeddings file not found: {self.embeddings_file}. Starting with empty embeddings.")
            self.usage_embeddings = []


    def _save_embeddings(self):
        """Saves the current usage embeddings to the specified JSON file."""
        try:
            embeddings_to_save = []
            for item in self.usage_embeddings:
                 item_to_save = item.copy()
                 if 'embedding' in item_to_save and isinstance(item_to_save['embedding'], np.ndarray):
                      item_to_save['embedding'] = item_to_save['embedding'].tolist()
                 embeddings_to_save.append(item_to_save)

            with open(self.embeddings_file, 'w') as f:
                json.dump(embeddings_to_save, f)
            # print(f"Saved {len(self.usage_embeddings)} embeddings to {self.embeddings_file}")
        except Exception as e:
            print(f"Error saving embeddings to {self.embeddings_file}: {e}")


    def log_usage(self, prompt, intent, thesis_stage="unknown"):
        """
        Logs new usage entry, generates embedding, and saves embeddings to file.

        Args:
            prompt (str): User's input prompt text.
            intent (str): Inferred intent of the prompt.
            thesis_stage (str): Relevant thesis stage.
        """
        log_entry = {
            'timestamp': pd.Timestamp.now(),
            'prompt': prompt,
            'intent': intent,
            'thesis_stage': thesis_stage
        }
        self.usage_logs.append(log_entry)
        # print(f"Usage logged: Timestamp={log_entry['timestamp']}, Prompt='{prompt}', Intent='{intent}', Thesis Stage='{thesis_stage}'")

        # print("Generating embedding for the new log entry...")
        self._generate_embedding_for_log(log_entry)


    def _generate_embedding_for_log(self, log_entry):
         """
         Generates embedding for a single log entry's prompt and stores it.

         Args:
             log_entry (dict): The log entry dictionary.
         """
         try:
             prompt_text = log_entry['prompt']
             response = self.client.embeddings.create(
                 model="text-embedding-3-small",
                 input=prompt_text
             )
             embedding = response.data[0].embedding
             self.usage_embeddings.append({'embedding': embedding, 'original_index': len(self.usage_logs) - 1})
             # print(f"Generated embedding for log entry at index {len(self.usage_embeddings) - 1}")

             self._save_embeddings()

         except Exception as e:
             print(f"Error generating embedding for log entry: {e}")


    def generate_all_usage_embeddings(self, batch_size=100):
        """
        Generates embeddings for all logs without embeddings using OpenAI API.

        Args:
            batch_size (int): Number of log entries per API call batch.
        """
        # print(f"Generating embeddings for all usage logs using OpenAI API in batches of {batch_size}...")
        # Filter logs that do not have embeddings yet
        logs_without_embeddings = []
        existing_indices = {item['original_index'] for item in self.usage_embeddings}
        for i, log_entry in enumerate(self.usage_logs):
             if i not in existing_indices:
                  logs_without_embeddings.append((i, log_entry))

        if not logs_without_embeddings:
             # print("All usage logs already have embeddings.")
             return

        num_logs_to_process = len(logs_without_embeddings)
        num_batches = math.ceil(num_logs_to_process / batch_size)

        for i in range(num_batches):
            start_index = i * batch_size
            end_index = min((i + 1) * batch_size, num_logs_to_process)
            batch_data = logs_without_embeddings[start_index:end_index]

            batch_prompts = [log_entry['prompt'] for original_index, log_entry in batch_data]
            batch_original_indices = [original_index for original_index, log_entry in batch_data]


            if not batch_prompts:
                # print(f"Skipping empty batch {i+1}/{num_batches}.")
                continue

            try:
                response = self.client.embeddings.create(
                    model="text-embedding-3-small",
                    input=batch_prompts
                )
                for j, item in enumerate(response.data):
                    self.usage_embeddings.append({
                        'embedding': item.embedding,
                        'original_index': batch_original_indices[j]
                    })
                # print(f"Generated embeddings for batch {i+1}/{num_batches} ({start_index} to {end_index-1} of logs to process).")
            except Exception as e:
                 print(f"Error generating embeddings for batch {i+1}/{num_batches}: {e}")

        # print(f"Generated {len(self.usage_embeddings) - len(existing_indices)} new embeddings. Total embeddings: {len(self.usage_embeddings)}")

        self._save_embeddings()


    def find_similar_usage(self, query_prompt, n=3):
        """
        Finds the n most similar usage logs based on embedding similarity.

        Args:
            query_prompt (str): Query prompt text.
            n (int): Number of similar logs to return.

        Returns:
            list of dict: List of dictionaries for similar logs.
        """
        if not self.usage_embeddings:
            # print("No usage embeddings available to query. Please generate embeddings first.")
            return []

        try:
            query_response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=query_prompt
            )
            query_embedding = query_response.data[0].embedding

            distances = []
            for item in self.usage_embeddings:
                dist = distance.cosine(query_embedding, item['embedding'])
                distances.append({
                    "distance": dist,
                    "original_index": item['original_index']
                    })

            distances_sorted = sorted(distances, key=lambda x: x['distance'])

            similar_logs = []
            for item in distances_sorted[0:n]:
                if 0 <= item['original_index'] < len(self.usage_logs):
                     original_log = self.usage_logs[item['original_index']]
                     similar_logs.append({
                         "distance": item['distance'],
                         "original_log": original_log
                     })
                else:
                    print(f"Warning: Original log index {item['original_index']} out of bounds.")

            return similar_logs

        except Exception as e:
            print(f"Error during similar usage query: {e}")
            return []



# Test

In [58]:

# print("\n### UsageLogger Test Cases")

class TestUsageLogger(unittest.TestCase):

    @patch('__main__.UsageLogger._load_embeddings') # Patch _load_embeddings at the class level
    def setUp(self, mock_load_embeddings):
        """Set up mock OpenAI client and UsageLogger instance before each test."""
        # print("\nSetting up for a new test...")
        self.mock_openai_client = MagicMock()
        self.mock_openai_client.embeddings = MagicMock()
        self.test_embeddings_file = "test_usage_embeddings.json"
        if os.path.exists(self.test_embeddings_file):
            os.remove(self.test_embeddings_file)
        # The UsageLogger will be initialized with _load_embeddings mocked
        self.usage_logger = UsageLogger(openai_client=self.mock_openai_client, embeddings_file=self.test_embeddings_file)

        # Ensure usage_embeddings is empty at the start of each test due to the patch
        self.assertEqual(len(self.usage_logger.usage_embeddings), 0, "_load_embeddings should be mocked and leave usage_embeddings empty")
        # print("Setup complete.")


    def tearDown(self):
        """Clean up the test embeddings file after each test."""
        # print("Cleaning up after test...")
        if os.path.exists(self.test_embeddings_file):
            os.remove(self.test_embeddings_file)
        print("Cleanup complete.")


    @patch('__main__.UsageLogger._save_embeddings') # Patch save_embeddings to prevent file writes during this test
    def test_log_usage(self, mock_save_embeddings):
        """Test that log_usage correctly adds an entry and generates embedding."""
        # print("\nTesting log_usage method...")
        prompt = "How do I write a literature review?"
        intent = "writing_support"
        thesis_stage = "literature review"

        mock_embedding_vector = [0.1] * 1536
        self.mock_openai_client.embeddings.create.return_value = MagicMock(data=[MagicMock(embedding=mock_embedding_vector)])
        # print(f"Mocked OpenAI embeddings.create to return embedding vector: {mock_embedding_vector[:5]}...")

        self.usage_logger.log_usage(prompt, intent, thesis_stage)

        self.assertEqual(len(self.usage_logger.usage_logs), 1)
        self.assertEqual(len(self.usage_logger.usage_embeddings), 1)

        log_entry = self.usage_logger.usage_logs[0]
        self.assertEqual(log_entry['prompt'], prompt)
        self.assertEqual(log_entry['intent'], intent)
        self.assertEqual(log_entry['thesis_stage'], thesis_stage)
        self.assertIsInstance(log_entry['timestamp'], pd.Timestamp)

        embedding_entry = self.usage_logger.usage_embeddings[0]
        self.assertEqual(embedding_entry['original_index'], 0)
        self.assertTrue(np.array_equal(embedding_entry['embedding'], mock_embedding_vector))

        self.mock_openai_client.embeddings.create.assert_called_once_with(
            model="text-embedding-3-small",
            input=prompt
        )
        mock_save_embeddings.assert_called_once()
        # print("log_usage test completed.")


    @patch('builtins.print')
    @patch('__main__.UsageLogger._save_embeddings') # Patch save_embeddings to prevent file writes during this test
    def test_generate_all_usage_embeddings(self, mock_save_embeddings, mock_print):
        """Test that generate_all_usage_embeddings generates and saves embeddings in batches."""
        # print("\nTesting generate_all_usage_embeddings method...")
        mock_embedding_vector = [0.1] * 1536
        # Configure the mock create method to return a response with embeddings for each item in the input list
        def mock_create_embeddings(model, input):
            if isinstance(input, list):
                # Return a list of mock data objects, each with the mock embedding
                return MagicMock(data=[MagicMock(embedding=mock_embedding_vector) for _ in input])
            else:
                # This case should not be hit by generate_all_usage_embeddings with batching
                return MagicMock(data=[MagicMock(embedding=mock_embedding_vector)])

        self.mock_openai_client.embeddings.create.side_effect = mock_create_embeddings
        # print(f"Mocked OpenAI embeddings.create to return embedding vector: {mock_embedding_vector[:5]}...")

        # Add some usage logs directly without calling log_usage
        self.usage_logger.usage_logs.extend([
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log 1', 'intent': 'intent1', 'thesis_stage': 'stage1'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log 2', 'intent': 'intent2', 'thesis_stage': 'stage2'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log 3', 'intent': 'intent3', 'thesis_stage': 'stage3'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log 4', 'intent': 'intent4', 'thesis_stage': 'stage4'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log 5', 'intent': 'intent5', 'thesis_stage': 'stage5'},
        ])


        # Ensure usage_embeddings is empty before calling generate_all_usage_embeddings
        self.usage_logger.usage_embeddings = []

        # print(f"Added {len(self.usage_logger.usage_logs)} usage logs directly. usage_embeddings size: {len(self.usage_logger.usage_embeddings)}")


        # Generate embeddings in batches
        self.usage_logger.generate_all_usage_embeddings(batch_size=2)
        # print("Called generate_all_usage_embeddings with batch_size=2.")

        # Assertions
        self.assertEqual(len(self.usage_logger.usage_embeddings), 5)

        self.assertEqual(self.mock_openai_client.embeddings.create.call_count, 3) # 5 logs, batch_size 2 -> 3 calls

        # Manually check the call arguments
        actual_calls = self.mock_openai_client.embeddings.create.call_args_list
        self.assertEqual(len(actual_calls), 3)

        # Expected inputs based on batch size 2
        expected_inputs = [
            ["Log 1", "Log 2"],
            ["Log 3", "Log 4"],
            ["Log 5"],
        ]

        for i, expected_input in enumerate(expected_inputs):
            # Check the positional and keyword arguments of each call
            # The call object stores args as a tuple and kwargs as a dictionary
            self.assertEqual(actual_calls[i][0], ()) # No positional args expected
            self.assertEqual(actual_calls[i][1]['model'], "text-embedding-3-small")
            self.assertEqual(actual_calls[i][1]['input'], expected_input)


        for i, embedding_entry in enumerate(self.usage_logger.usage_embeddings):
            self.assertIn('embedding', embedding_entry)
            self.assertIn('original_index', embedding_entry)
            # Note: Due to any_order=True in assert_has_calls, we can't strictly check original_index order this way
            self.assertTrue(np.array_equal(embedding_entry['embedding'], mock_embedding_vector))
            self.assertTrue(0 <= embedding_entry['original_index'] < len(self.usage_logger.usage_logs))

        mock_save_embeddings.assert_called_once()
        # print("generate_all_usage_embeddings test completed.")

    @patch('builtins.print')
    @patch('__main__.UsageLogger._save_embeddings') # Patch save_embeddings
    def test_generate_all_usage_embeddings_empty_batch(self, mock_save_embeddings, mock_print):
        """Test generate_all_usage_embeddings with an empty or small batch scenario."""
        # print("\nTesting generate_all_usage_embeddings with an empty or small batch scenario...")
        mock_embedding_vector = [0.1] * 1536
        # Configure the mock create method to return a response with embeddings for each item in the input list
        def mock_create_embeddings(model, input):
            if isinstance(input, list):
                return MagicMock(data=[MagicMock(embedding=mock_embedding_vector) for _ in input])
            else:
                return MagicMock(data=[MagicMock(embedding=mock_embedding_vector)])

        self.mock_openai_client.embeddings.create.side_effect = mock_create_embeddings

        # print(f"Mocked OpenAI embeddings.create to return embedding vector: {mock_embedding_vector[:5]}...")

        # Scenario 1: Empty logs
        # print("Testing with no logs...")
        self.usage_logger.generate_all_usage_embeddings(batch_size=5)
        self.assertEqual(len(self.usage_logger.usage_embeddings), 0)
        self.mock_openai_client.embeddings.create.assert_not_called()
        mock_save_embeddings.assert_not_called() # No embeddings to save
        # print("Assertion passed: No embeddings generated and API not called for empty logs.")

        # Scenario 2: Single log, batch size larger than log count
        # print("Testing with a single log and larger batch size...")
        # Add a single log directly
        self.usage_logger.usage_logs.append({'timestamp': pd.Timestamp.now(), 'prompt': 'Single Log', 'intent': 'single_intent', 'thesis_stage': 'single_stage'})
        self.assertEqual(len(self.usage_logger.usage_logs), 1)

        # Ensure usage_embeddings is empty before calling generate_all_usage_embeddings
        self.usage_logger.usage_embeddings = []


        self.mock_openai_client.embeddings.create.reset_mock()
        mock_save_embeddings.reset_mock()

        self.usage_logger.generate_all_usage_embeddings(batch_size=5)
        self.assertEqual(len(self.usage_logger.usage_embeddings), 1)
        self.mock_openai_client.embeddings.create.assert_called_once_with(model="text-embedding-3-small", input=["Single Log"])
        mock_save_embeddings.assert_called_once()
        # print("Assertion passed: Correctly handled single log with larger batch size.")

        # print("generate_all_usage_embeddings empty batch scenario test completed.")


    @patch('builtins.print')
    @patch('__main__.UsageLogger._save_embeddings') # Patch save_embeddings
    def test_find_similar_usage(self, mock_save_embeddings, mock_print):
        """Test that find_similar_usage correctly finds similar logs based on mock embeddings."""
        # print("\nTesting find_similar_usage method...")

        # Manually set usage logs (these will be used to retrieve original logs)
        self.usage_logger.usage_logs.extend([
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log entry 0 content (related to literature)', 'intent': 'research', 'thesis_stage': 'literature review'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log entry 1 content (related to methodology)', 'intent': 'methodology', 'thesis_stage': 'methodology'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log entry 2 content (more literature review)', 'intent': 'writing', 'thesis_stage': 'literature review'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Log entry 3 content (data analysis)', 'intent': 'analysis', 'thesis_stage': 'results'},
        ])

        # print(f"Manually added {len(self.usage_logger.usage_logs)} usage logs for similarity search test.")

        # Manually set query embedding
        query_embedding_vector = np.array([0.1, 0.2, 0.3] + [0] * 1533) # Vector designed to be close to literature review embeddings
        self.mock_openai_client.embeddings.create.return_value = MagicMock(data=[MagicMock(embedding=query_embedding_vector.tolist())]) # Return as list for mock
        # print(f"Mocked OpenAI embeddings.create for query to return embedding (first few values): {query_embedding_vector[:5]}...")


        # Manually set usage embeddings for controlled testing, ensuring literature review logs are closest
        embedding_log1 = np.array([0.11, 0.21, 0.31] + [0] * 1533) # Very close to query_embedding
        embedding_log2 = np.array([0.9, 0.8, 0.7] + [0] * 1533)    # Significantly different
        embedding_log3 = np.array([0.13, 0.23, 0.33] + [0] * 1533) # Also very close to query_embedding
        embedding_another = np.array([0.85, 0.75, 0.65] + [0] * 1533) # Significantly different

        self.usage_logger.usage_embeddings = [
            {'embedding': embedding_log1, 'original_index': 0}, # Literature review log
            {'embedding': embedding_log2, 'original_index': 1}, # Methodology log
            {'embedding': embedding_log3, 'original_index': 2}, # Another literature review log
            {'embedding': embedding_another, 'original_index': 3}, # Data analysis log
        ]
        # Save these manual embeddings to the test file so _load_embeddings on subsequent calls works
        self.usage_logger._save_embeddings()
        # print("Manually set and saved usage embeddings for testing.")


        # Find similar usage (requesting top 2)
        query_prompt = "Query about literature review"
        # print(f"Finding similar usage for query: '{query_prompt}' (n=2)")
        similar_logs = self.usage_logger.find_similar_usage(query_prompt, n=2)
        # print(f"Found {len(similar_logs)} similar logs.")


        # Assertions
        self.mock_openai_client.embeddings.create.assert_called_once_with(model="text-embedding-3-small", input=query_prompt)

        self.assertEqual(len(similar_logs), 2)

        # Calculate expected closest logs based on cosine distance
        query_embedding = np.array(query_embedding_vector)
        distances = []
        for item in self.usage_logger.usage_embeddings:
            dist = distance.cosine(query_embedding, item['embedding'])
            distances.append({
                "distance": dist,
                "original_index": item['original_index']
                })

        # Sort by distance and then by original_index for deterministic order in case of tie
        expected_closest = sorted(distances, key=lambda x: (x['distance'], x['original_index']))[:2]
        # print(f"Expected closest logs (distance, index): {expected_closest}")

        # Get the prompts for the expected closest logs
        expected_log_prompts = [self.usage_logger.usage_logs[item['original_index']]['prompt'] for item in expected_closest]
        # Get the prompts for the returned similar logs
        returned_log_prompts = [log['original_log']['prompt'] for log in similar_logs]

        # Assert that the returned prompts match the expected prompts
        self.assertEqual(returned_log_prompts, expected_log_prompts)

        # Also assert that the distances in the returned logs are close to the expected distances
        self.assertTrue(np.isclose(similar_logs[0]['distance'], expected_closest[0]['distance']))
        self.assertTrue(np.isclose(similar_logs[1]['distance'], expected_closest[1]['distance']))


        # print("find_similar_usage test completed.")


    @patch('builtins.print')
    def test_find_similar_usage_no_embeddings(self, mock_print):
        """Test that find_similar_usage returns empty list if no embeddings exist."""
        # print("\nTesting find_similar_usage with no embeddings...")
        self.assertEqual(len(self.usage_logger.usage_embeddings), 0)

        query_prompt = "Some query"
        # print(f"Finding similar usage for query: '{query_prompt}' with no embeddings.")
        similar_logs = self.usage_logger.find_similar_usage(query_prompt)
        # print(f"Found {len(similar_logs)} similar logs.")

        self.assertEqual(len(similar_logs), 0)
        self.mock_openai_client.embeddings.create.assert_not_called()
        # print("find_similar_usage with no embeddings test completed.")


    @patch('builtins.print')
    def test_find_similar_usage_with_invalid_index(self, mock_print):
        """Test find_similar_usage handles cases where original_index might be invalid."""
        # print("\nTest find_similar_usage handles invalid original_index...")
        self.usage_logger.usage_logs.extend([
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Valid Log 1', 'intent': 'test', 'thesis_stage': 'test'},
            {'timestamp': pd.Timestamp.now(), 'prompt': 'Valid Log 2', 'intent': 'test', 'thesis_stage': 'test'}
        ])
        # print(f"Manually added {len(self.usage_logger.usage_logs)} usage logs.")

        mock_embedding_vector = np.array([0.5] * 1536)
        self.usage_logger.usage_embeddings = [
            {'embedding': mock_embedding_vector, 'original_index': 0}, # Valid index
            {'embedding': mock_embedding_vector, 'original_index': 99}, # Invalid index
            {'embedding': mock_embedding_vector, 'original_index': 1}, # Valid index
        ]
        # print(f"Manually set {len(self.usage_logger.usage_embeddings)} usage embeddings, including one with an invalid index.")


        query_embedding = np.array([0.51] * 1536)
        self.mock_openai_client.embeddings.create.return_value = MagicMock(data=[MagicMock(embedding=query_embedding)])
        # print(f"Mocked OpenAI embeddings.create for query to return embedding (first few values): {query_embedding[:5]}...")


        query_prompt = "Query text"
        # print(f"Attempting to find similar usage for query: '{query_prompt}' with invalid index present.")
        similar_logs = self.usage_logger.find_similar_usage(query_prompt, n=3)
        # print(f"Found {len(similar_logs)} similar logs.")

        self.mock_openai_client.embeddings.create.assert_called_once_with(model="text-embedding-3-small", input=query_prompt)

        self.assertEqual(len(similar_logs), 2)

        returned_original_indices = [log['original_log']['prompt'] for log in similar_logs]
        expected_valid_prompts = ['Valid Log 1', 'Valid Log 2']
        self.assertEqual(sorted(returned_original_indices), sorted(expected_valid_prompts))

        mock_print.assert_any_call(f"Warning: Original log index 99 out of bounds.")
        # print("find_similar_usage with invalid index test completed.")


    @patch('builtins.print')
    def test_continuous_embedding_generation(self, mock_print):
        """Test that embeddings are generated automatically upon logging and saved."""
        # print("\nTesting continuous embedding generation via log_usage...")
        mock_embedding_vector = [0.5] * 1536
        self.mock_openai_client.embeddings.create.return_value = MagicMock(data=[MagicMock(embedding=mock_embedding_vector)])
        # print(f"Mocked OpenAI embeddings.create to return embedding vector (first few values): {mock_embedding_vector[:5]}...")

        self.usage_logger.log_usage("Prompt 1", "intent1", "stage1")
        self.usage_logger.log_usage("Prompt 2", "intent2", "stage2")
        self.usage_logger.log_usage("Prompt 3", "intent3", "stage3")
        # print(f"Finished logging {len(self.usage_logger.usage_logs)} usage logs.")

        self.assertEqual(len(self.usage_logger.usage_embeddings), 3)
        self.assertEqual(self.mock_openai_client.embeddings.create.call_count, 3)

        for i in range(3):
            self.assertEqual(self.usage_logger.usage_embeddings[i]['original_index'], i)
            self.assertTrue(np.array_equal(self.usage_logger.usage_embeddings[i]['embedding'], mock_embedding_vector))

        self.assertTrue(os.path.exists(self.test_embeddings_file))
        with open(self.test_embeddings_file, 'r') as f:
            saved_embeddings = json.load(f)
        self.assertEqual(len(saved_embeddings), 3)
        for i in range(3):
             self.assertEqual(saved_embeddings[i]['original_index'], i)
             self.assertTrue(np.array_equal(np.array(saved_embeddings[i]['embedding']), mock_embedding_vector))
        # print("continuous embedding generation test completed.")


    @patch('builtins.print')
    def test_load_embeddings_on_init(self, mock_print):
        """Test that embeddings are loaded from file upon initialization."""
        # print("\nTesting loading embeddings on initialization...")
        dummy_embeddings = [
            {'embedding': [0.1] * 1536, 'original_index': 0},
            {'embedding': [0.2] * 1536, 'original_index': 1},
        ]
        with open(self.test_embeddings_file, 'w') as f:
            json.dump(dummy_embeddings, f)
        # print(f"Created dummy embeddings file: {self.test_embeddings_file} with {len(dummy_embeddings)} entries.")

        # Initialize a new UsageLogger instance - _load_embeddings is NOT patched here
        new_usage_logger = UsageLogger(openai_client=self.mock_openai_client, embeddings_file=self.test_embeddings_file)
        # print("Initialized a new UsageLogger instance.")

        self.assertEqual(len(new_usage_logger.usage_embeddings), 2)

        self.assertEqual(new_usage_logger.usage_embeddings[0]['original_index'], 0)
        self.assertTrue(np.array_equal(new_usage_logger.usage_embeddings[0]['embedding'], np.array([0.1] * 1536)))
        self.assertEqual(new_usage_logger.usage_embeddings[1]['original_index'], 1)
        self.assertTrue(np.array_equal(new_usage_logger.usage_embeddings[1]['embedding'], np.array([0.2] * 1536)))

        if os.path.exists(self.test_embeddings_file):
            os.remove(self.test_embeddings_file)
        # print("Cleaned up dummy embeddings file.")

        # print("Loading embeddings on initialization test completed.")


    @patch('builtins.print')
    def test_generate_all_usage_embeddings_no_logs(self, mock_print):
        """Test generate_all_usage_embeddings when there are no logs."""
        # print("\nTesting generate_all_usage_embeddings with no logs...")
        self.assertEqual(len(self.usage_logger.usage_logs), 0)

        self.usage_logger.generate_all_usage_embeddings()
        # print("Called generate_all_usage_embeddings.")

        self.assertEqual(len(self.usage_logger.usage_embeddings), 0)
        self.mock_openai_client.embeddings.create.assert_not_called()
        self.assertFalse(os.path.exists(self.test_embeddings_file))
        # print("generate_all_usage_embeddings with no logs test completed.")


# This allows running the tests in a notebook environment
# Note: In a standard Python script, you would use unittest.main()
if __name__ == '__main__':
    runner = unittest.TextTestRunner(verbosity=2)
    suite = unittest.TestLoader().loadTestsFromTestCase(TestUsageLogger)
    # print("\n--- Running UsageLogger Tests ---")
    runner.run(suite)
    # print("--- Finished Running UsageLogger Tests ---")

test_continuous_embedding_generation (__main__.TestUsageLogger.test_continuous_embedding_generation)
Test that embeddings are generated automatically upon logging and saved. ... ok
test_find_similar_usage (__main__.TestUsageLogger.test_find_similar_usage)
Test that find_similar_usage correctly finds similar logs based on mock embeddings. ... ok
test_find_similar_usage_no_embeddings (__main__.TestUsageLogger.test_find_similar_usage_no_embeddings)
Test that find_similar_usage returns empty list if no embeddings exist. ... ok
test_find_similar_usage_with_invalid_index (__main__.TestUsageLogger.test_find_similar_usage_with_invalid_index)
Test find_similar_usage handles cases where original_index might be invalid. ... ok
test_generate_all_usage_embeddings (__main__.TestUsageLogger.test_generate_all_usage_embeddings)
Test that generate_all_usage_embeddings generates and saves embeddings in batches. ... ok
test_generate_all_usage_embeddings_empty_batch (__main__.TestUsageLogger.test_generate_

Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
Cleanup complete.
