# MBJM

## RAG-based Spoiler Detection and Context-Preserving Redaction


**Names & SRNs of the team:**

Hamsini V & PES1UG22AM062

Kirti S & PES1UG22AM084

Sudarshan Srinivasan & PES1UG22AM166

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install sentence-transformers chromadb -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m80.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.6/101.6 kB[0m 

# --- Snippet 1: Setup and Configuration (for Index Creation) ---

## Purpose

This snippet initializes the environment for building a ChromaDB vector index from source data. Its primary role is to import necessary libraries and define crucial configuration parameters, such as file paths, model identifiers, and processing settings, that will be used in subsequent steps to load data, generate embeddings, and save the index.

## Key Actions

1.  **Import Libraries:** Imports essential Python modules:
    *   `json`: To load data from JSON files (presumably containing the spoiler text and metadata).
    *   `os`: For interacting with the operating system, particularly for path manipulation (e.g., joining paths, checking existence - though not explicitly used *in* this snippet, likely needed later).
    *   `glob`: To find files matching a specific pattern (e.g., finding all `.json` files within the `SOURCE_DATA_PATH`).
    *   `chromadb`: The client library for creating and interacting with the Chroma vector database.
    *   `SentenceTransformer` (from `sentence_transformers`): The class used to load models that convert text into dense vector embeddings.
    *   `time`: Standard library for time-related functions, potentially used for timing the index creation process.
    *   `numpy`: A fundamental package for scientific computing, often used implicitly by embedding models or for handling numerical data (added here for potential later use, though not directly used in *this* snippet).

2.  **Define Configuration Variables:** Sets up key parameters controlling the index creation process:
    *   `SOURCE_DATA_PATH` (str): Specifies the directory path (`"/kaggle/input/thewirechunked2/"`) where the input JSON files containing the text data (likely "The Wire" spoilers, based on context) are located. In Kaggle, this points to a read-only input dataset.
    *   `MODEL_NAME` (str): Defines the identifier (`'all-MiniLM-L6-v2'`) for the pre-trained Sentence Transformer model to be used for generating text embeddings. The `SentenceTransformer` library will typically download this model from Hugging Face Hub if it's not available locally or specified via a local path. `'all-MiniLM-L6-v2'` is a popular choice known for its balance of speed and embedding quality.
        *   *Note:* A commented-out line (`# MODEL_PATH = ...`) indicates an alternative approach where the model could be loaded from a pre-downloaded directory within Kaggle input, potentially saving download time if the model files are provided as a separate dataset. The active configuration uses `MODEL_NAME`, implying direct download/use via the library.
    *   `OUTPUT_DB_PATH` (str): Sets the directory path (`"/kaggle/working/the_wire_s1_chroma_db_3"`) where the newly created ChromaDB persistent index will be stored. In Kaggle, `/kaggle/working/` is the primary writable directory. ChromaDB will create necessary files within this directory.
    *   `COLLECTION_NAME` (str): Assigns a name (`"wire_s1_spoilers_3"`) to the specific collection within the ChromaDB database. This acts like a table name and is essential for storing and later retrieving the data associated with this specific index.
    *   `BATCH_SIZE` (int): Determines how many text documents will be processed (i.e., embedded) in a single batch (`128`). This parameter affects performance and memory usage, especially GPU VRAM if running on a GPU. Larger batches can increase throughput but require more memory.

## Context & Importance

This snippet is the **foundational step** for the **index creation workflow**. It doesn't perform any data loading, embedding, or saving itself, but it meticulously defines *all* the necessary parameters required for those subsequent actions. The choices made here (source data location, embedding model, output path, collection name, batch size) directly dictate how the index will be built and what characteristics it will have. Ensuring these paths and names are correct is critical for the success of the following steps (loading data, creating embeddings, initializing ChromaDB, and adding documents).

In [None]:
# === Snippet 1: Setup and Configuration ===

import json
import os
import glob
import chromadb
from sentence_transformers import SentenceTransformer
import time
import numpy as np # Added for potential use later

print("--- Running Snippet 1: Setup and Configuration ---")

# --- Configuration ---
# Path to the directory containing your JSON spoiler files
SOURCE_DATA_PATH = "/kaggle/input/thewirechunked2/"
# Path to the pre-loaded sentence transformer model directory
# IMPORTANT: Adjust this path if your Kaggle input dataset for the model has a different name!
#MODEL_PATH = "/kaggle/input/sentencetransformers-allminilml6v2/all-MiniLM-L6-v2"
MODEL_NAME = 'all-MiniLM-L6-v2' # Use this line instead if you want to download the model directly

# Path where the ChromaDB index will be saved in the notebook's temporary output
OUTPUT_DB_PATH = "/kaggle/working/the_wire_s1_chroma_db_3"
# Name for the collection within ChromaDB
COLLECTION_NAME = "wire_s1_spoilers_3"
# Batch size for processing embeddings (adjust based on GPU memory if needed)
BATCH_SIZE = 128



2025-04-22 08:13:02.942596: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745309583.130657      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745309583.186744      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


--- Running Snippet 1: Setup and Configuration ---


In [None]:
print(f"Source Data Path: {SOURCE_DATA_PATH}")
print(f"Embedding Model Path: {MODEL_NAME}")
print(f"Output DB Path: {OUTPUT_DB_PATH}")
print(f"Collection Name: {COLLECTION_NAME}")
print("Configuration set.")
print("--- Finished Snippet 1 ---")

Source Data Path: /kaggle/input/thewirechunked2/
Embedding Model Path: all-MiniLM-L6-v2
Output DB Path: /kaggle/working/the_wire_s1_chroma_db_3
Collection Name: wire_s1_spoilers_3
Configuration set.
--- Finished Snippet 1 ---


# --- Snippet 2: Load Spoiler Data ---

## Purpose

This snippet is responsible for locating, reading, and aggregating data from multiple JSON files containing spoiler information. It assumes these files are organized within a specific directory (`SOURCE_DATA_PATH`) and potentially follow a naming convention that allows for logical sorting (e.g., by episode number). The final output is a single Python list (`all_spoiler_data`) containing all the data points loaded from the individual files, ready for further processing (like embedding generation).

## Key Actions

1.  **Import Necessary Libraries:** Imports `json` (for parsing JSON), `os` (for path manipulation like joining paths and extracting basenames), and `glob` (for finding files matching a pattern).
2.  **Configuration Display:** Prints the `SOURCE_DATA_PATH` to confirm the location being searched.
3.  **Define Search Pattern:**
    *   Constructs a file search pattern (`json_pattern`) using `os.path.join` to combine the `SOURCE_DATA_PATH` with a specific filename pattern (`"ragdatas1e*.json"`). This pattern aims to find JSON files specifically related to Season 1 episodes.
    *   Uses `glob.glob(json_pattern)` to find all file paths within the source directory that match this pattern.
4.  **File Discovery Check:**
    *   Checks if the `json_files` list is empty.
    *   If no files are found, it prints an error message indicating the pattern used and suggests checking the path and naming convention, preventing further execution on empty data.
5.  **Sort Files:**
    *   If files are found, it prints the number discovered.
    *   Defines a `sort_key` helper function to extract the episode number from the filename (assuming the format `ragdatas1eXX.json`). It includes basic error handling to place files with unexpected names at the end.
    *   Sorts the list of discovered `json_files` using this `sort_key` to ensure files are processed in episode order (e.g., `e01.json` before `e02.json`). This can be important for preserving logical sequence if the data has temporal dependencies.
    *   Prints the sorted list of filenames that will be processed.
6.  **Iterate and Load Data:**
    *   Loops through each `file_path` in the `sorted_json_files` list.
    *   Prints the name of the file currently being processed.
    *   Uses a `try...except` block for robust file handling:
        *   Opens the file in read mode (`with open(file_path, 'r') as f:`).
        *   Parses the JSON content using `json.load(f)`.
        *   **Data Structure Validation:** Checks if the loaded `episode_data` is a Python `list` (as expected). If it is, it appends all items from this list to the main `all_spoiler_data` list using `extend()`.
        *   If the data structure is not a list, it prints a warning and skips the file.
        *   Catches `json.JSONDecodeError` specifically for issues during parsing.
        *   Catches general `Exception` for any other errors during file processing (e.g., file not found if it was deleted between `glob` and `open`, permission issues).
7.  **Report Summary:**
    *   After the loop, prints the total number of individual spoiler events loaded into `all_spoiler_data` and the total number of files processed.
8.  **Basic Validation:**
    *   Performs a final check to see if `all_spoiler_data` actually contains any data. Prints an error if it's empty, confirming a potential issue upstream. Prints a success message otherwise.

## Context & Importance

This snippet acts as the data ingestion stage for the index creation pipeline. It bridges the gap between raw data stored in files and a structured Python representation (`all_spoiler_data`) needed for the next steps.
*   **Data Aggregation:** It gathers potentially fragmented data (one file per episode) into a unified list.
*   **Robustness:** Includes error handling for missing files, malformed JSON, and unexpected data structures within files.
*   **Order Preservation:** The sorting step ensures that data is loaded in a predictable, logical sequence (by episode), which might be relevant depending on how the data is structured and used later.
*   **Foundation:** Successfully loading and structuring the data here is **critical** for the subsequent embedding generation and database population steps. If this snippet fails or loads incorrect data, the resulting vector index will be flawed.

In [None]:
# === Snippet 2: Load Spoiler Data ===

import json
import os
import glob

# Configuration from Snippet 1 (repeated for clarity, not strictly needed if run in same session)
SOURCE_DATA_PATH = "/kaggle/input/thewirechunked2/"

print("--- Running Snippet 2: Load Spoiler Data ---")
print(f"Loading JSON data from: {SOURCE_DATA_PATH}")

all_spoiler_data = []
# Construct the search pattern based on the user's file names
json_pattern = os.path.join(SOURCE_DATA_PATH, "ragdatas1e*.json")
json_files = glob.glob(json_pattern)

if not json_files:
    print(f"Error: No JSON files found matching pattern '{json_pattern}'.")
    print("Please check the SOURCE_DATA_PATH and file naming convention.")
else:
    print(f"Found {len(json_files)} JSON files matching pattern.")
    # Sort files numerically based on episode number if possible
    def sort_key(filepath):
        basename = os.path.basename(filepath)
        # Extract episode number (assuming format ragdatas1eXX.json)
        try:
            num_part = basename.split('e')[1].split('.')[0]
            return int(num_part)
        except:
            return 999 # Put files that don't match at the end

    sorted_json_files = sorted(json_files, key=sort_key)
    print("Files will be processed in this order:")
    for f in sorted_json_files:
        print(f"  - {os.path.basename(f)}")

    for file_path in sorted_json_files:
        print(f"  Processing file: {os.path.basename(file_path)}...")
        try:
            with open(file_path, 'r') as f:
                episode_data = json.load(f)
                # episode_data should be the list of dictionaries for that episode
                if isinstance(episode_data, list):
                     all_spoiler_data.extend(episode_data) # Add this episode's list to the main list
                else:
                     print(f"    Warning: Expected a list in {file_path}, but got {type(episode_data)}. Skipping.")
        except json.JSONDecodeError:
            print(f"    Error decoding JSON from file: {file_path}")
        except Exception as e:
            print(f"    An error occurred processing file {file_path}: {e}")

print(f"\nLoaded total {len(all_spoiler_data)} spoiler events from {len(json_files)} files.")

# Basic validation
if not all_spoiler_data:
    print("Error: No data was loaded into all_spoiler_data. Check JSON files and paths.")
else:
    print("Data loading appears successful.")

print("--- Finished Snippet 2 ---")

--- Running Snippet 2: Load Spoiler Data ---
Loading JSON data from: /kaggle/input/thewirechunked2/
Found 13 JSON files matching pattern.
Files will be processed in this order:
  - ragdatas1e1.json
  - ragdatas1e2.json
  - ragdatas1e3.json
  - ragdatas1e4.json
  - ragdatas1e5.json
  - ragdatas1e6.json
  - ragdatas1e7.json
  - ragdatas1e8.json
  - ragdatas1e9.json
  - ragdatas1e10.json
  - ragdatas1e11.json
  - ragdatas1e12.json
  - ragdatas1e13.json
  Processing file: ragdatas1e1.json...
  Processing file: ragdatas1e2.json...
  Processing file: ragdatas1e3.json...
  Processing file: ragdatas1e4.json...
  Processing file: ragdatas1e5.json...
  Processing file: ragdatas1e6.json...
  Processing file: ragdatas1e7.json...
  Processing file: ragdatas1e8.json...
  Processing file: ragdatas1e9.json...
  Processing file: ragdatas1e10.json...
  Processing file: ragdatas1e11.json...
  Processing file: ragdatas1e12.json...
  Processing file: ragdatas1e13.json...

Loaded total 466 spoiler events fr

# --- Snippet 3: Load Embedding Model ---

## Purpose

This snippet focuses on loading the pre-trained Sentence Transformer model specified in the configuration. This model is essential for converting the textual spoiler data (loaded in Snippet 2) into numerical vector embeddings, which are required for storage and similarity search within the ChromaDB vector database.

## Key Actions

1.  **Import Necessary Class:** Imports the `SentenceTransformer` class from the `sentence_transformers` library.
2.  **Confirm Model Identifier:** Uses the `MODEL_NAME` variable (`'all-MiniLM-L6-v2'`) defined in Snippet 1 to specify which model to load.
    *   *Note:* It includes commented-out code (`# MODEL_PATH = ...` and the corresponding `SentenceTransformer` call) to show the alternative approach of loading the model from a specific local directory path, which might be used if the model files were provided as a Kaggle input dataset. The active code uses `MODEL_NAME`, implying the library will handle finding or downloading the model.
3.  **Status Logging:** Prints messages indicating the start of the loading process and which model (`MODEL_NAME`) is being loaded. It also includes a notice that this might involve downloading model files if they are not already cached locally (requiring internet access in environments like Kaggle).
4.  **Instantiate Model:**
    *   Uses a `try...except` block for robust loading.
    *   Attempts to create an instance of the model using `embedding_model = SentenceTransformer(MODEL_NAME, device='cuda')`.
        *   `MODEL_NAME`: Tells the library which model configuration and weights to use (e.g., `'all-MiniLM-L6-v2'`). The library typically downloads these from the Hugging Face Hub if needed.
        *   `device='cuda'`: **Crucially**, this argument instructs the library to load the model onto the available CUDA-enabled GPU. Performing embedding calculations on a GPU is significantly faster than on a CPU for large datasets.
5.  **Error Handling:**
    *   If any exception occurs during model loading (e.g., download error due to no internet, invalid model name, insufficient GPU memory, issues with underlying libraries like PyTorch/CUDA), the `except` block catches it.
    *   Prints an informative error message including the exception details (`e`).
    *   Provides hints for troubleshooting (check internet or the model path if that method was used).
    *   Includes a commented-out `exit()` call, suggesting that in a full script, execution might be halted if the model fails to load, as subsequent steps depend on it.
6.  **Success Confirmation:** If the model loads without errors, it prints a success message confirming the model is loaded onto the GPU.

## Context & Importance

This snippet is a critical prerequisite for creating the vector index.
*   It loads the **engine (`embedding_model`)** responsible for converting text into meaningful numerical representations.
*   Loading the model onto the **GPU (`'cuda'`)** is vital for achieving reasonable performance when processing potentially large amounts of text data in the next step.
*   The success of this step directly impacts the ability to proceed with generating embeddings for the spoiler data loaded in Snippet 2. Failure here means the index cannot be built.

In [None]:
# === Snippet 3: Load Embedding Model ===

from sentence_transformers import SentenceTransformer

# Configuration from Snippet 1 (using MODEL_NAME since MODEL_PATH wasn't a directory)
MODEL_NAME = 'all-MiniLM-L6-v2'
# MODEL_PATH = "/kaggle/input/sentencetransformers-allminilml6v2/all-MiniLM-L6-v2" # Use this if you loaded from Kaggle input instead

print("--- Running Snippet 3: Load Embedding Model ---")
# Make sure internet is enabled in Kaggle settings if using MODEL_NAME
print(f"Loading embedding model '{MODEL_NAME}'...")
print("(This may download the model files if not cached)...")

try:
    # Use MODEL_NAME to download/load
    embedding_model = SentenceTransformer(MODEL_NAME, device='cuda')
    # If you decided to load from Kaggle input path instead, use this line:
    # embedding_model = SentenceTransformer(MODEL_PATH, device='cuda')
    print("Embedding model loaded successfully onto GPU.")
except Exception as e:
    print(f"Error loading embedding model: {e}")
    print("Check internet connection (if downloading) or MODEL_PATH (if loading from input).")
    # Optionally add exit() here if you want the script to stop on error
    # exit()

print("--- Finished Snippet 3 ---")

--- Running Snippet 3: Load Embedding Model ---
Loading embedding model 'all-MiniLM-L6-v2'...
(This may download the model files if not cached)...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded successfully onto GPU.
--- Finished Snippet 3 ---


# --- Snippet 4: Prepare Data for ChromaDB ---

## Purpose

This snippet takes the raw spoiler data loaded in Snippet 2 (a list of dictionaries, `all_spoiler_data`) and transforms it into the specific list formats required by ChromaDB for adding documents to a collection. These formats are: a list of text documents, a corresponding list of metadata dictionaries, and a corresponding list of unique document IDs.

## Key Actions

1.  **Prerequisite Check:**
    *   Verifies that the `all_spoiler_data` variable exists (was created in Snippet 2) and is not empty.
    *   If the data is missing or empty, it prints an error message and includes a commented-out `exit()` call, indicating that execution should ideally stop as subsequent steps would fail.

2.  **Data Extraction (List Comprehensions):**
    *   Uses list comprehensions for efficient extraction:
        *   `texts = [item['text'] for item in all_spoiler_data]`: Creates a list `texts` containing only the string value associated with the `'text'` key from each dictionary in `all_spoiler_data`. This list holds the content that will be embedded.
        *   `metadatas = [item['metadata'] for item in all_spoiler_data]`: Creates a list `metadatas` containing only the dictionary value associated with the `'metadata'` key from each item. These dictionaries will be stored alongside the vectors and can be used for filtering queries later.

3.  **Unique ID Generation:**
    *   Generates a unique string ID for each item in `all_spoiler_data`. Uniqueness is crucial for ChromaDB.
    *   **Strategy:** Creates IDs by combining the season and episode number (extracted from the item's metadata) with a sequential index *within that specific episode*. This ensures IDs like `s01e01_0`, `s01e01_1`, `s01e02_0`, etc.
    *   **Implementation:**
        *   Initializes an empty list `ids` and a dictionary `event_counters` to store the next available index for each episode key (e.g., `"s01e01"`).
        *   Iterates through `all_spoiler_data`.
        *   Extracts `season` and `episode` from the `metadata`.
        *   Formats an `episode_key` string (e.g., `s01e01`).
        *   Uses the `event_counters` dictionary to get the current index for that episode (initializing to 0 if the key is new).
        *   Constructs the ID string (e.g., `f"{episode_key}_{current_index}"`).
        *   Appends the generated ID to the `ids` list.
        *   Increments the counter for that `episode_key` in `event_counters`.

4.  **Logging and Verification:**
    *   Prints the number of items prepared in each list (`texts`, `metadatas`, `ids`).
    *   Prints the first few examples (ID, metadata, truncated text) for visual verification of the prepared data structure.
    *   **Crucial Validation:** Checks if the lengths of the three generated lists (`texts`, `metadatas`, `ids`) are all equal to each other *and* equal to the length of the original `all_spoiler_data`. This ensures data integrity and prevents errors when adding to ChromaDB. Prints an error if lengths mismatch.

5.  **Error Handling:**
    *   Wraps the core preparation logic in a `try...except` block.
    *   Catches `KeyError` specifically, which would occur if an item in `all_spoiler_data` is missing the expected `'text'` or `'metadata'` keys (or `'season'`, `'episode'` within metadata). It prints an informative error suggesting checking the JSON structure.
    *   Catches general `Exception` for any other unexpected errors during the preparation process.

## Context & Importance

This snippet is a critical data transformation step in the index creation pipeline.
*   It converts the heterogeneous list of dictionaries (`all_spoiler_data`) into the homogeneous lists (`texts`, `metadatas`, `ids`) that ChromaDB's `add` or `upsert` methods expect.
*   The quality and correctness of these three lists directly determine what gets stored in the vector database.
*   Generating **unique and meaningful IDs** is vital for managing documents within ChromaDB.
*   The **length validation check** is essential to catch errors *before* attempting to interact with ChromaDB, which requires these lists to be perfectly aligned. Failure in this snippet means the data cannot be correctly ingested into the database in the next step.

In [None]:
# === Snippet 4: Prepare Data for ChromaDB ===

print("--- Running Snippet 4: Prepare Data for ChromaDB ---")

# Ensure all_spoiler_data exists from Snippet 2
if 'all_spoiler_data' not in globals() or not all_spoiler_data:
    print("Error: 'all_spoiler_data' not found or is empty. Please run Snippet 2 first.")
    # exit() # Optional: stop execution if data is missing
else:
    print(f"Preparing {len(all_spoiler_data)} items for ChromaDB...")

    try:
        # Extract text documents
        texts = [item['text'] for item in all_spoiler_data]

        # Extract metadata dictionaries
        metadatas = [item['metadata'] for item in all_spoiler_data]

        # Create unique IDs (e.g., s01e01_0, s01e01_1, s01e02_0, ...)
        # Using episode and index within episode for uniqueness
        ids = []
        event_counters = {} # Track index within each episode
        for item in all_spoiler_data:
            season = item['metadata']['season']
            episode = item['metadata']['episode']
            episode_key = f"s{season:02d}e{episode:02d}"
            if episode_key not in event_counters:
                event_counters[episode_key] = 0
            current_index = event_counters[episode_key]
            ids.append(f"{episode_key}_{current_index}")
            event_counters[episode_key] += 1

        print(f"Prepared {len(texts)} texts.")
        print(f"Prepared {len(metadatas)} metadata entries.")
        print(f"Generated {len(ids)} unique IDs.")

        # Optional: Print first few examples to verify
        print("\nExample Data:")
        for i in range(min(3, len(texts))):
            print(f"  ID: {ids[i]}, Metadata: {metadatas[i]}, Text: '{texts[i][:50]}...'")

        # Basic validation
        if not (len(texts) == len(metadatas) == len(ids) == len(all_spoiler_data)):
            print("Error: Length mismatch between texts, metadatas, and IDs!")
        else:
            print("\nData preparation successful.")

    except KeyError as e:
        print(f"Error: Missing key {e} in one of the spoiler data dictionaries.")
        print("Please check the structure of your JSON files.")
    except Exception as e:
        print(f"An unexpected error occurred during data preparation: {e}")


print("--- Finished Snippet 4 ---")

--- Running Snippet 4: Prepare Data for ChromaDB ---
Preparing 466 items for ChromaDB...
Prepared 466 texts.
Prepared 466 metadata entries.
Generated 466 unique IDs.

Example Data:
  ID: s01e01_0, Metadata: {'season': 1, 'episode': 1}, Text: 'Homicide Detective Jimmy McNulty investigates the ...'
  ID: s01e01_1, Metadata: {'season': 1, 'episode': 1}, Text: 'McNulty persuades an eyewitness to the Snot Boogie...'
  ID: s01e01_2, Metadata: {'season': 1, 'episode': 1}, Text: 'During D'Angelo Barksdale's murder trial, key witn...'

Data preparation successful.
--- Finished Snippet 4 ---


# --- Snippet 5: Initialize ChromaDB and Create Collection ---

## Purpose

This snippet initializes the connection to the ChromaDB vector database, ensuring it persists on disk, and then creates or retrieves the specific collection where the spoiler embeddings and metadata will be stored. This sets up the database structure ready to receive the data prepared in the previous steps.

## Key Actions

1.  **Import Libraries:** Imports `chromadb` for database interaction and `os` for filesystem operations (specifically, creating directories).
2.  **Use Configuration:** References `OUTPUT_DB_PATH` (where the database files will be stored) and `COLLECTION_NAME` (the name of the "table" within the database) defined in Snippet 1.
3.  **Ensure Output Directory:**
    *   Uses `os.makedirs(OUTPUT_DB_PATH, exist_ok=True)` to create the directory specified by `OUTPUT_DB_PATH` if it doesn't already exist.
    *   `exist_ok=True` prevents an error if the directory already exists. This is good practice before initializing a `PersistentClient` which expects the path to be available.
4.  **Initialize ChromaDB Client:**
    *   Uses a `try...except` block for robust initialization.
    *   Creates a `PersistentClient` instance: `client = chromadb.PersistentClient(path=OUTPUT_DB_PATH)`.
        *   `PersistentClient`: This type of client tells ChromaDB to store its data (vectors, metadata, index structures) on the local filesystem at the specified `path`. This allows the database to persist across script runs or notebook sessions, unlike an in-memory client.
    *   Logs success or prints an error message if initialization fails (e.g., due to permissions issues). Includes an optional commented-out `exit()` for halting execution on failure.
5.  **Get or Create Collection:**
    *   Uses a `try...except` block for robustness.
    *   Calls `collection = client.get_or_create_collection(...)`:
        *   This is an idempotent operation: If a collection with the specified `name` already exists in the database at `OUTPUT_DB_PATH`, it retrieves and returns that collection object. If it doesn't exist, it creates a new collection with that name and returns it. This prevents errors if the script is run multiple times.
        *   `name=COLLECTION_NAME`: Specifies the desired name for the collection.
        *   `metadata={"hnsw:space": "cosine"}`: **Crucially**, this sets the distance metric used for similarity search within the collection's index (HNSW is the default index type). `cosine` distance (or cosine similarity) is generally recommended and performs well for sentence embeddings like those produced by `all-MiniLM-L6-v2`.
    *   Logs success, indicating the collection is ready, and prints the current number of items in the collection (`collection.count()`). This count will be 0 if the collection was just created, or non-zero if the snippet is re-run after data has been added previously.
    *   Catches and logs any errors during collection access/creation. Includes an optional commented-out `exit()`.

## Context & Importance

This snippet establishes the actual database environment where the vector data will live.
*   **Persistence:** Using `PersistentClient` ensures that the created index is saved to disk and can be reloaded later or in different sessions without rebuilding it from scratch (as long as the `OUTPUT_DB_PATH` remains accessible).
*   **Collection Setup:** `get_or_create_collection` provides robustness against re-running the script and correctly sets up the collection structure.
*   **Distance Metric:** Specifying `"hnsw:space": "cosine"` aligns the database's similarity calculation with the nature of the sentence embeddings being used, ensuring meaningful search results later.
*   This step is the **immediate prerequisite** for adding the prepared data (`texts`, `metadatas`, `ids` from Snippet 4) into the database using the `collection.add()` method in the next snippet. Failure here means there is no database collection ready to accept the data.

In [None]:
# === Snippet 5: Initialize ChromaDB and Create Collection ===

import chromadb
import os

# Configuration from Snippet 1
OUTPUT_DB_PATH = "/kaggle/working/the_wire_s1_chroma_db_3"
COLLECTION_NAME = "wire_s1_spoilers_3"

print("--- Running Snippet 5: Initialize ChromaDB ---")

# Ensure the output directory exists (PersistentClient usually creates it, but good practice)
print(f"Ensuring output directory exists: {OUTPUT_DB_PATH}")
os.makedirs(OUTPUT_DB_PATH, exist_ok=True)

print(f"Initializing ChromaDB PersistentClient at: {OUTPUT_DB_PATH}")
try:
    client = chromadb.PersistentClient(path=OUTPUT_DB_PATH)
    print("ChromaDB client initialized.")
except Exception as e:
    print(f"Error initializing ChromaDB client: {e}")
    # exit() # Optional

print(f"Getting or creating collection: '{COLLECTION_NAME}'")
try:
    # Use get_or_create_collection for robustness
    # Specify cosine distance as it's good for sentence embeddings
    collection = client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"}
    )
    print(f"Collection '{COLLECTION_NAME}' ready. Current item count: {collection.count()}")
    # Note: Count will be 0 if newly created, or >0 if re-running and collection persisted somehow
except Exception as e:
    print(f"Error getting or creating ChromaDB collection: {e}")
    # exit() # Optional

print("--- Finished Snippet 5 ---")

--- Running Snippet 5: Initialize ChromaDB ---
Ensuring output directory exists: /kaggle/working/the_wire_s1_chroma_db_3
Initializing ChromaDB PersistentClient at: /kaggle/working/the_wire_s1_chroma_db_3
ChromaDB client initialized.
Getting or creating collection: 'wire_s1_spoilers_3'
Collection 'wire_s1_spoilers_3' ready. Current item count: 0
--- Finished Snippet 5 ---


# --- Snippet 6: Embed and Add Data to ChromaDB ---

## Purpose

This snippet performs the core task of the index creation process: generating vector embeddings for the prepared text data using the loaded Sentence Transformer model and then adding the data (embeddings, original text, metadata, and unique IDs) into the designated ChromaDB collection in batches.

## Key Actions

1.  **Prerequisite Checks:**
    *   Verifies that essential variables from previous snippets (`embedding_model`, `collection`, `ids`, `texts`, `metadatas`) exist in the global scope. If any are missing, it prints an error and halts execution (`exit()`), preventing errors in subsequent operations.
2.  **Configuration & Initialization:**
    *   Uses the `BATCH_SIZE` defined in Snippet 1 to control how many items are processed at once.
    *   Prints status messages indicating the start of the process, the total number of items, and the batch size.
    *   Initializes `start_time` using `time.time()` to measure the duration of the embedding and adding process.
    *   Initializes `total_items_added` counter to track progress.
3.  **Batch Processing Loop:**
    *   Iterates through the data (`ids`, `texts`, `metadatas`) in steps of `BATCH_SIZE` using a `for` loop and `range(0, len(texts), BATCH_SIZE)`. Batching is crucial for managing memory (especially GPU VRAM during embedding) and preventing potential timeouts or crashes when dealing with large datasets.
    *   **Data Slicing:** Inside the loop, it extracts slices of the `ids`, `texts`, and `metadatas` lists corresponding to the current batch using Python list slicing (e.g., `ids[i:i+BATCH_SIZE]`).
    *   **Progress Logging:** Prints the current batch number and the range of item indices being processed in that batch.
    *   **Embedding Generation:**
        *   Uses a `try...except` block for robustness.
        *   Calls `embedding_model.encode()` on the `batch_texts`.
            *   `convert_to_numpy=True`: Initially gets the embeddings as a NumPy array for potential efficiency.
            *   `show_progress_bar=False`: Suppresses the default progress bar from the `sentence-transformers` library to keep the console output cleaner within the loop.
            *   `device='cuda'`: Explicitly ensures the computation happens on the GPU, leveraging the model loaded onto CUDA in Snippet 3.
        *   Converts the resulting NumPy array (`batch_embeddings_np`) into a list of lists (`batch_embeddings_list`) using `.tolist()`, as ChromaDB's `add` method expects embeddings in this format.
        *   Logs the number of embeddings generated for the batch.
        *   If an error occurs during embedding, it prints an error message and uses `continue` to skip the rest of the loop for this batch and proceed to the next one.
    *   **Adding Data to ChromaDB:**
        *   Uses a `try...except` block.
        *   Calls `collection.add()` to insert the current batch's data into the ChromaDB collection.
            *   `ids=batch_ids`: Provides the unique identifiers for the documents.
            *   `embeddings=batch_embeddings_list`: Provides the generated vector embeddings.
            *   `metadatas=batch_metadatas`: Provides the associated metadata dictionaries.
            *   `documents=batch_texts`: **Crucially**, also stores the original text content alongside the embedding. This allows retrieving the actual text during queries without needing a separate lookup based on ID.
        *   Increments `total_items_added` by the number of items successfully processed in the batch.
        *   Logs the successful addition of the batch.
        *   If an error occurs while adding data (e.g., database connectivity issue, data validation error within ChromaDB), it prints an error message. (Note: Depending on the error, one might choose to stop the entire process or just log and continue).
4.  **Timing and Summary:**
    *   Calculates the total `elapsed_time` after the loop finishes.
    *   Prints the total time taken and the `total_items_added`.
5.  **Verification:**
    *   Performs a final check by calling `collection.count()` to get the total number of items currently stored in the ChromaDB collection.
    *   Compares this `final_count` to the expected number (`len(texts)`).
    *   Prints a confirmation message if the counts match or a warning if they differ, indicating potential issues during the add process (e.g., skipped batches due to errors).
6.  **Final Instructions:**
    *   Prints the path (`OUTPUT_DB_PATH`) where the populated ChromaDB index data is stored.
    *   Provides guidance on the next logical step in a typical Kaggle workflow: committing the notebook and saving the contents of the output directory as a new Kaggle Dataset so the created index can be easily reused in other notebooks.

## Context & Importance

This is the **most computationally intensive** part of the index creation pipeline.
*   It translates the textual data into high-dimensional vectors using the ML model (embedding).
*   It populates the vector database with these embeddings, along with their associated IDs, metadata, and the original text documents.
*   The use of **batching** and **GPU acceleration** (`device='cuda'`) is critical for making this process feasible for non-trivial amounts of data within reasonable time and memory limits.
*   Storing the **original documents** alongside the embeddings via `collection.add(documents=...)` is a key feature, simplifying later retrieval steps.
*   The successful execution of this snippet results in a **fully populated, persistent ChromaDB vector index** stored at `OUTPUT_DB_PATH`, ready to be saved and used for semantic search and RAG applications.

In [None]:
# === Snippet 6: Embed and Add Data to ChromaDB ===

import time
import numpy as np # Ensure numpy is imported

# Ensure variables from previous snippets exist
if 'embedding_model' not in globals(): print("Error: embedding_model not found."); exit()
if 'collection' not in globals(): print("Error: collection not found."); exit()
if 'ids' not in globals(): print("Error: ids not found."); exit()
if 'texts' not in globals(): print("Error: texts not found."); exit()
if 'metadatas' not in globals(): print("Error: metadatas not found."); exit()

# Configuration from Snippet 1
BATCH_SIZE = 128

print("--- Running Snippet 6: Embed and Add Data ---")
print(f"Embedding {len(texts)} texts and adding to ChromaDB in batches of {BATCH_SIZE}...")

start_time = time.time()
total_items_added = 0

# Process data in batches
for i in range(0, len(texts), BATCH_SIZE):
    # Slice data for the current batch
    batch_ids = ids[i:i+BATCH_SIZE]
    batch_texts = texts[i:i+BATCH_SIZE]
    batch_metadatas = metadatas[i:i+BATCH_SIZE]

    current_batch_num = i // BATCH_SIZE + 1
    total_batches = (len(texts) + BATCH_SIZE - 1) // BATCH_SIZE
    print(f"  Processing Batch {current_batch_num}/{total_batches} (Items {i+1} to {min(i+BATCH_SIZE, len(texts))})...")

    # Generate embeddings for the current batch
    try:
        batch_embeddings_np = embedding_model.encode(
            batch_texts,
            convert_to_numpy=True, # Get numpy array
            show_progress_bar=False, # Keep output clean
            device='cuda' # Ensure GPU is used
        )
        # Convert numpy array to list of lists for ChromaDB
        batch_embeddings_list = batch_embeddings_np.tolist()
        print(f"    Generated {len(batch_embeddings_list)} embeddings for batch.")

    except Exception as e:
        print(f"    Error generating embeddings for batch {current_batch_num}: {e}")
        print("    Skipping this batch.")
        continue # Skip to the next batch

    # Add the batch to the ChromaDB collection
    try:
        collection.add(
            ids=batch_ids,
            embeddings=batch_embeddings_list,
            metadatas=batch_metadatas,
            documents=batch_texts # Store original text as well
        )
        total_items_added += len(batch_ids)
        print(f"    Successfully added batch {current_batch_num} to ChromaDB.")
    except Exception as e:
        print(f"    Error adding batch {current_batch_num} to ChromaDB: {e}")
        # Consider whether to stop or continue if a batch fails

end_time = time.time()
elapsed_time = end_time - start_time

print(f"\nFinished embedding and adding data in {elapsed_time:.2f} seconds.")
print(f"Total items added attempt: {total_items_added} (should match total items if no errors).")

# --- Verification ---
print("\nVerifying collection count...")
try:
    final_count = collection.count()
    print(f"Final item count in collection '{COLLECTION_NAME}': {final_count}")
    if final_count != len(texts):
        print(f"Warning: Expected {len(texts)} items, but final collection count is {final_count}.")
    else:
        print("Collection count matches expected number of items.")
except Exception as e:
    print(f"Error getting final collection count: {e}")

print(f"\nChromaDB index data should now be populated in: {OUTPUT_DB_PATH}")
print("Next step: Commit the notebook and save this output directory as a new Kaggle Dataset.")
print("--- Finished Snippet 6 ---")

--- Running Snippet 6: Embed and Add Data ---
Embedding 466 texts and adding to ChromaDB in batches of 128...
  Processing Batch 1/4 (Items 1 to 128)...
    Generated 128 embeddings for batch.
    Successfully added batch 1 to ChromaDB.
  Processing Batch 2/4 (Items 129 to 256)...
    Generated 128 embeddings for batch.
    Successfully added batch 2 to ChromaDB.
  Processing Batch 3/4 (Items 257 to 384)...
    Generated 128 embeddings for batch.
    Successfully added batch 3 to ChromaDB.
  Processing Batch 4/4 (Items 385 to 466)...
    Generated 82 embeddings for batch.
    Successfully added batch 4 to ChromaDB.

Finished embedding and adding data in 1.17 seconds.
Total items added attempt: 466 (should match total items if no errors).

Verifying collection count...
Final item count in collection 'wire_s1_spoilers_3': 466
Collection count matches expected number of items.

ChromaDB index data should now be populated in: /kaggle/working/the_wire_s1_chroma_db_3
Next step: Commit the no