
## Hybrid LLM + Semantic Search Experiment on Manim CodeGen Dataset
#
**Purpose:**  
 For each Manim code snippet, this notebook:
  - Generates an LLM-based explanation
  - Uses Sentence Transformers to rank dataset queries most similar to the LLM explanation
  - Uses an LLM prompt to select the best matching query out of the top-N candidates
  - Repeats this for several runs/iterations and collects results for further analysis
#
 **Instructions:**  
 1. Set your configuration variables below (API key path, batch size, etc.)
 2. Run all cells in order


### 1. Imports & Configuration

In [1]:
import torch
import time
import sys
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset
from utils import llm_tools, tools_local

import openai

  backends.update(_get_backends("networkx.backends"))
2025-06-26 17:38:31.346563: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-26 17:38:31.770940: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750952312.070456  321903 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750952312.217254  321903 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750952313.444902  321903 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1750952313.444953  321903 computation_placer.cc:

In [2]:
# ========== Configuration Variables ==========
API_KEY_PATH = "open_ai_API.txt"
DATASET_NAME = "generaleoley/manim-codegen"
DATASET_SPLIT = "train"
START_INDEX = 30   # Index to start looping over code snippets
END_INDEX = 50     # Index to end loop (exclusive)
RUNS = 4           # Number of repeat runs for experiment
TOP_N = 5          # Number of top similar queries to consider for LLM selection
LLM_MODEL = "gpt-4o-mini"
LLM_MAX_TOKENS = 700
EXPLAIN_TEMP = 0.01
COMPARE_MAX_TOKENS = 1000
COMPARE_TEMP = 0.01

### 2. Environment & Resource Check


In [3]:

print("Python version:", sys.version)
print("CUDA Available:", torch.cuda.is_available())
if torch.cuda.is_available():
    device_id = torch.cuda.current_device()
    print(f"Current device ID: {device_id}")
    print(f"Current device name: {torch.cuda.get_device_name(device_id)}")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')




Python version: 3.9.21 (main, Dec  5 2024, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)]
CUDA Available: True
Current device ID: 0
Current device name: NVIDIA TITAN Xp


### 3. API Key Handling

In [4]:
def load_api_key(path: str) -> str:
    try:
        with open(path, "r") as f:
            return f.read().strip()
    except Exception as e:
        raise RuntimeError(f"Error loading OpenAI API key: {e}")

openai.api_key = load_api_key(API_KEY_PATH)

### 4. Load Dataset and Initialize Utilities


In [5]:

print("Loading dataset...")
data = load_dataset(DATASET_NAME, split=DATASET_SPLIT)
print(f"Dataset loaded: {DATASET_NAME}, split: {DATASET_SPLIT}. Total samples: {len(data)}")

lt = llm_tools(api_key=openai.api_key)
tl = tools_local()

# Load embedding model ONCE (outside the loop)
print("Loading embedding model...")
model1 = SentenceTransformer('all-MiniLM-L6-v2').to(device)


Loading dataset...
Dataset loaded: generaleoley/manim-codegen, split: train. Total samples: 1622
Loading embedding model...


### 5. Main Experiment Loop: Hybrid Retrieval


In [6]:
def safe_openai_chat_completion(messages, model, max_tokens, temperature):
    """Call OpenAI chat completion API with error handling."""
    try:
        response = openai.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"OpenAI API call failed: {e}")
        return None

all_runs_searched_indices = []
all_runs_top_similarities = []

start_time = time.time()

for run in range(RUNS):
    sorted_set = []
    run_top_similarities = []

    for i in range(START_INDEX, END_INDEX):
        reference = i
        code_string = data[reference]['answer']
        formatted_code_string = code_string.replace('\\n', '\n')

        # Step 1: LLM-based Explanation
        messages_explain = [
            {"role": "user", "content": f"Explain the purpose of this code without getting into technical details.\n\nParagraph 1:\n{formatted_code_string}"},
            {"role": "user", "content": f"In detail, explain what is happening with the visuals.\n\nParagraph 1:\n{formatted_code_string}"}
        ]
        open_gen = safe_openai_chat_completion(messages_explain, LLM_MODEL, LLM_MAX_TOKENS, EXPLAIN_TEMP)
        if open_gen is None:
            print(f"Skipping iteration {i} due to LLM error.")
            continue

        # Step 2: Semantic Similarity Ranking (embeddings)
        top_similarities = lt.similarity_ranking(data, model1, open_gen, TOP_N)
        all_scores = [item['score'] for item in top_similarities]
        all_indices = [item['index'] for item in top_similarities]
        all_queries = [item['query'] for item in top_similarities]
        run_top_similarities.append({
            'iteration': i,
            'indices': all_indices,
            'scores': all_scores
        })

        # Step 3: LLM Comparison on Top-N Queries
        messages2 = lt.create_message_for_comparison(code_string, all_queries)
        response_content = safe_openai_chat_completion(messages2, LLM_MODEL, COMPARE_MAX_TOKENS, COMPARE_TEMP)
        if response_content is None:
            print(f"Skipping comparison step for iteration {i} due to LLM error.")
            continue

        # Step 4: Final Search - Find index in full dataset based on LLM output
        searched_index = tl.searching(response_content, reference, data)
        print(f"Run {run}, Iteration {i}, Searched Index: {searched_index}")
        sorted_set.append(searched_index)

    all_runs_searched_indices.append(sorted_set)
    all_runs_top_similarities.append(run_top_similarities)
    print(f"Completed Run {run}")

end_time = time.time()
print(f"\nTotal execution time: {end_time - start_time:.2f} seconds")

Run 0, Iteration 30, Searched Index: [(30, 117)]
Run 0, Iteration 31, Searched Index: [(31, 308)]
Run 0, Iteration 32, Searched Index: [(32, 314)]
Run 0, Iteration 33, Searched Index: [(33, 287)]
Run 0, Iteration 34, Searched Index: [(34, 447)]
Run 0, Iteration 35, Searched Index: [(35, 116)]
Run 0, Iteration 36, Searched Index: [(36, 223)]
Run 0, Iteration 37, Searched Index: [(37, 556)]
Run 0, Iteration 38, Searched Index: [(38, 84)]
Run 0, Iteration 39, Searched Index: [(39, 193)]
Run 0, Iteration 40, Searched Index: [(40, 768)]
Run 0, Iteration 41, Searched Index: [(41, 804)]
Run 0, Iteration 42, Searched Index: [(42, 595)]
Run 0, Iteration 43, Searched Index: [(43, 749)]
Run 0, Iteration 44, Searched Index: [(44, 564)]
Run 0, Iteration 45, Searched Index: [(45, 643)]
Run 0, Iteration 46, Searched Index: []
Run 0, Iteration 47, Searched Index: [(47, 933)]
Run 0, Iteration 48, Searched Index: [(48, 1001)]
Run 0, Iteration 49, Searched Index: [(49, 506)]
Completed Run 0
Run 1, Iterat

### 6. Results Summary


In [7]:

for run_idx, run_top in enumerate(all_runs_top_similarities):
    print(f"\nTop Similarities from Run {run_idx + 1}:")
    for item in run_top:
        print(f"Iteration {item['iteration']} - Indices: {item['indices']}, Scores: {item['scores']}")

for run_idx, run_results in enumerate(all_runs_searched_indices):
    print(f"\nResults from Run {run_idx + 1}: {run_results}")

print("\nExperiment complete. Use the above indices and scores for further analysis or visualization.")




Top Similarities from Run 1:
Iteration 30 - Indices: [117, 185, 113, 783, 114], Scores: [0.6476756930351257, 0.5698220729827881, 0.5655390024185181, 0.5581626296043396, 0.5419474840164185]
Iteration 31 - Indices: [229, 308, 232, 269, 1011], Scores: [0.7176162600517273, 0.627753734588623, 0.6198282837867737, 0.5995882749557495, 0.5909781455993652]
Iteration 32 - Indices: [314, 31, 349, 468, 729], Scores: [0.7405128479003906, 0.6769720315933228, 0.6182374954223633, 0.6159311532974243, 0.6150411367416382]
Iteration 33 - Indices: [287, 209, 291, 233, 212], Scores: [0.7922728061676025, 0.7172515988349915, 0.7120786309242249, 0.6544233560562134, 0.6282938718795776]
Iteration 34 - Indices: [447, 549, 579, 755, 558], Scores: [0.8150026202201843, 0.8014081120491028, 0.7923883199691772, 0.7848175168037415, 0.7741966247558594]
Iteration 35 - Indices: [116, 872, 115, 118, 877], Scores: [0.6955097317695618, 0.5502761602401733, 0.5375751256942749, 0.5203375816345215, 0.5070396065711975]
Iteration 3

### 7. Next Steps & Analysis Instructions
#
 - **Interpret Results:** Each "searched index" is a tuple (reference index, matched query index). Review how often the correct prompt is found among the top-N or is picked by the LLM.
 - **Visualization:** You can plot histograms of similarity scores or match counts across runs.
 - **Parameter Tuning:** Adjust `TOP_N`, number of runs, or code index range to expand your experiment.
 - **Error Handling:** Any failed API calls are reported; rerun those iterations if needed.

 **This notebook now follows best practices for documentation, efficiency, and reusability.**