# üìò Promptimus Prime: LLM-AutoDiff Reproduction

## ü§ñ **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**

Welcome to **Promptimus Prime**! This notebook reproduces the experiments from the paper *"LLM-AutoDiff: Auto-Differentiate Any LLM Workflow"*.

We utilize **Textual Gradient Descent (TGD)** to automatically optimize system prompts for Large Language Models. Instead of manual prompt engineering, we treat the prompt as a set of trainable parameters.

### üßÆ **The Task: GSM8K (Grade School Math)**
*   **Goal:** Solve multi-step mathematical reasoning problems.
*   **Student Model:** `Qwen2.5-1.5B-Instruct` (Lightweight, efficient).
*   **Teacher Model:** `Qwen2.5-7B-Instruct` (Stronger reasoning capabilities).

### üõ†Ô∏è **Architecture (Peer Nodes)**
We implement the full **Peer Nodes** architecture described in the paper. Instead of a single text block, the optimizer refines three distinct components simultaneously:
1.  **Instruction Node:** The core task definition.
2.  **Few-Shot Node:** Dynamic examples to guide reasoning.
3.  **Format Node:** Constraints on the output structure.

### üîÑ **The Loop**
1.  **Forward Pass:** Student attempts to solve a math problem.
2.  **Evaluation:** We check if the final answer matches the Ground Truth.
3.  **Backward Pass:** If incorrect, the Teacher analyzes the error and generates a "Textual Gradient".
4.  **Update:** The Optimizer refines specific Peer Nodes (e.g., adding a new example) to fix the error.

### üöÄ **Step 1: Setup & Installation**

We start by cloning the **Promptimus Prime** repository. Then, we install all necessary dependencies defined in `requirements.txt` to ensure our environment matches the project specifications.

**Note:** Ensure you are connected to a **GPU Runtime** (T4 is sufficient) before running this cell.

In [None]:
# 1. Clone the repository
!git clone https://github.com/imlydianna/AutoPrompt-Lite.git

# 2. Enter the project directory
%cd AutoPrompt-Lite

# 3. Install dependencies from requirements.txt
!pip install -q -r requirements.txt

We add the repository to the system path to allow direct imports. We also configure logging to suppress verbose output from libraries, ensuring that progress bars (tqdm) render correctly in Colab.

In [None]:
import sys
import logging
import transformers

# Add the repository to Python path
repo_path = "/content/promptimus-prime"
if repo_path not in sys.path:
    sys.path.append(repo_path)

# Configure Global Logging (Silence the noise)
# Force re-configuration to override Colab defaults
logging.basicConfig(level=logging.INFO, force=True)

# Suppress specific library noise
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("adalflow").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.ERROR)
transformers.logging.set_verbosity_error()

print("‚úÖ Environment configured for interactive execution.")

### üîë **Step 2: Hugging Face Login (Optional)**

If you plan to use gated models or want to avoid download limits, log in to Hugging Face.

In [None]:
from google.colab import userdata 
from huggingface_hub import login

try:
    # Ensure you have added 'HF_TOKEN' to your Colab Secrets
    token = userdata.get('HF_TOKEN')
    login(token)
    print("‚úÖ Successfully logged in to Hugging Face!")
except:
    print("‚ö†Ô∏è HF_TOKEN not found in secrets. Continuing without authentication (some models may not work).")

### üß† **Step 3: Run Training (Optimization Loop)**

We will now start the **Textual Gradient Descent** loop.
The optimizer will work on **all three Peer Nodes** simultaneously:
1.  Refining the **Instruction**.
2.  Curating/Editing **Few-Shot Demos**.
3.  Adjusting the **Output Format**.

*   **Train Split:** Used to generate gradients (feedback) from the Teacher.
*   **Validation Split:** Used to verify if the proposed changes actually improve performance.

We import the training logic directly from `src.tasks.gsm8k.train` to ensure real-time logging.

In [None]:
# We import the main execution function and run it directly
# This will load the models (4-bit), run the optimization steps, and save the result.
from src.tasks.gsm8k.train import run_training # pyright: ignore[reportMissingImports]

# Execute the training pipeline
run_training()

### üìä **Step 4: Final Evaluation**

Now that the optimization loop is complete, we rigorously evaluate the performance gain on a held-out **Test Set** (unseen data).

The evaluation script performs a side-by-side comparison of two configurations:
1.  **Baseline State:** The initial Instruction, Demos, and Format (loaded from `src/tasks/gsm8k/prompts/`).
2.  **Optimized State:** The refined Instruction, Demos, and Format (loaded from `outputs/gsm8k/`).

This step calculates the final accuracy for both configurations and saves a detailed CSV report (`comparison_results.csv`) for the next analysis step.

In [None]:
# We import the evaluation function and run it directly
from src.tasks.gsm8k.evaluate import run_evaluation # pyright: ignore[reportMissingImports]

# Execute the evaluation
run_evaluation()

### üìù **Step 5: Inspect the Optimized Peer Nodes**

Let's see what the "Teacher" taught the "Student". Since we used the **Peer Nodes** architecture, the optimizer has refined three distinct components.

Below are the final optimized versions of:
1.  **Instruction:** The core task definition.
2.  **Demos:** The few-shot examples added or modified.
3.  **Format:** The output structure constraints.

In [None]:
import os

output_dir = "outputs/gsm8k"
files_to_inspect = [
    "optimized_instruction.txt",
    "optimized_demos.txt",
    "optimized_format.txt"
]

print(f"üìÇ Inspecting artifacts in: {output_dir}\n")

for filename in files_to_inspect:
    file_path = os.path.join(output_dir, filename)
    
    if os.path.exists(file_path):
        print(f"‚ú® \033[1m{filename.upper()}:\033[0m") 
        print("="*60)
        with open(file_path, "r") as f:
            content = f.read().strip()
            print(content if content else "[Empty File]")
        print("="*60 + "\n")
    else:
        print(f"‚ùå {filename} not found. (Did training complete?)")

### üìà **Step 6: Visualization & Analysis**

Quantitative metrics (accuracy scores) tell only half the story. To truly verify the effectiveness of **LLM-AutoDiff**, we need to inspect the **qualitative changes** in the prompts and their impact on the model's reasoning.

In this final step, we run our visualization module to generate two key insights:

1.  **Word-Level Diffs:** A color-coded comparison showing exactly how the Optimizer refined the **Instruction**, edited the **Few-Shot Demos**, and tweaked the **Output Format**.
    *   <span style="color:green">**Green**</span>: Content added to guide the model.
    *   <span style="color:red">**Red**</span>: Confusing or redundant constraints removed.

2.  **Success Stories:** A side-by-side analysis of specific test cases where the **Baseline failed** but the **Optimized prompt succeeded**. This demonstrates the tangible impact of the optimization on the Student's Chain-of-Thought.

In [None]:
# We import the visualization function and run it directly
from src.tasks.gsm8k.visualize import run_visualization # pyright: ignore[reportMissingImports]

# Execute the visualization logic
run_visualization()