# Adversarial Simulation Notebook

This notebook demonstrates how to run an adversarial simulation against a language model (in this case, Claude) and analyze the results.

## Setup

First, we'll import the necessary libraries and set up our environment.

In [None]:
import logging
from typing import List
from isopro.adversarial_simulation import AdversarialSimulator, AdversarialEnvironment
from isopro.utils.analyze_adversarial_sim import analyze_adversarial_results, summarize_adversarial_impact
from isopro.agents.ai_agent import AI_Agent
import anthropic
import os
from dotenv import load_dotenv
import json
from datetime import datetime
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns

# Load environment variables
load_dotenv()

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## Define Helper Classes and Functions

Now, we'll define our ClaudeAgent class and some helper functions.

In [None]:
class ClaudeAgent(AI_Agent):
    def __init__(self, name):
        super().__init__(name)
        self.client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    def run(self, input_data):
        response = self.client.messages.create(
            model="claude-3-opus-20240229",
            max_tokens=100,
            messages=[{"role": "user", "content": input_data['text']}]
        )
        return response.content[0].text

    def step(self, sim_state):
        sim_state['output'] = self.run(sim_state)
        return sim_state

class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.floating):
            return float(obj)
        if isinstance(obj, torch.Tensor):
            return obj.tolist()
        return super(NumpyEncoder, self).default(obj)

def save_scores(output_dir: str, run_id: str, analysis_results: dict) -> None:
    os.makedirs(output_dir, exist_ok=True)
    output_file = os.path.join(output_dir, f"adv-{run_id}.json")
    
    try:
        with open(output_file, 'w') as f:
            json.dump(analysis_results, f, indent=2, cls=NumpyEncoder)
        
        logging.info(f"Saved analysis results to {output_file}")
    except Exception as e:
        logging.error(f"Error saving analysis results: {str(e)}")

def get_sample_inputs() -> List[str]:
    return [
        "What is the capital of France?",
        "How does photosynthesis work?",
        "Explain the theory of relativity."
    ]

## Run the Adversarial Simulation

Now we'll set up and run our adversarial simulation.

In [None]:
def run_simulation():
    run_id = datetime.now().strftime("%Y%m%d-%H%M%S")
    logger.info(f"Starting adversarial simulation run {run_id}")

    claude_agent = ClaudeAgent("Claude Agent")

    # Create the AdversarialEnvironment
    adv_env = AdversarialEnvironment(
        agent_wrapper=claude_agent,
        num_adversarial_agents=2,
        attack_types=["textbugger", "deepwordbug"],
        attack_targets=["input", "output"]
    )

    # Set up the adversarial simulator with the environment
    simulator = AdversarialSimulator(adv_env)

    input_data = get_sample_inputs()

    logger.info("Starting adversarial simulation...")
    simulation_results = simulator.run_simulation(input_data, num_steps=1)

    logger.info("Analyzing simulation results...")
    analysis_results = analyze_adversarial_results(simulation_results)

    summary = summarize_adversarial_impact(analysis_results)

    print("\nAdversarial Simulation Summary:")
    print(summary)

    output_dir = "output"
    save_scores(output_dir, run_id, analysis_results)

    logger.info("Simulation complete.")
    
    return simulation_results, analysis_results

# Run the simulation
simulation_results, analysis_results = run_simulation()

## Analyze and Visualize Results

Now that we have our results, let's analyze and visualize them.

In [None]:
def plot_metric_changes(analysis_results):
    metrics = ['bleu', 'rouge-1', 'rouge-2', 'rouge-l', 'perplexity', 'coherence']
    changes = [analysis_results[f'{metric}_change'] for metric in metrics]
    
    plt.figure(figsize=(12, 6))
    sns.barplot(x=metrics, y=changes)
    plt.title('Changes in Metrics After Adversarial Attacks')
    plt.xlabel('Metrics')
    plt.ylabel('Percentage Change')
    plt.xticks(rotation=45)
    plt.show()

plot_metric_changes(analysis_results)

# Display original and perturbed inputs and outputs
for i, result in enumerate(simulation_results):
    print(f"\nExample {i+1}:")
    print(f"Original Input: {result['original_input']}")
    print(f"Perturbed Input: {result['perturbed_input']}")
    print(f"Original Output: {result['original_output']}")
    print(f"Perturbed Output: {result['perturbed_output']}")
    print("-" * 50)

## Conclusion

This notebook demonstrates how to run an adversarial simulation against a language model and analyze the results. The simulation applies various adversarial attacks to the input or output of the model and measures the impact on different metrics.

Key observations:
1. The changes in different metrics (BLEU, ROUGE, perplexity, coherence) show how the adversarial attacks affect the model's performance.
2. By comparing the original and perturbed inputs and outputs, we can see how the attacks modify the text and how the model's responses change as a result.

This information can be used to assess the robustness of the language model against adversarial attacks and identify areas for improvement in the model's defenses.