# Azure OpenAI Em Dash Pipeline - Data Generation

This notebook contains the **data generation pipeline** for the em dash analysis project. It generates stories using different AI models, extracts em dash sentences, and creates paraphrases.

## Pipeline Overview
1. **Story Generation**: Generate stories using GPT-35-Turbo, GPT-4, and GPT-4.1
2. **Em Dash Extraction**: Extract sentences containing em dashes from generated stories
3. **Paraphrase Generation**: Create paraphrases that remove em dashes
4. **Data Export**: Save all generated data to JSON files for analysis

## Output Files
- `story_analyses_YYYYMMDD_HHMMSS.json` - Generated stories with analysis
- `em_dash_sentences_YYYYMMDD_HHMMSS.json` - Extracted em dash sentences
- `paraphrase_results_YYYYMMDD_HHMMSS.json` - Generated paraphrases
- `complete_analysis_YYYYMMDD_HHMMSS.json` - Combined pipeline results

---

**Related Notebooks:**
- `evaluation_analysis.ipynb` - Analysis and visualization of generated data
- `mystery_clean.ipynb` - Original development version with all sections

## 1. Setup and Configuration

In [None]:
# Install required packages
!pip install openai python-dotenv tiktoken

In [None]:
# Import required libraries
import os
import random
import re
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
import tiktoken
from datetime import datetime
from collections import defaultdict, Counter

# Load environment variables
load_dotenv()

In [None]:
# Configure Azure OpenAI client
api_key = "6514FollowMySubstackBlog2745"
api_version = ""
azure_endpoint = "msukhareva.substack.com"

client = AzureOpenAI(
    api_key=api_key,
    api_version=api_version,
    azure_endpoint=azure_endpoint
)

# Available models for analysis
available_models = ["gpt-35-turbo", "gpt-4", "gpt-4.1"]

print("✅ Azure OpenAI client configured successfully!")
print(f"📊 Available models: {available_models}")
print(f"🌐 Endpoint: {azure_endpoint}")
print(f"📝 API Version: {api_version}")

In [None]:
# Test the connection
try:
    response = client.chat.completions.create(
        model="gpt-35-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello! Can you confirm that the Azure OpenAI connection is working?"}
        ],
        max_tokens=100,
        temperature=0.7
    )
    
    print("✅ Connection successful!")
    print(f"Response: {response.choices[0].message.content}")
    
except Exception as e:
    print(f"❌ Connection failed: {str(e)}")

## 2. Core Pipeline Functions

In [None]:
def get_tokenizer_for_model(model_name):
    """Get the appropriate tokenizer for the model"""
    if model_name in ["gpt-4", "gpt-4.1"]:
        return tiktoken.encoding_for_model("gpt-4")
    elif model_name == "gpt-35-turbo":
        return tiktoken.encoding_for_model("gpt-3.5-turbo")
    else:
        return tiktoken.encoding_for_model("gpt-3.5-turbo")

def count_tokens(text, model_name):
    """Count tokens in text using the appropriate tokenizer for the model"""
    try:
        tokenizer = get_tokenizer_for_model(model_name)
        tokens = tokenizer.encode(text)
        return {
            'token_count': len(tokens),
            'text_length': len(text),
            'model': model_name
        }
    except Exception as e:
        return {'token_count': 0, 'text_length': len(text), 'model': model_name, 'error': str(e)}

def count_em_dashes(text):
    """Count different types of dashes in text"""
    em_dash_count = text.count('—')  # Em dash (U+2014)
    en_dash_count = text.count('–')  # En dash (U+2013)
    double_hyphen_count = text.count('--')  # Double hyphen
    
    total_em_dashes = em_dash_count + en_dash_count + double_hyphen_count
    
    return {
        'em_dash_count': em_dash_count,
        'en_dash_count': en_dash_count,
        'double_hyphen_count': double_hyphen_count,
        'total_em_dashes': total_em_dashes
    }

def analyze_story_text(story, model_name):
    """Comprehensive analysis of story text"""
    token_info = count_tokens(story, model_name)
    dash_info = count_em_dashes(story)
    
    word_count = len(story.split())
    sentence_count = len([s for s in story.split('.') if s.strip()])
    
    return {
        'story_text': story,
        'model': model_name,
        'word_count': word_count,
        'sentence_count': sentence_count,
        'character_count': len(story),
        **token_info,
        **dash_info
    }

def save_to_json(data, filename_prefix, description="data"):
    """Save data to timestamped JSON file"""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{filename_prefix}_{timestamp}.json"
    
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        
        file_size = os.path.getsize(filename)
        file_size_kb = file_size / 1024
        
        print(f"✅ {description} saved to: {filename}")
        print(f"📁 File size: {file_size_kb:.1f} KB")
        
        return filename
        
    except Exception as e:
        print(f"❌ Error saving {description}: {str(e)}")
        return None

print("✅ Core pipeline functions loaded!")

## 3. Story Generation Functions

In [None]:
# Random story prompt generation
characters = [
    "a mysterious librarian", "an elderly baker", "a young astronaut", "a street musician",
    "a detective", "a time traveler", "a robot", "a witch", "a pirate", "a teacher",
    "a chef", "a scientist", "a ghost", "a photographer", "a taxi driver"
]

settings = [
    "in a haunted mansion", "on a distant planet", "in a busy coffee shop", "during a thunderstorm",
    "in an underwater city", "at a carnival", "in a secret laboratory", "on a moving train",
    "in a magical forest", "at a school reunion", "in a small town", "on a deserted island",
    "in the future", "in ancient times", "in a parallel universe"
]

objects = [
    "a mysterious letter", "an old photograph", "a broken compass", "a glowing stone",
    "a music box", "a diary", "a map", "a key", "a mirror", "a book",
    "a watch that runs backwards", "a painting", "a telescope", "a locked box", "a recipe"
]

emotions_themes = [
    "seeking revenge", "looking for love", "trying to solve a mystery", "facing their biggest fear",
    "discovering a hidden talent", "making an important decision", "reuniting with an old friend",
    "starting over in life", "protecting someone they care about", "learning a valuable lesson"
]

def generate_random_prompt():
    """Generate a random story prompt"""
    character = random.choice(characters)
    setting = random.choice(settings)
    obj = random.choice(objects)
    theme = random.choice(emotions_themes)
    
    prompt = f"Write a short story about {character} {setting} who discovers {obj} while {theme}."
    return prompt

def generate_ai_story(prompt, model_name):
    """Generate a story using the specified AI model"""
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a creative writer. Write engaging short stories with rich descriptions and dialogue. Use natural punctuation including em dashes when appropriate for dramatic pauses or emphasis."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=800,
            temperature=0.8
        )
        
        # Check if response and content exist
        if not response or not response.choices or len(response.choices) == 0:
            return f"Error generating story with {model_name}: Empty response from API"
        
        content = response.choices[0].message.content
        if content is None:
            return f"Error generating story with {model_name}: API returned None content"
        
        story = content.strip()
        if not story:
            return f"Error generating story with {model_name}: API returned empty content"
        
        return story
        
    except Exception as e:
        return f"Error generating story with {model_name}: {str(e)}"

def generate_stories_with_full_analysis():
    """Generate stories with all three models and analyze them"""
    prompt = generate_random_prompt()
    
    print(f"📝 PROMPT: {prompt}")
    print("=" * 80)
    
    analyses = []
    
    for model in available_models:
        print(f"\n🤖 Generating story with {model.upper()}...")
        
        story = generate_ai_story(prompt, model)
        
        if story.startswith("Error"):
            print(f"❌ {story}")
            continue
            
        analysis = analyze_story_text(story, model)
        analysis['prompt'] = prompt
        analyses.append(analysis)
        
        print(f"📊 Analysis: {analysis['word_count']} words, {analysis['token_count']} tokens, {analysis['total_em_dashes']} em dashes")
        print(f"📖 Story preview: {story[:100]}...")
        
        time.sleep(1)  # Rate limiting
    
    return analyses

def generate_multiple_stories_with_analysis(num_story_rounds=3):
    """Generate multiple rounds of stories for comprehensive analysis"""
    print(f"🎭 GENERATING {num_story_rounds} ROUNDS OF STORIES")
    print(f"📊 Total stories to generate: {num_story_rounds * len(available_models)}")
    print("=" * 80)
    
    all_analyses = []
    
    for round_num in range(1, num_story_rounds + 1):
        print(f"\n🎯 ROUND {round_num}/{num_story_rounds}")
        round_analyses = generate_stories_with_full_analysis()
        all_analyses.extend(round_analyses)
        
        if round_num < num_story_rounds:
            print("\n⏱️ Waiting before next round...")
            time.sleep(2)
    
    print(f"\n✅ COMPLETED! Generated {len(all_analyses)} stories total")
    return all_analyses

print("✅ Story generation functions ready!")
print("🎭 Functions available: generate_ai_story, generate_stories_with_full_analysis")

## 4. Em Dash Extraction Functions

In [None]:
def has_em_dash(sentence):
    """Check if sentence contains any type of em dash"""
    return '—' in sentence or '–' in sentence or '--' in sentence

def extract_sentences_with_em_dashes(text):
    """Extract all sentences that contain em dashes"""
    # Split text into sentences (simple approach)
    sentences = re.split(r'[.!?]+', text)
    
    em_dash_sentences = []
    for sentence in sentences:
        sentence = sentence.strip()
        if sentence and has_em_dash(sentence):
            em_dash_sentences.append(sentence+".")
    
    return em_dash_sentences

def extract_em_dash_sentences_from_analyses(analyses_list):
    """Extract em dash sentences from a list of story analyses"""
    all_em_dash_sentences = []
    
    for analysis in analyses_list:
        story_text = analysis['story_text']
        model = analysis['model']
        
        sentences = extract_sentences_with_em_dashes(story_text)
        
        for sentence in sentences:
            dash_count = count_em_dashes(sentence)
            sentence_data = {
                'sentence': sentence,
                'model': model,
                'total_dashes': dash_count['total_em_dashes'],
                'em_dash_count': dash_count['em_dash_count'],
                'en_dash_count': dash_count['en_dash_count'],
                'double_hyphen_count': dash_count['double_hyphen_count']
            }
            all_em_dash_sentences.append(sentence_data)
    
    return all_em_dash_sentences

print("✅ Em dash extraction functions ready!")
print("🔍 Functions available: extract_sentences_with_em_dashes, extract_em_dash_sentences_from_analyses")

## 5. Paraphrase Generation Functions

In [None]:
def calculate_word_difference(text1, text2):
    """
    Calculate the number of tokens that differ between two texts using frequency-based comparison.
    Includes both words and punctuation as tokens.
    """
    import re
    from collections import Counter
    
    def get_tokens(text):
        """Extract all tokens (words + punctuation) from text, normalizing em dashes"""
        # Normalize different dash types to em dash for consistency
        cleaned = re.sub(r'[—–]|--', '—', text)
        # Split into tokens: words (\w+) and punctuation (\S)
        tokens = re.findall(r'\w+|[^\w\s]', cleaned.lower())
        return tokens
    
    tokens1 = get_tokens(text1)
    tokens2 = get_tokens(text2)
    
    # Calculate differences using frequency counters
    counter1 = Counter(tokens1)
    counter2 = Counter(tokens2)
    
    # Calculate frequency-based differences
    unique_to_text1 = sum((counter1 - counter2).values())
    unique_to_text2 = sum((counter2 - counter1).values())
    total_difference = unique_to_text1 + unique_to_text2
    
    return {
        'total_difference': total_difference,
        'tokens1': tokens1,
        'tokens2': tokens2,
        'unique_to_text1': unique_to_text1,
        'unique_to_text2': unique_to_text2,
        'similarity_ratio': 1 - (total_difference / max(len(tokens1), len(tokens2))) if max(len(tokens1), len(tokens2)) > 0 else 1
    }

def is_paraphrase_similar_enough(original, paraphrase, max_word_difference=4):
    """Check if paraphrase is similar enough to original"""
    word_diff = calculate_word_difference(original, paraphrase)
    return word_diff['total_difference'] <= max_word_difference

def generate_paraphrase(sentence, model_name="gpt-35-turbo"):
    """Generate a paraphrase that removes em dashes"""
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a writing assistant. Rewrite sentences to remove em dashes (—) while preserving the original meaning. Use alternative punctuation like commas, periods, or restructure the sentence. Keep the meaning and tone as close to the original as possible."},
                {"role": "user", "content": f"Rewrite this sentence to remove all em dashes while keeping the same meaning: {sentence}"}
            ],
            max_tokens=150,
            temperature=0.3
        )
        
        if response and response.choices and len(response.choices) > 0:
            content = response.choices[0].message.content
            if content:
                paraphrase = content.strip().replace('"', '').replace("'", "").strip()
                # Ensure no em dashes remain
                if '—' not in paraphrase and '–' not in paraphrase and '--' not in paraphrase:
                    return paraphrase
        
        return None
        
    except Exception as e:
        print(f"❌ Error generating paraphrase: {str(e)}")
        return None

def generate_multiple_paraphrases(sentence, model_name="gpt-35-turbo", max_attempts=5, max_paraphrases=2):
    """Generate multiple paraphrases for a sentence"""
    paraphrases = []
    attempts = 0
    
    while len(paraphrases) < max_paraphrases and attempts < max_attempts:
        attempts += 1
        paraphrase = generate_paraphrase(sentence, model_name)
        
        if paraphrase and paraphrase not in paraphrases:
            paraphrases.append(paraphrase)
    
    return paraphrases

def process_em_dash_replacements_filtered(em_dash_sentences, max_word_difference=5, max_paraphrases_per_sentence=2):
    """Process em dash sentences and generate filtered paraphrases"""
    print(f"🎨 GENERATING PARAPHRASES (≤{max_word_difference} token diff, up to {max_paraphrases_per_sentence} per sentence)")
    print("=" * 80)
    
    all_results = []
    total_attempts = 0
    successful_paraphrases = 0
    filtered_out = 0
    
    for i, sentence_data in enumerate(em_dash_sentences):
        original_sentence = sentence_data['sentence']
        model = sentence_data['model']
        
        print(f"\n🔄 Processing {i+1}/{len(em_dash_sentences)}: {model.upper()}")
        print(f"📝 Original: \"{original_sentence[:60]}...\"")
        
        # Generate multiple paraphrases
        paraphrases = generate_multiple_paraphrases(
            original_sentence, 
            model_name="gpt-35-turbo",
            max_paraphrases=max_paraphrases_per_sentence
        )
        
        valid_paraphrases = []
        
        for paraphrase in paraphrases:
            total_attempts += 1
            
            # Check similarity
            if is_paraphrase_similar_enough(original_sentence, paraphrase, max_word_difference):
                # Calculate token changes
                original_tokens = count_tokens(original_sentence, model)
                paraphrase_tokens = count_tokens(paraphrase, model)
                token_difference = paraphrase_tokens['token_count'] - original_tokens['token_count']
                
                # Count dashes removed
                original_dashes = count_em_dashes(original_sentence)['total_em_dashes']
                paraphrase_dashes = count_em_dashes(paraphrase)['total_em_dashes']
                dashes_removed = original_dashes - paraphrase_dashes
                
                valid_paraphrases.append({
                    'paraphrase': paraphrase,
                    'token_difference': token_difference,
                    'dashes_removed': dashes_removed,
                    'original_tokens': original_tokens['token_count'],
                    'paraphrase_tokens': paraphrase_tokens['token_count']
                })
                
                successful_paraphrases += 1
                print(f"✅ Valid: \"{paraphrase[:50]}...\" (tokens: {token_difference:+d})")
            else:
                filtered_out += 1
                print(f"❌ Filtered: Too different from original")
        
        if valid_paraphrases:
            result = {
                'original_sentence': original_sentence,
                'model': model,
                'original_dashes': sentence_data['total_dashes'],
                'paraphrases': valid_paraphrases,
                'paraphrase_count': len(valid_paraphrases)
            }
            all_results.append(result)
        
        time.sleep(0.5)  # Rate limiting
    
    print(f"\n✅ PARAPHRASE GENERATION COMPLETE!")
    print(f"📊 Results: {successful_paraphrases} valid paraphrases from {total_attempts} attempts")
    print(f"🔍 Filtered out: {filtered_out} paraphrases (too different)")
    print(f"📋 Sentences with valid paraphrases: {len(all_results)}")
    
    return all_results

print("✅ Paraphrase generation functions ready!")
print("🎨 Functions available: generate_paraphrase, process_em_dash_replacements_filtered")

## 6. Pipeline Execution

Execute the complete data generation pipeline in 3 steps:

### Step 1: Generate Stories

In [None]:
# 🎭 STEP 1: GENERATE STORIES AND SAVE TO JSON
print("🎭 GENERATING STORIES AND SAVING TO JSON")
print("=" * 60)

# Configuration
num_story_rounds = 1  # Adjust as needed
print(f"📊 Configuration: {num_story_rounds} rounds × {len(available_models)} models = {num_story_rounds * len(available_models)} stories")
print()

# Generate stories
print("🚀 Starting story generation...")
all_story_analyses = generate_multiple_stories_with_analysis(num_story_rounds)

if all_story_analyses:
    print(f"\n✅ Generated {len(all_story_analyses)} stories successfully!")
    
    # Save to JSON
    stories_filename = save_to_json(
        all_story_analyses, 
        "story_analyses", 
        f"{len(all_story_analyses)} story analyses"
    )
    
    # Display summary
    print(f"\n📊 STORY GENERATION SUMMARY:")
    model_counts = {}
    total_em_dashes = 0
    
    for analysis in all_story_analyses:
        model = analysis['model']
        model_counts[model] = model_counts.get(model, 0) + 1
        total_em_dashes += analysis.get('total_em_dashes', 0)
    
    print(f"   • Total stories: {len(all_story_analyses)}")
    print(f"   • Total em dashes found: {total_em_dashes}")
    
    for model, count in model_counts.items():
        print(f"   • {model.upper()}: {count} stories")
    
    print(f"\n🎯 Ready for Step 2: Em dash extraction")
    print(f"📁 Stories saved in: {stories_filename}")
    
else:
    print("❌ No stories were generated!")
    print("💡 Check API connection and try again")

### Step 2: Extract Em Dashes and Generate Paraphrases

In [None]:
# Helper function to find latest file
def find_latest_file(pattern):
    """Find the most recent file matching the pattern"""
    import glob
    files = glob.glob(pattern)
    if files:
        return max(files, key=os.path.getctime)
    return None

def load_from_json(filename, description="data"):
    """Load data from JSON file"""
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        print(f"✅ {description} loaded from: {filename}")
        if isinstance(data, (list, dict)):
            print(f"📝 Size: {len(data)} items")
        return data
        
    except Exception as e:
        print(f"❌ Error loading {description}: {str(e)}")
        return None

In [None]:
# 🔍 STEP 2: LOAD STORIES, EXTRACT EM DASHES, GENERATE PARAPHRASES
print("🔍 LOADING STORIES AND PROCESSING EM DASHES")
print("=" * 60)

# Configuration for paraphrasing
max_word_difference = 5
max_paraphrases_per_sentence = 2

print(f"🎨 Paraphrase config: ≤{max_word_difference} token differences, up to {max_paraphrases_per_sentence} per sentence")
print(f"🚫 Em dash filter: Reject any paraphrases containing em dashes (—)")
print()

# Load from latest file automatically (or use the stories from Step 1 if still in memory)
if 'all_story_analyses' not in locals() or not all_story_analyses:
    latest_stories_file = find_latest_file("story_analyses_*.json")
    if latest_stories_file:
        print(f"📂 Loading from latest file: {latest_stories_file}")
        all_story_analyses = load_from_json(latest_stories_file, "story analyses")
    else:
        print("❌ No story analysis files found!")
        print("💡 Run Step 1 first to generate stories")
        all_story_analyses = None
else:
    print("📂 Using stories from Step 1 (still in memory)")

if all_story_analyses:
    print(f"✅ Loaded {len(all_story_analyses)} story analyses")
    
    # Extract em dash sentences
    print("\n🔍 Extracting em dash sentences...")
    em_dash_sentences = extract_em_dash_sentences_from_analyses(all_story_analyses)
    
    if em_dash_sentences:
        print(f"✅ Found {len(em_dash_sentences)} sentences with em dashes")
        
        # Generate paraphrases
        print(f"\n🎨 Generating paraphrases...")
        paraphrase_results = process_em_dash_replacements_filtered(
            em_dash_sentences,
            max_word_difference=max_word_difference,
            max_paraphrases_per_sentence=max_paraphrases_per_sentence
        )
        
        if paraphrase_results:
            print(f"\n✅ Generated {len(paraphrase_results)} paraphrases successfully!")
            
            # Save paraphrases to JSON
            paraphrases_filename = save_to_json(
                paraphrase_results,
                "paraphrase_results",
                f"{len(paraphrase_results)} paraphrase results"
            )
            
            # Also save em dash sentences for reference
            em_dash_filename = save_to_json(
                em_dash_sentences,
                "em_dash_sentences", 
                f"{len(em_dash_sentences)} em dash sentences"
            )
            
            print(f"\n📊 PARAPHRASING SUMMARY:")
            print(f"   • Em dash sentences: {len(em_dash_sentences)}")
            print(f"   • Successful paraphrases: {len(paraphrase_results)}")
            print(f"   • Average paraphrases per sentence: {len(paraphrase_results)/len(em_dash_sentences):.2f}")
            
            print(f"\n🎯 Ready for Step 3: Final compilation")
            print(f"📁 Paraphrases saved in: {paraphrases_filename}")
            print(f"📁 Em dash sentences saved in: {em_dash_filename}")
            
        else:
            print("❌ No paraphrases generated!")
            print("💡 Try adjusting max_word_difference or generating more stories")
    else:
        print("❌ No em dash sentences found!")
        print("💡 Try generating more stories to find em dashes")
else:
    print("❌ Could not load story analyses")
    print("💡 Make sure Step 1 has been completed")

### Step 3: Compile Complete Analysis

In [None]:
# 📊 STEP 3: COMPILE COMPLETE ANALYSIS
print("📊 COMPILING COMPLETE ANALYSIS")
print("=" * 40)

# Load all data if not in memory
print("📂 Loading all pipeline data...")

# Load stories (from memory or file)
if 'all_story_analyses' not in locals() or not all_story_analyses:
    latest_stories_file = find_latest_file("story_analyses_*.json")
    if latest_stories_file:
        all_story_analyses = load_from_json(latest_stories_file, "story analyses")
    else:
        all_story_analyses = None

# Load paraphrases (from memory or file)
if 'paraphrase_results' not in locals() or not paraphrase_results:
    latest_paraphrases_file = find_latest_file("paraphrase_results_*.json")
    if latest_paraphrases_file:
        paraphrase_results = load_from_json(latest_paraphrases_file, "paraphrase results")
    else:
        paraphrase_results = None

# Load em dash sentences (from memory or file)
if 'em_dash_sentences' not in locals() or not em_dash_sentences:
    latest_em_dash_file = find_latest_file("em_dash_sentences_*.json")
    if latest_em_dash_file:
        em_dash_sentences = load_from_json(latest_em_dash_file, "em dash sentences")
    else:
        em_dash_sentences = None

# Compile complete analysis if all data is available
if all_story_analyses and paraphrase_results and em_dash_sentences:
    print(f"\n✅ All pipeline data loaded successfully!")
    print(f"   • Story analyses: {len(all_story_analyses)}")
    print(f"   • Em dash sentences: {len(em_dash_sentences)}")
    print(f"   • Paraphrase results: {len(paraphrase_results)}")
    
    # Compile comprehensive results
    complete_analysis_results = {
        'story_analyses': all_story_analyses,
        'em_dash_sentences': em_dash_sentences,
        'paraphrase_results': paraphrase_results,
        'pipeline_metadata': {
            'generation_timestamp': datetime.now().isoformat(),
            'total_stories': len(all_story_analyses),
            'total_em_dash_sentences': len(em_dash_sentences),
            'total_paraphrases': sum(len(result.get('paraphrases', [])) for result in paraphrase_results),
            'models_used': available_models,
            'pipeline_config': {
                'max_word_difference': max_word_difference,
                'max_paraphrases_per_sentence': max_paraphrases_per_sentence,
                'num_story_rounds': num_story_rounds if 'num_story_rounds' in locals() else 'unknown'
            }
        }
    }
    
    # Save complete analysis results
    analysis_filename = save_to_json(
        complete_analysis_results,
        "complete_analysis",
        "complete pipeline analysis"
    )
    
    print(f"\n🎉 PIPELINE COMPLETED SUCCESSFULLY!")
    print(f"📊 Final Statistics:")
    print(f"   • Total stories generated: {len(all_story_analyses)}")
    print(f"   • Total em dash sentences extracted: {len(em_dash_sentences)}")
    print(f"   • Total paraphrases generated: {complete_analysis_results['pipeline_metadata']['total_paraphrases']}")
    
    print(f"\n💾 Complete analysis saved to: {analysis_filename}")
    print(f"🎯 Ready for evaluation analysis!")
    print(f"💡 Use the evaluation_analysis.ipynb notebook to analyze the generated data")
    
else:
    print("❌ Could not load all required pipeline data!")
    print("💡 Make sure Steps 1 and 2 have been completed successfully")
    
    # Show what's available
    print(f"\n📁 Data availability:")
    print(f"   • Story analyses: {'✅' if all_story_analyses else '❌'}")
    print(f"   • Em dash sentences: {'✅' if em_dash_sentences else '❌'}")
    print(f"   • Paraphrase results: {'✅' if paraphrase_results else '❌'}")

## 7. Pipeline Summary

✅ **Pipeline Complete!**

This notebook has generated:
- **Stories**: AI-generated stories from multiple models
- **Em Dash Sentences**: Extracted sentences containing em dashes
- **Paraphrases**: Alternative versions without em dashes
- **Complete Analysis**: Combined dataset for evaluation

**Output Files** (with timestamps):
- `story_analyses_YYYYMMDD_HHMMSS.json`
- `em_dash_sentences_YYYYMMDD_HHMMSS.json`
- `paraphrase_results_YYYYMMDD_HHMMSS.json`
- `complete_analysis_YYYYMMDD_HHMMSS.json`

**Next Steps:**
1. Use `evaluation_analysis.ipynb` to analyze the generated data
2. Generate visualizations and statistics
3. Conduct token change analysis
4. Create publication-ready results

---

**Notes:**
- Adjust `num_story_rounds` in Step 1 to generate more/fewer stories
- Modify `max_word_difference` and `max_paraphrases_per_sentence` in Step 2 for different filtering
- All intermediate data is saved, so you can re-run individual steps as needed