# Part 1: Generating a Diverse LLM Corpus for Frequency Analysis
## Session 1: From LLM Generation to Word Frequency (45 minutes)

**Learning Objectives:**
- Generate a diverse text corpus using an LLM to mirror the vocabulary found in large-scale datasets like the English Crowdsourcing Project (ECP).
- Understand how prompt engineering across different genres (news, technical, fiction) can create a representative word frequency list.
- Calculate word frequencies from the generated text to create a custom `llm_frequency` predictor.
- Export the frequency data for comparative analysis in the next session.

**Session Structure:**
- **Setup & API Configuration** (10 minutes)
- **Diverse Corpus Generation** (25 minutes)
- **Frequency Calculation & Export** (10 minutes)

---

💡 **Research Context:** Our goal is to create a high-quality `llm_frequency` predictor. The ECP dataset, which we use for validation in Notebook 2, contains a wide vocabulary from many sources (general use, dictionaries, etc.). To create a comparable predictor, we must generate a corpus that is equally diverse. A simple corpus (e.g., only children's stories) is insufficient. This session focuses on generating a varied corpus to capture a broad slice of the English language.

## 1.1 Setup and API Configuration (10 minutes)

Let's set up our environment and configure the API client to generate our corpus.


In [None]:
# Environment Setup and API Configuration
import os
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import time
from typing import Dict, List, Any

print("🚀 LLM Text Generation Session")
print("=" * 35)
print("Setting up environment for corpus generation...")

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("✅ Base environment configured")

In [None]:
# API Configuration for Multiple LLM Services

class LLMManager:
    """
    Unified interface for multiple LLM APIs
    """
    
    def __init__(self):
        self.apis = {
            'deepseek': {
                'endpoint': 'https://api.deepseek.com/v1/chat/completions',
                'model': 'deepseek-chat',
                'status': 'unconfigured'
            },
            'openai': {
                'endpoint': 'https://api.openai.com/v1/chat/completions',
                'model': 'gpt-3.5-turbo',
                'status': 'unconfigured'
            },
            'groq': {
                'endpoint': 'https://api.groq.com/openai/v1/chat/completions',
                'model': 'llama3-8b-8192',
                'status': 'unconfigured'
            }
        }
        
    def configure_api(self, service: str, api_key: str) -> bool:
        """
        Configure API key for a specific service
        """
        if service in self.apis:
            self.apis[service]['api_key'] = api_key
            self.apis[service]['status'] = 'configured'
            print(f"✅ {service.capitalize()} API configured")
            return True
        return False
    
    def test_connection(self, service: str) -> bool:
        """
        Test API connection with a simple request
        """
        if self.apis[service]['status'] != 'configured':
            print(f"❌ {service.capitalize()} not configured")
            return False
            
        try:
            # Simple test prompt
            response = self.generate_text(
                service=service,
                prompt="Say 'API test successful' in one sentence.",
                max_tokens=20
            )
            if response and 'successful' in response.lower():
                self.apis[service]['status'] = 'active'
                print(f"✅ {service.capitalize()} connection verified")
                return True
        except Exception as e:
            print(f"❌ {service.capitalize()} connection failed: {str(e)[:50]}...")
            
        return False
    
    def generate_text(self, service: str, prompt: str, max_tokens: int = 150, 
                     temperature: float = 0.7) -> str:
        """
        Generate text using specified LLM service
        """
        if self.apis[service]['status'] not in ['configured', 'active']:
            raise ValueError(f"{service} API not properly configured")
            
        headers = {
            'Authorization': f'Bearer {self.apis[service]["api_key"]}',
            'Content-Type': 'application/json'
        }
        
        data = {
            'model': self.apis[service]['model'],
            'messages': [
                {'role': 'user', 'content': prompt}
            ],
            'max_tokens': max_tokens,
            'temperature': temperature
        }
        
        response = requests.post(
            self.apis[service]['endpoint'],
            headers=headers,
            json=data,
            timeout=30
        )
        
        if response.status_code == 200:
            result = response.json()
            return result['choices'][0]['message']['content'].strip()
        else:
            raise Exception(f"API request failed: {response.status_code} - {response.text}")
    
    def get_status(self) -> Dict[str, str]:
        """
        Get status of all configured APIs
        """
        return {service: config['status'] for service, config in self.apis.items()}

# Initialize LLM manager
llm_manager = LLMManager()

print("🔧 LLM Manager initialized")
print("Available services: DeepSeek (free), OpenAI (premium), Groq (backup)")
print("\n📝 Next: Configure your API keys below")

In [None]:
# --- Corpus Generation and Frequency Calculation ---

# A diverse set of prompts to generate a representative corpus
genre_prompts = {
    "Technical/Scientific": [
        "Explain the process of photosynthesis in a way a high school student could understand.",
        "Describe the basic principles of machine learning, including supervised and unsupervised learning.",
        "Write a brief summary of the theory of plate tectonics and its importance in geology."
    ],
    "News/Informative": [
        "Write a short news report about the opening of a new public library in a small town.",
        "Summarize the key findings of a recent study on the effects of remote work on employee productivity.",
        "Create a brief article about the cultural significance of the Silk Road."
    ],
    "Fiction/Creative": [
        "Write the opening paragraph of a mystery novel set in a futuristic city.",
        "Create a short story about a character who discovers a hidden world in their own backyard.",
        "Write a descriptive scene of a bustling marketplace in a fantasy kingdom."
    ],
    "General Knowledge/How-To": [
        "Explain how to bake a simple loaf of bread, listing the ingredients and steps.",
        "Write a short guide on the benefits of regular exercise for both physical and mental health.",
        "Describe the rules of a classic card game like 'Hearts' or 'Rummy'."
    ]
}

def generate_text_from_prompt(prompt: str, llm_manager, service: str) -> str:
    """Generates a text of about 200-300 words based on a prompt."""
    full_prompt = f"Please write a clear and well-structured text of about 200-300 words based on the following instruction: {prompt}"
    try:
        # Use the manager to generate text
        return llm_manager.generate_text(
            service=service,
            prompt=full_prompt,
            max_tokens=400,
            temperature=0.7
        )
    except Exception as e:
        print(f"Error generating text for prompt '{prompt[:30]}...': {e}")
        return ""

# 1. Generate the corpus if a service is active
if 'primary_service' in globals() and primary_service:
    print(f"🚀 Generating diverse corpus using '{primary_service}'...")
    corpus_texts = []
    for genre, prompts in genre_prompts.items():
        print(f"   - Generating texts for genre: {genre}")
        for p in prompts:
            corpus_texts.append(generate_text_from_prompt(p, llm_manager, primary_service))
            time.sleep(1) # Be respectful to the API
    
    full_corpus_text = " ".join(corpus_texts)
    print("✅ Corpus generation complete.")

    # 2. Calculate word frequencies
    print("\nCalculating word frequencies...")
    words = re.findall(r'\\b\\w+\\b', full_corpus_text.lower())
    word_counts = Counter(words)
    total_words = len(words)
    llm_frequency = {word: count / total_words for word, count in word_counts.items()}
    print("Frequency calculation complete.")

    # 3. Create a DataFrame and export
    print("\nCreating DataFrame and exporting to CSV...")
    freq_df = pd.DataFrame(list(llm_frequency.items()), columns=['word', 'llm_frequency'])
    freq_df.to_csv('generated_frequency_predictors.csv', index=False)
    print("✅ Export complete to 'generated_frequency_predictors.csv'")

    # Display the top 10 most frequent words
    print("\n--- Top 10 Most Frequent Words ---")
    print(freq_df.sort_values(by='llm_frequency', ascending=False).head(10))

else:
    print("⚠️ Cannot generate corpus. No active LLM service.")
    print("   Please configure an API key in the cell above and re-run.")
    # Create a dummy freq_df for demonstration if needed
    freq_df = pd.DataFrame({'word': ['the', 'a', 'is'], 'llm_frequency': [0.1, 0.05, 0.04]})


In [None]:
# API Key Configuration
# 🔑 Configure your API keys here

print("🔑 API Key Configuration")
print("-" * 25)

# Option 1: DeepSeek (Free tier available)
# Get your free API key at: https://platform.deepseek.com/api_keys
deepseek_key = ""  # Enter your DeepSeek API key

# Option 2: OpenAI (Premium)
# Get your API key at: https://platform.openai.com/api-keys
openai_key = ""   # Enter your OpenAI API key

# Option 3: Groq (Free tier available)
# Get your API key at: https://console.groq.com/keys
groq_key = ""     # Enter your Groq API key

# Configure available services
configured_services = []

if deepseek_key:
    llm_manager.configure_api('deepseek', deepseek_key)
    configured_services.append('deepseek')

if openai_key:
    llm_manager.configure_api('openai', openai_key)
    configured_services.append('openai')

if groq_key:
    llm_manager.configure_api('groq', groq_key)
    configured_services.append('groq')

if not configured_services:
    print("⚠️ No API keys configured. Please add at least one API key above.")
    print("   Recommended: DeepSeek (free tier available)")
else:
    print(f"✅ Configured services: {', '.join(configured_services)}")
    
    # Test connections
    print("\n🔍 Testing API connections...")
    active_services = []
    
    for service in configured_services:
        if llm_manager.test_connection(service):
            active_services.append(service)
    
    if active_services:
        print(f"\n🎯 Ready for text generation with: {', '.join(active_services)}")
        primary_service = active_services[0]
        print(f"   Primary service: {primary_service}")
    else:
        print("\n❌ No working connections. Please check your API keys.")

## 1.2 Guided Text Generation (25 minutes)

Now we'll systematically generate research corpus using different approaches and parameters.

In [None]:
# Research-Oriented Text Generation Framework

class CorpusGenerator:
    """
    Systematic corpus generation for psycholinguistic research
    """
    
    def __init__(self, llm_manager):
        self.llm_manager = llm_manager
        self.corpus_samples = {}
        
        # Research topics for diverse content
        self.research_topics = [
            "environmental protection and sustainability",
            "artificial intelligence and machine learning", 
            "climate change and global warming",
            "renewable energy technologies",
            "biodiversity conservation efforts",
            "scientific research methodology",
            "data analysis and statistics",
            "educational technology development"
        ]
    
    def generate_systematic_corpus(self, service: str, num_samples: int = 4):
        """
        Generate corpus with systematic parameter variation
        """
        
        print(f"📝 Generating Systematic Corpus using {service.capitalize()}")
        print("-" * 55)
        
        generation_configs = [
            {'name': 'Formal Academic', 'temp': 0.3, 'style': 'formal academic language'},
            {'name': 'General Informative', 'temp': 0.7, 'style': 'clear explanatory writing'},
            {'name': 'Accessible Science', 'temp': 0.9, 'style': 'accessible science communication'},
            {'name': 'Technical Detail', 'temp': 0.5, 'style': 'technical but readable style'}
        ]
        
        for i, config in enumerate(generation_configs[:num_samples]):
            topic = self.research_topics[i % len(self.research_topics)]
            
            prompt = f"""Write a 200-word informative text about {topic} using {config['style']}. 
            
            Requirements:
            - Focus on factual, educational content
            - Use varied vocabulary and sentence structures
            - Maintain scientific accuracy
            - Write for educated general audience
            
            Topic: {topic.title()}
            """
            
            try:
                print(f"   Generating {config['name']} text (T={config['temp']})...")
                
                text = self.llm_manager.generate_text(
                    service=service,
                    prompt=prompt,
                    max_tokens=300,
                    temperature=config['temp']
                )
                
                self.corpus_samples[config['name']] = {
                    'text': text,
                    'topic': topic,
                    'temperature': config['temp'],
                    'style': config['style'],
                    'word_count': len(text.split()),
                    'service': service
                }
                
                print(f"   ✅ Generated {len(text.split())} words")
                
                # Brief pause to be respectful to API
                time.sleep(1)
                
            except Exception as e:
                print(f"   ❌ Failed to generate {config['name']}: {str(e)[:50]}...")
        
        print(f"\n✅ Corpus generation complete: {len(self.corpus_samples)} samples")
        return self.corpus_samples
    
    def display_corpus_overview(self):
        """
        Display overview of generated corpus
        """
        
        if not self.corpus_samples:
            print("No corpus samples available.")
            return
        
        print("📊 Corpus Overview")
        print("=" * 20)
        
        total_words = sum(sample['word_count'] for sample in self.corpus_samples.values())
        
        print(f"   Total samples: {len(self.corpus_samples)}")
        print(f"   Total words: {total_words:,}")
        print(f"   Average words per sample: {total_words / len(self.corpus_samples):.1f}")
        
        print("\n📝 Sample Details:")
        for name, sample in self.corpus_samples.items():
            print(f"   {name}: {sample['word_count']} words (T={sample['temperature']})")
            print(f"      Topic: {sample['topic']}")
            print(f"      Preview: {sample['text'][:80]}...")
            print()

# Initialize corpus generator
corpus_gen = CorpusGenerator(llm_manager)

print("🏗️ Corpus Generator initialized")
print("Ready to generate systematic research corpus")

In [None]:
# Generate the research corpus

if 'primary_service' in globals():
    print("🚀 Starting Corpus Generation")
    print("=" * 35)
    
    # Generate corpus using primary service
    corpus_samples = corpus_gen.generate_systematic_corpus(primary_service, num_samples=4)
    
    # Display overview
    corpus_gen.display_corpus_overview()
    
    # Save for use in next notebook
    print("\n💾 Saving corpus for analysis in Notebook 2...")
    
else:
    print("⚠️ No active LLM service available.")
    print("   Please configure API keys in the previous cell and run again.")
    
    # Create sample data for demonstration
    print("\n🔄 Creating demonstration corpus...")
    corpus_samples = {
        'Formal Academic': {
            'text': "Environmental protection represents a critical challenge requiring systematic approaches to sustainability. Research demonstrates that comprehensive conservation strategies must integrate technological innovation with policy frameworks. Scientific evidence indicates that immediate action is necessary to address climate change impacts. Effective environmental management requires interdisciplinary collaboration among researchers, policymakers, and community stakeholders.",
            'topic': 'environmental protection and sustainability',
            'temperature': 0.3,
            'style': 'formal academic language',
            'word_count': 58,
            'service': 'demo'
        },
        'General Informative': {
            'text': "Artificial intelligence is transforming how we approach complex problems across many fields. Machine learning algorithms can process vast amounts of data to identify patterns and make predictions that would be impossible for humans alone. These technologies are already helping doctors diagnose diseases, scientists discover new materials, and engineers design more efficient systems. As AI continues to develop, it promises to unlock solutions to some of our most pressing challenges.",
            'topic': 'artificial intelligence and machine learning',
            'temperature': 0.7,
            'style': 'clear explanatory writing',
            'word_count': 72,
            'service': 'demo'
        }
    }
    
    corpus_gen.corpus_samples = corpus_samples
    corpus_gen.display_corpus_overview()

## 1.3 Quality Assessment Framework (10 minutes)

Now we'll assess the quality of our generated text and detect potential biases.

In [None]:
# Text Quality Assessment Framework

class QualityAssessment:
    """
    Comprehensive quality assessment for generated text
    """
    
    def __init__(self):
        # Bias detection word lists (simplified for demonstration)
        self.bias_indicators = {
            'gender': ['he', 'she', 'his', 'her', 'him', 'man', 'woman', 'male', 'female'],
            'political': ['liberal', 'conservative', 'democrat', 'republican', 'progressive'],
            'economic': ['rich', 'poor', 'wealthy', 'poverty', 'elite', 'class'],
            'cultural': ['western', 'eastern', 'american', 'european', 'traditional', 'modern']
        }
        
        self.quality_metrics = {}
    
    def assess_linguistic_quality(self, text: str) -> Dict[str, float]:
        """
        Assess basic linguistic quality metrics
        """
        
        # Basic text statistics
        words = text.lower().split()
        sentences = text.split('.')
        
        # Remove empty sentences
        sentences = [s.strip() for s in sentences if s.strip()]
        
        # Calculate metrics
        word_count = len(words)
        sentence_count = len(sentences)
        avg_words_per_sentence = word_count / max(sentence_count, 1)
        
        # Vocabulary richness (Type-Token Ratio)
        unique_words = len(set(words))
        ttr = unique_words / max(word_count, 1)
        
        # Average word length
        avg_word_length = sum(len(word) for word in words) / max(word_count, 1)
        
        return {
            'word_count': word_count,
            'sentence_count': sentence_count,
            'avg_words_per_sentence': avg_words_per_sentence,
            'type_token_ratio': ttr,
            'avg_word_length': avg_word_length,
            'unique_words': unique_words
        }
    
    def detect_bias_indicators(self, text: str) -> Dict[str, Any]:
        """
        Detect potential bias indicators in text
        """
        
        text_lower = text.lower()
        words = text_lower.split()
        
        bias_detection = {}
        
        for bias_type, indicators in self.bias_indicators.items():
            found_indicators = [word for word in indicators if word in text_lower]
            
            bias_detection[bias_type] = {
                'indicators_found': found_indicators,
                'count': len(found_indicators),
                'density': len(found_indicators) / max(len(words), 1) * 100  # per 100 words
            }
        
        return bias_detection
    
    def assess_content_appropriateness(self, text: str) -> Dict[str, Any]:
        """
        Assess content appropriateness for research use
        """
        
        # Check for scientific language indicators
        scientific_terms = ['research', 'study', 'analysis', 'evidence', 'data', 
                          'method', 'results', 'findings', 'investigation', 'experiment']
        
        text_lower = text.lower()
        scientific_count = sum(1 for term in scientific_terms if term in text_lower)
        
        # Check for problematic content indicators
        problematic_terms = ['violence', 'hate', 'discrimination', 'controversial', 
                           'offensive', 'inappropriate']
        
        problematic_count = sum(1 for term in problematic_terms if term in text_lower)
        
        return {
            'scientific_terms_count': scientific_count,
            'scientific_density': scientific_count / max(len(text.split()), 1) * 100,
            'problematic_terms_count': problematic_count,
            'appropriateness_score': max(0, scientific_count - problematic_count * 2)  # Simple scoring
        }
    
    def comprehensive_assessment(self, corpus_samples: Dict[str, Dict]) -> Dict[str, Any]:
        """
        Run comprehensive quality assessment on entire corpus
        """
        
        print("🔍 Comprehensive Quality Assessment")
        print("=" * 40)
        
        assessment_results = {}
        
        for sample_name, sample_data in corpus_samples.items():
            text = sample_data['text']
            
            print(f"\n📝 Assessing: {sample_name}")
            print("-" * (15 + len(sample_name)))
            
            # Run all assessments
            linguistic_quality = self.assess_linguistic_quality(text)
            bias_detection = self.detect_bias_indicators(text)
            content_appropriateness = self.assess_content_appropriateness(text)
            
            # Compile results
            assessment_results[sample_name] = {
                'linguistic_quality': linguistic_quality,
                'bias_detection': bias_detection,
                'content_appropriateness': content_appropriateness
            }
            
            # Display key metrics
            print(f"   📊 Linguistic Quality:")
            print(f"      Words: {linguistic_quality['word_count']}")
            print(f"      Sentences: {linguistic_quality['sentence_count']}")
            print(f"      Avg words/sentence: {linguistic_quality['avg_words_per_sentence']:.1f}")
            print(f"      Vocabulary richness (TTR): {linguistic_quality['type_token_ratio']:.3f}")
            
            print(f"   🎯 Content Assessment:")
            print(f"      Scientific terms: {content_appropriateness['scientific_terms_count']}")
            print(f"      Appropriateness score: {content_appropriateness['appropriateness_score']}")
            
            # Bias indicators summary
            total_bias_indicators = sum(b['count'] for b in bias_detection.values())
            print(f"   ⚖️ Bias Indicators: {total_bias_indicators} total detected")
            
            if total_bias_indicators > 0:
                for bias_type, detection in bias_detection.items():
                    if detection['count'] > 0:
                        print(f"      {bias_type.title()}: {detection['indicators_found']}")
            else:
                print(f"      ✅ No obvious bias indicators detected")
        
        return assessment_results

# Initialize quality assessment
quality_assessor = QualityAssessment()

print("🔧 Quality Assessment Framework initialized")
print("Ready to evaluate generated corpus")

In [None]:
# Run quality assessment on generated corpus

if 'corpus_samples' in globals() and corpus_samples:
    # Run comprehensive assessment
    assessment_results = quality_assessor.comprehensive_assessment(corpus_samples)
    
    print("\n✅ Quality assessment complete!")
    
    # Summary recommendations
    print("\n💡 Research Recommendations:")
    print("-" * 30)
    
    total_words = sum(sample['word_count'] for sample in corpus_samples.values())
    
    if total_words > 200:
        print("   ✅ Sufficient corpus size for preliminary analysis")
    else:
        print("   ⚠️ Consider generating more text for robust analysis")
    
    # Check vocabulary diversity
    avg_ttr = np.mean([assessment_results[name]['linguistic_quality']['type_token_ratio'] 
                      for name in assessment_results.keys()])
    
    if avg_ttr > 0.6:
        print("   ✅ Good vocabulary diversity across samples")
    else:
        print("   ⚠️ Consider varying prompts for more lexical diversity")
    
    # Check scientific content
    avg_scientific = np.mean([assessment_results[name]['content_appropriateness']['appropriateness_score'] 
                             for name in assessment_results.keys()])
    
    if avg_scientific > 2:
        print("   ✅ Appropriate scientific content for research use")
    else:
        print("   ⚠️ Consider more research-focused prompts")
    
    print("\n🎯 Ready for Notebook 2: Corpus Analysis and Validation")
    
else:
    print("⚠️ No corpus available for assessment.")
    print("   Please generate corpus in previous cells first.")

# 🎯 Session 4A Summary

## What We've Accomplished

✅ **API Configuration**: Set up multiple LLM services with robust error handling  
✅ **Systematic Generation**: Created research corpus with controlled parameter variation  
✅ **Quality Assessment**: Evaluated linguistic quality and detected potential biases  
✅ **Research Preparation**: Generated corpus ready for psycholinguistic validation  

## Key Learning Outcomes

1. **LLM API Mastery**: Understanding of different services and their characteristics
2. **Prompt Engineering**: Systematic approach to generating research-quality text
3. **Quality Control**: Framework for assessing and validating generated content
4. **Ethical Awareness**: Bias detection and content appropriateness evaluation

## Ready for Notebook 2

Our generated corpus is now ready for:
- **Frequency Analysis**: Comparison with reference corpora (SUBTLEX, Multilex)
- **Behavioral Validation**: Connection to human reading time databases
- **Statistical Modeling**: Cubic splines regression and surprisal analysis
- **Research Application**: Integration with established psycholinguistic methods

---

**Next**: Open **Notebook 2** to analyze our generated corpus and validate it against human behavioral data!