# 🚀 Quick Setup for Colab/Binder

**Running on Google Colab or Binder?** Run this cell first to set up the environment:

```python
# Detect environment and install requirements if needed
import sys
import os

# Check if we're in Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("🔧 Setting up Google Colab environment...")
    # Clone the repository if not already present
    if not os.path.exists('mlschool-text'):
        !git clone https://github.com/jobschepens/mlschool-text.git
        os.chdir('mlschool-text')
    else:
        os.chdir('mlschool-text')
    
    # Install requirements
    !pip install -r requirements_colab.txt
    print("✅ Colab setup complete!")

elif 'BINDER_LAUNCH_HOST' in os.environ:
    print("🔧 Binder environment detected - dependencies should already be installed")
    print("✅ Binder setup complete!")

else:
    print("💻 Local environment detected")
    print("Make sure you've run: pip install -r requirements.txt")

# Set working directory for consistent paths
if os.path.exists('mlschool-text') and os.getcwd().endswith('mlschool-text'):
    pass  # Already in the right directory
elif os.path.exists('mlschool-text'):
    os.chdir('mlschool-text')

print(f"📁 Working directory: {os.getcwd()}")
```

---

In [None]:
# Environment Setup (Run this first on Colab/Binder)
import sys
import os

# Check if we're in Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("🔧 Setting up Google Colab environment...")
    # Clone the repository if not already present
    if not os.path.exists('mlschool-text'):
        !git clone https://github.com/jobschepens/mlschool-text.git
        os.chdir('mlschool-text')
    else:
        os.chdir('mlschool-text')
    
    # Install requirements
    !pip install -r requirements_colab.txt
    print("✅ Colab setup complete!")

elif 'BINDER_LAUNCH_HOST' in os.environ:
    print("🔧 Binder environment detected - dependencies should already be installed")
    print("✅ Binder setup complete!")

else:
    print("💻 Local environment detected")
    print("Make sure you've run: pip install -r requirements.txt")

# Set working directory for consistent paths
if os.path.exists('mlschool-text') and os.getcwd().endswith('mlschool-text'):
    pass  # Already in the right directory
elif os.path.exists('mlschool-text'):
    os.chdir('mlschool-text')

print(f"📁 Working directory: {os.getcwd()}")
print("🎯 Ready to start! You can now run the rest of the notebook.")

# Part 1: Generating a Diverse LLM Corpus for Frequency Analysis
## Session 1: From LLM Generation to Word Frequency (45 minutes)

**Learning Objectives:**
- Generate a diverse text corpus using an LLM to mirror the vocabulary found in large-scale datasets like the English Crowdsourcing Project (ECP).
- Understand how prompt engineering across different genres (news, technical, fiction) can create a representative word frequency list.
- Calculate word frequencies from the generated text to create a custom `llm_frequency` predictor.
- Export the frequency data for comparative analysis in the next session.

**Session Structure:**
- **Setup & API Configuration** (10 minutes)
- **Diverse Corpus Generation** (25 minutes)
- **Frequency Calculation & Export** (10 minutes)

---

💡 **Research Context:** Our goal is to create a high-quality `llm_frequency` predictor. The ECP dataset, which we use for validation in Notebook 2, contains a wide vocabulary from many sources (general use, dictionaries, etc.). To create a comparable predictor, we must generate a corpus that is equally diverse. A simple corpus (e.g., only children's stories) is insufficient. This session focuses on generating a varied corpus to capture a broad slice of the English language.

## 1.1 Setup and API Configuration (10 minutes)

Let's set up our environment and configure the API client to generate our corpus.


In [4]:
# Environment Setup and API Configuration
import os
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import time
from typing import Dict, List, Any

print("🚀 LLM Text Generation Session")
print("=" * 35)
print("Setting up environment for corpus generation...")

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("✅ Base environment configured")

🚀 LLM Text Generation Session
Setting up environment for corpus generation...
✅ Base environment configured


In [5]:
# Simplified API Function for OpenRouter
import os
import requests
import json
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

def call_openrouter(prompt: str, model_name: str = "openai/gpt-3.5-turbo") -> str:
    """
    Calls the OpenRouter API with a specified model and prompt.
    
    Args:
        prompt (str): The user prompt to send to the LLM.
        model_name (str): The model to use (e.g., "openai/gpt-3.5-turbo").
        
    Returns:
        str: The generated text content from the LLM.
    """
    api_key = os.getenv("OPENROUTER_API_KEY")
    if not api_key:
        raise ValueError("OPENROUTER_API_KEY not found in .env file. Please add it.")
        
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json',
        "HTTP-Referer": "http://localhost:3000", # Optional for analytics
        "X-Title": "ML-School-Text-Generation" # Optional for analytics
    }
    
    data = {
        'model': model_name,
        'messages': [{'role': 'user', 'content': prompt}]
    }
    
    try:
        response = requests.post(
            "https://openrouter.ai/api/v1/chat/completions",
            headers=headers,
            json=data,
            timeout=60
        )
        response.raise_for_status()
        result = response.json()
        return result['choices'][0]['message']['content'].strip()
    except requests.exceptions.RequestException as e:
        print(f"❌ API request failed: {e}")
        return f"Error: {e}"

# --- Test the function ---
print("🚀 Testing OpenRouter API function...")
try:
    test_prompt = "Write a single sentence about psycholinguistics."
    test_response = call_openrouter(test_prompt)
    print(f"✅ API Test Successful!")
    print(f"   Prompt: {test_prompt}")
    print(f"   Response: {test_response}")
except ValueError as e:
    print(f"❌ {e}")
except Exception as e:
    print(f"❌ An unexpected error occurred during the test: {e}")

🚀 Testing OpenRouter API function...
✅ API Test Successful!
   Prompt: Write a single sentence about psycholinguistics.
   Response: Psycholinguistics is the scientific study of how language is acquired, processed, and used in the mind.
✅ API Test Successful!
   Prompt: Write a single sentence about psycholinguistics.
   Response: Psycholinguistics is the scientific study of how language is acquired, processed, and used in the mind.


First, we need a way to communicate with a Large Language Model (LLM). Instead of a complex setup, we'll use a single function that calls the **OpenRouter API**. OpenRouter allows us to access many different models (like GPT, Claude, Llama, etc.) through one consistent interface.

**Key Steps:**
1.  **API Key**: The function requires an `OPENROUTER_API_KEY`. You'll need to create a `.env` file in the root of this project and add your key there.
2.  **Function Definition**: We define `call_openrouter`, which handles the API request.
3.  **Testing**: We'll run a quick test to ensure everything is configured correctly.

*If the test fails, please ensure your `.env` file is set up correctly.*

## 1.2 Interactive Text Generation (20 minutes)

Now for the hands-on part. We will generate short text samples using different models and prompts to understand how these choices influence the resulting text. This is the core of corpus generation methodology.

**Our goals here are to:**
- **Explore Model Differences**: See how different models (e.g., a creative one vs. a technical one) respond to the same prompt.
- **Understand Prompt Engineering**: Learn how small changes to a prompt can alter the style, vocabulary, and complexity of the output.
- **Observe Corpus Characteristics**: Get a feel for the kind of text each strategy produces.

We'll start by defining a few models available through OpenRouter.

In [None]:
# --- Model and Prompt Experimentation ---
import json

# Define the specific models we want to test in this notebook.
# We will filter the larger list from models.json using these IDs.
selected_model_ids = {
    "qwen/qwen3-30b-a3b-instruct-2507",
    "meta-llama/llama-3.3-8b-instruct:free", # Note: Using the free tier ID from the JSON
    "deepseek/deepseek-chat-v3.1:free"      # Note: Using the free tier ID from the JSON
}

# Load models from the JSON file
try:
    with open('../models.json', 'r') as f:
        models_data = json.load(f)['models']
    
    # Create a full dictionary of available models, preferring free tiers
    available_models = {}
    for model in models_data:
        if model.get('type') != 'image_generation':
            model_id = model.get('free_tier', {}).get('id', model['id'])
            available_models[model['name']] = model_id
            
    # Filter the available models to get only the ones we selected
    models_to_test = {
        name: model_id for name, model_id in available_models.items()
        if model_id in selected_model_ids
    }
    
    print(f"✅ Successfully loaded and selected {len(models_to_test)} models from 'models.json'.")
    if len(models_to_test) < len(selected_model_ids):
        print("   ⚠️ Some selected models were not found in 'models.json'.")

except (FileNotFoundError, json.JSONDecodeError, KeyError) as e:
    print(f"⚠️ Could not load models from 'models.json': {e}")
    print("   Falling back to default models.")
    # Fallback to a default set if the file is missing or invalid
    models_to_test = {
        "OpenAI GPT-3.5": "openai/gpt-3.5-turbo",
        "Llama3 8B": "meta-llama/llama-3-8b-instruct"
    }

# Define a simple prompt
prompt = "Write a short, imaginative story about a librarian who discovers a book that writes itself."

print("\\n🧪 Running text generation experiment...")
print("=" * 40)
print(f'Prompt: "{prompt}"')
print("-" * 40)

# Generate text from each model
for name, model_id in models_to_test.items():
    print(f"🤖 Generating with {name} ({model_id})...")
    try:
        generated_text = call_openrouter(prompt, model_name=model_id)
        print(f"✅ Success!")
        print(f"   Response:\\n---\\n{generated_text}\\n---\\n")
    except Exception as e:
        print(f"❌ Failed to generate text with {name}: {e}")

print("✅ Experiment complete!")

✅ Successfully loaded and selected 3 models from 'models.json'.
\n🧪 Running text generation experiment...
Prompt: "Write a short, imaginative story about a librarian who discovers a book that writes itself."
----------------------------------------
🤖 Generating with Qwen3-30B-A3B-Instruct-2507 (qwen/qwen3-30b-a3b-instruct-2507)...
✅ Success!
   Response:\n---\nElara Thorne’s fingers brushed the spine of *The Forgotten Language of Starfall*, a title she’d never seen cataloged. It sat nestled between two crumbling volumes of 18th-century botany, its leather cover warm and subtly glowing, as if breathing.

She’d just closed the Rare Manuscripts wing for the night. Rain lashed the tall windows, turning the city into a blur of neon smudges. The library was hers, quiet save for the rhythmic tick of the grandfather clock in the lobby. She opened the book.

No ink. No paper. Just a smooth, blank page that shimmered like liquid mercury.

Then, a single word bloomed in the center: *Whisper.*

El

### Analysis and Exploration

Look at the outputs from the different models. Consider the following questions:
- **Vocabulary**: Which model used more common words? Which used more unusual or complex words?
- **Style**: Did one model produce a more creative or narrative style? Was another more direct or factual?
- **Length & Structure**: Were there noticeable differences in sentence length or structure?

This hands-on exploration is the first step in understanding how to build a corpus that is well-suited for psycholinguistic analysis. A good corpus needs to be diverse enough to capture the wide range of words people encounter in their daily lives.

## 1.3 Large-Scale Corpus Generation (via Script)

While we can generate small amounts of text interactively in this notebook, creating a multi-million word corpus requires a dedicated script that can run for a long time, save its progress, and handle potential interruptions.

For this purpose, we will use `scripts/script-1-gen.py`.

**How it Works:**
1.  **Configuration File**: The script's behavior is controlled by a `.json` configuration file. We have created `scripts/config_2m_llama.json` specifically for generating our 2-million-word Llama 3.3 corpus.
2.  **Seed vs. No Seed Words**: The script can use general, self-contained prompts across different genres to generate a diverse corpus without being biased by a specific word list or it can use seed words to "inspire" the LLM to write into a certain direction.
3.  **State Management**: It continuously saves its progress to a state file (`.json`), so if the script is stopped, it can be resumed later without losing work or money.

**To run the generation process, you would execute the following command in your terminal:**
```bash
python scripts/script-1-gen.py --config scripts/config_2m_llama.json
```

This process will run in the background to generate the `large_corpus_2m_llama.txt` file in the `output/` directory, which is essential for the next stages of our research. Note that you can change certain settings by making a new config file. For the purpose of this summer school session, a corpus has been pre-generated for you.

## Session 1 Summary & Next Steps

**What We've Accomplished:**
- **Simplified API Access**: We set up a clean, reusable function to connect to various LLMs via OpenRouter.
- **Interactive Exploration**: We experimented with different models and prompts to see how they affect text generation.
- **Systematic Corpus Generation**: We created a small but diverse corpus by generating text from prompts across multiple genres.

We will now move on to **Notebook 2**, where we will:
- Load our custom-generated frequency predictor.
- Compare it against established, human-derived frequency norms (like SUBTLEX).
- Validate its predictive power against real human reading time data.

---

**Next**: Open `notebook2_corpus_analysis.ipynb` to begin the analysis phase.