# Task 0: The Library of Babel

Building a dataset with three classes:
- **Class 1**: Human text (Austen + Gaskell novels)
- **Class 2**: AI text on same topics (neutral style)
- **Class 3**: AI text on same topics (mimicking author styles)

Goal: Make authorship the key differentiator, not topic. All paragraphs are 100-200 words.

## Notebook Overview

This notebook generates the dataset for Task 0. It's organized into:

1. **Setup** - Imports, paths, and configuration
2. **Class 1 (Human)** - Extract and process text from novels
3. **Topic Extraction** - Identify shared themes
4. **Class 2 (AI-Neutral)** - Generate paragraphs without style mimicry
5. **Class 3 (AI-Styled)** - Generate paragraphs mimicking author styles

**Output Files:**
- `class1_human_data_try1.json`
- `class2_ai_data_try1.json`
- `class3_ai_data_try1.json`

## Step 1: Choose Authors

Using Jane Austen and Elizabeth Gaskell because:
- Both write about similar topics (family, class, morality, domestic life)
- Different styles (Austen: ironic, restrained; Gaskell: emotional, descriptive)
- Same time period and language (19th-century British English)

This keeps topic constant while varying style.

### Why These Specific Authors?

Could have used other author pairs (e.g., Dickens + Thackeray, Brontë sisters). Chose Austen + Gaskell because:
1. **Topic overlap** - Both write about domestic life, marriage, class
2. **Different styles** - Austen is ironic/restrained, Gaskell is emotional/descriptive
3. **Same era** - Both 19th century, similar language baseline
4. **Availability** - Clean Project Gutenberg texts available

## Step 2: Source Material

Three novels per author from Project Gutenberg:

**Jane Austen**
- Pride and Prejudice
- Sense and Sensibility
- Emma

**Elizabeth Gaskell**
- North and South
- Mary Barton
- Cranford

## Step 3: Preprocessing Pipeline

1. Parse HTML and remove Project Gutenberg boilerplate
2. Normalize text (fix drop caps, remove page markers)
3. Clean paragraphs
4. Re-chunk into 100-200 word windows

## Imports

In [1]:
import os
import json
import time
import uuid
import re
import hashlib
import random
import numpy as np
from pathlib import Path
from bs4 import BeautifulSoup
from itertools import product
from google import genai

All required libraries are imported here at the top for clarity. If running this notebook on a fresh environment, install:
```
pip install beautifulsoup4 numpy google-generativeai nltk
```

## Setup Paths

This will work when cloned from GitHub. It checks if we're in the project root or need to go into PreCog_Task folder.

In [2]:
PROJECT_ROOT = Path.cwd()

# If data folder doesn't exist in current directory, try PreCog_Task subfolder
if not (PROJECT_ROOT / "data").exists():
    if (PROJECT_ROOT / "PreCog_Task" / "data").exists():
        PROJECT_ROOT = PROJECT_ROOT / "PreCog_Task"
    else:
        raise FileNotFoundError("Cannot find data folder. Make sure you're in the right directory.")

print(f"Using project root: {PROJECT_ROOT}")

Using project root: /home/venya-velmurugan/Documents/PreCog_Task


In [3]:
list((PROJECT_ROOT / "data" / "raw" ).iterdir())


[PosixPath('/home/venya-velmurugan/Documents/PreCog_Task/data/raw/austen'),
 PosixPath('/home/venya-velmurugan/Documents/PreCog_Task/data/raw/gaskell')]

In [4]:
BOOKS = {
    "austen": {
        "pride_and_prejudice": PROJECT_ROOT / "data/raw/austen/pride_and_prejudice.html",
        "emma": PROJECT_ROOT / "data/raw/austen/emma.html",
        "sense_and_sensibility": PROJECT_ROOT / "data/raw/austen/sense_and_sensibility.html",
    },
    "gaskell": {
        "cranford": PROJECT_ROOT / "data/raw/gaskell/cranford.html",
        "mary_barton": PROJECT_ROOT / "data/raw/gaskell/mary_barton.html",
        "north_and_south": PROJECT_ROOT / "data/raw/gaskell/north_and_south.html",
    },
}

## Text Cleaning Functions

These utility functions handle common text normalization issues in Project Gutenberg files:
- Drop caps (e.g., "I T is" → "It is")
- Page markers (e.g., {45}, {xxiii})
- Formatting artifacts
- Remove chapter headings
- Remove licenses
- Remove "The end" and other such markers

In [5]:
def normalize_leading_split_word(text):
    return re.sub(
        r"^([A-Z])\s+([A-Z])\s+",
        lambda m: m.group(1) + m.group(2).lower() + " ",
        text
    )

def remove_page_markers(text):
    
    return re.sub(
        r"\{\s*[0-9ivxlcdm]+\s*\}",
        "",
        text,
        flags=re.IGNORECASE
    )


In [6]:
def merge_dropcaps(blocks):
    merged = []
    i = 0

    while i < len(blocks):
        curr = blocks[i].strip()

        if len(curr) == 1 and curr.isalpha() and i + 1 < len(blocks):
            nxt = blocks[i + 1].lstrip()
            m = re.match(r"^([A-Z])\s+(.*)", nxt)
            if m:
                merged.append(curr + m.group(1).lower() + " " + m.group(2))
            else:
                merged.append(curr + nxt)
            i += 2
        else:
            merged.append(curr)
            i += 1

    return merged


In [7]:

def extract_blocks(soup):
    blocks = []
    for tag in soup.find_all(["p", "h1", "h2", "h3"]):
        text = tag.get_text(" ", strip=True)
        if text:
            blocks.append(text)
            
    return blocks


def is_true_chapter_heading(text):
    t = text.strip()
    return bool(
        re.match(
            r"^(CHAPTER|Chapter)\s+([IVXLCDM]+|\d+)\b",
            t
        )
    )


In [8]:
def is_gutenberg_license(text):
    t = text.strip().upper()
    return (
        "PROJECT GUTENBERG LICENSE" in t
        or "THE FULL PROJECT GUTENBERG LICENSE" in t
    )


In [9]:
def is_explicit_end_marker(text):
    t = text.strip().upper()
    return (
        t == "THE END."
        or t == "THE END"
        or t.startswith("THE END.")
        or "*** END OF THE PROJECT GUTENBERG EBOOK" in t
        or "WALTER SCOTT PRESS" in t
    )


## Load and Clean Books

In [10]:
BOOK_PATH = PROJECT_ROOT / "data" / "raw" / "austen" / "pride_and_prejudice.html"

with open(BOOK_PATH, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser")


In [11]:
raw_paragraphs = [p.get_text().strip() for p in soup.find_all("p")]
len(raw_paragraphs), raw_paragraphs[:5]


(2146,
 ['Title: Pride and Prejudice',
  'Author: Jane Austen',
  'Release date: June 1, 1998 [eBook #1342]\n                Most recently updated: September 22, 2025',
  'Language: English',
  'Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)'])

### Test the Cleaning Pipeline

Load one book and verify the cleaning works correctly.

## Remove Project Gutenberg Boilerplate

Project Gutenberg files have headers, legal notices, and end markers that need to be removed.

In [12]:
def clean_gutenberg_html(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f, "html.parser")

    for tag in soup(["script", "style"]):
        tag.decompose()

    blocks = extract_blocks(soup)
    blocks = merge_dropcaps(blocks)

    start_idx = None
    for i, block in enumerate(blocks):
        if is_true_chapter_heading(block):
            start_idx = i + 1
            break

    if start_idx is None:
        raise ValueError("No chapter heading found - HTML structure unexpected")

    end_idx = len(blocks)
    for i, block in enumerate(blocks):
        if "END OF THE PROJECT GUTENBERG EBOOK" in block.upper():
            end_idx = i
            break

    story = blocks[start_idx:end_idx]

    final_paragraphs = []
    for i, p in enumerate(story):
        if i == 0:
            final_paragraphs.append(p)
        elif len(p.split()) > 10 and not is_true_chapter_heading(p):
            final_paragraphs.append(p)

    final_paragraphs = [
        normalize_leading_split_word(remove_page_markers(p)).strip()
        for p in final_paragraphs
    ]

    while final_paragraphs and is_explicit_end_marker(final_paragraphs[-1]):
        final_paragraphs.pop()

    return final_paragraphs


clean_data = clean_gutenberg_html(BOOK_PATH)

### Main Cleaning Function

This does the heavy lifting:
1. Parse HTML with BeautifulSoup
2. Extract text blocks (paragraphs and headings)
3. Find where the actual story starts (first chapter heading)
4. Find where it ends (Project Gutenberg end marker)
5. Filter out non-narrative content (chapter headings, short fragments)
6. Apply text normalization

In [13]:
sample = clean_gutenberg_html(BOOKS["austen"]["pride_and_prejudice"])

print("--- FIRST PARAGRAPH ---")
print(sample[0])

print("\n--- LAST PARAGRAPH ---")
print(sample[-1])


--- FIRST PARAGRAPH ---
It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.

--- LAST PARAGRAPH ---
With the Gardiners they were always on the most intimate terms. Darcy,
as well as Elizabeth, really loved them; and they were both ever
sensible of the warmest gratitude towards the persons who, by bringing
her into Derbyshire, had been the means of uniting them.


### Process All Books

This cell loads and cleans all 6 books (3 per author).

In [14]:
all_clean_data = {}

for author, books in BOOKS.items():
    for book, path in books.items():
        paras = clean_gutenberg_html(path)
        all_clean_data[(author, book)] = paras
        print(f"{author}/{book}: {len(paras)} paragraphs")

austen/pride_and_prejudice: 1832 paragraphs
austen/emma: 2115 paragraphs
austen/sense_and_sensibility: 1592 paragraphs
gaskell/cranford: 611 paragraphs
gaskell/mary_barton: 2527 paragraphs
gaskell/north_and_south: 2629 paragraphs


In [15]:
for (author, book), paras in all_clean_data.items():
    print(f"\n{author.upper()} - {book}")
    print("FIRST:", paras[0])
    print("LAST :", paras[-1])


AUSTEN - pride_and_prejudice
FIRST: It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.
LAST : With the Gardiners they were always on the most intimate terms. Darcy,
as well as Elizabeth, really loved them; and they were both ever
sensible of the warmest gratitude towards the persons who, by bringing
her into Derbyshire, had been the means of uniting them.

AUSTEN - emma
FIRST: Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy
disposition, seemed to unite some of the best blessings of existence; and had
lived nearly twenty-one years in the world with very little to distress or vex
her.
LAST : The wedding was very much like other weddings, where the parties have no taste
for finery or parade; and Mrs. Elton, from the particulars detailed by her
husband, thought it all extremely shabby, and very inferior to her
own.—“Very little white satin, very few lace veils; a most pitiful
business!—Selina w

## Verify Consistency Across Books

Before proceeding, verifying that preprocessing didn't introduce artifacts. We check average sentence length across books by the same author to check that it should be roughly similar so that it is not a differentiating factor.

In [16]:
def avg_sentence_length(paragraphs):
    lengths = []
    for p in paragraphs:
        sentences = re.split(r"[.!?]", p)
        for s in sentences:
            if len(s.split()) > 3:
                lengths.append(len(s.split()))
    return np.mean(lengths)

for author in ["austen", "gaskell"]:
    vals = []
    for (a, b), paras in all_clean_data.items():
        if a == author:
            vals.append(avg_sentence_length(paras))
    print(author, "avg sentence length:", np.mean(vals))


austen avg sentence length: 19.674943900032964
gaskell avg sentence length: 22.09862320992688


## Re-chunk Into Fixed-Size Paragraphs

Original paragraphs vary in length. We need consistent 100-200 word chunks for fair comparison.

In [17]:
def chunk_into_windows(paragraphs, min_words=100, max_words=200):
    chunks = []
    current = []

    for p in paragraphs:
        words = p.split()
        if len(current) + len(words) <= max_words:
            current.extend(words)
        else:
            if len(current) >= min_words:
                chunks.append(" ".join(current))
            current = words

    if len(current) >= min_words:
        chunks.append(" ".join(current))

    return chunks


### Chunking Function

Original paragraphs in novels vary wildly in length (some are 500+ words, others are 20 words).

This function re-chunks all text into consistent 100-200 word windows while preserving sentence boundaries. This ensures:
- Fair comparison across classes
- Models can't use paragraph length as a shortcut
- All samples have similar statistical properties

In [18]:
all_chunks = {}

for key, paras in all_clean_data.items():
    all_chunks[key] = chunk_into_windows(paras)
    print(key, len(all_chunks[key]))

print("\n--- SAMPLE CHUNKS ---\n")
for key, chunks in all_chunks.items():
    print(f"{key} - Total Chunks: {len(chunks)}")
    print("First Chunk:", chunks[0])
    print("Last Chunk :", chunks[-1])
    print()

('austen', 'pride_and_prejudice') 621
('austen', 'emma') 806
('austen', 'sense_and_sensibility') 612
('gaskell', 'cranford') 323
('gaskell', 'mary_barton') 822
('gaskell', 'north_and_south') 870

--- SAMPLE CHUNKS ---

('austen', 'pride_and_prejudice') - Total Chunks: 621
First Chunk: It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters. “My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last? ” “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” “Do not you want to know who has taken it?” cried his wife, impatiently. “ You want to tell me, and I have no objection to hearing it.”

## Save Class 1 (Human Data)

In [19]:
human_data = []

for (author, book), chunks in all_chunks.items():
    for chunk in chunks:
        human_data.append({
            "text": chunk,
            "label": "human",
            "author": author,
            "source": book
        })

print("Total human paragraphs:", len(human_data))
print("Sample entry:\n", human_data[0])


Total human paragraphs: 4054
Sample entry:
 {'text': 'It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters. “My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last? ” “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” “Do not you want to know who has taken it?” cried his wife, impatiently. “ You want to tell me, and I have no objection to hearing it.”', 'label': 'human', 'author': 'austen', 'source': 'pride_and_prejudice'}


### Apply Chunking

This cell processes all the cleaned books and re-chunks them into 100-200 word windows.

In [20]:
import json

HUMAN_OUTPUT_FILE = "class1_human_data_try1.json"

with open(HUMAN_OUTPUT_FILE, "w", encoding="utf-8") as f:
    json.dump(human_data, f, indent=2, ensure_ascii=False)

print(f"Human data saved to {HUMAN_OUTPUT_FILE}")


Human data saved to class1_human_data_try1.json


## Topic Extraction

Need to identify 5-10 high-level topics that appear in both authors' work. These will be used to generate AI text on the same themes.

### Approach

Using a combination of personal reading experience and exploratory analysis (using online summaries, llms, etc.) to identify recurring themes across the six novels.

### Final Topics (9 total)

1. Courtship and Marriage  
2. Domestic Life and Family Obligation  
3. Social Class and Reputation  
4. Moral Judgment and Personal Character  
5. Gender Roles and Social Constraint  
6. Work, Industry, and Economic Struggle  
7. Community, Gossip, and Social Networks  
8. Individual Desire vs Social Expectation  
9. Change, Mobility, and Social Reform

### Design Choice: 9 Topics Overall vs Per-Book

Since they talk about similar topics, instead of 5-10 topics per book, I decided to treat topics as **corpus-level semantic anchors** rather than book-specific labels.

Reasoning:
- Each novel naturally contributes content to multiple topics (e.g., both authors write about marriage, class, morality)
- Treating topics at the corpus level prevents artificial inflation of topic categories
- With 500 paragraphs per AI class and 9 topics, we get ~55 paragraphs per topic, which is sufficient for diversity
- This approach ensures topics are actually comparable naturally across human and AI classes

## API key setup

In [21]:
# API key setup (move to environment variable for production)
os.environ["GOOGLE_API_KEY"] = "AIzaSyC-yEM1oYrnRGRl31LhO1jUVPksDadQBk0"

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
MODEL_NAME = "models/gemma-3-27b-it"

In [22]:
from google import genai
import os

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

models = client.models.list()
for m in models:
    print(m.name)


models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5-computer-use-preview-10-2025
models/deep-research-pro-preview-12-2025
models/embedding-001
models/text-embedding-004
models/gemini-embedding-001
models/aqa
models/

### Design Choice: Why Gemma?

Using Gemini's Gemma model because:
1. **Free API access** - Found on Reddit that Gemma-3 has a largely unrestricted free tier (gemini-pro has stricter limits)
2. **Quality** - Gemma-3-27b-it produces coherent, diverse text
3. **Controllability** - Good response to temperature/top_p tuning
4. **Rate limits** - Can generate 500+ paragraphs without hitting quotas

Alternative models like Gemini would work but require paid API access.

In [23]:
topics = [
    "Courtship and Marriage",
    "Domestic Life and Family Obligation",
    "Social Class and Reputation",
    "Moral Judgment and Personal Character",
    "Gender Roles and Social Constraint",
    "Work, Industry, and Economic Struggle",
    "Community, Gossip, and Social Networks",
    "Individual Desire versus Social Expectation",
    "Change, Mobility, and Social Reform"
]

TOTAL_PARAGRAPHS = 500
NUM_TOPICS = len(topics)

base_per_topic = TOTAL_PARAGRAPHS // NUM_TOPICS
remainder = TOTAL_PARAGRAPHS % NUM_TOPICS

print(f"Will generate ~{base_per_topic} paragraphs per topic")

Will generate ~55 paragraphs per topic


This cell sets up the topic list and calculates how many paragraphs to generate per topic.

## Class 2: AI Text (Neutral Style)

Generate 500 paragraphs on the same topics but with a neutral, contemporary voice. No style mimicry.

### Diversity Strategy

To avoid repetition, each paragraph uses:
- A **topic** (e.g., "Courtship and Marriage")
- A **lens** (e.g., psychological, economic, sociological)
- A **constraint** (e.g., "Begin with a contrast between past and present")

This creates varied prompts even within the same topic.

### Why These Lenses and Constraints?

The lens (perspective) + constraint (structural rule) combination serves two purposes:

1. **Diversity** - Prevents the model from generating repetitive text on the same topic
2. **Semantic variation** - Forces different angles on the same theme (e.g., marriage from psychological vs economic perspective)

Note: **Gemma is deterministic**. It produces the same output for the same prompt. Without varying the lens and constraint, we'd get identical paragraphs for each topic. This combinatorial approach ensures every prompt is unique.

This strategy is inspired by few-shot prompting but applied systematically across all generations.

In [24]:
topics = [
    "Courtship and Marriage",
    "Domestic Life and Family Obligation",
    "Social Class and Reputation",
    "Moral Judgment and Personal Character",
    "Gender Roles and Social Constraint",
    "Work, Industry, and Economic Struggle",
    "Community, Gossip, and Social Networks",
    "Individual Desire versus Social Expectation",
    "Change, Mobility, and Social Reform"
]

In [25]:
LENSES = [
    "a sociological perspective",
    "a psychological perspective",
    "an economic perspective",
    "an ethical or moral perspective",
    "an individual lived-experience perspective",
    "a community-level perspective",
    "a historical-comparative perspective",
    "an institutional or systemic perspective"
]

In [26]:
CONSTRAINTS = [
    "Begin by describing a specific everyday situation.",
    "Begin with a contrast between past and present.",
    "Begin by highlighting a tension or trade-off.",
    "Begin by describing an individual dilemma.",
    "Begin by describing a social norm or expectation.",
    "Begin by focusing on consequences rather than causes.",
    "Begin by describing a small, ordinary decision.",
    "Begin by outlining a common misunderstanding.",
    "Begin with a question that the paragraph then answers.",
    "Begin by describing a quiet or unnoticed moment."
]


### Deduplication

Track generated text to avoid duplicates given the deterministic nature.

In [27]:
seen_texts = set()

def is_duplicate(text):
    key = text.strip().lower()
    return key in seen_texts

def mark_seen(text):
    seen_texts.add(text.strip().lower())

In [29]:

assert "GOOGLE_API_KEY" in os.environ, "GOOGLE_API_KEY not set"

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

MODEL_NAME = "models/gemma-3-27b-it"

def call_gemma_api(prompt, max_retries=3):
    """
    Calls Gemma with high diversity settings.
    Retries on transient failures.
    """
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model=MODEL_NAME,
                contents=prompt,
                config={
                    "temperature": 0.95,
                    "top_p": 0.9,
                    "top_k": 50,
                    "max_output_tokens": 450
                }
            )

            text = response.text.strip()

            if len(text.split()) < 80:
                raise ValueError("Output too short, retrying")

            return text

        except Exception as e:
            print(f"API error (attempt {attempt+1}/{max_retries}): {e}")
            time.sleep(2)

    return None

### API Call Function

This handles the actual generation with:
- **High temperature (0.95)** - Encourages diverse outputs
- **Retries** - Handles transient API failures
- **Length check** - Ensures raw output is in the 100-200 word range.

In [30]:
def enforce_length(text, min_words=100, max_words=200):
    words = text.split()
    if len(words) < min_words:
        return None
    if len(words) > max_words:
        return " ".join(words[:max_words])
    return text


In [31]:
from itertools import product

ai_data = []
SEEN_TEXTS = set()
OUTPUT_FILE = "class2_ai_data_try1.json"

COMBINATIONS = list(product(LENSES, CONSTRAINTS))

def mark_seen(text):
    SEEN_TEXTS.add(text[:500])

def is_duplicate(text):
    return text[:500] in SEEN_TEXTS


### Prompt Builder

Creates a unique prompt for each paragraph by combining:
- Topic (e.g., "Courtship and Marriage")
- Lens (e.g., "psychological perspective")
- Constraint (e.g., "Begin with a contrast")
- Variation ID (for tracking in later analysis)

In [32]:
import uuid

def build_class2_prompt(topic, lens, constraint):
    variation_id = uuid.uuid4().hex[:8]

    prompt = f"""
You are generating original prose for a research dataset.

Topic:
"{topic}"

Perspective (MANDATORY):
Write from {lens}.

Structural constraint (MANDATORY):
{constraint}

Rules:
- Write ONE paragraph of 100–200 words
- Neutral, contemporary narrative voice
- No imitation of any author or literary period
- No archaic or Victorian language
- No dialogue unless unavoidable
- Avoid generic summaries
- Do not reuse phrasing or structure from previous responses

Variation ID: {variation_id}

Return only the paragraph text.
""".strip()

    return prompt, variation_id


### Class 2 Generation Loop

This is the main generation loop. Key features:
- Iterates through all topics
- Uses different lens/constraint combinations for each paragraph
- Checks for duplicates using string prefix matching (first 500 chars)
- Saves after each paragraph (safety against API failures)
- Adds ~0.6s delay between calls (rate limiting)

In [None]:


for topic_idx, topic in enumerate(topics):
    n = base_per_topic + (1 if topic_idx < remainder else 0)
    generated_for_topic = 0
    attempts = 0
    combo_idx = topic_idx % len(COMBINATIONS)

    print(f"\nGenerating for topic: {topic} (target: {n} paragraphs)")

    while generated_for_topic < n:
        if combo_idx >= len(COMBINATIONS):
            print("Warning: Exhausted all lens-constraint combinations")
            break

        lens, constraint = COMBINATIONS[combo_idx]
        combo_idx += 1
        attempts += 1

        prompt, vid = build_class2_prompt(topic, lens, constraint)
        text = call_gemma_api(prompt)

        if not text:
            continue

        text = enforce_length(text)
        if text is None:
            continue

        if is_duplicate(text):
            continue

        mark_seen(text)

        ai_data.append({
            "text": text,
            "label": "ai",
            "topic": topic,
            "lens": lens,
            "constraint": constraint,
            "variation_id": vid
        })

        generated_for_topic += 1

        with open(OUTPUT_FILE, "w") as f:
            json.dump(ai_data, f, indent=2)

        print(f"{topic}: {generated_for_topic}/{n}")
        time.sleep(0.6)


Generating for topic: Courtship and Marriage (target: 56 paragraphs)
Courtship and Marriage: 1/56
Courtship and Marriage: 2/56
Courtship and Marriage: 3/56
Courtship and Marriage: 4/56
Courtship and Marriage: 5/56
Courtship and Marriage: 6/56
Courtship and Marriage: 7/56
Courtship and Marriage: 8/56
Courtship and Marriage: 9/56
Courtship and Marriage: 10/56
Courtship and Marriage: 11/56
Courtship and Marriage: 12/56
Courtship and Marriage: 13/56
Courtship and Marriage: 14/56
Courtship and Marriage: 15/56
Courtship and Marriage: 16/56
Courtship and Marriage: 17/56
Courtship and Marriage: 18/56
Courtship and Marriage: 19/56
Courtship and Marriage: 20/56
Courtship and Marriage: 21/56
Courtship and Marriage: 22/56
Courtship and Marriage: 23/56
Courtship and Marriage: 24/56
Courtship and Marriage: 25/56
Courtship and Marriage: 26/56
Courtship and Marriage: 27/56
Courtship and Marriage: 28/56
Courtship and Marriage: 29/56
Courtship and Marriage: 30/56
Courtship and Marriage: 31/56
Courtship

## Class 3: AI Text (Style-Conditioned)

Generate 500 paragraphs on the same topics, but conditioned to mimic Austen or Gaskell's style.

Distribution:
- 250 paragraphs in Austen's style
- 250 paragraphs in Gaskell's style

Author style guide generated again through a mix of human observed features and features describing the author on the internet and other resources

In [1]:
TOTAL_PARAGRAPHS = 500
AUTHORS = ["austen", "gaskell"]
PARAS_PER_AUTHOR = TOTAL_PARAGRAPHS // len(AUTHORS)  # 250 each


### Style Guides

These describe the authors' stylistic features at an abstract level:
- Sentence patterns (e.g., Austen's periodic sentences)
- Vocabulary preferences (e.g., Gaskell's industrial terms)
- Punctuation habits (e.g., Austen's semicolons)
- Narrative voice (e.g., Austen's irony vs Gaskell's emotional directness)

The model uses these to condition generation without copying specific text.

In [None]:
AUTHOR_STYLE_GUIDES = {
    "austen": """
- Use syntactically complex but balanced sentences
- Employ indirect evaluation and mild irony
- Prefer formal, polite diction
- Embed moral judgment subtly within narration
- Avoid emotional excess or overt sentiment
""",
    "gaskell": """
- Use emotionally expressive but controlled language
- Emphasize social conditions and human impact
- Allow moral concern to be explicit
- Alternate between reflection and description
- Use grounded, socially aware narration
"""
}


### Class 3 Setup and Helper Functions

This cell sets up:
- Deduplication using MD5 hashes (more robust than string matching)
- Length enforcement (trim to 100-200 words)
- Prompt builder that combines topic + author style + lens + constraint

The hash-based deduplication normalizes text (lowercase, remove punctuation) before checking for duplicates.

In [None]:
import json
import time
import uuid
import re
import hashlib
from itertools import product

OUTPUT_FILE = "class3_ai_data_try1.json"

AUTHORS = ["austen", "gaskell"]
TOTAL_PARAGRAPHS = 500
PARAS_PER_AUTHOR = TOTAL_PARAGRAPHS // len(AUTHORS)

COMBINATIONS = list(product(LENSES, CONSTRAINTS))

ai_style_data = []
SEEN_HASHES = set()

def normalize_for_hash(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^\w\s]", "", text)
    return text.strip()

def is_duplicate(text):
    h = hashlib.md5(normalize_for_hash(text).encode()).hexdigest()
    return h in SEEN_HASHES

def mark_seen(text):
    h = hashlib.md5(normalize_for_hash(text).encode()).hexdigest()
    SEEN_HASHES.add(h)

def enforce_length(text, min_words=100, max_words=200):
    words = text.split()
    if len(words) < min_words:
        return None
    if len(words) > max_words:
        return " ".join(words[:max_words])
    return text



 Generating for AUSTEN (250)

--- austen | Courtship and Marriage (28) ---
[ austen | Courtship and Marriage ] 1/28
[ austen | Courtship and Marriage ] 2/28
[ austen | Courtship and Marriage ] 3/28
[ austen | Courtship and Marriage ] 4/28
[ austen | Courtship and Marriage ] 5/28
[ austen | Courtship and Marriage ] 6/28
[ austen | Courtship and Marriage ] 7/28
[ austen | Courtship and Marriage ] 8/28
[ austen | Courtship and Marriage ] 9/28
[ austen | Courtship and Marriage ] 10/28
[ austen | Courtship and Marriage ] 11/28
[ austen | Courtship and Marriage ] 12/28
[ austen | Courtship and Marriage ] 13/28
[ austen | Courtship and Marriage ] 14/28
[ austen | Courtship and Marriage ] 15/28
[ austen | Courtship and Marriage ] 16/28
[ austen | Courtship and Marriage ] 17/28
[ austen | Courtship and Marriage ] 18/28
[ austen | Courtship and Marriage ] 19/28
[ austen | Courtship and Marriage ] 20/28
[ austen | Courtship and Marriage ] 21/28
[ austen | Courtship and Marriage ] 22/28
[ austen 

In [1]:
def build_class3_prompt(topic, author, lens, constraint):
    variation_id = uuid.uuid4().hex[:8]

    prompt = f"""
Write a single self-contained paragraph of 100–200 words on the topic:

"{topic}"

Write in a style inspired by {author}, following these stylistic tendencies:

{AUTHOR_STYLE_GUIDES[author]}

Additional constraints:
- {lens}
- {constraint}

Rules:
- Do not reference specific novels, characters, or places
- Do not quote or paraphrase existing text
- Maintain a continuous prose paragraph
- Emphasize sentence rhythm, clause structure, and narrative voice typical of this author
- Preserve topic content while expressing it through stylistic form

Variation ID: {variation_id}

Return only the paragraph text.
""".strip()

    return prompt, variation_id

### Class 3 Generation Loop

This is the main generation loop. Key features:
- Iterates through all topics
- Uses different lens/constraint combinations for each paragraph
- Checks for duplicates using MD5 hashing (see setup cell above)
- Saves after each paragraph (safety against API failures)
- Adds ~0.6s delay between calls (rate limiting)

In [None]:
NUM_TOPICS = len(topics)
base_per_topic = PARAS_PER_AUTHOR // NUM_TOPICS
remainder = PARAS_PER_AUTHOR % NUM_TOPICS

for author in AUTHORS:

    author_count = 0
    print(f"\nGenerating for {author.upper()} (target: {PARAS_PER_AUTHOR} paragraphs)")

    for topic_idx, topic in enumerate(topics):

        n = base_per_topic + (1 if topic_idx < remainder else 0)
        generated_for_topic = 0
        combo_idx = (topic_idx + hash(author)) % len(COMBINATIONS)
        attempts = 0

        print(f"\n{author} - {topic} (target: {n})")

        while generated_for_topic < n:

            lens, constraint = COMBINATIONS[combo_idx % len(COMBINATIONS)]
            combo_idx += 1

            prompt, vid = build_class3_prompt(
                topic=topic,
                author=author,
                lens=lens,
                constraint=constraint
            )

            text = call_gemma_api(prompt)
            if not text:
                continue

            text = enforce_length(text)
            if text is None:
                continue

            if is_duplicate(text):
                continue

            mark_seen(text)

            ai_style_data.append({
                "text": text,
                "label": "ai",
                "author_style": author,
                "topic": topic,
                "lens": lens,
                "constraint": constraint,
                "variation_id": vid
            })

            generated_for_topic += 1

            with open(OUTPUT_FILE, "w") as f:
                json.dump(ai_style_data, f, indent=2)

            print(f"{author} - {topic}: {generated_for_topic}/{n}")
            time.sleep(0.6)


print("\nClass 3 generation complete.")
print(f"Total paragraphs generated: {len(ai_style_data)}")

## Done

Generated three datasets:
- `class1_human_data_try1.json`: 500+ human paragraphs
- `class2_ai_data_try1.json`: 500 AI paragraphs (neutral)
- `class3_ai_data_try1.json`: 500 AI paragraphs (styled)

All paragraphs are 100-200 words on the same 9 topics.

### Dataset Summary

**Class 1 (Human):**
- Source: 6 novels (3 Austen + 3 Gaskell)
- Processing: HTML cleaning, boilerplate removal, re-chunking
- Output: ~4000+ paragraphs, 100-200 words each

**Class 2 (AI-Neutral):**
- Source: Gemini API (gemma-3-27b-it)
- Strategy: Paragraph-level generation with lens/constraint diversity
- Output: 500 paragraphs, 100-200 words each

**Class 3 (AI-Styled):**
- Source: Gemini API (gemma-3-27b-it)
- Strategy: Style-conditioned generation (250 Austen-style, 250 Gaskell-style)
- Output: 500 paragraphs, 100-200 words each

All three classes share the same 9 topics, enabling fair comparison in downstream tasks.

## Review of data

After running the metrics of task 1 on this dataset, I decided that the created dataset was too weak and I would have to go through rounds of improving the generated dataset.

I decided to have three variants (try1, try2, try3) so that it would allow us to compare:
- **Paragraph-level generation** (this notebook) - More structured, explicit constraints
- **Story-level generation** (try2) - More natural flow, organic variation
- **Few-shot with examples** (try3) - Best style mimicry but risk of copying

Tasks 1 & 2 will help us verify whether the dataset generation strategy has actually improved the dataset and does it become harder to detect.

### Next: Try 2 - Story-Level Generation

**Limitation of Try 1:**
This notebook generates paragraphs independently with explicit lens/constraint combinations. While this ensures diversity, it may produce text that feels overly structured or artificially varied.

**Try 2 Approach:**
Instead of generating 500 isolated paragraphs, Try 2 will:
1. **Generate longer coherent stories** (1500-3000 words) on each topic
2. **Extract 100-200 word chunks** from these stories (like we did for human text)
3. **Leverage contextual flow** - paragraphs will have narrative continuity rather than being standalone

**Why This Might Work Better:**
- More natural variation within and between paragraphs
- Mimics the actual structure of how human authors write (continuous narrative, not isolated thoughts)
- Reduces the "template-like" quality that can emerge from structured prompts
- Potentially harder for classifiers to detect since the generation process more closely resembles human writing

**Trade-off:**
Less explicit control over diversity (no lens/constraint combinations), but potentially more authentic stylistic variation through natural narrative flow.