<a href="https://colab.research.google.com/github/laurencoetzee001/Beads_Co-detect/blob/main/Oct_Stage_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Co-DETECT Bead Exchange Analysis - Google Colab Implementation
# Expert-Validated Codebook v4.0 - August 2025
# ENHANCED VERSION with Robust Restart Capabilities
# Using Claude 3.5 Haiku for Fast & Cost-Efficient Processing
# Run this setup cell first to initialize the Co-DETECT system

import os
import pickle
import pandas as pd
import numpy as np
import json
import random
from datetime import datetime
import time
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import anthropic

# Mount Google Drive (run this once)
from google.colab import drive
drive.mount('/content/drive')

# =============================================================================
# CO-DETECT CONFIGURATION
# =============================================================================

# Configuration
WORKING_DIR = '/content/drive/MyDrive/CoDetectBeadAnalysis'
BACKUP_DIR = '/content/drive/MyDrive/colab_backups/codetect'
DATASET_FILE = 'Munashe_Cleaned.xlsx'
BACKUP_FILENAME = 'Oct_Stage_1_progress'
LINES_PER_BACKUP = 50

# Co-DETECT Parameters
SAMPLE_SIZE = 1500
EDGE_CASE_THRESHOLD = 0.7
MAX_RETRIES = 3
BATCH_SIZE = 50

# Cost estimation (Claude 3.5 Haiku pricing)
COST_PER_1K_TOKENS_INPUT = 0.001  # $1 per million tokens
COST_PER_1K_TOKENS_OUTPUT = 0.005  # $5 per million tokens
ESTIMATED_TOKENS_PER_TEXT = 800

# Create directories
os.makedirs(WORKING_DIR, exist_ok=True)
os.makedirs(BACKUP_DIR, exist_ok=True)
os.chdir(WORKING_DIR)

# Global counters and tracking (ENHANCED for restart)
if 'progress_counter' not in globals():
    progress_counter = 0
if 'codetect_results' not in globals():
    codetect_results = []
if 'processed_indices' not in globals():
    processed_indices = set()  # Track which row indices have been processed
if 'session_start_time' not in globals():
    session_start_time = None

# =============================================================================
# EXPERT-VALIDATED CODEBOOK v4.0
# =============================================================================

EXPERT_CODEBOOK = """
# Bead Exchange Historical Data Analysis Codebook v4.0 (Expert-Validated + Historian-Designed - August 2025)

## CRITICAL INSTRUCTIONS FOR AI/API PROCESSING

**PRIORITY**: This codebook combines the complete historian-designed structure with expert-validated edge case resolutions from Co-DETECT analysis. Follow ALL decision rules strictly. When in doubt, be CONSERVATIVE and require explicit evidence.

**KEY PRINCIPLE**: Code only what is explicitly stated in the text. Do NOT infer, assume, or interpret beyond what is directly written.

## PREPROCESSING DATA QUALITY RULES (Expert-Validated)

### Rule P1: Text Length and Quality Filtering
- **Entries with <50 readable characters** → `read_entry = "NaN"` (corrupted/fragment)
- **Entries with >80% non-alphabetic characters** → `read_entry = "NaN"` (corrupted OCR)

### Rule P2: Content Relevance Filtering
- **"Context page" entries without bead-related terms** → `read_entry = 0` (not about beads)
- **Bead-related terms**: bead, beads, pearl, coral, glass (ornament context), necklace, bracelet, ornament, jewelry, string

## Introduction

This codebook identifies and classifies historical references to beads and their exchange across cultures and time periods. It provides systematic guidelines for analyzing historical texts to extract information about beads, their characteristics, and their role in economic and cultural exchanges.

The unit of observation is a specific exchange rate at a specific location/ethnic group or specific point of time. If multiple exchange rates are stated in the same context (e.g., two exchange rates in two markets), each should be treated as a separate entry.

**HISTORIAN EXAMPLE**: "A slave-boy could be purchased for five fundo, or fifty strings of beads: the same article would now fetch three hundred. A fundo of cheap white porcelain-beads would procure a milk cow; and a goat, or ten hens its equivalent, was to be bought for one khete."

This passage contains multiple distinct exchange rates that would require separate entries:
1. Five fundo (fifty strings of beads) for a slave-boy (historical rate)
2. Three hundred strings of beads for a slave-boy (current rate)
3. One fundo of cheap white porcelain beads for a milk cow
4. One khete of beads for a goat or ten hens

## MANDATORY DECISION FLOWCHART (Expert-Enhanced)

For EVERY entry, follow this sequence:

1. **Data Quality Assessment** (apply preprocessing rules)
2. **Read Entry Assessment** (code="read_entry")
3. **Exchange Occurrence Check** (code="4a_exchange") - CRITICAL DECISION POINT
4. **If exchange="no"**: Set exchange variables to "NA" BUT capture contextual details for historical analysis
5. **If exchange="xo"**: Code all exchange variables based on explicit evidence
6. **Function coding**: Can be done regardless of exchange status
7. **Multiple exchange check**: Duplicate row for each distinct exchange rate

## Question-by-Question Guidance (Complete Historian Design + Expert Enhancements)

### 0. Read Entry (code="read_entry")
**Question**: Is this a readable entry about beads?

**Values**:
- `0`: Entry does not clearly mention beads in meaningful context
- `1`: Entry clearly mentions beads with sufficient context for analysis
- `2`: Entry mentions beads but has typos (e.g., "head" instead of "bead")
- `NaN`: Corrupted/unreadable text only

**ENHANCED RULE**: Code `1` only when beads are central to the passage content. Include ALL types of beads - cultural, trade, equipment, ceremonial.

### 1. Function (code="1_function")
**Question**: What is the function of the beads in society?

#### 1a. Physical Function (code="1a_physical_function")
**Question**: What is the physical function?

**Values**:
- `2`: Aesthetic function (jewelry & fashion, decoration)
- `NaN`: Not mentioned/Not applicable

**EXPERT-ENHANCED EXAMPLES**:
- "Wearing strings of beads around their loins" → `2`
- "The women adorned their hair with colorful glass beads" → `2`
- "The hut was decorated with beaded curtains" → `2`
- **"Set of double beads on horse equipment"** → `2` (decorative equipment) - Expert Decision
- **"beads of his rosary"** → `2` (can have aesthetic component)

#### 1b. Trade Function (code="1b_trade_function")
**Question**: What is the trade function?

**Values**:
- `2`: Money (exchange, tax)
- `NaN`: Not mentioned/Not applicable

**EXAMPLES**:
- "They used beads as currency to purchase supplies" → `2`
- "Beads were collected as tax by the chieftain" → `2`
- **"bought the secret for beads"** → `2` (beads functioning as payment) - Expert Decision

#### 1c. Social Function (code="1c_social_function")
**Question**: What is the social, religious function of the beads?

**Values**:
- `3`: Ceremonial (prayer beads) & social function (gifts, status)
- `NaN`: Not mentioned/Not applicable

**EXPERT-ENHANCED EXAMPLES**:
- "Prayer beads for religious ceremonies" → `3`
- "Gift of beads to demonstrate status" → `3`
- "The chief received tribute in the form of rare beads" → `3`
- "At the wedding ceremony, the bride was given ancestral beads" → `3`
- **"beads of his rosary"** → `3` (ceremonial - prayer beads) - Expert Decision
- **"presents of glass beads"** → `3` (social - gift-giving) - Expert Decision

### 2. Nature of Exchange (code="2_nature_of_exchange")
**Question**: If beads were exchanged, was the observation consensual or conflictual?

**MANDATORY PREREQUISITE**: Only code if 4a_exchange="xo"

**Values**:
- `1`: Consensual - Willing, mutually agreed exchange where both parties enter freely with roughly equal power and information
- `2`: Conflictual - Forced, unequal, or contested exchange with significant power imbalance
- `3`: Competitive/haggling (involves negotiation)
- `4`: Social (gifts/tributes/presents) - Expert Enhanced
- `NaN`: Not mentioned/Not applicable

**HISTORIAN EXAMPLES**:
- "They willingly traded beads for supplies" → `1`
- "The merchants happily exchanged glass beads for local produce" → `1`
- "Both parties seemed satisfied with the exchange of beads for cattle" → `1`
- "Beads were demanded as tribute from the conquered village" → `2`
- "Under threat of violence, they were forced to accept beads of inferior quality" → `2`
- "The colonial officials imposed a tax to be paid in blue beads" → `2`
- "After much bargaining, we agreed on twenty strings of beads for the goat" → `3`
- "The price began at five strings but after negotiation rose to eight" → `3`
- "Three days of intense haggling preceded the exchange of beads for ivory" → `3`
- "A ceremonial gift of beads was presented to the bride's family" → `4`
- "The visiting dignitary received a necklace of rare beads" → `4`
- "Annual tribute of decorative beads was given to honor the ancestors" → `4`

**EXPERT ADDITIONS**:
- **"offered presents of glass beads"** → `4` (social - gift-giving)
- **"bought the secret for beads"** → `1` (consensual commercial)

### 3. Between Groups (code="3_between_groups")
**Question**: Who were the people involved in the interaction? Was the interaction observed between different ethnic groups or between local and visiting traveler?

**MANDATORY PREREQUISITE**: Only code if 4a_exchange="xo"

**Values**:
- `1`: Between local and traveler (international travelers)
- `2`: Between local and local (inter-ethnic) [add STRING of both groups], i.e., Zulu trading with Xhosa
- `3`: Between local and local (intra-ethnic) [add STRING of both groups], i.e., Kikuyu trading with Kikuyu
- `4`: Between international travelers only

**Decision rules**:
- Include names of ethnic groups in addition to the code
- Distinguish between inter-ethnic and intra-ethnic exchanges

### 4a. Exchange (code="4a_exchange") - CRITICAL DECISION POINT
**Question**: Were beads actually exchanged?

**Values**:
- `no`: No mention (exchange did not actually take place, hearsay)
- `xo`: Exchanged

**EXPERT-ENHANCED CRITERIA FOR "xo":**
1. **Explicit transaction verbs**: "traded", "sold", "bought", "exchanged", "gave for", "received for", "purchased", "offered presents"
2. **Clear parties**: Identifiable giver AND receiver
3. **Completed action**: Past tense indicating transaction happened
4. **Specific items**: What was given AND what was received
5. **INTANGIBLE GOODS INCLUDED**: Knowledge, secrets, services count (Expert Decision)

**❌ NEVER code "xo" for (Expert-Validated):**
- **Historical generalizations**: "when trading first commenced, natives sold ivory for beads" (Expert Decision)
- **Observational descriptions**: "showed me his beads", "displayed the beads" (Expert Decision)
- **Wearing/using descriptions**: "wore beads", "adorned with beads"
- **Price lists without transactions**: "beads cost £2"

**Decision rules**:
- Use `xo` to indicate an actual exchange took place
- For every distinct exchange rate at the same market/location, duplicate the row and enter relevant exchange rates separately (Historian Rule)

### 4b. Beads Exchanged (code="4b_beads_exchanged")
**Question**: Were there specific types of beads used in the exchange? Describe the unit of exchange of beads.

**Format**: Provide details including quantity, weight, body measurement, and characteristics of bead used in exchange.

**HISTORIAN EXAMPLES**:
- "A fondo of large white beads with blue eyes"
- "100 strings of red beads"
- "Two fathoms of small blue glass beads"
- "A handful of amber-colored tubular beads"
- "Five bracelets of multicolored seed beads"
- "Twenty large coral beads measured from thumb to elbow"
- "Three strings of small white porcelain beads and one string of large blue glass beads"
- "One khete of cheap white glass beads"
- "An arm's length of faceted crystal beads"
- "A necklace of alternating black and white ceramic beads"

**EXPERT ADDITIONS**:
- **"string of wooden beads"** (from intangible exchange case)
- **"glass beads"** (from gift-giving case)

**Decision rules**:
- If more than one type of bead was exchanged, separate by a comma
- Include all relevant characteristics: size, color, shape, type
- Include measurement terms (fathom, khete, string, fundo, etc.) when specified
- **Expert Addition**: For historical generalizations, still capture bead details even when 4a_exchange="no"

### 4c. Exchanged Item (code="4c_exchanged_item")
**Question**: What were beads exchanged for?

**Format**: Include quantity and item. What was "bought" for the amount given in 4b.

**HISTORIAN EXAMPLES**:
- "One cow"
- "Handful of eggs"
- "Few francs"
- "10 goats"
- "2 lb of flour and 2 eggs"
- "Dozen bullets"

**EXPERT ADDITIONS (Intangible Goods)**:
- **"the secret for writing amulets"** (knowledge)
- **"friendship/diplomatic relations"** (social capital)
- **"ivory"** (from historical generalizations)

**Decision rules**:
- If more than one item was exchanged, separate by a comma
- Be as specific as possible with quantities and descriptions
- **Expert Addition**: Include intangible goods with same detail level as physical goods

### 5. Related Items (code="5_related_items")
**Question**: List all items that were traded, exchanged or given along as gifts in correlation with beads.

#### 5a. Raw Materials (code="5a_related_raw_materials")
**Question**: Was the bead connected to raw materials (also as part of jewellery)?

**Format**: List all raw material items traded, separated by comma or NaN

**HISTORIAN EXAMPLES**: Wire, Iron bars/rods (bronze bars, copper bars, etc.), Precious stones/metals (agates/pearls/corals/amber/red alum/alkali/marble/limestone, etc.), Ebony/ivory, Salt, Rubber/gum, Skins/leather/hides/horns/animal products, Gold/silver/gold dust

**EXPERT ADDITION**: **"copper bracelets"** (from historical trading patterns case)

#### 5b. Jewelry and Fashion (code="5b_related_jewerly_fashion")
**Question**: Was the bead connected to jewelry or fashion items?

**Format**: List all jewelry and fashion items traded, separated by comma or NaN

**HISTORIAN EXAMPLES**: Shells (e.g. cowries, etc.), Ostrich feathers, Wax/seals/stamps, Jewellery/rings/bracelets/necklaces, Cloth/Clothing/textiles (dyes, buttons, silk, European cotton, ornamental threads, muslins, kaftans, wraps, flannel, canvas, brocade, wool, etc.)

#### 5c. Consumables and Utilitarian Items (code="5c_related_consumerables")
**Question**: Was the bead connected to consumables or utilitarian items?

**Format**: List all consumables and utilitarian items traded, separated by comma or NaN

**HISTORIAN EXAMPLES**: Coins (other foreign currencies), Livestock (e.g. chicken, goats, horses, donkeys, sheep, cows, etc.), Medicines/remedies/herbal plants, Spices/essences/fragrances/perfumes, Dried food/fruit (tamarinds/dates/honey/grains/etc.), Hardware/manufactures (pewter, zinc, etc.), Tobacco/snuff, Water

#### 5d. Decoratives (code="5d_related_decoratives")
**Question**: Was the bead connected to decorative items?

**Format**: List all decorative items traded, separated by comma or NaN

**HISTORIAN EXAMPLES**: Scarabs, Antiquities/furniture items/collectables, Indigenous weapons, spears, shields, Prints/artwork/books/paper/scrolls, Guns/gunpowder, Slaves, Glass objects, Musical instruments

### 6. Bead Ethnic Group (code="6_bead_ethnic_group")
**Question**: Was an ethnic group mentioned in connection to the exchange?

**Format**: Add ethnic group name(s) as STRING. If more than one ethnic group, separate with comma.

**HISTORIAN EXAMPLES**:
- "Zulu, Xhosa"
- "Turkana"
- "Wongára women from Bontúku"
- "Hausa merchants from Sokotu"

**EXPERT ADDITION**: **"natives, trading adventurers"** (from historical pattern case)

### 7. Market Town (code="7_market_town")
**Question**: Is the location of exchange described as a market?

**Values**:
- `0`: No
- `1`: Yes

**Decision rules**: Only mark `1` if the location is explicitly described as a market or trading center

### 8. Location Name (code="8_location_name")
**Question**: Is there a mention of any geographic location?

**Format**: Add the place name and description of the location (as detailed as available in text).

**HISTORIAN EXAMPLES**:
- "Between 4° and 5° north, 27° and 28° east"
- "Unyamwezi, Jenne, north of lake Stefanie"
- "Reshiat or Rissiat (northern Gallaland in northern Lake Rudolf)"

**EXPERT ADDITION**: **"White Nile"** (from historical pattern case)

### 9. Place of Manufacture (code="9_place_of_manufacture")
**Question**: Is there a mention of the origin of the bead?

**Format**: Add origin of the bead (geographic term).

**HISTORIAN EXAMPLES**:
- "Venetian glass beads"
- "Beads from x region"

**Decision rules**: Origin refers to where beads were manufactured, not where they are used

### 10. Beads Observed (code="10_beads_observed")
**Question**: List all beads traded at this location (even if not necessarily traded for specific items).

#### 10a. Size (code="10a_size")
**Format**: Add size (large, medium, small, thin, thick)

#### 10b. Color (code="10b_color")
**Format**: Add color

**HISTORIAN EXAMPLES**: Red, blue, white, pink, coral, amber, copper, transparent, green, yellow, black, multicolored

#### 10c. Shape (code="10c_shape")
**Format**: Add shape

**HISTORIAN EXAMPLES**: Round, tubular (like a long sausage), square, oval, oblong (like a squished sausage), punched with a hole, wound, pressed, decorative, faceted, bugle, chevron

#### 10d. Type (code="10d_type")
**Format**: Add material type

**HISTORIAN EXAMPLES**:
- Glass (also seed beads = small glass beads used as currency)
- Clay
- Metal (brass/copper/silver/gold/iron)
- Stone (quartz/agate/carnelian/jasper/amethyst/lapis lazuli/turquoise/malachite)
- Coral, bone, ivory, dried seed, ceramic, wooden, porcelain, shell (seashells), eggshell (ostrich)

**EXPERT ADDITION**: **wooden** (from prayer bead case)

**HISTORIAN COMPLETE ENTRY EXAMPLES**:
- "Small amber glass beads"
- "Mid-sized oval blue beads"
- "Transparent square glass beads"
- "Large round coral beads"
- "Small red seed beads"
- "Medium faceted blue glass beads"
- "Large tubular white porcelain beads"
- "Small round black glass beads, large oval amber beads"
- "Tiny multicolored glass seed beads"
- "Medium wound glass beads with copper inclusions"
- "Large chevron beads with blue, red, and white patterns"
- "Small green tubular glass beads, medium round ivory beads"
- "Large faceted crystal beads with gold flecks"

### 11. Units of Measurement (code="11_units_of_measurement")
**Question**: What were the units of measurement for the bead?

**Values**:
- `1`: String
- `2`: Plaited string/woven string
- `3`: Necklace, anklet, bracelet, waist beads, headwear
- `4`: Other measurement (write in STRING name)

**HISTORIAN EXAMPLES**:
- "Strings of beads" → `1`
- "Woven strand of beads" → `2`
- "Beaded necklaces" → `3`
- "Fathom of beads" → `4, fathom`
- "Fundo of beads" → `4, fundo`
- "Khete of beads" → `4, khete`

### 12. Local Name (code="12_local_name")
**Question**: Does the bead have an ethnic or other language name?

**Format**: If yes, write the STRING of the name of the bead.

**HISTORIAN EXAMPLES**:
- "Aggrey beads, as the natives called them" → `aggrey`
- "Samesame beads" → `samesame`

### 13. Notes (code="13_notes")
**Question**: Any notes and observations from this snippet that would be useful for context, future analysis, and for clarification of the coding?

**Format**: Free text field for additional context and observations.

**HISTORIAN EXAMPLES**:
- "The exchange took place during a period of drought, which may have affected values"
- "Text mentions historical change in bead values over time - current rate vs. past rate"
- "Author notes that the beads were considered especially valuable due to their rarity in this region"
- "The exchange was part of a larger diplomatic mission, suggesting social aspects beyond commercial value"
- "Local taboos influenced which colors of beads were acceptable for trade"
- "Text mentions that exchange rates had been stable for several decades"
- "The author appears unfamiliar with local customs and may have misinterpreted some aspects of the exchange"
- "Exchange took place in context of religious festival, potentially affecting rates"
- "Text indicates these particular beads had special ritual significance beyond their material value"
- "Multiple transliterated local terms used for different bead types, suggesting complex local classification system"

**EXPERT ADDITIONS**:
- **"Historical generalization about trading patterns between natives and trading adventurers"**
- **"Prayer beads observed, no exchange occurred"**
- **"Decorative beads on horse equipment, commercial price list"**
- **"Gift-giving for diplomatic relations"**

## EXPERT-VALIDATED EDGE CASE EXAMPLES

### Historical Generalization Pattern (Expert Decision: 4a_exchange = "no")
**Text**: "When trading adventurers first commenced on the White Nile, the natives sold ivory for beads and copper bracelets"
**Expert Coding**:
- `4a_exchange = "no"` (pattern, not specific transaction)
- `4b_beads_exchanged = "beads"` (capture context)
- `4c_exchanged_item = "ivory"` (capture context)
- `5a_related_raw_materials = "copper bracelets"`
- `13_notes = "Historical generalization about early trading patterns"`

### Intangible Exchange Pattern (Expert Decision: Valid Exchange)
**Text**: "I bought the secret for a string of wooden beads"
**Expert Coding**:
- `4a_exchange = "xo"` (intangible goods count)
- `4c_exchanged_item = "secret for writing amulets"`
- `10d_type = "wooden"`
- `11_units_of_measurement = "1"` (string)

## HISTORIAN CODING PROCESS FLOW

1. Determine if the entry contains relevant information about beads (`read_entry`)
2. Identify the function of beads in the described context (`1_function`)
3. Assess the nature of any exchange described (`2_nature_of_exchange`)
4. Document the groups involved in the exchange (`3_between_groups`)
5. Note whether beads were actually exchanged (`4a_exchange`)
6. Document the specific beads used in exchange (`4b_beads_exchanged`)
7. Record what items were received in exchange for beads (`4c_exchanged_item`)
8. List all items traded or given as gifts at the location (`5_related_items`)
9. Note ethnic groups connected to the exchange (`6_bead_ethnic_group`)
10. Indicate if the exchange location was a market (`7_market_town`)
11. Document the geographic location (`8_location_name`)
12. Note the place of manufacture of the beads (`9_place_of_manufacture`)
13. Describe all beads observed in detail (`10_beads_observed`)
14. Record how beads were measured or counted (`11_units_of_measurement`)
15. Document any local names for beads (`12_local_name`)
16. Add any additional relevant observations (`13_notes`)

## HISTORIAN SPECIAL NOTES

1. **Context is key**: Information about beads might be spread across several sentences or paragraphs.
2. **Implicit references**: Sometimes beads are referenced indirectly or as part of a larger exchange.
3. **Historical knowledge**: Certain terms (like "coral beads") have specific cultural or historical meanings.
4. **Multiple passages**: When analyzing a document with multiple references to beads, consider each distinct reference separately.
5. **Ambiguity**: When information is ambiguous, note this in your reasoning but select the most likely code.
6. **Duplicate entries**: For every distinct exchange rate at the same market/location, duplicate the row and enter relevant exchange rates separately.

## HISTORIAN COMMON PITFALLS TO AVOID

1. **Missing implicit references** to beads in historical texts
2. **Overlooking ethnic groups** mentioned in relation to beads
3. **Confusing material with color** (e.g., coral is both a material and implies a color)
4. **Missing associated goods** that appear in different parts of the text
5. **Ignoring contextual clues** about function or exchange nature
6. **Failing to specify location names** beyond just the numeric code
7. **Missing tributary or gift exchanges** which should be coded as `xo` for exchange

## XML OUTPUT FORMAT (Complete)

<read_entry>1</read_entry>
<1a_physical_function>2</1a_physical_function>
<1b_trade_function>NaN</1b_trade_function>
<1c_social_function>3</1c_social_function>
<2_nature_of_exchange>4</2_nature_of_exchange>
<3_between_groups>1</3_between_groups>
<4a_exchange>xo</4a_exchange>
<4b_beads_exchanged>glass beads</4b_beads_exchanged>
<4c_exchanged_item>diplomatic relations</4c_exchanged_item>
<5a_related_raw_materials>copper, iron</5a_related_raw_materials>
<5b_related_jewerly_fashion>NaN</5b_related_jewerly_fashion>
<5c_related_consumerables>NaN</5c_related_consumerables>
<5d_related_decoratives>NaN</5d_related_decoratives>
<6_bead_ethnic_group>Sheikh Wadelai's people</6_bead_ethnic_group>
<7_market_town>0</7_market_town>
<8_location_name>White Nile region</8_location_name>
<9_place_of_manufacture>NaN</9_place_of_manufacture>
<10a_size>NaN</10a_size>
<10b_color>NaN</10b_color>
<10c_shape>NaN</10c_shape>
<10d_type>glass</10d_type>
<11_units_of_measurement>NaN</11_units_of_measurement>
<12_local_name>NaN</12_local_name>
<13_notes>Gift-giving for diplomatic relations</13_notes>
<confidence>0.95</confidence>

## FINAL INSTRUCTION

Apply preprocessing quality checks first. Code conservatively based on explicit evidence. For every distinct exchange rate at the same location, create separate entries. Capture contextual details even when coding "no exchange" for historical pattern analysis.
"""

# =============================================================================
# ENHANCED BACKUP SYSTEM FUNCTIONS WITH RESTART CAPABILITY
# =============================================================================

def save_codetect_backup(data, backup_type='pickle', custom_name=None):
    """Save Co-DETECT data to backup file with timestamp"""
    global progress_counter, processed_indices

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = custom_name or f"{BACKUP_FILENAME}_{timestamp}"

    # Enhanced backup includes tracking information
    backup_data = {
        'results': data if isinstance(data, list) else data,
        'progress_counter': progress_counter,
        'processed_indices': list(processed_indices),
        'timestamp': timestamp,
        'sample_size': SAMPLE_SIZE
    }

    if backup_type == 'pickle':
        filepath = os.path.join(BACKUP_DIR, f"{filename}.pkl")
        with open(filepath, 'wb') as f:
            pickle.dump(backup_data, f)
    elif backup_type == 'json':
        filepath = os.path.join(BACKUP_DIR, f"{filename}.json")
        with open(filepath, 'w') as f:
            json.dump(backup_data, f, indent=2, default=str)
    elif backup_type == 'csv' and isinstance(data, pd.DataFrame):
        filepath = os.path.join(BACKUP_DIR, f"{filename}.csv")
        data.to_csv(filepath, index=False)
        # Also save tracking info separately
        tracking_file = os.path.join(BACKUP_DIR, f"{filename}_tracking.json")
        with open(tracking_file, 'w') as f:
            json.dump({
                'progress_counter': progress_counter,
                'processed_indices': list(processed_indices),
                'timestamp': timestamp
            }, f)

    print(f"✅ Backup saved: {os.path.basename(filepath)}")
    print(f"📊 Progress: {progress_counter}/{SAMPLE_SIZE} texts processed")
    return filepath

def auto_backup_codetect(data, force=False):
    """Automatically backup Co-DETECT data every LINES_PER_BACKUP lines"""
    global progress_counter

    if progress_counter % LINES_PER_BACKUP == 0 or force:
        return save_codetect_backup(data, backup_type='json', custom_name=f"codetect_progress_{progress_counter}")
    return None

def load_codetect_backup(custom_name=None):
    """Load the most recent Co-DETECT backup with enhanced tracking"""
    pattern = custom_name or BACKUP_FILENAME

    # Find all backup files
    backup_files = [f for f in os.listdir(BACKUP_DIR)
                   if f.startswith(pattern) and (f.endswith('.pkl') or f.endswith('.json'))]

    if not backup_files:
        print("❌ No Co-DETECT backup files found")
        return None

    # Get the most recent file
    latest_file = max(backup_files, key=lambda x: os.path.getmtime(os.path.join(BACKUP_DIR, x)))
    filepath = os.path.join(BACKUP_DIR, latest_file)

    # Load the data
    if latest_file.endswith('.pkl'):
        with open(filepath, 'rb') as f:
            backup_data = pickle.load(f)
    elif latest_file.endswith('.json'):
        with open(filepath, 'r') as f:
            backup_data = json.load(f)

    print(f"✅ Loaded backup: {latest_file}")

    # Handle both old and new backup formats
    if isinstance(backup_data, dict) and 'results' in backup_data:
        # New format with tracking
        return backup_data
    else:
        # Old format - just results
        return {
            'results': backup_data,
            'progress_counter': len(backup_data) if isinstance(backup_data, list) else 0,
            'processed_indices': [],
            'timestamp': 'unknown'
        }

def list_available_backups():
    """List all available backup files with details"""
    print("📁 Available Co-DETECT Backups:")
    print("=" * 80)

    backup_files = [f for f in os.listdir(BACKUP_DIR)
                   if f.startswith(BACKUP_FILENAME) and (f.endswith('.pkl') or f.endswith('.json'))]

    if not backup_files:
        print("❌ No backup files found")
        return []

    # Sort by modification time (newest first)
    backup_files.sort(key=lambda x: os.path.getmtime(os.path.join(BACKUP_DIR, x)), reverse=True)

    backup_info = []
    for i, filename in enumerate(backup_files[:10], 1):  # Show last 10 backups
        filepath = os.path.join(BACKUP_DIR, filename)
        mod_time = datetime.fromtimestamp(os.path.getmtime(filepath))
        file_size = os.path.getsize(filepath) / 1024  # KB

        # Try to read progress info
        try:
            if filename.endswith('.json'):
                with open(filepath, 'r') as f:
                    data = json.load(f)
                    if isinstance(data, dict) and 'progress_counter' in data:
                        progress = data['progress_counter']
                    else:
                        progress = len(data) if isinstance(data, list) else 'Unknown'
            else:
                progress = 'Unknown'
        except:
            progress = 'Unknown'

        print(f"{i}. {filename}")
        print(f"   Modified: {mod_time.strftime('%Y-%m-%d %H:%M:%S')}")
        print(f"   Size: {file_size:.1f} KB | Progress: {progress}")
        print()

        backup_info.append({
            'filename': filename,
            'filepath': filepath,
            'modified': mod_time,
            'progress': progress
        })

    return backup_info

def restore_from_backup(backup_name=None):
    """
    ENHANCED: Restore Co-DETECT session from a specific backup
    This is the main function to use when restarting after interruption
    """
    global codetect_results, progress_counter, processed_indices, session_start_time

    print("🔄 RESTORING CO-DETECT SESSION FROM BACKUP")
    print("=" * 80)

    if backup_name:
        # Load specific backup
        filepath = os.path.join(BACKUP_DIR, backup_name)
        if not os.path.exists(filepath):
            print(f"❌ Backup file not found: {backup_name}")
            return False

        if filepath.endswith('.pkl'):
            with open(filepath, 'rb') as f:
                backup_data = pickle.load(f)
        else:
            with open(filepath, 'r') as f:
                backup_data = json.load(f)
    else:
        # Load most recent backup
        backup_data = load_codetect_backup()
        if not backup_data:
            return False

    # Extract data from backup
    if isinstance(backup_data, dict) and 'results' in backup_data:
        # New format
        codetect_results = backup_data['results']
        progress_counter = backup_data.get('progress_counter', len(codetect_results))
        processed_indices = set(backup_data.get('processed_indices', []))
        backup_timestamp = backup_data.get('timestamp', 'unknown')

        # If processed_indices is empty but we have results, rebuild it
        if not processed_indices and codetect_results:
            processed_indices = set(r.get('row_index', r.get('text_id', i))
                                   for i, r in enumerate(codetect_results))
    else:
        # Old format
        codetect_results = backup_data if isinstance(backup_data, list) else []
        progress_counter = len(codetect_results)
        # Rebuild processed_indices from results
        processed_indices = set(r.get('row_index', r.get('text_id', i))
                               for i, r in enumerate(codetect_results))
        backup_timestamp = 'unknown'

    session_start_time = datetime.now()

    print(f"✅ SESSION RESTORED SUCCESSFULLY")
    print(f"   Backup timestamp: {backup_timestamp}")
    print(f"   Texts already processed: {len(codetect_results)}")
    print(f"   Progress counter: {progress_counter}/{SAMPLE_SIZE}")
    print(f"   Remaining texts: {SAMPLE_SIZE - progress_counter}")
    print(f"   Processed indices tracked: {len(processed_indices)}")
    print()
    print("💡 You can now continue with: resume_codetect_analysis(client, sample_df)")

    return True

# =============================================================================
# ANTHROPIC API SETUP
# =============================================================================

def setup_anthropic_client():
    """Set up Anthropic API client"""
    print("🔑 Setting up Anthropic API client...")

    # Get API key from Colab secrets
    try:
        from google.colab import userdata
        api_key = userdata.get('ANTHROPIC_API_KEY')
        print("✅ API key loaded from Colab secrets")
    except:
        # Fallback: enter API key manually
        import getpass
        api_key = getpass.getpass("Enter your Anthropic API key: ")

    client = anthropic.Anthropic(api_key=api_key)

    # Test the connection
    try:
        test_response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=10,
            messages=[{"role": "user", "content": "Test"}]
        )
        print("✅ Anthropic API connection successful!")
        return client
    except Exception as e:
        print(f"❌ Error connecting to Anthropic API: {e}")
        return None

# =============================================================================
# DATA LOADING AND PREPROCESSING
# =============================================================================

def load_and_prepare_data():
    """Load and preprocess the bead exchange data"""
    print("📊 Loading and preparing bead exchange data...")

    file_path = os.path.join(WORKING_DIR, DATASET_FILE)

    # Check if file exists
    if not os.path.exists(file_path):
        print(f"❌ File not found: {file_path}")
        print(f"Please upload '{DATASET_FILE}' to {WORKING_DIR}")
        return None

    try:
        # Load Excel file
        df = pd.read_excel(file_path)
        print(f"✅ Original dataset: {len(df):,} rows")

        # Basic preprocessing
        valid_df = df.dropna(subset=['text_page_gp']).copy()
        print(f"✅ Valid text entries: {len(valid_df):,}")

        # Apply expert-validated data quality filters
        filtered_df = apply_data_quality_filters(valid_df)

        # Create stratified sample
        sample_df = create_stratified_sample(filtered_df, SAMPLE_SIZE)

        return sample_df

    except Exception as e:
        print(f"❌ Error loading data: {e}")
        return None

def apply_data_quality_filters(df):
    """Apply expert-validated data quality preprocessing rules"""
    print("🔍 Applying expert-validated data quality filters...")

    initial_count = len(df)

    # Rule P1: Text length filtering
    df['text_length'] = df['text_page_gp'].str.len()
    df['readable_chars'] = df['text_page_gp'].str.count(r'[a-zA-Z]')
    df['readable_ratio'] = df['readable_chars'] / df['text_length']

    # Filter very short texts (<50 chars)
    quality_df = df[df['text_length'] >= 50].copy()
    removed_short = len(df) - len(quality_df)
    print(f"📏 After length filter (≥50 chars): {len(quality_df):,} entries ({removed_short} removed)")

    # Filter heavily corrupted OCR (>80% non-alphabetic)
    quality_df = quality_df[quality_df['readable_ratio'] >= 0.2].copy()
    removed_corrupted = len(df) - removed_short - len(quality_df)
    print(f"🔤 After OCR quality filter (≥20% readable): {len(quality_df):,} entries ({removed_corrupted} removed)")

    # Rule P2: Content relevance filtering for context pages
    bead_terms = ['bead', 'beads', 'pearl', 'coral', 'glass', 'necklace', 'bracelet', 'ornament', 'jewelry']

    def has_bead_content(text, page_type):
        if pd.isna(text):
            return False
        text_lower = str(text).lower()
        has_bead_terms = any(term in text_lower for term in bead_terms)

        # If it's a context page, it must have bead terms
        if page_type == 'support' and not has_bead_terms:
            return False
        return True

    quality_df['has_bead_content'] = quality_df.apply(
        lambda row: has_bead_content(row['text_page_gp'], row.get('page_type', '')), axis=1
    )

    quality_df = quality_df[quality_df['has_bead_content']].copy()
    removed_irrelevant = initial_count - removed_short - removed_corrupted - len(quality_df)
    print(f"📋 After content relevance filter: {len(quality_df):,} entries ({removed_irrelevant} removed)")

    total_removed = initial_count - len(quality_df)
    print(f"✅ Total filtered out: {total_removed:,} entries ({(total_removed/initial_count*100):.1f}%)")

    return quality_df

def create_stratified_sample(df, sample_size):
    """Create stratified sample for Co-DETECT analysis"""
    print(f"🎯 Creating stratified sample of {sample_size:,} texts...")

    # Stratify by country and explorer to ensure diversity
    sample_df = df.groupby(['countries', 'explorer_surname'], group_keys=False).apply(
        lambda x: x.sample(min(len(x), max(1, int(sample_size * len(x) / len(df)))))
    ).reset_index(drop=True)

    # If we don't have enough, fill randomly
    if len(sample_df) < sample_size:
        remaining = df[~df.index.isin(sample_df.index)]
        additional_needed = sample_size - len(sample_df)
        if len(remaining) >= additional_needed:
            additional = remaining.sample(additional_needed)
            sample_df = pd.concat([sample_df, additional]).reset_index(drop=True)

    # Trim to exact size if over
    if len(sample_df) > sample_size:
        sample_df = sample_df.sample(sample_size).reset_index(drop=True)

    print(f"✅ Prepared dataset: {len(sample_df):,} texts")
    return sample_df

def estimate_costs(sample_size):
    """Estimate API costs for the analysis"""
    input_tokens = sample_size * ESTIMATED_TOKENS_PER_TEXT
    output_tokens = sample_size * 300  # Estimated output tokens per response

    input_cost = (input_tokens / 1000) * COST_PER_1K_TOKENS_INPUT
    output_cost = (output_tokens / 1000) * COST_PER_1K_TOKENS_OUTPUT
    total_cost = input_cost + output_cost

    print(f"💰 Estimated cost for {sample_size:,} texts: ${total_cost:.2f}")
    print(f"   Input tokens: {input_tokens:,} (${input_cost:.2f})")
    print(f"   Output tokens: {output_tokens:,} (${output_cost:.2f})")
    print(f"   Expected analysis time: ~{sample_size//25:.0f} minutes (Claude 3.5 Haiku is fast!)")

    return total_cost

# =============================================================================
# AI ANNOTATION FUNCTIONS
# =============================================================================

def create_annotation_prompt(text):
    """Create annotation prompt with expert-validated codebook"""
    return f"""
You are an expert historical researcher analyzing bead exchange patterns. Use the following expert-validated codebook to analyze this historical text about beads.

EXPERT-VALIDATED CODEBOOK:
{EXPERT_CODEBOOK}

CRITICAL REQUIREMENTS:
1. Apply preprocessing data quality rules first
2. Follow the mandatory decision flowchart
3. Use expert-validated examples for edge cases
4. When in doubt, be CONSERVATIVE and require explicit evidence
5. Provide a confidence score (0.0-1.0) for your analysis

TEXT TO ANALYZE:
"{text}"

Provide your analysis in this XML format:

<analysis>
<read_entry>[0/1/2/NaN]</read_entry>
<4a_exchange>[xo/no/NA/NaN]</4a_exchange>
<2_nature_of_exchange>[1/2/3/4/NA/NaN]</2_nature_of_exchange>
<3_between_groups>[1/2/3/4/NA/NaN]</3_between_groups>
<4b_beads_exchanged>[description or NA/NaN]</4b_beads_exchanged>
<4c_exchanged_item>[description or NA/NaN]</4c_exchanged_item>
<1a_physical_function>[2/NA/NaN]</1a_physical_function>
<1b_trade_function>[2/NA/NaN]</1b_trade_function>
<1c_social_function>[3/NA/NaN]</1c_social_function>
<6_bead_ethnic_group>[groups or NA/NaN]</6_bead_ethnic_group>
<8_location_name>[location or NA/NaN]</8_location_name>
<10d_type>[material or NA/NaN]</10d_type>
<11_units_of_measurement>[1/2/3/4/NA/NaN]</11_units_of_measurement>
<13_notes>[additional context or NA/NaN]</13_notes>
<confidence>[0.0-1.0]</confidence>
<reasoning>[brief explanation of key decisions]</reasoning>
</analysis>
"""

def parse_xml_response(response_text):
    """Parse XML response from AI annotation"""
    import xml.etree.ElementTree as ET

    try:
        # Extract XML from response
        start_tag = response_text.find('<analysis>')
        end_tag = response_text.find('</analysis>') + len('</analysis>')

        if start_tag == -1 or end_tag == -1:
            return None

        xml_content = response_text[start_tag:end_tag]
        root = ET.fromstring(xml_content)

        result = {}
        for child in root:
            result[child.tag] = child.text if child.text else ""

        return result

    except Exception as e:
        try:
            # Fallback: try to extract confidence at minimum
            import re
            conf_match = re.search(r'<confidence>([\d.]+)</confidence>', response_text)
            if conf_match:
                return {'confidence': conf_match.group(1)}
        except:
            pass
        return None

def annotate_single_text(client, text, text_id, row_data):
    """Annotate a single text using AI with expert-validated codebook"""
    global progress_counter, codetect_results, processed_indices

    progress_counter += 1
    print(f"Processing {text_id}/{SAMPLE_SIZE}: ", end="", flush=True)

    prompt = create_annotation_prompt(text)

    for attempt in range(MAX_RETRIES):
        try:
            response = client.messages.create(
                model="claude-3-5-haiku-20241022",
                max_tokens=1500,
                temperature=0.1,  # Low temperature for consistency
                messages=[{"role": "user", "content": prompt}]
            )

            response_text = response.content[0].text

            # Parse XML response
            result = parse_xml_response(response_text)

            if result and 'confidence' in result:
                confidence = float(result['confidence'])
                status = "EDGE CASE" if confidence < EDGE_CASE_THRESHOLD else "OK"
                print(f"{status} (conf: {confidence:.2f})")

                # Add metadata
                result['text_id'] = text_id
                result['original_text'] = text
                result['row_index'] = row_data.name if hasattr(row_data, 'name') else text_id
                result['explorer'] = f"{row_data.get('explorer_first_name', '')} {row_data.get('explorer_surname', '')}"
                result['year'] = row_data.get('year_began', '')
                result['countries'] = row_data.get('countries', '')

                # Add to global results and tracking
                codetect_results.append(result)
                processed_indices.add(result['row_index'])

                # Auto-backup progress
                auto_backup_codetect(codetect_results)

                return result
            else:
                print("Error parsing response: " + str(response_text)[:50])
                if attempt < MAX_RETRIES - 1:
                    print("\nRetrying...", end="")
                    time.sleep(2)

        except Exception as e:
            print(f"Error: {str(e)[:50]}")
            if attempt < MAX_RETRIES - 1:
                print("\nRetrying...", end="")
                time.sleep(5)

    # Fallback for failed cases
    print("FAILED")
    failed_result = {
        'text_id': text_id,
        'confidence': 0.0,
        'read_entry': 'NaN',
        'error': 'Failed to process after retries',
        'original_text': text,
        'row_index': row_data.name if hasattr(row_data, 'name') else text_id
    }
    codetect_results.append(failed_result)
    processed_indices.add(failed_result['row_index'])
    return failed_result

# =============================================================================
# MAIN CO-DETECT FUNCTIONS WITH RESTART CAPABILITY
# =============================================================================

def run_annotation_phase(client, sample_df, resume=False):
    """
    Run the annotation phase with automatic backup and resume capability

    Args:
        client: Anthropic API client
        sample_df: DataFrame with texts to annotate
        resume: If True, skip already processed texts
    """
    global progress_counter, codetect_results, processed_indices

    print(f"🏷️ Step 2: Annotating texts...")
    print("💾 Auto-backup enabled every 50 texts")

    if resume:
        print(f"♻️ RESUME MODE: Skipping {len(processed_indices)} already processed texts")

    start_time = time.time()

    # Process texts
    texts_to_process = len(sample_df) - len(processed_indices) if resume else len(sample_df)
    print(f"📊 Texts to process: {texts_to_process}")

    for idx, row in sample_df.iterrows():
        # Skip if already processed (when resuming)
        if resume and idx in processed_indices:
            continue

        text = row['text_page_gp']
        result = annotate_single_text(client, text, idx + 1, row)

        # Save major checkpoint every 200 texts
        if progress_counter % 200 == 0:
            save_codetect_backup(codetect_results, backup_type='json',
                               custom_name=f"codetect_checkpoint_{progress_counter}")

    # Force final backup
    save_codetect_backup(codetect_results, backup_type='json',
                        custom_name=f"codetect_final_{progress_counter}")

    elapsed_time = time.time() - start_time
    edge_cases = [r for r in codetect_results if float(r.get('confidence', 0)) < EDGE_CASE_THRESHOLD]

    print(f"\n✅ Annotation complete: {len(edge_cases)} edge cases found ({len(edge_cases)/len(codetect_results)*100:.1f}%)")
    print(f"⏱️ Time elapsed: {elapsed_time/60:.1f} minutes")

    return codetect_results

def resume_codetect_analysis(client, sample_df):
    """
    MAIN FUNCTION TO RESUME INTERRUPTED ANALYSIS

    Use this after calling restore_from_backup() to continue where you left off

    Args:
        client: Anthropic API client (from setup_anthropic_client())
        sample_df: DataFrame with texts (from load_and_prepare_data())
    """
    global codetect_results, progress_counter, processed_indices

    print("🔄 RESUMING CO-DETECT ANALYSIS")
    print("=" * 80)
    print(f"📊 Current status:")
    print(f"   Already processed: {len(codetect_results)} texts")
    print(f"   Progress counter: {progress_counter}")
    print(f"   Remaining: {SAMPLE_SIZE - len(processed_indices)} texts")
    print("=" * 80)

    if len(processed_indices) >= SAMPLE_SIZE:
        print("✅ All texts already processed!")
        print("💡 Run analyze_edge_cases() and create_analysis_visualizations() to continue")
        return codetect_results

    proceed = input("\n▶️  Continue annotation from where we left off? (y/n): ")
    if proceed.lower() != 'y':
        print("Analysis paused.")
        return codetect_results

    # Continue annotation with resume=True
    results = run_annotation_phase(client, sample_df, resume=True)

    print("\n✅ RESUME COMPLETE!")
    print(f"📊 Total texts processed: {len(results)}")
    print("💡 Next steps:")
    print("   1. analyze_edge_cases()")
    print("   2. create_analysis_visualizations()")
    print("   3. generate_analysis_report()")

    return results

def analyze_edge_cases():
    """Analyze edge cases and create clusters"""
    global codetect_results

    edge_cases = [r for r in codetect_results if float(r.get('confidence', 0)) < EDGE_CASE_THRESHOLD]

    if len(edge_cases) == 0:
        print("🎉 No edge cases found! Codebook performing excellently.")
        return [], []

    print(f"🔍 Step 3: Clustering {len(edge_cases)} edge cases...")

    if len(edge_cases) < 3:
        print("⚠️ Too few edge cases for clustering")
        return edge_cases, []

    # Extract texts for clustering
    texts = [case.get('original_text', '') for case in edge_cases]

    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2))
    text_vectors = vectorizer.fit_transform(texts)

    # Determine number of clusters
    n_clusters = min(max(2, len(edge_cases) // 5), 5)

    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(text_vectors)

    print(f"✅ Created {n_clusters} balanced clusters")

    # Group edge cases by cluster
    clustered_cases = []
    for i in range(n_clusters):
        cluster_cases = [edge_cases[j] for j in range(len(edge_cases)) if cluster_labels[j] == i]
        if cluster_cases:
            clustered_cases.append({
                'cluster_id': i + 1,
                'size': len(cluster_cases),
                'cases': cluster_cases
            })

    return edge_cases, clustered_cases

def create_analysis_visualizations():
    """Create comprehensive visualizations of Co-DETECT results"""
    global codetect_results

    print("📊 Step 5: Creating visualizations...")

    # Convert results to DataFrame
    df_results = pd.DataFrame(codetect_results)
    df_results['confidence'] = pd.to_numeric(df_results['confidence'], errors='coerce')
    df_results['is_edge_case'] = df_results['confidence'] < EDGE_CASE_THRESHOLD

    # Create subplots
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=['Confidence Distribution', 'Edge Case Analysis',
                       'Exchange Patterns', 'Function Patterns'],
        specs=[[{'type': 'histogram'}, {'type': 'bar'}],
               [{'type': 'bar'}, {'type': 'bar'}]]
    )

    # 1. Confidence distribution
    fig.add_trace(
        go.Histogram(x=df_results['confidence'], nbinsx=20, name='Confidence Distribution'),
        row=1, col=1
    )

    # 2. Edge case analysis
    edge_case_counts = df_results['is_edge_case'].value_counts()
    fig.add_trace(
        go.Bar(x=['Normal Cases', 'Edge Cases'],
               y=[edge_case_counts.get(False, 0), edge_case_counts.get(True, 0)],
               name='Case Distribution'),
        row=1, col=2
    )

    # 3. Exchange patterns
    exchange_counts = df_results['4a_exchange'].value_counts()
    fig.add_trace(
        go.Bar(x=exchange_counts.index, y=exchange_counts.values, name='Exchange Patterns'),
        row=2, col=1
    )

    # 4. Function patterns
    function_counts = df_results['1c_social_function'].value_counts()
    fig.add_trace(
        go.Bar(x=function_counts.index, y=function_counts.values, name='Social Functions'),
        row=2, col=2
    )

    fig.update_layout(
        title_text=f"Co-DETECT Analysis Results - Expert-Validated Codebook v4.0 (n={len(codetect_results):,})",
        showlegend=False,
        height=800
    )

    fig.show()

    return fig

def generate_analysis_report():
    """Generate comprehensive Co-DETECT analysis report"""
    global codetect_results

    print("📋 Generating comprehensive analysis report...")

    df_results = pd.DataFrame(codetect_results)
    df_results['confidence'] = pd.to_numeric(df_results['confidence'], errors='coerce')

    total_cases = len(codetect_results)
    edge_cases = [r for r in codetect_results if float(r.get('confidence', 0)) < EDGE_CASE_THRESHOLD]
    edge_count = len(edge_cases)
    edge_percentage = (edge_count / total_cases) * 100

    # Calculate expected reduction (scaling from 800 sample to 1500 sample)
    expected_original_edges = int(30)  # Expected ~30 edge cases in 1500 sample at 2% rate

    report = f"""
# Co-DETECT Analysis Report - Iteration 2
## Expert-Validated Codebook v4.0

### Executive Summary
- **Total texts analyzed**: {total_cases:,}
- **Edge cases identified**: {edge_count} ({edge_percentage:.1f}%)
- **Target achievement**: {'✅ ACHIEVED' if edge_percentage < 1.0 else '🎯 IN PROGRESS'} (Target: <1.0%)
- **Improvement estimate**: {expected_original_edges - edge_count} fewer edge cases vs. baseline ({((expected_original_edges - edge_count)/expected_original_edges*100):.1f}% reduction)
- **Sample size increase**: 1,500 texts (vs. 800 in previous iterations)

### Confidence Score Analysis
- **Mean confidence**: {df_results['confidence'].mean():.3f}
- **Median confidence**: {df_results['confidence'].median():.3f}
- **High confidence cases** (≥0.9): {len(df_results[df_results['confidence'] >= 0.9])} ({len(df_results[df_results['confidence'] >= 0.9])/total_cases*100:.1f}%)
- **Medium confidence cases** (0.7-0.89): {len(df_results[(df_results['confidence'] >= 0.7) & (df_results['confidence'] < 0.9)])} ({len(df_results[(df_results['confidence'] >= 0.7) & (df_results['confidence'] < 0.9)])/total_cases*100:.1f}%)
- **Edge cases** (<0.7): {edge_count} ({edge_percentage:.1f}%)

### Exchange Pattern Analysis
"""

    # Add exchange analysis
    if '4a_exchange' in df_results.columns:
        exchange_counts = df_results['4a_exchange'].value_counts()
        for exchange_type, count in exchange_counts.items():
            percentage = (count / total_cases) * 100
            report += f"- **{exchange_type}**: {count} cases ({percentage:.1f}%)\n"

    # Add expert validation summary
    report += f"""

### Expert Validation Impact
This iteration tested expert decisions on 1,500 texts:

1. **Historical Generalizations**: Systematic handling of pattern descriptions
2. **Intangible Exchanges**: Recognition of knowledge/service exchanges
3. **Observational Contexts**: Proper classification without false exchanges
4. **Gift-Giving**: Correct identification as social exchanges
5. **Data Quality Filtering**: Automatic elimination of corrupted content

### Data Quality Filtering Results
- **Corrupted OCR filtered**: Entries with <50 chars or >80% symbols
- **Irrelevant content filtered**: Context pages without bead references
- **Processing efficiency**: Focus on valid bead-related content only

### Recommendations for Next Steps
"""

    if edge_percentage > 1.0:
        report += f"- **{edge_count} edge cases remain** - analyze for new patterns\n"
        report += "- Consider additional expert review of remaining cases\n"
        report += "- Potential for final codebook refinement iteration\n"
    else:
        report += "- **🎉 TARGET ACHIEVED!** Edge case rate below 1.0%\n"
        report += "- **Ready for full dataset application**\n"
        report += "- Consider final validation with human expert spot-checks\n"

    report += f"\n### Cost and Efficiency Analysis\n"
    report += f"- **Actual processing time**: {datetime.now().strftime('%H:%M:%S')}\n"
    report += f"- **Sample size**: {total_cases:,} texts\n"
    report += f"- **Success rate**: {((total_cases - len([r for r in codetect_results if 'error' in r]))/total_cases*100):.1f}%\n"

    return report

# =============================================================================
# MAIN EXECUTION FUNCTIONS
# =============================================================================

def setup_codetect_analysis():
    """Initialize Co-DETECT analysis environment"""
    print("🎯 Co-DETECT Implementation for Bead Exchange Analysis")
    print("=" * 60)
    print("Expert-Validated Codebook v4.0 with Edge Case Optimizations")
    print("✨ ENHANCED with Robust Restart Capabilities")
    print("🚀 Using Claude 3.5 Haiku (Fast & Cost-Efficient)")
    print(f"📊 Configuration:")
    print(f"   Sample size: {SAMPLE_SIZE:,} texts")
    print(f"   Estimated cost: ${estimate_costs(SAMPLE_SIZE):.2f}")
    print("=" * 60)

    # Initialize setup
    global progress_counter, codetect_results, processed_indices, session_start_time
    progress_counter = 0
    codetect_results = []
    processed_indices = set()
    session_start_time = datetime.now()

    return True

def run_codetect_iteration():
    """Run complete Co-DETECT analysis iteration"""

    print("🚀 Co-DETECT Analysis - Iteration 2 (RESTART-ENABLED)")
    print("=" * 60)

    # Step 1: Data preparation
    print("📊 Step 1: Data Preparation")
    sample_df = load_and_prepare_data()
    if sample_df is None:
        return None

    # Confirm to proceed
    proceed = input(f"\n💰 Estimated cost: ${estimate_costs(SAMPLE_SIZE):.2f}\nProceed with analysis? (y/n): ")
    if proceed.lower() != 'y':
        print("Analysis cancelled.")
        return None

    # Step 2: Setup AI client
    client = setup_anthropic_client()
    if client is None:
        return None

    # Step 3: Run annotation with auto-backup
    results = run_annotation_phase(client, sample_df)

    # Step 4: Analyze edge cases
    edge_cases, clustered_cases = analyze_edge_cases()

    # Step 5: Create visualizations
    fig = create_analysis_visualizations()

    # Step 6: Generate report
    report = generate_analysis_report()
    print("\n" + "=" * 60)
    print("📋 FINAL ANALYSIS REPORT")
    print("=" * 60)
    print(report)

    # Step 7: Save final results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    final_results = {
        'results': codetect_results,
        'edge_cases': edge_cases,
        'clustered_cases': clustered_cases,
        'report': report,
        'sample_df': sample_df.to_dict('records'),
        'config': {
            'sample_size': SAMPLE_SIZE,
            'edge_case_threshold': EDGE_CASE_THRESHOLD,
            'timestamp': timestamp
        }
    }

    save_codetect_backup(final_results, backup_type='json', custom_name=f"codetect_final_{timestamp}")

    # Also save as CSV for easy analysis
    results_df = pd.DataFrame(codetect_results)
    save_codetect_backup(results_df, backup_type='csv', custom_name=f"codetect_results_{timestamp}")

    return final_results

# =============================================================================
# RECOVERY AND UTILITY FUNCTIONS
# =============================================================================

def check_progress():
    """Check current progress of Co-DETECT analysis"""
    global codetect_results, progress_counter, processed_indices, session_start_time

    if len(codetect_results) == 0:
        print("📊 No progress yet - run setup_codetect_analysis() to begin")
        return

    edge_cases = [r for r in codetect_results if float(r.get('confidence', 0)) < EDGE_CASE_THRESHOLD]

    print("=" * 80)
    print("📊 CURRENT CO-DETECT PROGRESS")
    print("=" * 80)
    print(f"   Texts processed: {len(codetect_results)}/{SAMPLE_SIZE}")
    print(f"   Progress: {(len(codetect_results)/SAMPLE_SIZE*100):.1f}%")
    print(f"   Edge cases so far: {len(edge_cases)} ({len(edge_cases)/len(codetect_results)*100:.1f}%)")
    print(f"   Average confidence: {np.mean([float(r.get('confidence', 0)) for r in codetect_results]):.3f}")
    print(f"   Unique indices tracked: {len(processed_indices)}")

    if session_start_time:
        elapsed = (datetime.now() - session_start_time).total_seconds() / 60
        print(f"   Session duration: {elapsed:.1f} minutes")

    if len(codetect_results) > 0:
        try:
            latest_backup = max([f for f in os.listdir(BACKUP_DIR) if f.startswith('codetect_progress')],
                               key=lambda x: os.path.getmtime(os.path.join(BACKUP_DIR, x)), default=None)
            if latest_backup:
                backup_time = datetime.fromtimestamp(os.path.getmtime(os.path.join(BACKUP_DIR, latest_backup)))
                print(f"   Latest backup: {latest_backup}")
                print(f"   Backup time: {backup_time.strftime('%Y-%m-%d %H:%M:%S')}")
        except:
            pass

    print("=" * 80)

# =============================================================================
# EXECUTION INSTRUCTIONS
# =============================================================================

print("✅ Co-DETECT Analysis System Initialized!")
print("🔄 RESTART-ENABLED VERSION")
print(f"📁 Working directory: {WORKING_DIR}")
print(f"📄 Dataset file: {DATASET_FILE}")
print(f"💾 Backup directory: {BACKUP_DIR}")
print(f"📊 Sample size: {SAMPLE_SIZE:,} texts")

print("\n🚀 TO START NEW CO-DETECT ANALYSIS:")
print("=" * 80)
print("1. Ensure 'Munashe_Cleaned.xlsx' is in your Google Drive CoDetectBeadAnalysis folder")
print("2. Add ANTHROPIC_API_KEY to Colab secrets")
print("3. Run the following commands in order:")
print()
print("   # Initialize the analysis")
print("   setup_codetect_analysis()")
print()
print("   # Run the complete analysis")
print("   results = run_codetect_iteration()")
print()

print("\n🔄 TO RESTART FROM INTERRUPTION:")
print("=" * 80)
print("1. List available backups:")
print("   list_available_backups()")
print()
print("2. Restore from most recent backup:")
print("   restore_from_backup()")
print()
print("3. Load your data:")
print("   sample_df = load_and_prepare_data()")
print()
print("4. Setup API client:")
print("   client = setup_anthropic_client()")
print()
print("5. Resume analysis:")
print("   results = resume_codetect_analysis(client, sample_df)")
print()

print("\n🔧 UTILITY FUNCTIONS:")
print("=" * 80)
print("   check_progress()                    # Check current progress")
print("   list_available_backups()            # See all saved backups")
print("   restore_from_backup()               # Restore from latest backup")
print("   restore_from_backup('filename.json') # Restore from specific backup")

print(f"\n🎯 Expected Results with Expert-Validated Codebook:")
print(f"- Edge case reduction from ~30 (2.0%) to <15 (1.0%)")
print("- Better handling of historical generalizations")
print("- Proper recognition of intangible exchanges")
print("- Systematic data quality filtering")
print(f"- Estimated cost: ~${estimate_costs(SAMPLE_SIZE):.2f} (3x cheaper with Haiku!)")
print(f"- Estimated time: ~{SAMPLE_SIZE//25:.0f} minutes (3x faster with Haiku!)")

print("\n" + "=" * 80)
print("✨ ENHANCED FEATURES:")
print("- 💾 Automatic backup every 50 texts")
print("- 🔄 Resume from exact stopping point")
print("- 📊 Progress tracking with processed_indices")
print("- 🎯 Skip already-completed texts")
print("- 💡 Clear restore instructions")
print("=" * 80)

# =============================================================================
# Example usage for restart scenario:
# =============================================================================

"""
# ========================================
# SCENARIO 1: Starting Fresh
# ========================================
setup_codetect_analysis()
results = run_codetect_iteration()

# ========================================
# SCENARIO 2: Server Interrupted at text 237
# ========================================

# Step 1: Check what backups are available
list_available_backups()

# Step 2: Restore from the most recent backup
restore_from_backup()

# Step 3: Load your data
sample_df = load_and_prepare_data()

# Step 4: Setup API client
client = setup_anthropic_client()

# Step 5: Resume from where you left off
# This will automatically skip the 237 texts already processed
results = resume_codetect_analysis(client, sample_df)

# ========================================
# SCENARIO 3: Check Progress Anytime
# ========================================
check_progress()

# ========================================
# SCENARIO 4: Restore from Specific Backup
# ========================================
restore_from_backup('codetect_progress_250.json')
"""