# Prompts Module

> Few-shot prompt construction and kNN example selection.

This module handles:
- kNN-based selection of semantically similar training examples
- Few-shot prompt construction with chain-of-thought reasoning
- Prompt templates for data and code classification

**Research Background**: Few-shot learning with semantic similarity-based example selection substantially improves LLM classification accuracy over random sampling (Brown et al., 2020; Liu et al., 2022).

In [None]:
#| default_exp prompts

In [None]:
#| export
from __future__ import annotations
import numpy as np
from typing import List, Optional

from openness_classifier.core import OpennessCategory, ClassificationType
from openness_classifier.data import TrainingExample, EmbeddingModel

## kNN Example Selection

Select the k most semantically similar training examples for a given statement.

In [None]:
#| export
def select_knn_examples(
    statement: str,
    training_examples: List[TrainingExample],
    embedding_model: EmbeddingModel,
    k: int = 5,
    diversify: bool = True
) -> List[TrainingExample]:
    """Select k most similar training examples using kNN.
    
    Uses cosine similarity between sentence embeddings to find
    the most relevant examples for few-shot prompting.
    
    Args:
        statement: The statement to classify
        training_examples: Pool of training examples with embeddings
        embedding_model: Model for computing statement embedding
        k: Number of examples to select
        diversify: If True, ensure variety in selected examples' labels
        
    Returns:
        List of k most similar training examples
    """
    if not training_examples:
        return []
    
    # Ensure all examples have embeddings
    for ex in training_examples:
        if ex.embedding is None:
            ex.embedding = embedding_model.encode(ex.statement_text)
    
    # Compute embedding for input statement
    statement_embedding = embedding_model.encode(statement)
    
    # Compute similarities
    similarities = []
    for ex in training_examples:
        sim = _cosine_similarity(statement_embedding, ex.embedding)
        similarities.append((ex, sim))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    if diversify:
        # Select examples ensuring label diversity
        return _select_diverse_examples(similarities, k)
    else:
        # Just take top k
        return [ex for ex, _ in similarities[:k]]


def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


def _select_diverse_examples(
    sorted_examples: List[tuple],
    k: int
) -> List[TrainingExample]:
    """Select examples ensuring label diversity.
    
    Tries to include at least one example from each category
    while still prioritizing similarity.
    """
    selected = []
    seen_labels = set()
    
    # First pass: get one example per unique label from top candidates
    for ex, sim in sorted_examples:
        if ex.ground_truth not in seen_labels:
            selected.append(ex)
            seen_labels.add(ex.ground_truth)
            if len(selected) >= k:
                return selected
    
    # Second pass: fill remaining slots with most similar
    for ex, sim in sorted_examples:
        if ex not in selected:
            selected.append(ex)
            if len(selected) >= k:
                break
    
    return selected

<cell_type>markdown</cell_type>## Completeness Indicators and Prompt Templates

The refined taxonomy (002-refine-classification-taxonomy) introduces:
1. **Completeness indicators** for distinguishing mostly_open from mostly_closed
2. **5-step chain-of-thought reasoning** for more accurate boundary classification
3. **Hard precedence rule (FR-004)** for substantial access barriers

In [None]:
#| export
# Completeness indicators for mostly_open classification (FR-002, FR-003, refined per research.md)
MOSTLY_OPEN_DATA_COMPLETENESS = [
    "All",
    "Raw; Results; Source Data",
    "Raw; Results",
    "Raw",
    "Raw; Source Data",
]

MOSTLY_OPEN_CODE_COMPLETENESS = [
    "All",
    "Model",
    "Models",
    "Download; Process; Analysis; Figures",
    "Processing; Generate Results",
    "Processing; Results",
    "Models; Results",
    "Models; Analysis",
    "Analysis; Figures",
    "Model; Figures",
    "Results; Figures",
    "Generate Results; Figures",
]

# Substantial access barriers that force mostly_closed (FR-004 hard precedence rule)
SUBSTANTIAL_BARRIERS = [
    "data use agreement",
    "confidentiality",
    "proprietary",
    "upon request",
    "contact author",
    "restricted access",
    "DUA",
    "confidential",
]


SYSTEM_PROMPT = """You are an expert research analyst specializing in evaluating data and code availability statements in scholarly publications. Your task is to classify the openness of availability statements using a refined 4-category taxonomy.

Classification Categories (from most open to least open):

1. **open**: Fully accessible with no restrictions
   - Data/code in public repository (Zenodo, Figshare, GitHub public)
   - No registration, login, or approval required
   - Open license (CC-BY, MIT, etc.)

2. **mostly_open**: Largely accessible with minor restrictions
   - HIGH COMPLETENESS: Most or all data/code types available
     * Data: All, Raw, Raw; Results, Raw; Results; Source Data
     * Code: All, Models, Download; Process; Analysis; Figures, or similar comprehensive combinations
   - MINOR BARRIERS ONLY: Free registration, institutional access, citation requirement
   - Can use non-persistent repository (GitHub) IF comprehensive materials provided

3. **mostly_closed**: Largely restricted with limited access
   - LOW/PARTIAL COMPLETENESS: Only some data/code types available (e.g., Results only, Processing scripts only)
   - SUBSTANTIAL BARRIERS: Data use agreements, confidentiality restrictions, proprietary terms
   - Available only through specific collaborations or agreements
   - Note: Substantial barriers ALWAYS force mostly_closed regardless of completeness

4. **closed**: Not accessible
   - "Available upon request" (regardless of how polite)
   - Confidential, proprietary, or restricted
   - No statement provided
   - Contact author for access

CRITICAL RULE (HARD PRECEDENCE - FR-004):
If substantial access barriers exist (data use agreements, proprietary terms, confidentiality restrictions, "available upon request"), the classification MUST be mostly_closed or closed, REGARDLESS of completeness or repository quality. This rule has absolute precedence.

IMPORTANT: "Available upon request" or "contact the authors" is ALWAYS classified as **closed**."""


DATA_CLASSIFICATION_TEMPLATE = """Classify the following DATA availability statement.

{few_shot_examples}

Now classify this statement:

Statement: {statement}

Think step-by-step through this 5-step reasoning process:

1. **Identify data types mentioned**: What types of data are available?
   - Look for: Raw data, Results, Source Data, Processed data, All data
   - HIGH completeness indicators: All, Raw; Results; Source Data, Raw; Results, Raw

2. **Assess completeness**: Does the statement indicate all necessary materials for reproduction or only partial materials?
   - All/most data types = HIGH completeness → favors mostly_open
   - Only some data types (e.g., "Results only") = LOW completeness → favors mostly_closed

3. **Identify access barriers**: What restrictions are mentioned?
   - MINOR barriers (allow mostly_open): Free registration, institutional access, citation required
   - SUBSTANTIAL barriers (force mostly_closed): Data use agreement, confidentiality, proprietary, "upon request"

4. **Determine repository type** (if mentioned):
   - PERSISTENT (Zenodo, Figshare, Dryad, DOI): Adds confidence to classification
   - NON-PERSISTENT (GitHub, personal website): Acceptable IF high completeness

5. **Apply classification rules** (in this order):
   - FIRST CHECK: Substantial barrier present? → mostly_closed or closed (STOP, do not consider other factors)
   - "Upon request" or "contact author"? → closed
   - No substantial barrier + HIGH completeness → mostly_open
   - No substantial barrier + LOW completeness → mostly_closed
   - No barriers + fully public → open

Based on your analysis, provide:
- Classification: [open/mostly_open/mostly_closed/closed]
- Confidence: [0.0-1.0]
- Reasoning: [Your step-by-step analysis mentioning specific data types, barriers, and repository]"""


CODE_CLASSIFICATION_TEMPLATE = """Classify the following CODE availability statement.

{few_shot_examples}

Now classify this statement:

Statement: {statement}

Think step-by-step through this 5-step reasoning process:

1. **Identify code types mentioned**: What types of code are available?
   - Look for: Download scripts, Processing scripts, Analysis code, Figure generation, Models, All code
   - HIGH completeness indicators: All, Models, Download; Process; Analysis; Figures, Processing; Generate Results

2. **Assess completeness**: Does the statement indicate all necessary code for reproduction or only partial code?
   - All/most code types = HIGH completeness → favors mostly_open
   - Only some code types (e.g., "Processing scripts only") = LOW completeness → favors mostly_closed

3. **Identify access barriers**: What restrictions are mentioned?
   - MINOR barriers (allow mostly_open): Free registration, institutional access, citation required
   - SUBSTANTIAL barriers (force mostly_closed): Proprietary code, confidential algorithms, "upon request"

4. **Determine repository type** (if mentioned):
   - PERSISTENT (Zenodo, Figshare with DOI): Adds confidence to classification
   - NON-PERSISTENT (GitHub, GitLab): Acceptable IF comprehensive code provided (all types)

5. **Apply classification rules** (in this order):
   - FIRST CHECK: Substantial barrier present? → mostly_closed or closed (STOP, do not consider other factors)
   - "Upon request" or "contact author"? → closed
   - No substantial barrier + HIGH completeness → mostly_open
   - No substantial barrier + LOW completeness → mostly_closed
   - No barriers + fully public → open

Based on your analysis, provide:
- Classification: [open/mostly_open/mostly_closed/closed]
- Confidence: [0.0-1.0]
- Reasoning: [Your step-by-step analysis mentioning specific code types, barriers, and repository]"""

## Prompt Construction

In [None]:
#| export
def build_few_shot_prompt(
    statement: str,
    statement_type: ClassificationType,
    examples: List[TrainingExample],
    include_reasoning: bool = True
) -> str:
    """Build a few-shot classification prompt.
    
    Args:
        statement: The statement to classify
        statement_type: DATA or CODE
        examples: Selected few-shot examples
        include_reasoning: Whether to include CoT reasoning template
        
    Returns:
        Complete prompt string
    """
    # Format examples
    example_strs = []
    for i, ex in enumerate(examples, 1):
        example_strs.append(
            f"Example {i}:\n"
            f"Statement: {ex.statement_text}\n"
            f"Classification: {ex.ground_truth.value}"
        )
    
    few_shot_block = "\n\n".join(example_strs)
    
    if few_shot_block:
        few_shot_block = f"Here are some examples:\n\n{few_shot_block}\n"
    
    # Select template
    if statement_type == ClassificationType.DATA:
        template = DATA_CLASSIFICATION_TEMPLATE
    else:
        template = CODE_CLASSIFICATION_TEMPLATE
    
    return template.format(
        few_shot_examples=few_shot_block,
        statement=statement
    )

In [None]:
#| export
def has_substantial_barrier(text: str) -> bool:
    """Check if text contains any substantial access barrier (FR-004).

    Args:
        text: Statement or reasoning text to check

    Returns:
        True if substantial barrier found, False otherwise
    """
    text_lower = text.lower()
    return any(barrier.lower() in text_lower for barrier in SUBSTANTIAL_BARRIERS)


def extract_completeness_attributes(
    reasoning: str,
    statement_type: ClassificationType
) -> dict:
    """Extract completeness attributes from LLM reasoning (FR-007, SC-004).

    Parses the 5-step reasoning to identify data/code types, barriers,
    and repository type. Used for validation and audit.

    Args:
        reasoning: The LLM's reasoning text
        statement_type: DATA or CODE

    Returns:
        Dictionary with extracted attributes:
        - types_mentioned: List of data/code types identified
        - completeness_level: 'high' or 'low' based on types
        - barriers_found: List of barriers mentioned
        - substantial_barrier: Boolean (FR-004)
        - repository_type: 'persistent', 'non_persistent', or 'unknown'
        - reasoning_quality: Score 0-1 based on attribute coverage
    """
    import re
    reasoning_lower = reasoning.lower()

    # Extract data/code types mentioned
    types_mentioned = []
    if statement_type == ClassificationType.DATA:
        type_patterns = ['raw', 'results', 'source data', 'processed', 'all data', 'raw data']
    else:
        type_patterns = ['download', 'processing', 'analysis', 'figures', 'models?', 'all code']

    for pattern in type_patterns:
        if re.search(pattern, reasoning_lower):
            types_mentioned.append(pattern.replace('?', ''))

    # Determine completeness level
    high_completeness_keywords = ['high completeness', 'comprehensive', 'all', 'complete', 'most']
    low_completeness_keywords = ['low completeness', 'partial', 'only', 'limited', 'some']

    completeness_level = 'unknown'
    if any(kw in reasoning_lower for kw in high_completeness_keywords):
        completeness_level = 'high'
    elif any(kw in reasoning_lower for kw in low_completeness_keywords):
        completeness_level = 'low'

    # Extract barriers found
    barriers_found = [b for b in SUBSTANTIAL_BARRIERS if b.lower() in reasoning_lower]

    # Check for substantial barrier
    substantial_barrier = has_substantial_barrier(reasoning)

    # Determine repository type
    persistent_repos = ['zenodo', 'figshare', 'dryad', 'doi:']
    non_persistent_repos = ['github', 'gitlab', 'personal website', 'supplementary']

    repository_type = 'unknown'
    if any(repo in reasoning_lower for repo in persistent_repos):
        repository_type = 'persistent'
    elif any(repo in reasoning_lower for repo in non_persistent_repos):
        repository_type = 'non_persistent'

    # Calculate reasoning quality score (SC-004: 90% should mention completeness)
    quality_checks = [
        len(types_mentioned) > 0,  # Mentions specific types
        completeness_level != 'unknown',  # Assesses completeness
        repository_type != 'unknown' or substantial_barrier,  # Mentions repo or barrier
        'step' in reasoning_lower or any(str(i) in reasoning for i in range(1, 6)),  # Shows reasoning steps
    ]
    reasoning_quality = sum(quality_checks) / len(quality_checks)

    return {
        'types_mentioned': types_mentioned,
        'completeness_level': completeness_level,
        'barriers_found': barriers_found,
        'substantial_barrier': substantial_barrier,
        'repository_type': repository_type,
        'reasoning_quality': reasoning_quality,
    }


def parse_classification_response(response: str) -> tuple:
    """Parse LLM response to extract classification, confidence, and reasoning.

    Enhanced to extract completeness attributes from 5-step reasoning (T007).

    Args:
        response: Raw LLM response text

    Returns:
        Tuple of (OpennessCategory, confidence_score, reasoning)
    """
    import re

    # Default values
    category = None
    confidence = 0.8
    reasoning = response

    # Try to extract classification
    class_match = re.search(
        r'Classification:\s*(open|mostly_open|mostly_closed|closed)',
        response,
        re.IGNORECASE
    )
    if class_match:
        category = OpennessCategory.from_string(class_match.group(1))
    else:
        # Try alternative patterns
        for cat in ['open', 'mostly_open', 'mostly_closed', 'closed']:
            if cat in response.lower():
                category = OpennessCategory.from_string(cat)
                break

    # Try to extract confidence
    conf_match = re.search(r'Confidence:\s*([0-9.]+)', response, re.IGNORECASE)
    if conf_match:
        try:
            confidence = float(conf_match.group(1))
            confidence = max(0.0, min(1.0, confidence))  # Clamp to [0, 1]
        except ValueError:
            pass

    # Try to extract reasoning
    reason_match = re.search(r'Reasoning:\s*(.+?)(?=$|Classification:|Confidence:)',
                            response, re.IGNORECASE | re.DOTALL)
    if reason_match:
        reasoning = reason_match.group(1).strip()

    if category is None:
        # Default to closed if we can't parse
        category = OpennessCategory.CLOSED
        confidence = 0.3  # Low confidence for unparseable response
        reasoning = f"Could not parse response: {response[:200]}..."

    return category, confidence, reasoning

In [None]:
# Test prompt construction with refined taxonomy examples
from openness_classifier.core import OpennessCategory, ClassificationType

# Create mock examples demonstrating the refined taxonomy
class MockExample:
    def __init__(self, text, label):
        self.statement_text = text
        self.ground_truth = OpennessCategory.from_string(label)

# Examples representing boundary cases for the refined taxonomy
examples = [
    MockExample("All raw data and analysis code are available at https://zenodo.org/record/12345", "open"),
    MockExample("Raw; Results; Source Data available at Figshare with free registration", "mostly_open"),
    MockExample("Results available upon completion of data use agreement from ICPSR", "mostly_closed"),
    MockExample("Data available upon reasonable request from the corresponding author", "closed"),
]

# Test with a boundary case statement
prompt = build_few_shot_prompt(
    "Raw and processed data are deposited in Dryad with open access. Code is on GitHub.",
    ClassificationType.DATA,
    examples
)

print("=== Refined Taxonomy Prompt Test ===")
print(prompt[:800])
print("...")
print("\nPrompt construction test passed!")

In [None]:
# Test response parsing with 5-step reasoning
test_response = """Let me analyze this statement using the 5-step reasoning process.

1. **Identify data types**: The statement mentions raw data and processed data, indicating HIGH completeness.

2. **Assess completeness**: Raw and processed data are comprehensive - HIGH completeness.

3. **Identify access barriers**: "Open access" indicates NO substantial barriers.

4. **Repository type**: Dryad is a PERSISTENT repository with DOI.

5. **Apply classification rules**: No substantial barriers + high completeness = mostly_open

Classification: mostly_open
Confidence: 0.90
Reasoning: High completeness (raw and processed data) in persistent repository (Dryad) with open access makes this mostly_open."""

# Parse the response
category, confidence, reasoning = parse_classification_response(test_response)
print(f"Category: {category.value}")
print(f"Confidence: {confidence}")
print(f"Reasoning snippet: {reasoning[:100]}...")

# Test completeness attribute extraction
attributes = extract_completeness_attributes(test_response, ClassificationType.DATA)
print(f"\n=== Extracted Completeness Attributes ===")
print(f"Types mentioned: {attributes['types_mentioned']}")
print(f"Completeness level: {attributes['completeness_level']}")
print(f"Substantial barrier: {attributes['substantial_barrier']}")
print(f"Repository type: {attributes['repository_type']}")
print(f"Reasoning quality: {attributes['reasoning_quality']:.2f}")

# Test substantial barrier detection
assert has_substantial_barrier("Data available via data use agreement") == True
assert has_substantial_barrier("Data available with registration") == False
print("\nAll tests passed!")

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()