# Tutorial: Single Publication Classification

This tutorial demonstrates how to classify individual data and code availability statements using the openness classifier.

## Prerequisites

1. Install dependencies: `pixi install`
2. Configure your LLM provider in `.env`
3. Have the training data available

In [1]:
# Imports
from openness_classifier import classify_statement, classify_publication
from openness_classifier.core import ClassificationType, OpennessCategory
from openness_classifier.config import load_config

ModuleNotFoundError: No module named 'openness_classifier'

## 1. Classify a Data Availability Statement

Let's start with a simple example - a clear data availability statement.

In [None]:
# A clear "open" data statement
statement = "All data are available at https://zenodo.org/record/12345 under CC-BY 4.0 license."

result = classify_statement(statement, statement_type="data")

print(f"Statement: {statement}")
print(f"Category: {result.category.value}")
print(f"Confidence: {result.confidence_score:.2f}")
print(f"Reasoning: {result.reasoning}")

## 2. Examples of Each Category

Let's see how different types of statements are classified.

In [None]:
# Example statements for each category
examples = [
    # Expected: OPEN
    ("Data deposited in Figshare at doi:10.6084/m9.figshare.12345", "data"),
    
    # Expected: MOSTLY_OPEN
    ("Data available through the NEON data portal (free registration required)", "data"),
    
    # Expected: MOSTLY_CLOSED
    ("Data available under data use agreement from the consortium", "data"),
    
    # Expected: CLOSED
    ("Data available upon reasonable request from the corresponding author", "data"),
]

print("Classification Results:\n")
for statement, stmt_type in examples:
    result = classify_statement(statement, statement_type=stmt_type)
    print(f"Statement: {statement[:60]}...")
    print(f"  -> {result.category.value} (confidence: {result.confidence_score:.2f})")
    print()

## 3. Classify Code Availability

The classifier also works for code availability statements.

In [None]:
code_examples = [
    "Code available at https://github.com/author/repo under MIT license",
    "Analysis scripts included in the supplementary materials",
    "Custom code available from the authors upon request",
]

print("Code Classification Results:\n")
for statement in code_examples:
    result = classify_statement(statement, statement_type="code")
    print(f"Statement: {statement}")
    print(f"  -> {result.category.value} (confidence: {result.confidence_score:.2f})")
    print()

## 4. Classify a Complete Publication

Classify both data and code for a publication in one call.

In [None]:
data_result, code_result = classify_publication(
    data_statement="Raw data set files are available at https://doi.org/10.5281/zenodo.4323531.",
    code_statement="Simulation code is available at https://github.com/author/repo.",
    publication_id="doi:10.1021/example"
)

print("Publication Classification:")
print(f"  Data: {data_result.category.value} (confidence: {data_result.confidence_score:.2f})")
print(f"  Code: {code_result.category.value} (confidence: {code_result.confidence_score:.2f})")

## 5. Access Chain-of-Thought Reasoning

The classifier provides its reasoning for transparency.

In [None]:
# A more ambiguous statement
ambiguous = "Data are available from the corresponding author with a material transfer agreement."

result = classify_statement(ambiguous, "data", return_reasoning=True)

print(f"Statement: {ambiguous}")
print(f"\nCategory: {result.category.value}")
print(f"Confidence: {result.confidence_score:.2f}")
print(f"\nReasoning:\n{result.reasoning}")

## 6. Using Real Examples from Training Data

Let's test with actual examples from articles_reviewed.csv.

In [None]:
from openness_classifier.data import load_training_data
from pathlib import Path

# Load a few examples
data_path = Path('../../resources/abpoll-open-b71bd12/data/processed/articles_reviewed.csv')
if data_path.exists():
    data_examples, _ = load_training_data(data_path)
    
    print(f"Loaded {len(data_examples)} training examples")
    print("\nTesting with 5 random examples:\n")
    
    import random
    random.seed(42)
    test_samples = random.sample(data_examples, min(5, len(data_examples)))
    
    correct = 0
    for ex in test_samples:
        result = classify_statement(ex.statement_text, "data")
        match = "✓" if result.category == ex.ground_truth else "✗"
        if result.category == ex.ground_truth:
            correct += 1
        
        print(f"Statement: {ex.statement_text[:80]}...")
        print(f"  Ground truth: {ex.ground_truth.value}")
        print(f"  Predicted: {result.category.value} ({result.confidence_score:.2f}) {match}")
        print()
    
    print(f"Accuracy on sample: {correct}/{len(test_samples)} = {100*correct/len(test_samples):.0f}%")
else:
    print(f"Training data not found at {data_path}")

## Summary

This tutorial demonstrated:

1. **Basic classification**: Use `classify_statement()` for single statements
2. **Category examples**: Open, mostly open, mostly closed, and closed
3. **Code vs data**: Both statement types are supported
4. **Publication classification**: Classify both at once with `classify_publication()`
5. **Reasoning access**: Chain-of-thought explanations for transparency
6. **Real data testing**: Validate against human-coded examples

Next steps:
- See `02_batch_processing.ipynb` for processing CSV files
- See `03_validation_analysis.ipynb` for model evaluation