# Openness Classifier

> Few-shot LLM-based classification of data and code openness in scholarly publications.

This library provides tools for automatically classifying the openness of data and code availability statements in scholarly publications. It uses few-shot learning with large language models (LLMs) to categorize statements into a 4-category ordinal taxonomy.

## Research Context

Data and code sharing is essential for reproducible research, but assessing compliance at scale requires automated tools. This classifier enables:

- **Systematic Reviews**: Automatically classify hundreds of publications
- **FAIR Compliance Assessment**: Evaluate data/code openness across a corpus
- **Meta-Research**: Study trends in data sharing practices

The classifier is validated against human-coded examples from published research on data sharing practices.

## Classification Taxonomy

Statements are classified into four categories:

| Category | Description | Example |
|----------|-------------|----------|
| **open** | Fully accessible, no restrictions | "Data deposited in Zenodo under CC-BY" |
| **mostly_open** | Accessible with minor restrictions | "Data in institutional repository (free registration required)" |
| **mostly_closed** | Limited access with restrictions | "Data available under data use agreement" |
| **closed** | Not accessible | "Data available upon reasonable request" |

**Note**: "Available upon request" is always classified as **closed**, regardless of wording.

## Installation

```bash
# Clone the repository
git clone https://github.com/kcaylor/open_sesame.git
cd open_sesame

# Install with pixi (recommended)
pixi install

# Or with pip
pip install -e .
```

## Configuration

Create a `.env` file with your LLM provider credentials:

```bash
# Copy the example file
cp .env.example .env

# Edit with your API key
# For Claude:
LLM_PROVIDER=claude
ANTHROPIC_API_KEY=sk-ant-your-key-here

# For OpenAI:
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your-key-here

# For Ollama (local, no API key needed):
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
```

## Quick Start

In [None]:
# Import the classifier
from openness_classifier import classify_statement, classify_publication

# Classify a single data availability statement
result = classify_statement(
    "All data are available at https://zenodo.org/record/12345",
    statement_type="data"
)

print(f"Category: {result.category.value}")  # 'open'
print(f"Confidence: {result.confidence_score:.2f}")
print(f"Reasoning: {result.reasoning}")

In [None]:
# Classify both data and code for a publication
data_result, code_result = classify_publication(
    data_statement="Data deposited in Figshare at doi:10.6084/m9.figshare.12345",
    code_statement="Code available upon request from the corresponding author",
    publication_id="doi:10.1234/example"
)

print(f"Data: {data_result.category.value}")  # 'open'
print(f"Code: {code_result.category.value}")  # 'closed'

## Batch Processing

Process multiple publications from a CSV file:

In [None]:
from openness_classifier import classify_csv

# Process a CSV file
job = classify_csv(
    input_path="publications.csv",
    output_path="publications_classified.csv",
    id_column="doi",
    data_statement_column="data_statement",
    code_statement_column="code_statement",
    progress_callback=lambda p, t: print(f"Progress: {p}/{t}")
)

print(job.summary())

## Validation and Metrics

Evaluate model performance against human-coded ground truth:

In [None]:
from openness_classifier import validate_classifications, cross_validate
from openness_classifier.data import load_training_data, train_test_split

# Load and split data
data_examples, code_examples = load_training_data("data/articles_reviewed.csv")
train, test = train_test_split(data_examples, test_size=0.2)

# Run validation
result = validate_classifications(test, classifier)

# Print metrics
print(result.to_markdown())

## Visualization

Generate publication-quality figures:

In [None]:
from openness_classifier.visualization import plot_confusion_matrix, plot_validation_results

# Plot confusion matrix
fig = plot_confusion_matrix(
    result.confusion_matrices['data'],
    title='Data Availability Classification',
    save_path='figures/confusion_matrix.png'
)

## API Reference

### Core Functions

- `classify_statement(statement, statement_type, config=None)` - Classify a single statement
- `classify_publication(data_statement, code_statement, publication_id)` - Classify both data and code
- `classify_csv(input_path, output_path, ...)` - Batch process a CSV file

### Validation

- `validate_classifications(test_examples, classifier)` - Evaluate on test set
- `cross_validate(examples, config, n_folds=5)` - k-fold cross-validation

### Data Management

- `load_training_data(path)` - Load training examples from CSV
- `load_config(config_path=None)` - Load configuration from env/file

See the [API documentation](classifier_api.md) for full details.

## Supported LLM Providers

| Provider | Model Examples | Notes |
|----------|----------------|-------|
| **Claude** | claude-3-5-sonnet-20241022, claude-3-haiku | Recommended for best accuracy |
| **OpenAI** | gpt-4-turbo, gpt-4o, gpt-3.5-turbo | Good accuracy, widely available |
| **Ollama** | llama3:8b, mistral:7b | Local, free, good for development |

## Reproducibility

All classifications are logged with full metadata for reproducibility:

- LLM provider, model, and parameters
- Few-shot examples used
- Timestamp and confidence scores
- Chain-of-thought reasoning

Logs are stored in JSON Lines format in the `logs/` directory.

## Citation

If you use this tool in your research, please cite:

```bibtex
@software{openness_classifier,
  author = {Caylor, Kelly},
  title = {Openness Classifier: Few-shot LLM Classification of Data/Code Availability},
  year = {2026},
  url = {https://github.com/kcaylor/open_sesame}
}
```

## License

MIT License - see [LICENSE](LICENSE) for details.