# Sentence-Level Batching with `analyze_text`

This notebook demonstrates how the new sentence detection flow leverages [pySBD](https://github.com/nipunsadvilkar/pySBD) to batch long documents for OpenMed named-entity recognition. You will:

1. Load a Hugging Face token-classification model.
2. Run `analyze_text` with sentence detection enabled (the default).
3. Compare the same invocation with sentence detection disabled.
4. Inspect sentence-aware metadata attached to each entity prediction.

⚠️ **Prerequisites**: make sure you are inside the project virtual environment (`source .venv/bin/activate`) and have installed the runtime dependencies:

```bash
uv pip install transformers==4.57.1 torch==2.9.0 --index-url https://download.pytorch.org/whl/cpu
uv pip install pysbd==0.3.4
```

In [5]:
import time
from pathlib import Path

from openmed import analyze_text, OpenMedConfig, ModelLoader

# Configure a CPU-only loader with a dedicated cache directory
config = OpenMedConfig(cache_dir=".hf_cache", device="cpu")
loader = ModelLoader(config)

MODEL_ID = "OpenMed/OpenMed-NER-OncologyDetect-SuperMedical-355M"

# long_text = Path("../../tests/fixtures/long_clinical_note.txt").read_text().strip()
long_text = Path("../../tests/fixtures/clinical_note.txt").read_text().strip()
# Repeat the document to simulate a larger real-world payload
long_text = "\n\n".join([long_text])
print(f"Characters in input: {len(long_text):,}")

Characters in input: 1,137


## Analyze with Sentence Detection (default)

The helper splits the document into sentences using pySBD, batches them, and preserves offsets in the returned entities.

In [6]:
start = time.perf_counter()
result_sentence = analyze_text(
    long_text,
    model_name=MODEL_ID,
    loader=loader,
    config=config,
    output_format="dict",
    max_length=512,  # applied per sentence batch
    batch_size=16,
    confidence_threshold=0.8,
    sentence_detection=True,
    group_entities=True,
)
elapsed_sentence = time.perf_counter() - start

print(f"Sentence detection runtime: {elapsed_sentence:.2f}s")
print("Sentences processed:", result_sentence.metadata.get("sentence_count"))
print("Entities returned:", len(result_sentence.entities))

Device set to use cpu


Sentence detection runtime: 2.62s
Sentences processed: 21
Entities returned: 5


Each `EntityPrediction` now carries sentence metadata. The snippet below inspects the first few entries.

In [7]:
for entity in result_sentence.entities[:5]:
    print(
        entity.text,
        entity.label,
        entity.metadata["sentence_index"],
        f"({entity.metadata['sentence_start']}, {entity.metadata['sentence_end']})",
    )

⸻ Amino_acid 8 (283, 284)
⸻ Amino_acid 11 (340, 341)
aspirin Simple_chemical 18 (931, 1056)
aspirin Simple_chemical 18 (931, 1056)
nitroglycerin Simple_chemical 19 (1057, 1104)


In [8]:
result_sentence.entities

[EntityPrediction(text='⸻', label='Amino_acid', confidence=0.8921184738477071, start=283, end=284, metadata={'sentence_index': 8, 'sentence_text': '⸻', 'sentence_start': 283, 'sentence_end': 284}),
 EntityPrediction(text='⸻', label='Amino_acid', confidence=0.8956050674120585, start=340, end=341, metadata={'sentence_index': 11, 'sentence_text': '⸻', 'sentence_start': 340, 'sentence_end': 341}),
 EntityPrediction(text='aspirin', label='Simple_chemical', confidence=0.9147762656211853, start=986, end=993, metadata={'sentence_index': 18, 'sentence_text': 'He took his usual morning medications today, including aspirin 81 mg, but did not take any additional aspirin before arrival.', 'sentence_start': 931, 'sentence_end': 1056}),
 EntityPrediction(text='aspirin', label='Simple_chemical', confidence=0.8815539479255676, start=1033, end=1040, metadata={'sentence_index': 18, 'sentence_text': 'He took his usual morning medications today, including aspirin 81 mg, but did not take any additional aspi

In [None]:
start = time.perf_counter()
result_sentence = analyze_text(
    "Sentence 069: Patient denies new neurological symptoms such as weakness or visual changes.",
    model_name=MODEL_ID,
    loader=loader,
    config=config,
    output_format="dict",
    max_length=512,  # applied per sentence batch
    batch_size=4,
    confidence_threshold=0.8,
    sentence_detection=False,
    group_entities=True,
)
elapsed_sentence = time.perf_counter() - start

print(f"Sentence detection runtime: {elapsed_sentence:.2f}s")
print("Sentences processed:", result_sentence.metadata.get("sentence_count"))
print("Entities returned:", len(result_sentence.entities))
print("Entities:", result_sentence.entities)