# Document Features

This notebook demonstrates the `Document` class features:
- Loading documents from CSV
- Inspecting segments and statistics
- Sentencizing documents (splitting multi-sentence rows, merging cross-row sentences)
- Exporting documents to CSV and plain text

In [1]:
from locisimiles.document import Document

  from .autonotebook import tqdm as notebook_tqdm


## 1. Loading a Document

Load a CSV file with `seg_id` and `text` columns.

In [2]:
doc = Document("./hieronymus_samples.csv", author="Hieronymus")
print(doc)
print(f"Number of segments: {len(doc)}")

Document('hieronymus_samples.csv', segments=11, author='Hieronymus', meta={})
Number of segments: 11


## 2. Inspecting Segments

Iterate over segments and view their IDs and text.

In [3]:
for seg in doc:
    print(f"{seg.id}: {seg.text[:80]}..." if len(seg.text) > 80 else f"{seg.id}: {seg.text}")

hier. adv. iovin. 1.1: Furiosas Apollinis uates legimus; et illud Uirgilianum: Dat sine mente sonum.
hier. adv. iovin. 1.41: O decus Italiae, uirgo!
hier. adv. iovin. 2.36: Uirgilianum consilium est: Coniugium uocat, hoc praetexit nomine culpam.
hier. adv. pelag. 1.23: Hoc totum dico, quod non omnia possumus omnes: rarusque aut nullus est diuitum, ...
hier. adv. pelag. 3.11: Numquam hodie effugies, ueniam quocumque uocaris.
hier. adv. pelag. 3.4: Quod si paululum se remiserit, quomodo qui aduerso flumine lembum trahit, si rem...
hier. adv. rufin. 1.17: Non tu in triuiis, indocte, solebas, stridenti miserum stipula disperdere carmen...
hier. adv. rufin. 1.5: Sic pater ille Deum faciat: sic magnus Iesus.
hier. adv. rufin. 1.6: Obiiciunt mihi sectatores eius, cerealiaque arma expediunt fessi rerum, quare li...
hier. adv. rufin. 3.28: Miror quomodo oblitus sis illos uersiculos ponere: unde tremor terris, qua ui ma...
hier. adv. rufin. 3.29: Quod si in his uersiculis, quae de natura rerum s

## 3. Statistics

Get descriptive statistics about segment lengths and word counts.

In [4]:
stats = doc.statistics()
for key, value in stats.items():
    print(f"{key:>25}: {value}")

             num_segments: 11
              total_chars: 1339
              total_words: 195
    avg_chars_per_segment: 121.73
    avg_words_per_segment: 17.73
        min_segment_chars: 23
        max_segment_chars: 290


## 4. Sentencization

Re-segment the document so each segment contains exactly one sentence.
This handles:
- Rows with **multiple sentences** → split into separate segments
- Sentences **spanning multiple rows** → merged into a single segment

In [5]:
print(f"Before sentencize: {len(doc)} segments")
doc.sentencize()
print(f"After sentencize:  {len(doc)} segments")

Before sentencize: 11 segments
After sentencize:  15 segments


In [6]:
for seg in doc:
    print(f"{seg.id}: {seg.text[:80]}..." if len(seg.text) > 80 else f"{seg.id}: {seg.text}")

hier. adv. iovin. 1.1.1: Furiosas Apollinis uates legimus;
hier. adv. iovin. 1.1.2: et illud Uirgilianum: Dat sine mente sonum.
hier. adv. iovin. 1.41.1: O decus Italiae, uirgo!
hier. adv. iovin. 2.36.1: Uirgilianum consilium est: Coniugium uocat, hoc praetexit nomine culpam.
hier. adv. pelag. 1.23.1: Hoc totum dico, quod non omnia possumus omnes: rarusque aut nullus est diuitum, ...
hier. adv. pelag. 3.11.1: Numquam hodie effugies, ueniam quocumque uocaris.
hier. adv. pelag. 3.4.1: Quod si paululum se remiserit, quomodo qui aduerso flumine lembum trahit, si rem...
hier. adv. pelag. 3.4.2: sic humana conditio, si paululum se remiserit, discit fragilitatem suam, et mult...
hier. adv. rufin. 1.17.1: Non tu in triuiis, indocte, solebas, stridenti miserum stipula disperdere carmen...
hier. adv. rufin. 1.5.1: Sic pater ille Deum faciat: sic magnus Iesus.
hier. adv. rufin. 1.6.1: Obiiciunt mihi sectatores eius, cerealiaque arma expediunt fessi rerum, quare li...
hier. adv. rufin. 3.28.1: Mir

### Statistics After Sentencization

In [7]:
stats_after = doc.statistics()
for key, value in stats_after.items():
    print(f"{key:>25}: {value}")

             num_segments: 15
              total_chars: 1335
              total_words: 195
    avg_chars_per_segment: 89.0
    avg_words_per_segment: 13.0
        min_segment_chars: 23
        max_segment_chars: 166


## 5. Exporting

Save the sentencized document to CSV or plain text.

In [8]:
# Export as CSV (preserves seg_id and text columns)
csv_path = doc.save_csv("./hieronymus_sentencized.csv")
print(f"Saved CSV to: {csv_path}")

# Export as plain text (one segment per line)
txt_path = doc.save_plain("./hieronymus_sentencized.txt")
print(f"Saved TXT to: {txt_path}")

Saved CSV to: hieronymus_sentencized.csv
Saved TXT to: hieronymus_sentencized.txt


### Verify the Roundtrip

Reload the exported CSV and confirm it matches.

In [9]:
reloaded = Document("./hieronymus_sentencized.csv", author="Hieronymus")
print(f"Reloaded segments: {len(reloaded)}")
assert len(reloaded) == len(doc)
assert [s.text for s in reloaded] == [s.text for s in doc]
print("Roundtrip OK - reloaded document matches.")

Reloaded segments: 15
Roundtrip OK - reloaded document matches.
