Skip to content

refactor(ontology): primitives, LanguageTag, Transcription, derive_more#142

Merged
martsokha merged 7 commits into
mainfrom
review/ontology-artifacts
May 18, 2026
Merged

refactor(ontology): primitives, LanguageTag, Transcription, derive_more#142
martsokha merged 7 commits into
mainfrom
review/ontology-artifacts

Conversation

@martsokha
Copy link
Copy Markdown
Member

@martsokha martsokha commented Apr 13, 2026

Summary

  • Rename math module to primitive — better name now that it contains non-math types like LanguageTag
  • Add oxilangtag dependency and typed LanguageTag newtype for BCP-47 language tags, used in Entity::language and Transcription::language
  • Rework Transcription to support diarization: replaces flat text: String with Vec<TranscriptSegment> containing time_span, speaker_id, and confidence per segment
  • Replace manual Display, From, Deref, DerefMut impls with derive_more across 9 ontology types
  • Populate artifact types: TextArtifacts (language, char_count), TabularArtifacts (row/col counts, sparse headers), RichArtifacts (add tabular)
  • Wire up OCR results storage in ImageArtifacts::ocr_pages during vision extraction
  • Add as_text/as_tabular accessors on ContentArtifacts
  • Fix 20 context type issues: UUID consistency, serde tag collision, naming, security, missing fields
  • Pattern types: extract RegexPattern and GlobPattern as dedicated types
  • Temporal: add TimeOfDay and DateTime variants
  • Credential security: skip serializing plaintext values, add CredentialKind enum

Test plan

  • cargo check --workspace --all-features
  • cargo test --workspace --all-features (469 tests pass)

🤖 Generated with Claude Code

… Transcription, derive_more cleanup

- Rename `math` module to `primitive` across workspace
- Add `oxilangtag` dependency, create typed `LanguageTag` newtype for BCP-47 tags
- Use `LanguageTag` in `Entity::language` and `Transcription::language`
- Rework `Transcription`: remove `text` field, add `TranscriptSegment` with
  `time_span`, `speaker_id`, `confidence` for diarization support
- Add `Transcription::text()` method to join segments
- Replace manual impls with derive_more across ontology types:
  - `Annotations`: Deref, DerefMut, From, IntoIterator
  - `ContentSource`: Display
  - `ContentArtifacts`: From
  - `Contexts`: Deref, DerefMut, From
  - `ContextEntryData`: From
  - `Policies`: Deref, DerefMut
  - `RedactionMap`: Deref, DerefMut
  - `GraphNodeKind`: Display, From

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@martsokha martsokha self-assigned this Apr 13, 2026
@martsokha martsokha added ontology nvisy-ontology: entities, policies, context refactor code restructuring without behavior change labels Apr 13, 2026
martsokha and others added 6 commits April 13, 2026 17:13
- TextArtifacts: add language and char_count fields
- TabularArtifacts: add row_count, column_count, sparse headers (ColumnHeader)
- RichArtifacts: add tabular field alongside text and image
- ContentArtifacts: add as_text/as_text_mut and as_tabular/as_tabular_mut accessors
- Vision extraction: store OCR results in ImageArtifacts::ocr_pages instead of discarding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix UUID version: ContextEntry now uses now_v7 (was new_v4)
- EmbeddingData: Vec<f64> → Vec<f32>, remove redundant dimensions field,
  rename model → algorithm for consistency with FaceData/VoiceData
- PatternExpression: rename serde tag "kind" → "syntax" to avoid
  collision with AnalyticVariant's "kind" tag when flattened
- SignatureData: add missing algorithm field
- AddressData: rename region → state to avoid GeospatialVariant confusion
- GeoShape::Circle: centre → center (American English consistency)
- GeoShape::Polygon: polygon → boundary (avoid redundant naming)
- ReferenceVariant::Object → Image (match wrapped ImageData type)
- CredentialData: skip_serializing on value to prevent plaintext leaks,
  add CredentialKind enum replacing untyped credential_type string
- TextData: add language field (Option<LanguageTag>)
- TemporalVariant: add TimeSpan variant with TimeSpanData

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- PatternExpression: extract RegexPattern and GlobPattern as dedicated types
- ImageData: remove untyped format field
- TemporalVariant: add TimeOfDay(TimeOfDayData) and DateTime(DateTimeData)
  variants using jiff::civil::Time and jiff::civil::DateTime

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ontology context cleanup:
- Remove format hints from biometric/document/temporal types (ContentSource
  already carries file format via extension)
- Rename temporal::TimeOfDayData → TimeData (file: time_of_day.rs → time.rs)
- Rename existing time.rs → timespan.rs (contains TimeSpanData)
- TemporalVariant: Time(TimeData), TimeSpan(TimeSpanData) layout
- PatternExpression: extract RegexPattern and GlobPattern as dedicated structs
- CredentialData: add #[serde(default)] on value so round-trip yields ""
- TextData: derive Default, add new() and with_language() constructors;
  TextEntry: add new() constructor

Tests:
- Add 19 serde round-trip tests covering PatternExpression (incl. tag-
  collision regression), CredentialData (secret redaction + roundtrip),
  TextData, TimeData, DateTimeData, TimeSpanData, TabularArtifacts

rig 0.33 → 0.37:
- Workspace dep: rig (umbrella crate, version 0.37) — re-exports rig-core
- Adapt to API changes:
  - Completion::completion now generic over Into<Message>; use typed
    Vec::<Message>::new() for empty chat history
  - StructuredOutputError::PromptError now wraps Box<PromptError>; deref
    before From conversion

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Pin Rust toolchain to 1.95.0 in rust-toolchain.toml and Cargo.toml
- Pin CI toolchain via RUST_TOOLCHAIN env var (defined once per workflow)
  using dtolnay/rust-toolchain@master with explicit toolchain input
- Fix clippy errors:
  - sort_by with reverse → sort_by_key + Reverse (codec text/tabular)
  - needless_borrows on AsRef bounds (content_data sha256 test)
  - unused imports across engine tests after Entity::test_builder refactor
  - unused TabularLocation import in span_size tests
  - unused value parameter in annotation test_entity (prefix with _)
- Add #![allow(dead_code)] to engine tests/fixtures/mod.rs (shared helpers
  appear unused from individual test files)
- Replace 80+ inline std:: paths with top-of-file imports across 47 files:
  fmt::{Display,Debug,Formatter}, cmp::{Ordering,Reverse}, time::Duration,
  fs, mem, path, io, str, env, future, ops::Deref, slice, sync::Arc, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rustdoc fixes (18 files, ~30 broken links resolved):
- Correct stale paths: crate::graph::* → crate::workflow::* (ontology
  workflow files, 13 references)
- Method rename: PatternEngine::scan_text → scan_entities (pattern engine
  helpers + doc example)
- Error variant moves: Error::Request / Error::Validation → ErrorKind::*
  (provider STT, TTS, OCR backend)
- Self-method links: [`acquire_resources`] → Self::acquire_resources,
  [`maybe_compact`] → Self::maybe_compact
- External crate URLs: [`tempfile`], [`lopdf`], [`scraper`] → docs.rs URLs
- Cross-crate unreachables (private items / circular dep targets) rewritten
  as backticks or prose: extraction/detection/deduplication/redaction
  submodules, Pipeline, ExecutionPlan, RawMatch, CompositeKey, Operation,
  KeyProvider
- Header bullet that rustdoc parsed as a ref def: `[`Engine`]: …` → em-dash
- Macro files: add ref defs for [`Handler`], [`AudioHandler`],
  [`ImageHandler`], [`Span`], [`DocumentType`]

All ref defs are at the bottom of their docblocks.

Verified:
- `RUSTDOCFLAGS="-D warnings" cargo doc --workspace --all-features --no-deps`
  passes (only unrelated nvisy_server filename collision warning)
- `cargo clippy --workspace --all-features --all-targets -- -D warnings`
  clean
- `cargo +nightly fmt --all` applied across the workspace

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@martsokha martsokha merged commit 9355f16 into main May 18, 2026
5 checks passed
@martsokha martsokha deleted the review/ontology-artifacts branch May 18, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ontology nvisy-ontology: entities, policies, context refactor code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant