Skip to content

feat(search): add Tika object detection captions and labels#12072

Closed
paul43210 wants to merge 17 commits intoowncloud:masterfrom
paul43210:feature/object-detection-search
Closed

feat(search): add Tika object detection captions and labels#12072
paul43210 wants to merge 17 commits intoowncloud:masterfrom
paul43210:feature/object-detection-search

Conversation

@paul43210
Copy link
Copy Markdown
Contributor

@paul43210 paul43210 commented Mar 1, 2026

Problem

oCIS can extract rich metadata from images via Apache Tika, including object detection labels (Inception V3) and image captions (Show and Tell model). However, this metadata was not being captured, indexed, or exposed through the search API — meaning users had no way to search for photos by their visual content (e.g., finding all photos containing dogs or beach scenes).

This builds on the photo EXIF metadata work in #11912, extending search capabilities from camera/technical metadata to AI-generated visual content descriptions.

Summary

  • Extract OBJECT and CAPTION metadata from the configured content extractor
  • Index object labels (lowercaseKeyword analyzer for exact-match terms like "Labrador retriever") and captions (fulltext analyzer for word-level search on sentences)
  • Add objectLabel: and objectCaption: KQL field prefixes for querying (e.g., objectCaption:dog)
  • Expose results as oc:object-labels and oc:object-captions WebDAV properties in search responses
  • Add object_labels and object_captions fields to the protobuf Entity message
  • Regenerate protobuf rawDesc to include fields 20/21, fixing silent data loss during gRPC serialization
  • Use a low confidence threshold (0.0001) matching common log-probability output ranges from vision models

Extractor Compatibility Note

The ObjectCaptions and ObjectLabels indexed fields are populated by metadata from the configured content extractor via the standard extraction pipeline. These fields are extractor-agnostic by design.

Current status:

  • Apache Tika 3.x supports AI captions and object labels via ObjectRecognitionParser in the tika-parser-advancedmedia-package.
  • Apache Tika 4.x has removed this module (TIKA-4500), as the underlying DL4J/TensorFlow libraries were considered aged.
  • Tika 3.x is supported until June 2026. Tika 4.0.0 has not been released yet.

The oCIS-side indexed fields support the broader ecosystem of content analysis tools. Any extractor that populates CAPTION or object label metadata will automatically have its results indexed and searchable through the ObjectCaptions and ObjectLabels fields. The SEARCH_EXTRACTOR_TYPE configuration makes this pluggable.

Test plan

  • Content extraction tests pass (17/17) — including 3 new tests for low-confidence parsing and threshold filtering
  • Engine tests pass (27/27)
  • Query/bleve compiler tests pass — including new objectLabel/objectCaption KQL mapping tests
  • Deploy and verify objectCaption:dog returns results for photos containing dogs
  • Verify objectLabel:retriever returns results for photos with matching labels
  • Rebuild Bleve index after deploy so existing documents pick up the fulltext analyzer

Note

After deploying a binary with this change, the Bleve search index must be rebuilt for the caption analyzer change to take effect on existing documents. New documents indexed after deployment will use the fulltext analyzer automatically.

🤖 Generated with Claude Code

paul43210 and others added 13 commits January 15, 2026 22:35
This change enables photo gallery applications to access EXIF metadata
through the search API.

Changes:
- Add KQL query support for photo fields:
  - photo.takenDateTime (datetime range queries)
  - photo.cameraMake, photo.cameraModel (keyword queries)
  - photo.fNumber, photo.focalLength, photo.iso (numeric queries)
- Configure Bleve index to store photo fields (Store=true)
- Map photo metadata from Bleve hits to Entity.Photo protobuf
- Add oc:photo-* properties to WebDAV search responses
- Add unit tests for photo field queries

The photo metadata is extracted by Tika during indexing and stored
in the Bleve search index. This enables queries like:
  photo.takenDateTime:[2023-01-01 TO 2023-12-31]
  photo.cameraMake:Canon

Signed-off-by: Paul Faure <paul@faure.ca>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add location field mappings with Store=true in Bleve index
- Rename properties to oc:photo-location-* for consistency
- Add altitude support alongside latitude/longitude
- Requires index rebuild to take effect
- Add exposure numerator/denominator to search results
- Add altitude extraction from Tika GPS metadata
- All photo EXIF fields now fully exposed via WebDAV search
Extract photo and location property building into separate helper
functions (appendPhotoProps, appendLocationProps) to reduce cognitive
complexity from 41 to under 15.

This addresses SonarCloud issue go:S3776.
Extract repetitive nil/zero checks into reusable helper functions:
- appendStringProp: handles *string with nil and empty checks
- appendFloat32Prop: handles *float32 with nil and zero checks
- appendFloat64Prop: handles *float64 with nil check
- appendInt32Prop: handles *int32 with nil and zero checks

This reduces appendPhotoProps complexity from 18 to under 15 and
appendLocationProps complexity as well.

Addresses SonarCloud issue go:S3776.
The context parameter was not being used in the function.
Also removed from multistatusResponse and removed unused context import.

Addresses SonarCloud issue go:S1172.
- Add comment explaining why appendFloat64Prop doesn't check for zero
  (GPS coordinates 0.0 are valid for equator/prime meridian)
- Fix changelog to list only implemented fields (removed exposureTime,
  exposureBias, flash, meteringMode, whiteBalance; added exposureNumerator,
  exposureDenominator)
Extract OBJECT/CAPTION metadata from Tika's ObjectRecognitionParser,
index them in Bleve, make them queryable via KQL (objectLabel:/objectCaption:),
and expose them as oc:object-labels/oc:object-captions WebDAV properties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r for captions

Tika's Show and Tell / Inception V3 models produce log-probability
confidence scores in the 0.0001–0.002 range, far below the previous
0.10 threshold which silently discarded all results. Lower threshold
to 0.0001 for both labels and captions.

Switch ObjectCaptions Bleve analyzer from lowercaseKeyword to fulltext
so that word-level search (e.g. objectCaption:dog) works on multi-word
caption strings.

Add test coverage for low-confidence extraction and threshold filtering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paul43210 and others added 3 commits March 1, 2026 23:30
…awDesc

The ObjectLabels and ObjectCaptions fields were added to the Entity Go
struct but the protobuf rawDesc (binary file descriptor) was not
regenerated. protobuf-go v2 uses rawDesc for serialization, so these
fields were silently dropped during gRPC communication between the
search and webdav services — causing search to find matching documents
but return empty metadata in the WebDAV XML response.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…oval (TIKA-4500)

ObjectCaptions and ObjectLabels fields are extractor-agnostic by design.
Tika 3.x ObjectRecognitionParser is supported until June 2026.
Tika 4.x removed advancedmedia module but fields remain available
for any content extractor that provides caption/label metadata.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tika returns ISO as `exif:IsoSpeedRatings` (standard EXIF tag) but the
code only checked `Base ISO` (Nikon-specific MakerNote tag). Check the
standard key first, falling back to `Base ISO` for Nikon cameras.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Loop over candidate metadata keys instead of duplicating the parse/set
block in each branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paul43210
Copy link
Copy Markdown
Contributor Author

Closing this PR. With Tika 4.x expected this summer (which removes the advancedmedia module including ObjectRecognitionParser per TIKA-4500), the object detection caption/label fields added here would have limited value. The ISO key mapping fix and extractor-agnostic groundwork can be revisited if/when an alternative vision pipeline is integrated.

@paul43210 paul43210 closed this Mar 4, 2026
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Mar 4, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant