Skip to content

[FEATURE RFC] PPL Search Result Highlighting #5156

@RyanL1997

Description

@RyanL1997

PPL Search Result Highlighting — Design

Author: Jialiang Liang
Date: 2026-02-17


1. Scope

This design covers search result highlighting for PPL queries executed through the
Calcite engine. There are two distinct user stories:

User Story 1: OSD Explore users

When a user switches to PPL mode in Explore and runs search source=logs "error",
matching terms should be highlighted in the results table — the same experience they get
with DSL today. The user does not configure highlighting; OSD handles it automatically
using its existing getHighlightRequest() function.

User Story 2: API / CLI users

API consumers sending PPL queries via POST /_plugins/_ppl may want to highlight
specific fields with custom tags (e.g., <em> for HTML). They need full control over
which fields are highlighted, what tags are used, and the fragment size.

Both stories, one mechanism

Both are served by the same API: an optional highlight object in the PPL request body.
OSD and API users construct the config differently, but the backend treats them
identically — it is a pure pass-through. When no highlight is provided, behavior is
unchanged (backward compatible).

Out of scope

  • Porting the V2 highlight() PPL function to the Calcite engine
  • Customizable highlight settings in OSD UI (e.g., per-field configuration from Explore)

2. Highlighting Behavior

PPL highlighting is fully aligned with the existing DSL highlight feature — the same
OpenSearch highlighting engine, the same rules for what gets highlighted, and the same
rendering in OSD Explore.

When do highlights appear?

Only when the PPL query contains a full-text search — e.g.,
search source=logs "error" or search source=logs "connection timeout". The search
term is translated to a query_string query, and OpenSearch's highlighter identifies
where that term appears in the document.

If the query contains only structured filters (e.g., search source=logs | where status = 200),
no highlights are produced — even if the highlight config is present in the request.

What gets highlighted?

The matching search terms inside text and keyword field values. For example,
searching "Holmes" highlights the term Holmes wherever it appears — in firstname,
lastname, address, or any other string field that contains the match.

What does NOT get highlighted?

  • Non-string fields: Numeric, date, boolean, and other non-text field types never
    produce highlight fragments, even if the document matched because of those fields.
  • Structured filters: PPL commands like where, stats, or conditions using
    comparison operators (>, <, =) do not produce highlights. These are translated
    to range/term/bool filters in the DSL, which OpenSearch's highlighter does not act on.

What about piped commands?

Piped commands narrow the result set but do not affect which terms are highlighted.
For example:

search source=logs "error" | where status > 400

The where status > 400 filters rows, but only the "error" full-text search produces
highlights. This is the same behavior as a DSL bool query with a query_string in
must and a range in filter — the filter narrows results without contributing to
highlights.

What does it look like?

Matching terms in field values are wrapped in configurable tags:

  • OSD Explore: Tags are OSD internal markers (@opensearch-dashboards-highlighted-field@)
    that OSD renders as bold/colored text in the results table — identical to DSL behavior.
  • API/CLI: Users choose their own tags (e.g., <em>, <mark>, or custom markers)
    and see them in the JSON response.

3. API Design

Principle: caller-driven, backend pass-through

  • The caller (OSD, API client, CLI) controls highlighting by providing a highlight
    object in the PPL request body
  • The backend forwards the config as-is to OpenSearch and returns highlight data in
    the response
  • When no highlight is provided, no highlighting is applied

This is consistent with how DSL works: OSD injects the highlight clause, not OpenSearch.

Request API

POST /_plugins/_ppl
{
  "query": "search source=logs \"error\"",
  "highlight": {
    "fields": { "*": {} },
    "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
    "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
    "fragment_size": 2147483647
  }
}

The highlight object supports the same structure as OpenSearch's highlighting API:

Field Description
fields Map of field names to per-field config. "*" for wildcard.
pre_tags Array of tags inserted before highlighted terms
post_tags Array of tags inserted after highlighted terms
fragment_size Max character length of each fragment. OSD sets 2^31 - 1 so the entire field value is returned rather than OpenSearch's default 100-char truncation.

API/CLI example — specific field with custom tags:

POST /_plugins/_ppl
{
  "query": "search source=logs \"error\"",
  "highlight": {
    "fields": { "message": {} },
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"],
    "fragment_size": 200
  }
}

Response format

The response includes a highlights array parallel to datarows:

{
  "schema": [{ "name": "firstname", "type": "string" }, ...],
  "datarows": [["Holmes", ...], ["Blanche", ...], ["Amber", ...]],
  "highlights": [
    { "firstname": ["<tag>Holmes</tag>"], "firstname.keyword": ["<tag>Holmes</tag>"] },
    { "lastname": ["<tag>Holmes</tag>"], "lastname.keyword": ["<tag>Holmes</tag>"] },
    { "address": ["880 <tag>Holmes</tag> Lane"] }
  ],
  "total": 3,
  "size": 3
}
  • Each entry in highlights corresponds to the row at the same index in datarows
  • Entries are null when a row has no highlight data for the requested fields
  • The highlights array is omitted entirely when no highlight config is provided

4. Design Decisions

4.1 No command-level scoping — follow DSL behavior

DSL does not scope highlighting to specific query clauses — getHighlightRequest()
applies the same wildcard config regardless of query structure. To match this behavior,
when a caller provides a highlight config, the backend attaches it to the OpenSearch
request regardless of which PPL commands are in the pipeline.

4.2 Relationship with V2 highlight() function

PPL has an existing per-field highlight() function in the V2 engine (e.g.,
highlight(msg, pre_tags='<em>')). This is not supported in the Calcite engine.
The request-level API covers the same use cases:

Capability V2 highlight() function Request-level API
Per-field control highlight(msg) in PPL query "fields": { "msg": {} } in request body
Custom tags highlight(msg, pre_tags='<em>') "pre_tags": ["<em>"] in request body
Wildcard fields highlight(*) in PPL "fields": { "*": {} } in request body

Porting highlight() to Calcite is out of scope. The backend plumbing supports both
approaches if an in-query function is added later.

4.3 Backend is a pure pass-through — no hardcoded defaults

An alternative considered was a ?highlight=true query parameter that triggers the
backend to inject a default config. This was rejected because it would hardcode
OSD-specific knowledge (tags, fragment size) into the backend.

Instead, the backend simply forwards whatever config the caller provides. One mechanism
for all callers, no OSD-specific knowledge in the backend.


5. Performance Evaluation

Methodology

A/B benchmark comparing PPL query execution with and without highlighting on a
worst-case dataset where every document matches and every text field produces highlights:

  • Dataset: 10,000 documents, search term "error" in 4 text fields per document
  • Query: search source=highlight_perf_test "error" — returns all 10,000 rows
  • Iterations: 20 measured runs after 3 warmup runs
  • Environment: Single-node OpenSearch (1 shard, 0 replicas), local dev machine

Results

Scenario Avg Latency P50 P95 Response Size
No highlight 163 ms 161 ms 197 ms 2.84 MB
Highlight (wildcard *) 617 ms 613 ms 669 ms 9.29 MB
Highlight (single field) 299 ms 298 ms 325 ms 4.06 MB
Comparison Latency Overhead Size Overhead
Wildcard vs no highlight +278% (~3.8x) +227% (~3.3x)
Single field vs no highlight +83% (~1.8x) +43% (~1.4x)

Analysis

  1. Worst-case benchmark. Every row matches, every text field highlights. Real-world
    result sets are smaller and sparser.

  2. The overhead is in OpenSearch, not the SQL plugin. The backend is a pure
    pass-through. The latency and size cost is OpenSearch's native highlighting — the
    same cost DSL users already pay today.

  3. Single-field highlighting is significantly cheaper. API users who specify only the
    fields they need pay roughly half the overhead of wildcard.

  4. Response size is proportional to highlight data. Wildcard adds ~6.4 MB of
    highlight fragments (tags + full field values for 4 fields x 10k rows).


6. Limitations

  1. Same limitations as DSL highlighting. Only full-text queries produce highlights;
    only text/keyword fields; no query-aware scoping. These are inherent OpenSearch
    limitations (see Background: DSL Highlighting in the Appendix).

  2. shouldHighlight is hardcoded to true. The OSD frontend currently always sends
    the highlight config. It should read from the doc_table:highlight UI setting.

  3. No in-query highlight() function in Calcite. The request-level API covers the
    same capabilities, but interactive PPL users cannot use highlight() in queries.


Appendix

Background: DSL Highlighting

How OSD enables DSL highlighting

OSD's SearchSource.flatten() injects a highlight clause into every DSL request when
the doc_table:highlight UI setting is enabled (default: true). The
getHighlightRequest() function returns a hardcoded config: wildcard fields, OSD custom
tags, and fragment_size: 2^31 - 1.

Aspect Behavior
Fields Always "*" (wildcard) — hardcoded
Tags Always OSD custom tags — hardcoded
Fragment size Always max int — hardcoded
On/off doc_table:highlight UI setting (default: true)
Query-aware scoping None — receives the query but never inspects it

DSL highlighting limitations

These are inherent OpenSearch limitations, not OSD-specific:

  1. Only full-text queries produce highlights. query_string, match,
    match_phrase, etc. Structured filters (range, term, bool filter) do not.

  2. Only text and keyword fields are highlighted. Numeric, boolean, date, and
    other non-string types produce no fragments.

  3. No query-aware scoping. Same wildcard config regardless of query structure.

  4. Keyword subfields are included. "*" matches both firstname and
    firstname.keyword.

These limitations apply equally to PPL highlighting since we use the same mechanism.

Backend Implementation Details

Data flow

PPLQueryRequest         ->  PPLService            ->  AbstractPlan (carries highlightConfig)
   .getHighlight()          .setHighlightOnPlan()      across thread boundary

                        ->  QueryPlan.execute()    ->  CalcitePlanContext ThreadLocal
                            (on worker thread)         .setHighlightConfig()

                        ->  CalciteLogicalIndexScan (3-arg constructor)
                            buildInitialSchema() checks ThreadLocal:
                            if highlight config present, appends _highlight (SqlTypeName.ANY)
                            to the Calcite RowType

                        ->  CalciteEnumerableIndexScan.scan()
                            applyHighlightConfig(): reads ThreadLocal, builds HighlightBuilder,
                            attaches to OpenSearchRequestBuilder

OpenSearch response     ->  OpenSearchResponse.addHighlightsToBuilder()
   highlight fragments      builds _highlight ExprTupleValue per hit

                        ->  OpenSearchIndexEnumerator.current()
                            carries _highlight as opaque ExprValue in Calcite row
                            (SqlTypeName.ANY — Calcite passes it through without conversion)

                            Calcite operators (filter, sort, dedup) naturally preserve
                            _highlight as a regular column — no positional misalignment

                        ->  OpenSearchExecutionEngine.buildResultSet()
                            reads _highlight inline from each ResultSet row,
                            embeds in ExprTupleValue, excludes _highlight from response schema

                        ->  QueryResult.highlights()
                            extracts _highlight from each row tuple

                        ->  JdbcResponseFormatter / SimpleJsonResponseFormatter
                            writes "highlights" array in JSON response

Key files

File Role
PPLQueryRequest.getHighlight() Extracts highlight JSONObject from request body
PPLService.setHighlightOnPlan() Attaches config to AbstractPlan for cross-thread transport
AbstractPlan.highlightConfig Carries config from REST handler thread to worker thread
QueryPlan.execute() / ExplainPlan.execute() Sets CalcitePlanContext ThreadLocal on worker thread
CalciteLogicalIndexScan.buildInitialSchema() Conditionally appends _highlight column to RowType
AbstractCalciteIndexScan.applyHighlightConfig() Converts config map to HighlightBuilder
OpenSearchResponse.addHighlightsToBuilder() Builds _highlight ExprTupleValue from OpenSearch response
OpenSearchIndexEnumerator.current() Carries _highlight as opaque ExprValue in Calcite row
OpenSearchExecutionEngine.buildResultSet() Reads _highlight inline from row, excludes from schema
QueryResult.highlights() Extracts highlight data from row tuples
JdbcResponseFormatter / SimpleJsonResponseFormatter Writes highlights array in JSON response
HighlightExpression.HIGHLIGHT_FIELD Constant "_highlight" used across all files

Threading model

The PPL endpoint runs on a REST handler thread, but query execution runs on a separate
sql-worker thread pool. ThreadLocals do not cross thread boundaries.

Solution: The highlight config is carried via the AbstractPlan object (a normal
Java reference). execute() runs on the worker thread and sets the ThreadLocal there.

REST handler thread:  PPLService -> setHighlightOnPlan(plan) -> queryManager.submit(plan)
                                                                        |
Worker thread:        plan.execute() -> setHighlightThreadLocal() -> CalcitePlanContext.set()
                                                                        |
                      analyze() -> buildInitialSchema() adds _highlight to RowType
                                -> optimize() -> scan() -> applyHighlightConfig()
                                                        -> enumerator carries _highlight as column
Frontend Implementation Details (OSD)

Changes

File Change
ppl_search_strategy.ts Calls getHighlightRequest() and attaches config to request.body.highlight
facet.ts Forwards request.body.highlight into params.body sent to /_plugins/_ppl

Response handling

File Role
ppl_search_strategy.ts Stores rawResponse.data.highlights on dataFrame.meta
data_frames/utils.ts Attaches meta.highlights[i] as hit.highlight per row in convertResult()

getHighlightRequest() is the same function OSD uses for DSL — no new OSD-specific code
needed.

Sample Queries & Responses

Test data: accounts.json (1000 documents with fields like firstname, lastname,
address, age, etc.)

With highlight (OSD-style — wildcard fields, OSD tags)

curl -s -X POST "localhost:9200/_plugins/_ppl" \
  -H 'Content-Type: application/json' -d '{
    "query": "search source=accounts \"Holmes\"",
    "highlight": {
      "fields": { "*": {} },
      "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
      "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
      "fragment_size": 2147483647
    }
  }' | jq

Response (trimmed to highlights):

{
  "highlights": [
    {
      "firstname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
      "firstname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
    },
    {
      "lastname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
      "lastname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
    },
    {
      "address": ["880 @opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@ Lane"]
    }
  ],
  "total": 3,
  "size": 3
}

Without highlight (backward compatible)

curl -s -X POST "localhost:9200/_plugins/_ppl" \
  -H 'Content-Type: application/json' -d '{
    "query": "search source=accounts \"Holmes\""
  }' | jq

Response: No highlights field — backward compatible.

Explain (highlight appears in generated DSL)

curl -s -X POST "localhost:9200/_plugins/_ppl/_explain" \
  -H 'Content-Type: application/json' -d '{
    "query": "search source=accounts \"Holmes\"",
    "highlight": {
      "fields": { "*": {} },
      "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
      "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
      "fragment_size": 2147483647
    }
  }' | jq

Response (relevant portion):

{
  "query": { "query_string": { "query": "Holmes" } },
  "highlight": {
    "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
    "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
    "fields": { "*": { "fragment_size": 2147483647 } }
  }
}

Custom tags + specific field (API user scenario)

curl -s -X POST "localhost:9200/_plugins/_ppl" \
  -H 'Content-Type: application/json' -d '{
    "query": "search source=accounts \"Holmes\"",
    "highlight": {
      "fields": { "address": {} },
      "pre_tags": ["<em>"],
      "post_tags": ["</em>"],
      "fragment_size": 200
    }
  }' | jq

Response:

{
  "highlights": [
    null,
    null,
    { "address": ["880 <em>Holmes</em> Lane"] }
  ],
  "total": 3,
  "size": 3
}

Only address is highlighted. Rows 1 and 2 are null because "Holmes" appears in
firstname/lastname, not address.

Text search + piped filter

curl -s -X POST "localhost:9200/_plugins/_ppl" \
  -H 'Content-Type: application/json' -d '{
    "query": "search source=accounts \"Holmes\" | where balance > 40000",
    "highlight": {
      "fields": { "*": {} },
      "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
      "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
      "fragment_size": 2147483647
    }
  }' | jq

Response:

{
  "highlights": [
    {
      "lastname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
      "lastname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
    }
  ],
  "total": 1,
  "size": 1
}

The where balance > 40000 narrows results but does not produce highlights (it is a
range filter, not a full-text query).

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestfeature

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions