[FEATURE RFC] PPL Search Result Highlighting

# PPL Search Result Highlighting — Design

**Author**: Jialiang Liang
**Date**: 2026-02-17

---

## 1. Scope

This design covers **search result highlighting for PPL queries** executed through the
Calcite engine. There are two distinct user stories:

### User Story 1: OSD Explore users

When a user switches to PPL mode in Explore and runs `search source=logs "error"`,
matching terms should be highlighted in the results table — the same experience they get
with DSL today. The user does not configure highlighting; OSD handles it automatically
using its existing `getHighlightRequest()` function.

### User Story 2: API / CLI users

API consumers sending PPL queries via `POST /_plugins/_ppl` may want to highlight
specific fields with custom tags (e.g., `` for HTML). They need full control over
which fields are highlighted, what tags are used, and the fragment size.

### Both stories, one mechanism

Both are served by the same API: an optional `highlight` object in the PPL request body.
OSD and API users construct the config differently, but the backend treats them
identically — it is a pure pass-through. When no `highlight` is provided, behavior is
unchanged (backward compatible).

### Out of scope

- Porting the V2 `highlight()` PPL function to the Calcite engine
- Customizable highlight settings in OSD UI (e.g., per-field configuration from Explore)

---

## 2. Highlighting Behavior

PPL highlighting is fully aligned with the existing DSL highlight feature — the same
OpenSearch highlighting engine, the same rules for what gets highlighted, and the same
rendering in OSD Explore.

### When do highlights appear?

Only when the PPL query contains a **full-text search** — e.g.,
`search source=logs "error"` or `search source=logs "connection timeout"`. The search
term is translated to a `query_string` query, and OpenSearch's highlighter identifies
where that term appears in the document.

If the query contains only structured filters (e.g., `search source=logs | where status = 200`),
no highlights are produced — even if the highlight config is present in the request.

### What gets highlighted?

The matching search terms inside **`text` and `keyword` field values**. For example,
searching `"Holmes"` highlights the term `Holmes` wherever it appears — in `firstname`,
`lastname`, `address`, or any other string field that contains the match.

### What does NOT get highlighted?

- **Non-string fields**: Numeric, date, boolean, and other non-text field types never
 produce highlight fragments, even if the document matched because of those fields.
- **Structured filters**: PPL commands like `where`, `stats`, or conditions using
 comparison operators (`>`, `<`, `=`) do not produce highlights. These are translated
 to range/term/bool filters in the DSL, which OpenSearch's highlighter does not act on.

### What about piped commands?

Piped commands narrow the result set but do not affect which terms are highlighted.
For example:

```
search source=logs "error" | where status > 400
```

The `where status > 400` filters rows, but only the `"error"` full-text search produces
highlights. This is the same behavior as a DSL `bool` query with a `query_string` in
`must` and a `range` in `filter` — the filter narrows results without contributing to
highlights.

### What does it look like?

Matching terms in field values are wrapped in configurable tags:

- **OSD Explore**: Tags are OSD internal markers (`@opensearch-dashboards-highlighted-field@`)
 that OSD renders as bold/colored text in the results table — identical to DSL behavior.
- **API/CLI**: Users choose their own tags (e.g., ``, ``, or custom markers)
 and see them in the JSON response.

---

## 3. API Design

### Principle: caller-driven, backend pass-through

- The **caller** (OSD, API client, CLI) controls highlighting by providing a `highlight`
 object in the PPL request body
- The **backend** forwards the config as-is to OpenSearch and returns highlight data in
 the response
- When **no `highlight`** is provided, no highlighting is applied

This is consistent with how DSL works: OSD injects the highlight clause, not OpenSearch.

### Request API

```json
POST /_plugins/_ppl
{
 "query": "search source=logs \"error\"",
 "highlight": {
 "fields": { "*": {} },
 "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
 "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
 "fragment_size": 2147483647
 }
}
```

The `highlight` object supports the same structure as OpenSearch's [highlighting API](https://opensearch.org/docs/latest/search-plugins/searching-data/highlight/):

| Field | Description |
|-------|-------------|
| `fields` | Map of field names to per-field config. `"*"` for wildcard. |
| `pre_tags` | Array of tags inserted before highlighted terms |
| `post_tags` | Array of tags inserted after highlighted terms |
| `fragment_size` | Max character length of each fragment. OSD sets `2^31 - 1` so the entire field value is returned rather than OpenSearch's default 100-char truncation. |

**API/CLI example** — specific field with custom tags:

```json
POST /_plugins/_ppl
{
 "query": "search source=logs \"error\"",
 "highlight": {
 "fields": { "message": {} },
 "pre_tags": [""],
 "post_tags": [""],
 "fragment_size": 200
 }
}
```

### Response format

The response includes a `highlights` array parallel to `datarows`:

```json
{
 "schema": [{ "name": "firstname", "type": "string" }, ...],
 "datarows": [["Holmes", ...], ["Blanche", ...], ["Amber", ...]],
 "highlights": [
 { "firstname": ["<tag>Holmes</tag>"], "firstname.keyword": ["<tag>Holmes</tag>"] },
 { "lastname": ["<tag>Holmes</tag>"], "lastname.keyword": ["<tag>Holmes</tag>"] },
 { "address": ["880 <tag>Holmes</tag> Lane"] }
 ],
 "total": 3,
 "size": 3
}
```

- Each entry in `highlights` corresponds to the row at the same index in `datarows`
- Entries are `null` when a row has no highlight data for the requested fields
- The `highlights` array is **omitted entirely** when no highlight config is provided

---

## 4. Design Decisions

### 4.1 No command-level scoping — follow DSL behavior

DSL does not scope highlighting to specific query clauses — `getHighlightRequest()`
applies the same wildcard config regardless of query structure. To match this behavior,
when a caller provides a `highlight` config, the backend attaches it to the OpenSearch
request regardless of which PPL commands are in the pipeline.

### 4.2 Relationship with V2 `highlight()` function

PPL has an existing per-field `highlight()` function in the V2 engine (e.g.,
`highlight(msg, pre_tags='')`). This is **not supported in the Calcite engine**.
The request-level API covers the same use cases:

| Capability | V2 `highlight()` function | Request-level API |
|-----------|--------------------------|-------------------|
| Per-field control | `highlight(msg)` in PPL query | `"fields": { "msg": {} }` in request body |
| Custom tags | `highlight(msg, pre_tags='')` | `"pre_tags": [""]` in request body |
| Wildcard fields | `highlight(*)` in PPL | `"fields": { "*": {} }` in request body |

Porting `highlight()` to Calcite is out of scope. The backend plumbing supports both
approaches if an in-query function is added later.

### 4.3 Backend is a pure pass-through — no hardcoded defaults

An alternative considered was a `?highlight=true` query parameter that triggers the
backend to inject a default config. This was rejected because it would hardcode
OSD-specific knowledge (tags, fragment size) into the backend.

Instead, the backend simply forwards whatever config the caller provides. One mechanism
for all callers, no OSD-specific knowledge in the backend.

---

## 5. Performance Evaluation

### Methodology

A/B benchmark comparing PPL query execution **with** and **without** highlighting on a
worst-case dataset where every document matches and every text field produces highlights:

- **Dataset**: 10,000 documents, search term `"error"` in **4 text fields** per document
- **Query**: `search source=highlight_perf_test "error"` — returns all 10,000 rows
- **Iterations**: 20 measured runs after 3 warmup runs
- **Environment**: Single-node OpenSearch (1 shard, 0 replicas), local dev machine

### Results

| Scenario | Avg Latency | P50 | P95 | Response Size |
|----------|-------------|-----|-----|---------------|
| **No highlight** | 163 ms | 161 ms | 197 ms | 2.84 MB |
| **Highlight (wildcard `*`)** | 617 ms | 613 ms | 669 ms | 9.29 MB |
| **Highlight (single field)** | 299 ms | 298 ms | 325 ms | 4.06 MB |

| Comparison | Latency Overhead | Size Overhead |
|------------|-----------------|---------------|
| Wildcard vs no highlight | +278% (~3.8x) | +227% (~3.3x) |
| Single field vs no highlight | +83% (~1.8x) | +43% (~1.4x) |

### Analysis

1. **Worst-case benchmark.** Every row matches, every text field highlights. Real-world
 result sets are smaller and sparser.

2. **The overhead is in OpenSearch, not the SQL plugin.** The backend is a pure
 pass-through. The latency and size cost is OpenSearch's native highlighting — the
 same cost DSL users already pay today.

3. **Single-field highlighting is significantly cheaper.** API users who specify only the
 fields they need pay roughly half the overhead of wildcard.

4. **Response size is proportional to highlight data.** Wildcard adds ~6.4 MB of
 highlight fragments (tags + full field values for 4 fields x 10k rows).

---

## 6. Limitations

1. **Same limitations as DSL highlighting.** Only full-text queries produce highlights;
 only `text`/`keyword` fields; no query-aware scoping. These are inherent OpenSearch
 limitations (see [Background: DSL Highlighting](#background-dsl-highlighting) in the Appendix).

2. **`shouldHighlight` is hardcoded to `true`.** The OSD frontend currently always sends
 the highlight config. It should read from the `doc_table:highlight` UI setting.

3. **No in-query `highlight()` function in Calcite.** The request-level API covers the
 same capabilities, but interactive PPL users cannot use `highlight()` in queries.

---

## Appendix

<details>
<summary>Background: DSL Highlighting</summary>

### How OSD enables DSL highlighting

OSD's `SearchSource.flatten()` injects a `highlight` clause into every DSL request when
the `doc_table:highlight` UI setting is enabled (default: `true`). The
`getHighlightRequest()` function returns a hardcoded config: wildcard fields, OSD custom
tags, and `fragment_size: 2^31 - 1`.

| Aspect | Behavior |
|--------|----------|
| Fields | Always `"*"` (wildcard) — hardcoded |
| Tags | Always OSD custom tags — hardcoded |
| Fragment size | Always max int — hardcoded |
| On/off | `doc_table:highlight` UI setting (default: `true`) |
| Query-aware scoping | **None** — receives the query but never inspects it |

### DSL highlighting limitations

These are inherent OpenSearch limitations, not OSD-specific:

1. **Only full-text queries produce highlights.** `query_string`, `match`,
 `match_phrase`, etc. Structured filters (`range`, `term`, `bool` filter) do not.

2. **Only `text` and `keyword` fields are highlighted.** Numeric, boolean, date, and
 other non-string types produce no fragments.

3. **No query-aware scoping.** Same wildcard config regardless of query structure.

4. **Keyword subfields are included.** `"*"` matches both `firstname` and
 `firstname.keyword`.

These limitations apply equally to PPL highlighting since we use the same mechanism.

</details>

<details>
<summary>Backend Implementation Details</summary>

### Data flow

```
PPLQueryRequest -> PPLService -> AbstractPlan (carries highlightConfig)
 .getHighlight() .setHighlightOnPlan() across thread boundary

 -> QueryPlan.execute() -> CalcitePlanContext ThreadLocal
 (on worker thread) .setHighlightConfig()

 -> CalciteLogicalIndexScan (3-arg constructor)
 buildInitialSchema() checks ThreadLocal:
 if highlight config present, appends _highlight (SqlTypeName.ANY)
 to the Calcite RowType

 -> CalciteEnumerableIndexScan.scan()
 applyHighlightConfig(): reads ThreadLocal, builds HighlightBuilder,
 attaches to OpenSearchRequestBuilder

OpenSearch response -> OpenSearchResponse.addHighlightsToBuilder()
 highlight fragments builds _highlight ExprTupleValue per hit

 -> OpenSearchIndexEnumerator.current()
 carries _highlight as opaque ExprValue in Calcite row
 (SqlTypeName.ANY — Calcite passes it through without conversion)

 Calcite operators (filter, sort, dedup) naturally preserve
 _highlight as a regular column — no positional misalignment

 -> OpenSearchExecutionEngine.buildResultSet()
 reads _highlight inline from each ResultSet row,
 embeds in ExprTupleValue, excludes _highlight from response schema

 -> QueryResult.highlights()
 extracts _highlight from each row tuple

 -> JdbcResponseFormatter / SimpleJsonResponseFormatter
 writes "highlights" array in JSON response
```

### Key files

| File | Role |
|------|------|
| `PPLQueryRequest.getHighlight()` | Extracts `highlight` JSONObject from request body |
| `PPLService.setHighlightOnPlan()` | Attaches config to `AbstractPlan` for cross-thread transport |
| `AbstractPlan.highlightConfig` | Carries config from REST handler thread to worker thread |
| `QueryPlan.execute()` / `ExplainPlan.execute()` | Sets `CalcitePlanContext` ThreadLocal on worker thread |
| `CalciteLogicalIndexScan.buildInitialSchema()` | Conditionally appends `_highlight` column to RowType |
| `AbstractCalciteIndexScan.applyHighlightConfig()` | Converts config map to `HighlightBuilder` |
| `OpenSearchResponse.addHighlightsToBuilder()` | Builds `_highlight` ExprTupleValue from OpenSearch response |
| `OpenSearchIndexEnumerator.current()` | Carries `_highlight` as opaque ExprValue in Calcite row |
| `OpenSearchExecutionEngine.buildResultSet()` | Reads `_highlight` inline from row, excludes from schema |
| `QueryResult.highlights()` | Extracts highlight data from row tuples |
| `JdbcResponseFormatter` / `SimpleJsonResponseFormatter` | Writes `highlights` array in JSON response |
| `HighlightExpression.HIGHLIGHT_FIELD` | Constant `"_highlight"` used across all files |

### Threading model

The PPL endpoint runs on a REST handler thread, but query execution runs on a separate
`sql-worker` thread pool. ThreadLocals do not cross thread boundaries.

**Solution**: The highlight config is carried via the `AbstractPlan` object (a normal
Java reference). `execute()` runs on the worker thread and sets the ThreadLocal there.

```
REST handler thread: PPLService -> setHighlightOnPlan(plan) -> queryManager.submit(plan)
 |
Worker thread: plan.execute() -> setHighlightThreadLocal() -> CalcitePlanContext.set()
 |
 analyze() -> buildInitialSchema() adds _highlight to RowType
 -> optimize() -> scan() -> applyHighlightConfig()
 -> enumerator carries _highlight as column
```

</details>

<details>
<summary>Frontend Implementation Details (OSD)</summary>

### Changes

| File | Change |
|------|--------|
| `ppl_search_strategy.ts` | Calls `getHighlightRequest()` and attaches config to `request.body.highlight` |
| `facet.ts` | Forwards `request.body.highlight` into `params.body` sent to `/_plugins/_ppl` |

### Response handling

| File | Role |
|------|------|
| `ppl_search_strategy.ts` | Stores `rawResponse.data.highlights` on `dataFrame.meta` |
| `data_frames/utils.ts` | Attaches `meta.highlights[i]` as `hit.highlight` per row in `convertResult()` |

`getHighlightRequest()` is the same function OSD uses for DSL — no new OSD-specific code
needed.

</details>

<details>
<summary>Sample Queries & Responses</summary>

Test data: `accounts.json` (1000 documents with fields like `firstname`, `lastname`,
`address`, `age`, etc.)

### With highlight (OSD-style — wildcard fields, OSD tags)

```bash
curl -s -X POST "localhost:9200/_plugins/_ppl" \
 -H 'Content-Type: application/json' -d '{
 "query": "search source=accounts \"Holmes\"",
 "highlight": {
 "fields": { "*": {} },
 "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
 "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
 "fragment_size": 2147483647
 }
 }' | jq
```

**Response** (trimmed to highlights):

```json
{
 "highlights": [
 {
 "firstname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
 "firstname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
 },
 {
 "lastname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
 "lastname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
 },
 {
 "address": ["880 @opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@ Lane"]
 }
 ],
 "total": 3,
 "size": 3
}
```

### Without highlight (backward compatible)

```bash
curl -s -X POST "localhost:9200/_plugins/_ppl" \
 -H 'Content-Type: application/json' -d '{
 "query": "search source=accounts \"Holmes\""
 }' | jq
```

**Response**: No `highlights` field — backward compatible.

### Explain (highlight appears in generated DSL)

```bash
curl -s -X POST "localhost:9200/_plugins/_ppl/_explain" \
 -H 'Content-Type: application/json' -d '{
 "query": "search source=accounts \"Holmes\"",
 "highlight": {
 "fields": { "*": {} },
 "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
 "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
 "fragment_size": 2147483647
 }
 }' | jq
```

**Response** (relevant portion):

```json
{
 "query": { "query_string": { "query": "Holmes" } },
 "highlight": {
 "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
 "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
 "fields": { "*": { "fragment_size": 2147483647 } }
 }
}
```

### Custom tags + specific field (API user scenario)

```bash
curl -s -X POST "localhost:9200/_plugins/_ppl" \
 -H 'Content-Type: application/json' -d '{
 "query": "search source=accounts \"Holmes\"",
 "highlight": {
 "fields": { "address": {} },
 "pre_tags": [""],
 "post_tags": [""],
 "fragment_size": 200
 }
 }' | jq
```

**Response**:

```json
{
 "highlights": [
 null,
 null,
 { "address": ["880 Holmes Lane"] }
 ],
 "total": 3,
 "size": 3
}
```

Only `address` is highlighted. Rows 1 and 2 are `null` because "Holmes" appears in
`firstname`/`lastname`, not `address`.

### Text search + piped filter

```bash
curl -s -X POST "localhost:9200/_plugins/_ppl" \
 -H 'Content-Type: application/json' -d '{
 "query": "search source=accounts \"Holmes\" | where balance > 40000",
 "highlight": {
 "fields": { "*": {} },
 "pre_tags": ["@opensearch-dashboards-highlighted-field@"],
 "post_tags": ["@/opensearch-dashboards-highlighted-field@"],
 "fragment_size": 2147483647
 }
 }' | jq
```

**Response**:

```json
{
 "highlights": [
 {
 "lastname": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"],
 "lastname.keyword": ["@opensearch-dashboards-highlighted-field@Holmes@/opensearch-dashboards-highlighted-field@"]
 }
 ],
 "total": 1,
 "size": 1
}
```

The `where balance > 40000` narrows results but does not produce highlights (it is a
range filter, not a full-text query).

</details>

Field	Description
`fields`	Map of field names to per-field config. `"*"` for wildcard.
`pre_tags`	Array of tags inserted before highlighted terms
`post_tags`	Array of tags inserted after highlighted terms
`fragment_size`	Max character length of each fragment. OSD sets `2^31 - 1` so the entire field value is returned rather than OpenSearch's default 100-char truncation.

Capability	V2 `highlight()` function	Request-level API
Per-field control	`highlight(msg)` in PPL query	`"fields": { "msg": {} }` in request body
Custom tags	`highlight(msg, pre_tags='<em>')`	`"pre_tags": ["<em>"]` in request body
Wildcard fields	`highlight(*)` in PPL	`"fields": { "*": {} }` in request body

File	Role
`PPLQueryRequest.getHighlight()`	Extracts `highlight` JSONObject from request body
`PPLService.setHighlightOnPlan()`	Attaches config to `AbstractPlan` for cross-thread transport
`AbstractPlan.highlightConfig`	Carries config from REST handler thread to worker thread
`QueryPlan.execute()` / `ExplainPlan.execute()`	Sets `CalcitePlanContext` ThreadLocal on worker thread
`CalciteLogicalIndexScan.buildInitialSchema()`	Conditionally appends `_highlight` column to RowType
`AbstractCalciteIndexScan.applyHighlightConfig()`	Converts config map to `HighlightBuilder`
`OpenSearchResponse.addHighlightsToBuilder()`	Builds `_highlight` ExprTupleValue from OpenSearch response
`OpenSearchIndexEnumerator.current()`	Carries `_highlight` as opaque ExprValue in Calcite row
`OpenSearchExecutionEngine.buildResultSet()`	Reads `_highlight` inline from row, excludes from schema
`QueryResult.highlights()`	Extracts highlight data from row tuples
`JdbcResponseFormatter` / `SimpleJsonResponseFormatter`	Writes `highlights` array in JSON response
`HighlightExpression.HIGHLIGHT_FIELD`	Constant `"_highlight"` used across all files

File	Change
`ppl_search_strategy.ts`	Calls `getHighlightRequest()` and attaches config to `request.body.highlight`
`facet.ts`	Forwards `request.body.highlight` into `params.body` sent to `/_plugins/_ppl`

File	Role
`ppl_search_strategy.ts`	Stores `rawResponse.data.highlights` on `dataFrame.meta`
`data_frames/utils.ts`	Attaches `meta.highlights[i]` as `hit.highlight` per row in `convertResult()`

Scenario	Avg Latency	P50	P95	Response Size
No highlight	163 ms	161 ms	197 ms	2.84 MB
*Highlight (wildcard ``)**	617 ms	613 ms	669 ms	9.29 MB
Highlight (single field)	299 ms	298 ms	325 ms	4.06 MB

Comparison	Latency Overhead	Size Overhead
Wildcard vs no highlight	+278% (~3.8x)	+227% (~3.3x)
Single field vs no highlight	+83% (~1.8x)	+43% (~1.4x)

Aspect	Behavior
Fields	Always `"*"` (wildcard) — hardcoded
Tags	Always OSD custom tags — hardcoded
Fragment size	Always max int — hardcoded
On/off	`doc_table:highlight` UI setting (default: `true`)
Query-aware scoping	None — receives the query but never inspects it

[FEATURE RFC] PPL Search Result Highlighting #5156

Description

PPL Search Result Highlighting — Design

1. Scope

User Story 1: OSD Explore users

User Story 2: API / CLI users

Both stories, one mechanism

Out of scope

2. Highlighting Behavior

When do highlights appear?

What gets highlighted?

What does NOT get highlighted?

What about piped commands?

What does it look like?

3. API Design

Principle: caller-driven, backend pass-through

Request API

Response format

4. Design Decisions

4.1 No command-level scoping — follow DSL behavior

4.2 Relationship with V2 highlight() function

4.3 Backend is a pure pass-through — no hardcoded defaults

5. Performance Evaluation

Methodology

Results

Analysis

6. Limitations

Appendix

How OSD enables DSL highlighting

DSL highlighting limitations

Data flow

Key files

Threading model

Changes

Response handling

With highlight (OSD-style — wildcard fields, OSD tags)

Without highlight (backward compatible)

Explain (highlight appears in generated DSL)

Custom tags + specific field (API user scenario)

Text search + piped filter

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

4.2 Relationship with V2 `highlight()` function