Skip to content

Fixes #25721: Handle Pinot JSON results with a custom type#27530

Draft
hyspacex wants to merge 1 commit intoopen-metadata:mainfrom
hyspacex:pinot-json-type-fix
Draft

Fixes #25721: Handle Pinot JSON results with a custom type#27530
hyspacex wants to merge 1 commit intoopen-metadata:mainfrom
hyspacex:pinot-json-type-fix

Conversation

@hyspacex
Copy link
Copy Markdown

What changed

  • add a Pinot-specific PinotJSONType(types.JSON) with a custom result_processor
  • map Pinot json columns to that type instead of bare SQLAlchemy JSON
  • preserve OpenMetadata JSON semantics by registering the custom type in the column type parser
  • add Pinot regression tests for the connector-local JSON processing path

Why

Pinot returns JSON values in two different shapes depending on engine mode:

  • single-stage can return already-deserialized Python containers
  • multistage returns raw JSON strings

With SQLAlchemy 1.4.x, the generic types.JSON result pipeline crashes when Pinot returns non-bytes JSON payloads on the sample-data path. SQLAlchemy 2.0 masks the issue on main, but there was no Pinot-specific regression coverage for it.

This change makes Pinot JSON handling explicit in the connector instead of depending on SQLAlchemy version-specific behavior, while keeping Pinot columns exposed as JSON in OM.

Validation

  • reproduced the original failure against the 1.11.8 stack with PinotDB and multistage enabled
  • verified live PinotDB behavior: single-stage returns Python containers, multistage returns raw JSON strings for the same JSON column
  • verified the patched SQLAlchemy path against live PinotDB: both engine modes return parsed Python lists
  • ran the Pinot connector regression tests in a containerized ingestion environment

Notes

  • make generate completed successfully in the containerized toolchain used for verification
  • I also attempted the broader ingestion unit suite via the repo-supported path, but that environment currently hits an unrelated dependency-install failure while building .[all] (pkg_resources/cx_Oracle under build isolation) before test execution
  • open to extracting an isinstance fallback in get_column_type_mapping if maintainers prefer that over the current explicit Pinot type registration in the generic parser

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment on lines +33 to +47
def process(value: Any) -> Any:
if value is None or isinstance(value, (dict, list)):
return value

if isinstance(value, (str, bytes, bytearray)):
try:
return json.loads(value)
except (TypeError, ValueError) as exc:
logger.warning(
"Failed to deserialize Pinot JSON value. Returning raw value instead: %s",
exc,
)
return value

return value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: result_processor passes through numeric/bool JSON scalars unparsed

In PinotJSONType.result_processor, a raw JSON numeric or boolean scalar (e.g., the string "42" or "true" representing a JSON column value) would fall through to the isinstance(value, str) branch and be parsed correctly via json.loads. However, if Pinot's single-stage engine returns a Python int, float, or bool (which are valid JSON scalars), these fall through to the final return value without being covered by the isinstance(value, (dict, list)) short-circuit. This is functionally correct (they're already deserialized), but the docstring and the test coverage only describe containers. Consider adding a test case for scalar JSON values (int/float/bool) to document this behavior explicitly.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 20, 2026

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Integrates custom type handling for Pinot JSON results to resolve parsing errors. Ensure result_processor is updated to correctly handle numeric and boolean JSON scalars.

💡 Edge Case: result_processor passes through numeric/bool JSON scalars unparsed

📄 ingestion/src/metadata/ingestion/source/database/pinotdb/custom_types.py:33-47 📄 ingestion/tests/unit/topology/database/test_pinotdb.py:66-80

In PinotJSONType.result_processor, a raw JSON numeric or boolean scalar (e.g., the string "42" or "true" representing a JSON column value) would fall through to the isinstance(value, str) branch and be parsed correctly via json.loads. However, if Pinot's single-stage engine returns a Python int, float, or bool (which are valid JSON scalars), these fall through to the final return value without being covered by the isinstance(value, (dict, list)) short-circuit. This is functionally correct (they're already deserialized), but the docstring and the test coverage only describe containers. Consider adding a test case for scalar JSON values (int/float/bool) to document this behavior explicitly.

🤖 Prompt for agents
Code Review: Integrates custom type handling for Pinot JSON results to resolve parsing errors. Ensure result_processor is updated to correctly handle numeric and boolean JSON scalars.

1. 💡 Edge Case: result_processor passes through numeric/bool JSON scalars unparsed
   Files: ingestion/src/metadata/ingestion/source/database/pinotdb/custom_types.py:33-47, ingestion/tests/unit/topology/database/test_pinotdb.py:66-80

   In `PinotJSONType.result_processor`, a raw JSON numeric or boolean scalar (e.g., the string `"42"` or `"true"` representing a JSON column value) would fall through to the `isinstance(value, str)` branch and be parsed correctly via `json.loads`. However, if Pinot's single-stage engine returns a Python `int`, `float`, or `bool` (which are valid JSON scalars), these fall through to the final `return value` without being covered by the `isinstance(value, (dict, list))` short-circuit. This is functionally correct (they're already deserialized), but the docstring and the test coverage only describe containers. Consider adding a test case for scalar JSON values (int/float/bool) to document this behavior explicitly.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant