Datalake improvements , Json sampling, Tags ingestion by harshach · Pull Request #27401 · open-metadata/OpenMetadata

harshach · 2026-04-15T14:31:34Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Copilot

Pull request overview

This PR improves Datalake ingestion behavior by making schema inference more efficient (sampling fewer rows), adding optional cloud-object tag ingestion for Datalake files, and tightening some credential/session plumbing for cloud readers.

Changes:

Add a schema-inference mode to DataFrame readers so read_first_chunk() samples a small number of records instead of default chunk sizes.
Add opt-in Datalake table tag ingestion by reading provider object tags/metadata (S3/GCS/Azure) and mapping them to OpenMetadata classifications/tags.
Pass boto3 session through sampling/reader paths and adjust cloud utilities/tests accordingly.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
ingestion/src/metadata/readers/dataframe/base.py	Forces `schema_inference=True` for `read_first_chunk()` to enable sampling behavior in readers.
ingestion/src/metadata/readers/dataframe/json.py	Adds schema-inference sampling for JSON/JSONL streaming and threads file_size through.
ingestion/src/metadata/readers/dataframe/dsv.py	Adds schema-inference sampling for CSV/TSV reading (pandas chunk size reduction).
ingestion/src/metadata/readers/dataframe/parquet.py	Adds schema-inference batch sizing and avoids redundant size lookups when provided.
ingestion/src/metadata/readers/dataframe/avro.py	Adds schema-inference batch sizing and changes S3 Avro reading strategy.
ingestion/src/metadata/readers/file/adls.py	Returns only populated Azure storage options (enables DefaultAzureCredential fallback).
ingestion/src/metadata/utils/datalake/datalake_utils.py	Changes error behavior for first-chunk fetching (now re-raises).
ingestion/src/metadata/utils/s3_utils.py	Re-raises exceptions from S3 pagination helper after logging.
ingestion/src/metadata/utils/credentials.py	Adds temp credential file creation for GCP external account credentials.
ingestion/src/metadata/mixins/pandas/pandas_mixin.py	Extends `get_dataframes` to accept an optional boto3 session.
ingestion/src/metadata/sampler/pandas/sampler.py	Passes session from Datalake client into dataframe reader path.
ingestion/src/metadata/ingestion/source/database/datalake/metadata.py	Implements opt-in per-file tag ingestion for Datalake via provider object tags.
ingestion/src/metadata/ingestion/source/database/datalake/clients/base.py	Adds optional `get_object_tags()` API to DL clients.
ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py	Implements S3 object tag retrieval via `get_object_tagging`.
ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py	Implements tag retrieval from blob metadata; modifies test list-buckets function behavior.
ingestion/src/metadata/ingestion/source/database/datalake/clients/azure_blob.py	Implements Azure blob tag retrieval via `get_blob_tags`.
ingestion/tests/unit/readers/test_json_reader.py	Adds unit tests validating schema inference reads only a small sample.
ingestion/tests/unit/readers/test_s3_reader_credentials.py	Adds S3 JSONL sampling test for schema inference.
ingestion/tests/unit/readers/test_avro_reader.py	Updates Avro S3 test expectation to a single `get_object` call.
ingestion/tests/unit/readers/test_parquet_reader.py	Switches GCS mocks to `MagicMock` for filesystem mocking.

Copilot · 2026-04-15T14:38:36Z

+        credentials_dict = gcp_credentials.gcpConfig.model_dump()
+        credentials_dict["type"] = "external_account"


For GcpExternalAccount, building the credentials JSON via model_dump() produces field names like externalType, tokenURL, and subjectTokenType that do not match the expected Google external account credential file format (snake_case keys). This also differs from build_google_credentials_dict() (and the existing unit test expectations). Use the same mapping helper for external accounts to generate the correct JSON structure before writing the temp credentials file.

Suggested change

credentials_dict = gcp_credentials.gcpConfig.model_dump()

credentials_dict["type"] = "external_account"

credentials_dict = build_google_credentials_dict(

gcp_credentials.gcpConfig, single_project

)

Copilot · 2026-04-15T14:38:37Z

                logger.error(
                    f"Error fetching first chunk of file [{bucket_name}/{key}] using "
                    f"[{config_source.__class__.__name__}] due to: [{err}]"
                )
+                raise err


raise err here discards the original traceback context and makes debugging harder (it also leads to duplicated logging due to the outer except). Use a bare raise to preserve the original stack trace when rethrowing inside an except block.

Copilot · 2026-04-15T14:38:37Z

    def get_test_list_buckets_fn(self, bucket_name: Optional[str]) -> Callable:

        if bucket_name:
-            fn = partial(self._client.get_bucket, bucket_name)
-        else:
-            fn = self._client.list_buckets
-
-        os.environ.pop("GOOGLE_CLOUD_PROJECT", "")
-        if GOOGLE_CREDENTIALS in os.environ:
-            os.remove(os.environ[GOOGLE_CREDENTIALS])
-            del os.environ[GOOGLE_CREDENTIALS]
-
-        return fn
+            return partial(self._client.get_bucket, bucket_name)
+        return self._client.list_buckets


This method no longer cleans up GOOGLE_APPLICATION_CREDENTIALS / temp credential files after running the test callable. Since set_google_credentials() writes temp files and close() does not reliably remove them for all credential types/config wrappers, this can leak temp files and leave env vars set for the rest of the process (impacting other tests/connectors). Consider restoring cleanup here or ensuring close() unsets the env var and deletes tracked temp files for all GCP credential modes.

@harshach that is a valid comment

Copilot · 2026-04-15T14:38:37Z

        """Stream Avro from S3 without loading entire file into memory."""
-        schema_response = self.client.get_object(Bucket=bucket_name, Key=key)
+        import io
+
+        response = self.client.get_object(Bucket=bucket_name, Key=key)
        try:
-            columns = self._get_avro_columns(schema_response["Body"])
+            avro_bytes = io.BytesIO(response["Body"].read())
        finally:
-            schema_response["Body"].close()
+            response["Body"].close()


In the S3 Avro reader, response["Body"].read() loads the entire object into memory. This defeats the stated goal of streaming Avro and makes read_first_chunk still download/read the full file, which can cause OOM or large latency for big Avro objects. Consider reverting to streaming from the S3 body (even if that requires a second get_object call), or using a bounded/ranged read just for schema inference.

@harshach valid comment

Copilot · 2026-04-15T14:38:37Z

+        batch_size = (
+            SCHEMA_INFERENCE_SAMPLE_SIZE
+            if getattr(self, "_schema_inference", False)
+            else CHUNKSIZE
+        )

        def chunk_generator():
-            response = self.client.get_object(Bucket=bucket_name, Key=key)
-            try:
-                yield from self._stream_avro_records(response["Body"])
-            finally:
-                response["Body"].close()
+            yield from self._stream_avro_records(avro_bytes, batch_size=batch_size)



chunk_generator() streams from a shared BytesIO that is created once and advanced as it is consumed. If wrapper.dataframes() is invoked more than once (a common pattern for generator factories in this codebase), subsequent iterations will return no data unless the buffer is rewound. Ensure each generator invocation starts from position 0 (or constructs a fresh stream) so dataframes remains repeatable.

that is a valid comment. The initial implementation ensure that consecutive calls will pull from a fresh generator

gitar-bot · 2026-04-15T14:40:24Z

    except Exception as exc:
        logger.debug(traceback.format_exc())
        logger.warning(f"Unexpected exception to yield s3 object: {exc}")
+        raise


⚠️ Bug: Re-raising in list_s3_objects breaks callers that expect silent failure

Adding raise at the end of the except block in list_s3_objects changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (s3.py:get_table_names, file_client.py:get_pbit_files, s3/metadata.py:_generate_structured_containers_by_depth, s3/metadata.py:_yield_nested_unstructured_containers) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

Suggested fix:

If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-15T14:40:26Z

+    options = {}
+    if connection_args.tenantId:
+        options["tenant_id"] = connection_args.tenantId
+    if connection_args.clientId:
+        options["client_id"] = connection_args.clientId
+    if connection_args.clientSecret:
+        options["client_secret"] = connection_args.clientSecret.get_secret_value()
+    return options


💡 Edge Case: return_azure_storage_options may return empty dict

The refactored return_azure_storage_options now conditionally adds each field. If all three fields (tenantId, clientId, clientSecret) are None/empty, it returns {}. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with **storage_options into adlfs.AzureBlobFileSystem(...) should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-15T14:40:28Z

+
+        batch_size = (
+            SCHEMA_INFERENCE_SAMPLE_SIZE
+            if getattr(self, "_schema_inference", False)


💡 Quality: Schema inference flag uses instance attr instead of parameter

All readers communicate the schema_inference flag by setting self._schema_inference in _read() and reading it back via getattr(self, '_schema_inference', False) in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first read_first_chunk then read), the _schema_inference flag from the previous call leaks. It works today because _read always sets it, but the pattern is error-prone. Consider passing schema_inference as a parameter through the dispatch chain instead, or at minimum resetting it in read() as well.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-15T14:40:30Z

Code Review ⚠️ Changes requested 0 resolved / 3 findings

Datalake ingestion now includes JSON sampling and tag support, but re-raising exceptions in list_s3_objects breaks existing caller contracts. Additionally, schema inference relies on instance attributes instead of parameters, and return_azure_storage_options may return an empty dictionary.

⚠️

Bug: Re-raising in list_s3_objects breaks callers that expect silent failure

📄 ingestion/src/metadata/utils/s3_utils.py:35

Adding raise at the end of the except block in list_s3_objects changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (s3.py:get_table_names, file_client.py:get_pbit_files, s3/metadata.py:_generate_structured_containers_by_depth, s3/metadata.py:_yield_nested_unstructured_containers) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

Suggested fix

If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.

💡 Edge Case: return_azure_storage_options may return empty dict

📄 ingestion/src/metadata/readers/file/adls.py:39-46

The refactored return_azure_storage_options now conditionally adds each field. If all three fields (tenantId, clientId, clientSecret) are None/empty, it returns {}. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with **storage_options into adlfs.AzureBlobFileSystem(...) should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

💡 Quality: Schema inference flag uses instance attr instead of parameter

📄 ingestion/src/metadata/readers/dataframe/avro.py:126 📄 ingestion/src/metadata/readers/dataframe/avro.py:219 📄 ingestion/src/metadata/readers/dataframe/json.py:175 📄 ingestion/src/metadata/readers/dataframe/json.py:308 📄 ingestion/src/metadata/readers/dataframe/dsv.py:138 📄 ingestion/src/metadata/readers/dataframe/dsv.py:254 📄 ingestion/src/metadata/readers/dataframe/parquet.py:61 📄 ingestion/src/metadata/readers/dataframe/parquet.py:447

All readers communicate the schema_inference flag by setting self._schema_inference in _read() and reading it back via getattr(self, '_schema_inference', False) in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first read_first_chunk then read), the _schema_inference flag from the previous call leaks. It works today because _read always sets it, but the pattern is error-prone. Consider passing schema_inference as a parameter through the dispatch chain instead, or at minimum resetting it in read() as well.

🤖 Prompt for agents

Code Review: Datalake ingestion now includes JSON sampling and tag support, but re-raising exceptions in list_s3_objects breaks existing caller contracts. Additionally, schema inference relies on instance attributes instead of parameters, and return_azure_storage_options may return an empty dictionary.

1. ⚠️ Bug: Re-raising in list_s3_objects breaks callers that expect silent failure
   Files: ingestion/src/metadata/utils/s3_utils.py:35

   Adding `raise` at the end of the except block in `list_s3_objects` changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (`s3.py:get_table_names`, `file_client.py:get_pbit_files`, `s3/metadata.py:_generate_structured_containers_by_depth`, `s3/metadata.py:_yield_nested_unstructured_containers`) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

   Suggested fix:
   If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.

2. 💡 Edge Case: return_azure_storage_options may return empty dict
   Files: ingestion/src/metadata/readers/file/adls.py:39-46

   The refactored `return_azure_storage_options` now conditionally adds each field. If all three fields (`tenantId`, `clientId`, `clientSecret`) are `None`/empty, it returns `{}`. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with `**storage_options` into `adlfs.AzureBlobFileSystem(...)` should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

3. 💡 Quality: Schema inference flag uses instance attr instead of parameter
   Files: ingestion/src/metadata/readers/dataframe/avro.py:126, ingestion/src/metadata/readers/dataframe/avro.py:219, ingestion/src/metadata/readers/dataframe/json.py:175, ingestion/src/metadata/readers/dataframe/json.py:308, ingestion/src/metadata/readers/dataframe/dsv.py:138, ingestion/src/metadata/readers/dataframe/dsv.py:254, ingestion/src/metadata/readers/dataframe/parquet.py:61, ingestion/src/metadata/readers/dataframe/parquet.py:447

   All readers communicate the `schema_inference` flag by setting `self._schema_inference` in `_read()` and reading it back via `getattr(self, '_schema_inference', False)` in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first `read_first_chunk` then `read`), the `_schema_inference` flag from the previous call leaks. It works today because `_read` always sets it, but the pattern is error-prone. Consider passing `schema_inference` as a parameter through the dispatch chain instead, or at minimum resetting it in `read()` as well.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-15T15:58:46Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
49.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-04-15T16:28:25Z

🔴 Playwright Results — 1 failure(s), 26 flaky

✅ 3638 passed · ❌ 1 failed · 🟡 26 flaky · ⏭️ 84 skipped

Shard	Passed	Failed	Flaky	Skipped
🟡 Shard 1	477	0	3	4
🟡 Shard 2	642	0	5	7
🔴 Shard 3	645	1	7	1
🟡 Shard 4	624	0	5	22
✅ Shard 5	616	0	0	42
🟡 Shard 6	634	0	6	8

Genuine Failures (failed on all attempts)

❌ Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3)

Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.21ec06b9"')
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.21ec06b9"')�[22m

🟡 26 flaky test(s) (passed on retry)

Features/CustomizeDetailPage.spec.ts › Table - customization should work (shard 1, 1 retry)
Features/CustomizeDetailPage.spec.ts › API Collection - customization should work (shard 1, 1 retry)
Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
Features/AdvancedSearch.spec.ts › Verify Group functionality for field Display Name with AND operator (shard 2, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/ChangeSummaryBadge.spec.ts › Automated badge should appear on entity description with Automated source (shard 2, 1 retry)
Features/EntitySummaryPanel.spec.ts › should cancel edit display name modal (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should inherit reviewers from glossary when term is created (shard 2, 1 retry)
Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
Features/UserProfileOnlineStatus.spec.ts › Should not show online status for inactive users (shard 3, 1 retry)
Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for ApiEndpoint (shard 4, 1 retry)
Pages/DataContractsSemanticRules.spec.ts › Validate Owner Rule Not_In (shard 4, 1 retry)
Pages/Domains.spec.ts › Subdomain rename does not affect parent domain and updates nested children (shard 4, 1 retry)
Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
Pages/InputOutputPorts.spec.ts › Port drawers show Entity Type quick filter (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Worksheet (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

harshach added 2 commits April 15, 2026 07:30

Improve json schema extraction performance

11b8d08

Improve Datalake Connector. add bringing tags metadata

0d60a72

harshach requested a review from a team as a code owner April 15, 2026 14:31

Copilot AI review requested due to automatic review settings April 15, 2026 14:31

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 15, 2026

Copilot started reviewing on behalf of harshach April 15, 2026 14:32 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

gitar-bot bot reviewed Apr 15, 2026

View reviewed changes

harshach temporarily deployed to test April 15, 2026 14:41 — with GitHub Actions Inactive

harshach had a problem deploying to test April 15, 2026 14:41 — with GitHub Actions Failure

		credentials_dict = gcp_credentials.gcpConfig.model_dump()
		credentials_dict["type"] = "external_account"

Conversation

harshach commented Apr 15, 2026

Describe your changes:

Type of change:

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

TeddyCr Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

TeddyCr Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

TeddyCr Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 15, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

github-actions bot commented Apr 15, 2026

🔴 Playwright Results — 1 failure(s), 26 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gitar-bot bot commented Apr 15, 2026 •

edited

Loading