Skip to content

Datalake improvements , Json sampling, Tags ingestion#27401

Open
harshach wants to merge 2 commits intomainfrom
datalake_improvements
Open

Datalake improvements , Json sampling, Tags ingestion#27401
harshach wants to merge 2 commits intomainfrom
datalake_improvements

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

@harshach harshach requested a review from a team as a code owner April 15, 2026 14:31
Copilot AI review requested due to automatic review settings April 15, 2026 14:31
@github-actions github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Datalake ingestion behavior by making schema inference more efficient (sampling fewer rows), adding optional cloud-object tag ingestion for Datalake files, and tightening some credential/session plumbing for cloud readers.

Changes:

  • Add a schema-inference mode to DataFrame readers so read_first_chunk() samples a small number of records instead of default chunk sizes.
  • Add opt-in Datalake table tag ingestion by reading provider object tags/metadata (S3/GCS/Azure) and mapping them to OpenMetadata classifications/tags.
  • Pass boto3 session through sampling/reader paths and adjust cloud utilities/tests accordingly.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ingestion/src/metadata/readers/dataframe/base.py Forces schema_inference=True for read_first_chunk() to enable sampling behavior in readers.
ingestion/src/metadata/readers/dataframe/json.py Adds schema-inference sampling for JSON/JSONL streaming and threads file_size through.
ingestion/src/metadata/readers/dataframe/dsv.py Adds schema-inference sampling for CSV/TSV reading (pandas chunk size reduction).
ingestion/src/metadata/readers/dataframe/parquet.py Adds schema-inference batch sizing and avoids redundant size lookups when provided.
ingestion/src/metadata/readers/dataframe/avro.py Adds schema-inference batch sizing and changes S3 Avro reading strategy.
ingestion/src/metadata/readers/file/adls.py Returns only populated Azure storage options (enables DefaultAzureCredential fallback).
ingestion/src/metadata/utils/datalake/datalake_utils.py Changes error behavior for first-chunk fetching (now re-raises).
ingestion/src/metadata/utils/s3_utils.py Re-raises exceptions from S3 pagination helper after logging.
ingestion/src/metadata/utils/credentials.py Adds temp credential file creation for GCP external account credentials.
ingestion/src/metadata/mixins/pandas/pandas_mixin.py Extends get_dataframes to accept an optional boto3 session.
ingestion/src/metadata/sampler/pandas/sampler.py Passes session from Datalake client into dataframe reader path.
ingestion/src/metadata/ingestion/source/database/datalake/metadata.py Implements opt-in per-file tag ingestion for Datalake via provider object tags.
ingestion/src/metadata/ingestion/source/database/datalake/clients/base.py Adds optional get_object_tags() API to DL clients.
ingestion/src/metadata/ingestion/source/database/datalake/clients/s3.py Implements S3 object tag retrieval via get_object_tagging.
ingestion/src/metadata/ingestion/source/database/datalake/clients/gcs.py Implements tag retrieval from blob metadata; modifies test list-buckets function behavior.
ingestion/src/metadata/ingestion/source/database/datalake/clients/azure_blob.py Implements Azure blob tag retrieval via get_blob_tags.
ingestion/tests/unit/readers/test_json_reader.py Adds unit tests validating schema inference reads only a small sample.
ingestion/tests/unit/readers/test_s3_reader_credentials.py Adds S3 JSONL sampling test for schema inference.
ingestion/tests/unit/readers/test_avro_reader.py Updates Avro S3 test expectation to a single get_object call.
ingestion/tests/unit/readers/test_parquet_reader.py Switches GCS mocks to MagicMock for filesystem mocking.

Comment on lines +207 to +208
credentials_dict = gcp_credentials.gcpConfig.model_dump()
credentials_dict["type"] = "external_account"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GcpExternalAccount, building the credentials JSON via model_dump() produces field names like externalType, tokenURL, and subjectTokenType that do not match the expected Google external account credential file format (snake_case keys). This also differs from build_google_credentials_dict() (and the existing unit test expectations). Use the same mapping helper for external accounts to generate the correct JSON structure before writing the temp credentials file.

Suggested change
credentials_dict = gcp_credentials.gcpConfig.model_dump()
credentials_dict["type"] = "external_account"
credentials_dict = build_google_credentials_dict(
gcp_credentials.gcpConfig, single_project
)

Copilot uses AI. Check for mistakes.
Comment on lines 138 to +142
logger.error(
f"Error fetching first chunk of file [{bucket_name}/{key}] using "
f"[{config_source.__class__.__name__}] due to: [{err}]"
)
raise err
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise err here discards the original traceback context and makes debugging harder (it also leads to duplicated logging due to the outer except). Use a bare raise to preserve the original stack trace when rethrowing inside an except block.

Copilot uses AI. Check for mistakes.
Comment on lines 162 to +166
def get_test_list_buckets_fn(self, bucket_name: Optional[str]) -> Callable:

if bucket_name:
fn = partial(self._client.get_bucket, bucket_name)
else:
fn = self._client.list_buckets

os.environ.pop("GOOGLE_CLOUD_PROJECT", "")
if GOOGLE_CREDENTIALS in os.environ:
os.remove(os.environ[GOOGLE_CREDENTIALS])
del os.environ[GOOGLE_CREDENTIALS]

return fn
return partial(self._client.get_bucket, bucket_name)
return self._client.list_buckets
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method no longer cleans up GOOGLE_APPLICATION_CREDENTIALS / temp credential files after running the test callable. Since set_google_credentials() writes temp files and close() does not reliably remove them for all credential types/config wrappers, this can leak temp files and leave env vars set for the rest of the process (impacting other tests/connectors). Consider restoring cleanup here or ensuring close() unsets the env var and deletes tracked temp files for all GCP credential modes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshach that is a valid comment

Comment on lines 112 to +119
"""Stream Avro from S3 without loading entire file into memory."""
schema_response = self.client.get_object(Bucket=bucket_name, Key=key)
import io

response = self.client.get_object(Bucket=bucket_name, Key=key)
try:
columns = self._get_avro_columns(schema_response["Body"])
avro_bytes = io.BytesIO(response["Body"].read())
finally:
schema_response["Body"].close()
response["Body"].close()
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the S3 Avro reader, response["Body"].read() loads the entire object into memory. This defeats the stated goal of streaming Avro and makes read_first_chunk still download/read the full file, which can cause OOM or large latency for big Avro objects. Consider reverting to streaming from the S3 body (even if that requires a second get_object call), or using a bounded/ranged read just for schema inference.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshach valid comment

Comment on lines +124 to 132
batch_size = (
SCHEMA_INFERENCE_SAMPLE_SIZE
if getattr(self, "_schema_inference", False)
else CHUNKSIZE
)

def chunk_generator():
response = self.client.get_object(Bucket=bucket_name, Key=key)
try:
yield from self._stream_avro_records(response["Body"])
finally:
response["Body"].close()
yield from self._stream_avro_records(avro_bytes, batch_size=batch_size)

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_generator() streams from a shared BytesIO that is created once and advanced as it is consumed. If wrapper.dataframes() is invoked more than once (a common pattern for generator factories in this codebase), subsequent iterations will return no data unless the buffer is rewound. Ensure each generator invocation starts from position 0 (or constructs a fresh stream) so dataframes remains repeatable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a valid comment. The initial implementation ensure that consecutive calls will pull from a fresh generator

except Exception as exc:
logger.debug(traceback.format_exc())
logger.warning(f"Unexpected exception to yield s3 object: {exc}")
raise
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Re-raising in list_s3_objects breaks callers that expect silent failure

Adding raise at the end of the except block in list_s3_objects changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (s3.py:get_table_names, file_client.py:get_pbit_files, s3/metadata.py:_generate_structured_containers_by_depth, s3/metadata.py:_yield_nested_unstructured_containers) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

Suggested fix:

If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment on lines +39 to +46
options = {}
if connection_args.tenantId:
options["tenant_id"] = connection_args.tenantId
if connection_args.clientId:
options["client_id"] = connection_args.clientId
if connection_args.clientSecret:
options["client_secret"] = connection_args.clientSecret.get_secret_value()
return options
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: return_azure_storage_options may return empty dict

The refactored return_azure_storage_options now conditionally adds each field. If all three fields (tenantId, clientId, clientSecret) are None/empty, it returns {}. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with **storage_options into adlfs.AzureBlobFileSystem(...) should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion


batch_size = (
SCHEMA_INFERENCE_SAMPLE_SIZE
if getattr(self, "_schema_inference", False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Schema inference flag uses instance attr instead of parameter

All readers communicate the schema_inference flag by setting self._schema_inference in _read() and reading it back via getattr(self, '_schema_inference', False) in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first read_first_chunk then read), the _schema_inference flag from the previous call leaks. It works today because _read always sets it, but the pattern is error-prone. Consider passing schema_inference as a parameter through the dispatch chain instead, or at minimum resetting it in read() as well.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 15, 2026

Code Review ⚠️ Changes requested 0 resolved / 3 findings

Datalake ingestion now includes JSON sampling and tag support, but re-raising exceptions in list_s3_objects breaks existing caller contracts. Additionally, schema inference relies on instance attributes instead of parameters, and return_azure_storage_options may return an empty dictionary.

⚠️ Bug: Re-raising in list_s3_objects breaks callers that expect silent failure

📄 ingestion/src/metadata/utils/s3_utils.py:35

Adding raise at the end of the except block in list_s3_objects changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (s3.py:get_table_names, file_client.py:get_pbit_files, s3/metadata.py:_generate_structured_containers_by_depth, s3/metadata.py:_yield_nested_unstructured_containers) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

Suggested fix
If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.
💡 Edge Case: return_azure_storage_options may return empty dict

📄 ingestion/src/metadata/readers/file/adls.py:39-46

The refactored return_azure_storage_options now conditionally adds each field. If all three fields (tenantId, clientId, clientSecret) are None/empty, it returns {}. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with **storage_options into adlfs.AzureBlobFileSystem(...) should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

💡 Quality: Schema inference flag uses instance attr instead of parameter

📄 ingestion/src/metadata/readers/dataframe/avro.py:126 📄 ingestion/src/metadata/readers/dataframe/avro.py:219 📄 ingestion/src/metadata/readers/dataframe/json.py:175 📄 ingestion/src/metadata/readers/dataframe/json.py:308 📄 ingestion/src/metadata/readers/dataframe/dsv.py:138 📄 ingestion/src/metadata/readers/dataframe/dsv.py:254 📄 ingestion/src/metadata/readers/dataframe/parquet.py:61 📄 ingestion/src/metadata/readers/dataframe/parquet.py:447

All readers communicate the schema_inference flag by setting self._schema_inference in _read() and reading it back via getattr(self, '_schema_inference', False) in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first read_first_chunk then read), the _schema_inference flag from the previous call leaks. It works today because _read always sets it, but the pattern is error-prone. Consider passing schema_inference as a parameter through the dispatch chain instead, or at minimum resetting it in read() as well.

🤖 Prompt for agents
Code Review: Datalake ingestion now includes JSON sampling and tag support, but re-raising exceptions in list_s3_objects breaks existing caller contracts. Additionally, schema inference relies on instance attributes instead of parameters, and return_azure_storage_options may return an empty dictionary.

1. ⚠️ Bug: Re-raising in list_s3_objects breaks callers that expect silent failure
   Files: ingestion/src/metadata/utils/s3_utils.py:35

   Adding `raise` at the end of the except block in `list_s3_objects` changes a long-standing contract: callers previously relied on this function to log-and-swallow exceptions, yielding nothing on failure. Multiple callers (`s3.py:get_table_names`, `file_client.py:get_pbit_files`, `s3/metadata.py:_generate_structured_containers_by_depth`, `s3/metadata.py:_yield_nested_unstructured_containers`) do not wrap calls in try/except and will now propagate unhandled exceptions, potentially crashing entire ingestion workflows on transient S3 errors (e.g., access denied on a single prefix).

   Suggested fix:
   If the intent is to surface errors, add exception handling in each caller so a single prefix failure doesn't abort the whole ingestion. Alternatively, remove the re-raise and keep the original swallow-and-continue behavior, adding the re-raise only in the specific code paths that need it.

2. 💡 Edge Case: return_azure_storage_options may return empty dict
   Files: ingestion/src/metadata/readers/file/adls.py:39-46

   The refactored `return_azure_storage_options` now conditionally adds each field. If all three fields (`tenantId`, `clientId`, `clientSecret`) are `None`/empty, it returns `{}`. This is intentional ("allowing DefaultAzureCredential fallback" per the docstring), but downstream code that unpacks these options with `**storage_options` into `adlfs.AzureBlobFileSystem(...)` should be verified to work correctly with no auth options. This is likely fine if Azure's DefaultCredential chain is configured, but worth a note in case it causes auth failures in environments where the old explicit credentials were expected.

3. 💡 Quality: Schema inference flag uses instance attr instead of parameter
   Files: ingestion/src/metadata/readers/dataframe/avro.py:126, ingestion/src/metadata/readers/dataframe/avro.py:219, ingestion/src/metadata/readers/dataframe/json.py:175, ingestion/src/metadata/readers/dataframe/json.py:308, ingestion/src/metadata/readers/dataframe/dsv.py:138, ingestion/src/metadata/readers/dataframe/dsv.py:254, ingestion/src/metadata/readers/dataframe/parquet.py:61, ingestion/src/metadata/readers/dataframe/parquet.py:447

   All readers communicate the `schema_inference` flag by setting `self._schema_inference` in `_read()` and reading it back via `getattr(self, '_schema_inference', False)` in dispatch methods. This mutable-state-on-self pattern is fragile — if a reader instance is reused across calls (first `read_first_chunk` then `read`), the `_schema_inference` flag from the previous call leaks. It works today because `_read` always sets it, but the pattern is error-prone. Consider passing `schema_inference` as a parameter through the dispatch chain instead, or at minimum resetting it in `read()` as well.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 1 failure(s), 26 flaky

✅ 3638 passed · ❌ 1 failed · 🟡 26 flaky · ⏭️ 84 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 477 0 3 4
🟡 Shard 2 642 0 5 7
🔴 Shard 3 645 1 7 1
🟡 Shard 4 624 0 5 22
✅ Shard 5 616 0 0 42
🟡 Shard 6 634 0 6 8

Genuine Failures (failed on all attempts)

Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.21ec06b9"')
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.21ec06b9"')�[22m

🟡 26 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Table - customization should work (shard 1, 1 retry)
  • Features/CustomizeDetailPage.spec.ts › API Collection - customization should work (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/AdvancedSearch.spec.ts › Verify Group functionality for field Display Name with AND operator (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/ChangeSummaryBadge.spec.ts › Automated badge should appear on entity description with Automated source (shard 2, 1 retry)
  • Features/EntitySummaryPanel.spec.ts › should cancel edit display name modal (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should inherit reviewers from glossary when term is created (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should not show online status for inactive users (shard 3, 1 retry)
  • Flow/AddRoleAndAssignToUser.spec.ts › Verify assigned role to new user (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for ApiEndpoint (shard 4, 1 retry)
  • Pages/DataContractsSemanticRules.spec.ts › Validate Owner Rule Not_In (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Subdomain rename does not affect parent domain and updates nested children (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Port drawers show Entity Type quick filter (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Table (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Worksheet (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants