Skip to content

MINOR: Fix missing sesssion for s3 column ingestion#27393

Open
ulixius9 wants to merge 1 commit intomainfrom
fix_s3_missing_session
Open

MINOR: Fix missing sesssion for s3 column ingestion#27393
ulixius9 wants to merge 1 commit intomainfrom
fix_s3_missing_session

Conversation

@ulixius9
Copy link
Copy Markdown
Member

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Copilot AI review requested due to automatic review settings April 15, 2026 10:44
@ulixius9 ulixius9 requested a review from a team as a code owner April 15, 2026 10:44
@github-actions github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 15, 2026
if connection.awsConfig.endPointURL
else None
)
s3_kwargs = {"endpoint_url": endpoint_url} if endpoint_url else {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: CloudWatch client missing endpoint_url for custom endpoints

The refactored get_connection in connection.py builds s3_kwargs with the custom endpoint_url and passes it only to the S3 client (line 58), but the CloudWatch client on line 59 never receives it. The previous code used aws_client.get_client(), which applied endpoint_url to all services when configured (see aws_client.py:226-230).

This means users with a custom endPointURL (e.g., LocalStack, MinIO with CloudWatch-compatible metrics) will have a broken CloudWatch client that points to the default AWS endpoint instead of their custom one. This could cause connection test failures or incorrect metric retrieval.

Suggested fix:

kwargs = {"endpoint_url": endpoint_url} if endpoint_url else {}
return S3ObjectStoreClient(
    s3_client=session.client(service_name="s3", **kwargs),
    cloudwatch_client=session.client(service_name="cloudwatch", **kwargs),
    session=session,
)

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 15, 2026

Code Review ⚠️ Changes requested 0 resolved / 1 findings

Integrates missing session handling for S3 column ingestion but fails to include the required endpoint_url in the CloudWatch client for custom configurations.

⚠️ Bug: CloudWatch client missing endpoint_url for custom endpoints

📄 ingestion/src/metadata/ingestion/source/storage/s3/connection.py:56 📄 ingestion/src/metadata/ingestion/source/storage/s3/connection.py:59

The refactored get_connection in connection.py builds s3_kwargs with the custom endpoint_url and passes it only to the S3 client (line 58), but the CloudWatch client on line 59 never receives it. The previous code used aws_client.get_client(), which applied endpoint_url to all services when configured (see aws_client.py:226-230).

This means users with a custom endPointURL (e.g., LocalStack, MinIO with CloudWatch-compatible metrics) will have a broken CloudWatch client that points to the default AWS endpoint instead of their custom one. This could cause connection test failures or incorrect metric retrieval.

Suggested fix
kwargs = {"endpoint_url": endpoint_url} if endpoint_url else {}
return S3ObjectStoreClient(
    s3_client=session.client(service_name="s3", **kwargs),
    cloudwatch_client=session.client(service_name="cloudwatch", **kwargs),
    session=session,
)
🤖 Prompt for agents
Code Review: Integrates missing session handling for S3 column ingestion but fails to include the required endpoint_url in the CloudWatch client for custom configurations.

1. ⚠️ Bug: CloudWatch client missing endpoint_url for custom endpoints
   Files: ingestion/src/metadata/ingestion/source/storage/s3/connection.py:56, ingestion/src/metadata/ingestion/source/storage/s3/connection.py:59

   The refactored `get_connection` in `connection.py` builds `s3_kwargs` with the custom `endpoint_url` and passes it only to the S3 client (line 58), but the CloudWatch client on line 59 never receives it. The previous code used `aws_client.get_client()`, which applied `endpoint_url` to *all* services when configured (see `aws_client.py:226-230`).
   
   This means users with a custom `endPointURL` (e.g., LocalStack, MinIO with CloudWatch-compatible metrics) will have a broken CloudWatch client that points to the default AWS endpoint instead of their custom one. This could cause connection test failures or incorrect metric retrieval.

   Suggested fix:
   kwargs = {"endpoint_url": endpoint_url} if endpoint_url else {}
   return S3ObjectStoreClient(
       s3_client=session.client(service_name="s3", **kwargs),
       cloudwatch_client=session.client(service_name="cloudwatch", **kwargs),
       session=session,
   )

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix S3 structured column inference during storage ingestion by ensuring a boto3 Session is available and propagated into the dataframe reader path (fetch_dataframe_first_chunk), which can be required by underlying S3 filesystem/readers.

Changes:

  • Add an optional session parameter to StorageServiceSource.extract_column_definitions / _get_columns and pass it through to fetch_dataframe_first_chunk.
  • Capture and pass a session from the S3 source (S3Source) into _get_columns.
  • Update the S3 connection wrapper to expose the created boto3 session alongside the S3/CloudWatch clients.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
ingestion/src/metadata/ingestion/source/storage/storage_service.py Plumbs an optional session through the generic storage column-extraction path into fetch_dataframe_first_chunk.
ingestion/src/metadata/ingestion/source/storage/s3/metadata.py Stores the connection session on the source and passes it into _get_columns during container detail generation.
ingestion/src/metadata/ingestion/source/storage/s3/connection.py Creates and returns a boto3 session as part of S3ObjectStoreClient, and uses it to build service clients.

metadata_entry: MetadataEntry,
session: Any = None,
) -> List[Column]:
"""Extract Column related metadata from s3"""
Comment on lines 340 to 347
extracted_cols = self.extract_column_definitions(
container_name, sample_key, config_source, client, metadata_entry
container_name,
sample_key,
config_source,
client,
metadata_entry,
session,
)
s3_client=aws_client.get_client(service_name="s3"),
cloudwatch_client=aws_client.get_client(service_name="cloudwatch"),
s3_client=session.client(service_name="s3", **s3_kwargs),
cloudwatch_client=session.client(service_name="cloudwatch"),
@sonarqubecloud
Copy link
Copy Markdown

@github-actions
Copy link
Copy Markdown
Contributor

🔴 Playwright Results — 2 failure(s), 21 flaky

✅ 3633 passed · ❌ 2 failed · 🟡 21 flaky · ⏭️ 84 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 479 0 1 4
🟡 Shard 2 640 0 5 7
🔴 Shard 3 648 2 3 1
🟡 Shard 4 622 0 4 22
🟡 Shard 5 610 0 2 42
🟡 Shard 6 634 0 6 8

Genuine Failures (failed on all attempts)

Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m200�[39m
Received: �[31m400�[39m
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3)
Error: �[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m200�[39m
Received: �[31m400�[39m
🟡 21 flaky test(s) (passed on retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/ChangeSummaryBadge.spec.ts › Automated badge should appear on entity description with Automated source (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should inherit reviewers from glossary when term is created (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 2 retries)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Topic (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for ApiEndpoint (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Verify Domain entity API calls do not include invalid domains field in glossary term assets (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for table (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 1 retry)
  • Pages/InputOutputPorts.spec.ts › Input port button visible, output port button hidden when no assets (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › verify create lineage for entity - Spreadsheet (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for topic -> container (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants