Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_asset_name can't be specified with fluent datasources for use in data docs #8790

Closed
morphatic opened this issue Oct 4, 2023 · 7 comments · Fixed by #9950
Closed

data_asset_name can't be specified with fluent datasources for use in data docs #8790

morphatic opened this issue Oct 4, 2023 · 7 comments · Fixed by #9950
Assignees

Comments

@morphatic
Copy link

Describe the bug
When running a checkpoint and producing data docs from a batch request generated with the build_batch_request() method of a Fluent pandas filesystem datasource, the "Asset Name" column of the resulting data docs is never populated.

screenshot of data docs missing asset name

To Reproduce
This repository contains a minimum reproduction.

Expected behavior
The "Asset Name" column should be filled in the produced data docs.

Possible Causes and Fixes
There are two ways to approach this issue. It could be either or both of a problem in:

  1. the API for the BatchRequest class implemented for Fluent datasources
  2. how the default data docs renderer finds and processes metadata

When I asked about this issue in the GX Slack channel it was suggested that I could specify a value for the batch_spec_passthrough parameter. I tried doing this in several ways:

# I tried 2 ways of manually adding the property;
# Both resulted in: ValueError: "BatchRequest" object has no field "batch_spec_passthrough"
batch_request.batch_spec_passthrough = { "data_asset_name": "iris" }
batch_request["batch_spec_passthrough"] = { "data_asset_name": "people" }

# Tried adding `batch_spec_passthrough` during batch request creation
# Failed with: TypeError: _PandasDataAsset.build_batch_request() got an unexpected keyword "batch_spec_passthrough"
batch_request = asset.build_batch_request(batch_spec_passthrough = { ... })

# Tried adding it to the `validators` property when creating the checkpoint
# No error produced, but also didn't resolve the issue
checkpoint = context.add_or_update_checkpoint(
  name="demo_iris_expectations",
  run_name_template="%Y-%m-iris-demo",
  validations=[{
    "batch_request": batch_request,
    "expectation_suite_name": "iris_expectations",
    "batch_spec_passthrough": { "data_asset_name": "people" }
  }]
)

Inconsistent API for BatchRequest classesses
This led me to realize that the API for the BatchRequest class is NOT consistent across implementations:

It's not immediately clear to people who are relatively new to GX (like me) that these two BatchReuquest classes don't provide equivalent functionality.

A "quick" fix in the Renderer
The data_asset_name IS available in the metadata that gets passed to the renderer that creates the data docs. The current renderer code does not access it however. I WAS able to get the "Asset Name" column in the data docs to populate correctly with the following change to that code:

validation_success = validation.success
batch_kwargs = validation.meta.get("batch_kwargs", {})
batch_spec = validation.meta.get("batch_spec", {})
active_batch = validation.meta.get("active_batch_definition", {}) # <= ADD THIS

self.add_resource_info_to_index_links_dict(
  # ... other props
  asset_name=batch_kwargs.get("data_asset_name")
  or batch_spec.get("data_asset_name")
  or active_batch.get("data_asset_name"), # <= AND ADD THIS
  # ... other props
)

This "fixes" the immediate problem, but it feels kind of like a bandaid and that better solution would be to have a unified API for any class called BatchRequest.

Environment (please complete the following information):

  • Operating System: Windows
  • Great Expectations Version: 0.17.19
  • Data Source: Fluent Pandas filesystem datasource
  • Cloud environment: n/a

Additional context
There's a thread about this issue in the GX Slack channel.

@r34ctor
Copy link
Contributor

r34ctor commented Oct 6, 2023

Hi @morphatic! Thanks for raising this. We've captured this for review.

@HaydarAk
Copy link

Is there an update on this?
We are also dealing with this problem

@rcalcantara60
Copy link

Hi, is there any update? Pls...

@aslichampion
Copy link

Just wanted to add to this,

I was following a similar discussion to @morphatic in the GX forums in this thread.

I settled on the same temporary fix after a bit of investigating, but it would be really nice to see if there is any update on an actual fix. Thanks!

@Carlitos5336
Copy link

Thanks for the quick fix. I was having the same issue. Hopefully it gets officially solved soon.

@Kilo59 Kilo59 self-assigned this May 6, 2024
@Kilo59
Copy link
Member

Kilo59 commented May 6, 2024

I'll be looking into this shortly.

@Kilo59
Copy link
Member

Kilo59 commented May 22, 2024

@Kilo59 Kilo59 closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants