Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data docs does not contain the results of my expectations #3658

Closed
abekfenn opened this issue Nov 10, 2021 · 18 comments
Closed

Data docs does not contain the results of my expectations #3658

abekfenn opened this issue Nov 10, 2021 · 18 comments
Labels
community devrel This item is being addressed by the Developer Relations Team

Comments

@abekfenn
Copy link
Contributor

Describe the bug
This is probably not a bug and is user error but I didn't see a suitable template.. I am trying to run expectations via code (not the CLI) as a part of my ETL pipeline in order to validate data before it goes to production. I want to save the expectation results json and upload it to S3 and setup a S3-hosted data docs to pull from those results & the expectation suite.

To Reproduce
Steps to reproduce the behavior:

  1. Run the below attached code
  2. Note that when data docs opens, it only contains the expectations, and not the results of those expectations.

Expected behavior
Data docs contains the results of my expectations.

Environment (please complete the following information):

  • Operating System: Linux & MacOS
  • Great Expectations Version: [e.g. 0.14.1]

Additional context
If there is a better way to do this that say better leverages existing great_expectations features, do please point me in that direction.
Notably, I couldn't make the CLI configuration of my great_expectations.yml work for me, as I need this to run dynamically in a pipeline, uploading to different locations depending on client.

import great_expectations as ge
from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, FilesystemStoreBackendDefaults
from great_expectations.data_context import BaseDataContext
import numpy as np
import pandas as pd
import json
import os
from datetime import datetime
from great_expectations.data_context.types.resource_identifiers import (ExpectationSuiteIdentifier,
    ValidationResultIdentifier,
)

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

abs_path = os.getcwd() + '/great_expectations'

data_context_config = DataContextConfig(
    datasources={
        "my_pandas_datasource": DatasourceConfig(
            class_name="PandasDatasource",
        )
    },
    store_backend_defaults=FilesystemStoreBackendDefaults(root_directory=abs_path),
)
context = BaseDataContext(project_config=data_context_config)

domain_name = 'test'

suite = context.create_expectation_suite(domain_name, overwrite_existing=True)

batch_kwargs = {
    "datasource": 'my_pandas_datasource',
    "dataset": df,
    "data_asset_name": domain_name,
}

batch = context.get_batch(batch_kwargs, "test")

print(batch.head())

batch.expect_table_row_count_to_be_between(max_value=250, min_value=10)

batch.expect_table_column_count_to_equal(value=4)

batch.expect_table_columns_to_match_ordered_list(
    column_list=[
        "A",
        "B",
        "C",
        "D",
    ]
)

batch.expect_column_values_to_not_be_null(column="A",
    result_format='COMPLETE')

batch.expect_column_values_to_be_null(column="A",
    result_format='COMPLETE')

batch.expect_column_values_to_be_in_set(
    column="A",
    value_set=["A", "B", "C", "D", "E", "F"],
    result_format='COMPLETE'
)

results = batch.validate()


# This step is optional, but useful - evaluate the Expectations against the current batch of data
run_id = {
"run_name": domain_name,
"run_time": datetime.now()
}
results = batch.validate(expectation_suite=None,
                                run_id=None,
                                data_context=context,
                                evaluation_parameters=None,
                                catch_exceptions=True,
                                only_return_failures=False,
                                run_name=domain_name,
                                run_time=datetime.now(),)

# save the Expectation Suite (by default to a JSON file in great_expectations/expectations folder
# batch.save_expectation_suite(suite, domain_name, discard_failed_expectations=False)
batch.save_expectation_suite(discard_failed_expectations=False)

# Neither details nor meta (I inferred this as expected) seem to contain an expectation_suite_identifier
# expectation_suite_identifier = list(results["details"].keys())[0]
# expectation_suite_identifier = list(results["meta"].keys())[0]
# print('expectation_suite_identifier')
# print(expectation_suite_identifier)

validation_result_identifier = ValidationResultIdentifier(
    expectation_suite_identifier=domain_name,
    # expectation_suite_identifier=expectation_suite_identifier,
    batch_identifier=batch.batch_kwargs.to_id(),
    run_id=run_id
)

# This doesn't work
# context.build_data_docs()

# Neither does this
# context.build_data_docs(domain_name, results)
# context.open_data_docs(domain_name)

# Neither does this
# context.build_data_docs(domain_name, suite_identifier)
# context.open_data_docs(suite_identifier)
# context.open_data_docs(validation_result_identifier)

# Neither does this
suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=domain_name)
context.build_data_docs(domain_name, suite_identifier)
context.open_data_docs()

with open('validation_results.json', 'w') as f:
    f.write(str(results))
@abekfenn abekfenn changed the title Data docs contains the results of my expectations. Data docs DOES NOT contain the results of my expectations. Nov 11, 2021
@abekfenn abekfenn changed the title Data docs DOES NOT contain the results of my expectations. Data docs does not contain the results of my expectations Nov 11, 2021
@talagluck
Copy link
Contributor

Hi @abekfenn! Thanks so much for the question.

While you are running a validate step, and you are saving your suite, batch.validate() does not actually save your validation results. In order to do that, you will want to use a Checkpoint (or LegacyCheckpoint if you are using the V2 API). If you run great_expectations suite new on the CLI, you will see an example of this in the last cell. For reference, the code looks something like this:

batch.save_expectation_suite(discard_failed_expectations=False)

results = LegacyCheckpoint(
    name="_temp_checkpoint",
    data_context=context,
    batches=[
        {
          "batch_kwargs": batch_kwargs,
          "expectation_suite_names": [expectation_suite_name]
        }
    ]
).run()
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs(validation_result_identifier)

You'll find some more information about configuring Checkpoints in our docs, though there will be slight between V2 and V3. I believe you'll have more expressiveness with your Checkpoints in V3.

I'm going to close this issue for now, but feel free to chime in if you have additional questions here.

@talagluck talagluck added community devrel This item is being addressed by the Developer Relations Team labels Nov 12, 2021
@abekfenn
Copy link
Contributor Author

Hi @talagluck thanks so much for your response. This worked for me but then I encountered a bug which was recently resolved for V3 3225. I've tried to switch my code over to V3 but keep encountering an error and cannot find any documentation on in-code expectations using V3 checkpoints and data docs.

If you could point out the error in my logic below, I would be very grateful.

import great_expectations.jupyter_ux
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, FilesystemStoreBackendDefaults

from great_expectations.checkpoint import Checkpoint, LegacyCheckpoint
import numpy as np
import pandas as pd
import json
from ruamel import yaml
import os
from datetime import datetime

ge_abs_path = os.getcwd() + '/great_expectations'

abs_path = os.getcwd() + '/great_expectations'
domain_name = 'test'

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

'%Y-%m-%d %H:%M:%S'
context = ge.data_context.DataContext(
    context_root_dir=ge_abs_path,
    # store_backend_defaults=FilesystemStoreBackendDefaults(root_directory=abs_path),
)
yaml_config: str = f"""
        name: {domain_name}
        config_version: 1
        class_name: Checkpoint
        run_name_template: "%Y-%m-%d-%H-%M-%S_{domain_name}"
        validations:
        - batch_request:
            expectation_suite_name: {domain_name}
        """

datasource_config = {
                        "name": "test",
                        "class_name": "Datasource",
                        "module_name": "great_expectations.datasource",
                        "execution_engine": {
                            "module_name": "great_expectations.execution_engine",
                            "class_name": "PandasExecutionEngine",
                        },
                        "data_connectors": {
                            "default_runtime_data_connector_name": {
                                "class_name": "RuntimeDataConnector",
                                "module_name": "great_expectations.datasource.data_connector",
                                "batch_identifiers": ["default_identifier_name"],
                            },
                        },
                    }
context.test_yaml_config(yaml.dump(datasource_config))
context.add_datasource(**datasource_config)
# context.add_checkpoint(**yaml.load(yaml_config))

batch_request = RuntimeBatchRequest(
    datasource_name="test",
    data_connector_name="default_runtime_data_connector_name",
    data_asset_name="test",  # This can be anything that identifies this data_asset for you
    runtime_parameters={"batch_data": df},  # df is your dataframe
    batch_identifiers={"default_identifier_name": "default_identifier"},
)

context.create_expectation_suite(
    expectation_suite_name="test", overwrite_existing=True
)
validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name="test"
)

validator.expect_table_row_count_to_be_between(max_value=250, min_value=10)

validator.expect_table_column_count_to_equal(value=4)

validator.expect_table_columns_to_match_ordered_list(
    column_list=[
        "A",
        "B",
        "C",
        "D",
    ]
)

validator.expect_column_values_to_not_be_null(column="A",
        result_format={'result_format': 'SUMMARY'})

validator.expect_column_values_to_be_null(column="A",
        result_format={'result_format': 'COMPLETE'})

validator.expect_column_values_to_be_in_set(
    column="A",
    value_set=["A", "B", "C", "D", "E", "F"],
    result_format={'result_format': 'COMPLETE'}
)

validator.save_expectation_suite(
        filepath=None,
        discard_failed_expectations=False,
        discard_result_format_kwargs=False,
        discard_include_config_kwargs=False,
        discard_catch_exceptions_kwargs=False,
        suppress_warnings=False,)

# results = validator.validate()
# this works but doesn't allow data docs to be built

# add checkpoint config
checkpoint = Checkpoint(
    name=domain_name,
    data_context=context,
    config_version=1,
    run_name_template=f"%Y-%m-%d-%H-%M-%S_{domain_name}",
    expectation_suite_name=domain_name,
    action_list=[
        {
            "name": "store_validation_result",
            "action": {
                "class_name": "StoreValidationResultAction",
            },
        },
        {
            "name": "store_evaluation_params",
            "action": {
                "class_name": "StoreEvaluationParametersAction",
            },
        },
        {
            "name": "update_data_docs",
            "action": {
                "class_name": "UpdateDataDocsAction",
            },
        },
    ],
    validations=[{"batch_request": batch_request}],
)

results = checkpoint.run(
    # checkpoint_name=domain_name,
    # batch_request=batch_request,
    # run_id=None,
    # # run_id=_RUN_ID,
    # data_context=context,
    # evaluation_parameters=None,
    # catch_exceptions=True,
    # only_return_failures=False,
    # run_name=domain_name,
    # run_time=datetime.now(),
    )

validation_result_identifier = results.list_validation_result_identifiers()[0]
print('validation_result_identifier', validation_result_identifier)
context.build_data_docs()


validation_results = results.run_results[validation_result_identifier]["validation_result"]


with open('validation_results.json', 'w') as f:
    f.write(str(results))

@abekfenn
Copy link
Contributor Author

Error message:

2021-11-23T08:43:52-0800 - ERROR - Error running action with name update_data_docs
Traceback (most recent call last):
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/validation_operators/validation_operators.py", line 455, in _run_actions
    checkpoint_identifier=checkpoint_identifier,
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/checkpoint/actions.py", line 69, in run
    **kwargs,
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/checkpoint/actions.py", line 1008, in _run
    expectation_suite_identifier,
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/core/usage_statistics/usage_statistics.py", line 264, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 2520, in build_data_docs
    build_index=(build_index and not self.ge_cloud_mode),
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/render/renderer/site_builder.py", line 303, in build
    site_section_builder.build(resource_identifiers=resource_identifiers)
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/render/renderer/site_builder.py", line 406, in build
    source_store_keys = self.source_store.list_keys()
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/data_context/store/store.py", line 172, in list_keys
    return [self.tuple_to_key(key) for key in keys_without_store_backend_id]
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/data_context/store/store.py", line 172, in <listcomp>
    return [self.tuple_to_key(key) for key in keys_without_store_backend_id]
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/data_context/store/store.py", line 133, in tuple_to_key
    return self.key_class.from_tuple(tuple_)
  File "/virtual-env-ihL17uoP/lib/python3.7/site-packages/great_expectations/data_context/types/resource_identifiers.py", line 165, in from_tuple
    RunIdentifier.from_tuple((tuple_[-3], tuple_[-2])),
IndexError: tuple index out of range

@talagluck
Copy link
Contributor

Hi @abekfenn - at first glance, I don't see anything obviously incorrect. What line is triggering the error? Is it the checkpoint.run() call?

Are you able to save the checkpoint, and then call context.run_checkpoint(checkpoint_name)?

@abekfenn
Copy link
Contributor Author

Correct results = checkpoint.run() is triggering the error.
However, I noticed that when I leave context pointing to my root directory, I get no error.
When I update
ge_abs_path = os.getcwd() + '/shared/validate/great_expectations'
I get an error. Ideally, I would like to store validations etc. in a different sub-folder than my root directory. But this seems to be causing an issue even when I set this as the directory when I initiate the context.

Should I be referencing this path elsewhere too?

@talagluck
Copy link
Contributor

Hi @abekfenn - on second glance, it appears that you haven't configured a Store in the config you're using, as this appears to be commented out, though it's hard to tell exactly.

Have you seen our docs on configuring a DataContext without a .yml file? I suspect this may be what you want. Currently it seems that you are using a hybrid approach, and so I think your Stores are not being configured correctly, and this can be tough to troubleshoot. I would recommend either working entirely with an in-memory DataContext, or entirely with a .yml based config, at least initially.

@abekfenn
Copy link
Contributor Author

@talagluck thanks for getting back to me, I was trying different approaches to get it work using an in-memory DataContext vs a yml based config.

Am I not instantiating the context in-code with the following line?

ge_abs_path = os.getcwd() + '/shared/validate/great_expectations'
context = ge.data_context.DataContext(
    context_root_dir=ge_abs_path,
    # store_backend_defaults=FilesystemStoreBackendDefaults(root_directory=abs_path),
)

I've deleted/commented out the following lines and I still get the IndexError processing the tuple.

context.test_yaml_config(yaml.dump(datasource_config))
context.add_datasource(**datasource_config)

In any case, there appears to be an amalgamation of issues that mean I will have to run both batch.validate() and checkpoint.run().

Unfortunately, from what I've discovered:

  1. There is a bug in the main release of GE with unexpected_index_list not being returned for V3 checkpoints (TBD if your fix in result_format argument does not appear to be working properly in V3 #3225 resolves this..
  2. There is a bug in the V2 api that results in result_format being overwritten for v2 checkpoints (this was resolved for V3 in [BUGFIX] Ensure that result_format from saved expectation suite json file takes effect #3634
    but is still broken in V2)
  3. There is a bug whereby unexpected_index_list is not included in results for v3 validator.validate()

This means that in order for me to both use data docs and process the unexpected_index_list from the validation results (to e.g. drop bad records), I have to run BOTH an in-code validation using batch.validate() as well as a legacy checkpoint. This seems horribly inefficient but will have to do for now. Do let me know if I'm missing something but as I understand it, one can only build data docs from a checkpoint result.

P.S. I have yet to create issues for all of the above.

@talagluck
Copy link
Contributor

Hi @abekfenn - apologies, there's a bunch to work through here, so thanks for bearing with us!

First - yes you are right, it appears that you are correctly instantiating a DataContext in code. Is the directory that you are pointing to an initialized Great Expectations project? Are you able to share your Stores config from the great_expectations.yml for that project?

For your discoveries above:
1: We were recently made aware of this issue, and it has been filed in #3736 . We are working through this, though it may take us a few days to get to.

2: I was not aware of this issue. We are currently moving toward V3 being the primary API, and so are probably not able to prioritize a fix for this, but please do open an issue if this is a problem as well.

3: There are a few possibilities here (and this might be the case for 2 as well):

  • This is the same issue as 1
  • This is an issue of the return value of the method vs. its __repr__ method, and might be solved with a print of validator.validate() instead of just the call.
  • You need to specify result_format in the call to validator.validate(). This is a design flaw which we are in the process of resolving.

I don't fully understand why you need to run things two ways, though I agree that you shouldn't need to do this! Are you saying that you are using both V2 and V3? Could you please unpack what your use case is here? Hopefully some of this will be ironed out by the coming bugfixes, but it would be good to understand the end goal.

I'm also a bit confused by your script above - are you adding the expectations and validating them every time? Do you also have an Expectation Suite that you are persisting and using?

@talagluck
Copy link
Contributor

Apologies as well - we've had a few different issues pertaining to result_format at the same time, and I think the unexpected_index_list issue might have gotten buried in a thread for another, separate but related, issue. But also, our tests for this are passing, so I'm wondering if this piece might just be a configuration issue, and I'm trying to understand where that is coming from.

@abekfenn
Copy link
Contributor Author

abekfenn commented Nov 24, 2021

Apologies that this issue/ticket has ballooned in scope but thank you for helping me work through it.

I hope to see #3736 in the upcoming release and that it will solve these issues with respect to unexpected_index_list being None. Between that and #3634 I think my issue could be solved but I have yet to be able to test this and have to get GE released internally.

Answering your responses

With respect to 1) and 2) this makes sense. Thank you for your response.

With respect to 3)
-This is the same issue as 1
-This is an issue of the return value of the method vs. its repr method, and might be solved with a print of validator.validate() instead of just the call.

  • You need to specify result_format in the call to validator.validate(). This is a design flaw which we are in the process of resolving.
    -- You are correct, running validator.validate(result_format='COMPLETE') gave me the result I wanted
    -- However, running checkpoint.run(result_format='COMPLETE') did result in the validation results from my checkpoint containing unexpected_index_list

My Script

To answer this:
I'm also a bit confused by your script above - are you adding the expectations and validating them every time? Do you also have an Expectation Suite that you are persisting and using?

  • I wrote it this way to provide a straightforward example to easily reproduce and understand my issue. However, this does mirror my actual implementation. I was not aware of evaluation_parameters when I first started developing my expectations and saw the only way to dynamically adjust my expectations was to run all of this "in-code", allowing me to access the results of a previous step in my ETL pipeline. I plan to research evaluation parameters but for now my expectations otherwise work as expected, even if it isn't implemented in the intended form.
  • I am currently only persisting the expectation suite but not truly using it for anything.
  • I am worried that this means I am running a given validation rule twice, is this true? This would mean I am running the same validation rule 3x if I am doing batch.expect_x_to_be_x, validator.validate() and checkpoint.run(), correct?

My use case is this:

I am hoping to leverage GE to serve as both validation and automated data quality checks of my data in an ETL pipeline.

  1. I want to run expectations and process those expectations to drop records prior to production that do not meet important expectations.
  • For this reason, I need unexpected_index_list from the validation results.
  1. Serve up the results of my validations to a team to review data anomalies.
  • I would like to use data docs to make this easily digestible but I have also leveraged unexpected_index_list to create my own custom exception report that includes other attributes of the data (e.g. values of a Pk and other unique identifiers for the record associated with the bad value).

For this reason, I cannot just run validator.validate() as I was told that data docs cannot be built off of this validation result and needs to be built from the results of a checkpoint. Therefore, because the validation results of checkkpoint.run() do not currently contain unexpected_index_list, I have to run both validator.validate() and checkpoint.run() in order to obtain unexpected_index_list and generate data docs, respectively.

I am probably not following the intended implementation of GE, but have struggled to find documentation that covers my use case of running validations in an ETL pipeline to trigger actions on my data.

Thanks so much for your detailed responses, have a great thanksgiving if you are celebrating!

@talagluck
Copy link
Contributor

Hi @abekfenn - thanks for this.

Regarding your script, thank you, this clears things up. I agree that evaluation parameters may simplify things quite a bit. In general, we think about the ExpectationSuite as the source of truth, so that we clearly lay out Expectations about our data, and then validate those Expectations - adding them anew each time to the suite is somewhat counter to this, though I understand your use case. You may also want to take a look at our Rule-Based Profilers (though they are still somewhat experimental). You are correct that you would be running validations three times when adding the expectation, calling validator.validate(), and calling checkpoint.run().

To confirm:

You need to specify result_format in the call to validator.validate(). This is a design flaw which we are in the process of > resolving.
-- You are correct, running validator.validate(result_format='COMPLETE') gave me the result I wanted
-- However, running checkpoint.run(result_format='COMPLETE') did result in the validation results from my checkpoint containing unexpected_index_list

Did you mean to say that checkpoint.run(result_format='COMPLETE') did not work? I suspect that what is happening is that you are looking at the topmost CheckpointResult object instead of the ValidationResults within (you can do this using CheckpointResult.list_validation_results()). You should then see unexpected_index_list within there. I recognize that this is confusing (and apologies for that!), though typically, we expect that people are using Expectation and Validation Stores, and so they would see the unexpected_index_list within those Validations, rather than relying on the top level return of a call to Checkpoint.run().

Regarding #3736, I was mistaken - this is an issue of backend, as currently, uenxpected_index_list is only available for pandas.

Thanks also for explaining your use case further. We also hope to add support for returning an ID or PK for unexpected rows in the not-too-distant future, though I don't have a specific date for this, and I suspect that this would help as well.

@talagluck
Copy link
Contributor

And thank you! Hope you are having a great Thanksgiving weekend.

@abekfenn
Copy link
Contributor Author

abekfenn commented Nov 26, 2021

Thanks again for your response @talagluck. I was thinking of the code that runs my expectations (batch.expect_x_to_meet_x) as the source of truth as that is going into source control, but I understand this runs counter to the intended function and look forward to trying out evaluation parameters and simplifying my implementation of GE and reducing my runtime 3-fold 😄

Yes, you are right, I made a typo, running checkpoint.run(result_format='COMPLETE') did not result in the validation results from my checkpoint containing unexpected_index_list.

I believe I am looking at the ValidationResults within using the code above/below but please correct me if I'm wrong. FWIW, I DO see the unexpected_index_list in the results from within here after upgrading to GE 0.13.44 which contains the fix related to #3736. However, this wasn't the case in 0.13.43, thank you for this fix! The type of validation_results is great_expectations.core.expectation_validation_result.ExpectationSuiteValidationResult. This wasn't clear at first but after digging around in the GE code base and unit tests, I figured this out 😃

To be clear, I am leveraging GE for in-memory dataframes using pandas at the moment, so this addresses my use case. Nevertheless, we do plan on moving to in-database data storage and maintenance in the relatively near future so I do look forward to this getting added with #3195.

Really appreciate all of your support, I'm very close to being able to release GE to production and having the support of your team has been instrumental in that. Can't wait to see GE revolutionize our data quality!

results = checkpoint.run(result_format='COMPLETE')

validation_result_identifier = results.list_validation_result_identifiers()[0]
print('validation_result_identifier', validation_result_identifier)

validation_results = results.list_validation_result_identifiers()
print('validation_results', validation_results)
context.build_data_docs()

validation_results = results.run_results[validation_result_identifier]["validation_result"]

print(type(validation_results))  # great_expectations.core.expectation_validation_result.ExpectationSuiteValidationResult
print(validation_results)

@talagluck
Copy link
Contributor

Thanks so much for the follow-up, @abekfenn! Wishing you much luck as you move your implementation forward, and please don't hesitate to reach out with further questions or suggestions.

@abekfenn
Copy link
Contributor Author

abekfenn commented Dec 2, 2021

Hi @talagluck, let me know if you'd prefer I open another issue but I have one more related question.. Everything is working well for me but when I try to setup an S3 store for data docs, the docs do not get built and sent to S3. I have a feeling this is probably something to do with my code.
I am leveraging the same code to run my expectations as above.
Do I have to reference this site_name somewhere in the instantiation of my data context, checkpoint, checkpoint.run() or build_data_docs?

I am trying to setup an S3-hosted data docs. I can manually upload these files but I would like to be able to have index.html serve as the home page for each data asset. I am using the below yaml.

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/en/latest/reference/spare_parts/data_context_reference.html#configuration
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/en/latest/reference/core_concepts/datasource.html
datasources:
  test:
    module_name: great_expectations.datasource
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine
    class_name: Datasource
    data_connectors:
      default_runtime_data_connector_name:
        module_name: great_expectations.datasource.data_connector
        class_name: RuntimeDataConnector
        batch_identifiers:
          - default_identifier_name
config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    # Evaluation Parameters enable dynamic expectations. Read more here:
    # https://docs.greatexpectations.io/en/latest/reference/core_concepts/evaluation_parameters.html
    class_name: EvaluationParameterStore

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store


data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at
  # https://docs.greatexpectations.io/en/latestfeatures/data_docs.html
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: false
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
    site_section_builders:
      validations:  # if empty, or one of ['0', 'None', 'False', 'false', 'FALSE', 'none', 'NONE'], section not rendered
        class_name: DefaultSiteSectionBuilder
        source_store_name: validations_store
        # run_name_filter:
        #   ne: profiling
        renderer:
          module_name: great_expectations.render.renderer
          class_name: ValidationResultsPageRenderer
          column_section_renderer:
            class_name: ValidationResultsColumnSectionRenderer
            # table_renderer:
            #   module_name: custom_renderers.custom_table_content_block
            #   class_name: CustomTableContentBlockRendererdata_docs_sites:
  s3_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: ${S3_BUCKET}
      prefix: multitenant/${SITE}/great_expectations
      # boto3_options: {region_name: us-east-1}
      # put_options: {profile: devprofile}
    # store_backend:
    #   class_name: TupleS3StoreBackend
    #   bucket: data-docs.my_org  # UPDATE the bucket name here to match the bucket you configured above.
    #   base_public_path: http://www.mydns.com
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
      show_cta_footer: true

anonymous_usage_statistics:
  data_context_id: 0ac4488d-1355-49c0-8ec9-ba725b688387
  enabled: true
notebooks:
concurrency:
  enabled: false

Quick follow-up question, I've noticed that when I do a lot of development locally. The metrics calculations start to take REALLY long (particularly the last step in metrics calculation). Is this due to a lot of historical results or is there anything you might recommend to prevent this?

@talagluck
Copy link
Contributor

Hmm - I'm not sure exactly, but it depends on how you are building docs. I believe DataContext.build_data_docs() builds all sites by default, but individual sites can be specified as well. I would need to look into how this works for checkpoints (I think some Checkpoints use the UpdateDataDocs action, so if a site had never been built I don't know how this would work).

I'd be curious whether the site is being built (like the files are there on S3), but the index is not populating it. And also what will happen if you remove your local_site from the config.

As far as the metrics, this feels strange to me. It's a known issue that DataDocs takes longer to build over time, as the number of Validations increases, but I don't believe that Metrics should take any longer to run (this process is relatively lean as it relies on the backend to run, and so shouldn't be much more expensive than getting the metrics directly with pandas).

@abekfenn
Copy link
Contributor Author

abekfenn commented Dec 2, 2021

For posterity, figured it out.. had to add a call to update_data_docs by adding specifying an argument to the action_list param of checkpoint.run()

See:

        results = checkpoint.run(
            # batch_request=batch_request,
            evaluation_parameters=None,
            # run_id=_RUN_ID,
            run_name=domain_name,
            run_time=datetime.now(timezone.utc),
            result_format='COMPLETE',
            action_list=[
                {
                    "name": "store_validation_result",
                    "action": {"class_name": "StoreValidationResultAction"},
                },
                {
                    "name": "store_evaluation_params",
                    "action": {"class_name": "StoreEvaluationParametersAction"},
                },
                {
                    "name": "update_data_docs",
                    "action": {"class_name": "UpdateDataDocsAction", "site_names": ['s3_site']},
                },
            ],
            )

@talagluck
Copy link
Contributor

Great - thanks for the update, @abekfenn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community devrel This item is being addressed by the Developer Relations Team
Projects
None yet
Development

No branches or pull requests

2 participants