S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

talagluck · 2021-07-15T14:04:51Z

Describe the bug
This is describing the bug found in #2993 . With the following error

On S3 path looks like
test_configless2/20210705T133824.449268Z/20210705T133824.449268Z/some_string.json

But I get an error

Unable to retrieve object from TupleS3StoreBackend with the following Key: test_configless2/20210705T133824/449268Z/20210705T133824/449268Z/some_string.json

If you specify the prefix of an ExpectationStore as purely the top level prefix, without a delimiter and another name afterward, this causes a failure, e.g.:

# this will not work
  expectations_s3_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: "test_bucket"
      prefix: "my_prefix/" # does not have an additional word after the delimiter of the prefix

# this will work 
 expectations_s3_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: "test_bucket"
      prefix: "my_prefix/foo" # has an additional word after the delimiter of the prefix

This is the case whether the DataContext is being instantiated from a YAML or from an in-memory config. This may require S3 stores for Validations and DataDocs as well, though when these other stores lack the final word after the delimiter, this error does not occur.

To Reproduce
Steps to reproduce the behavior:

Create a Great Expectations project
Specify S3 Stores for Expectations, Validations, and Data Docs
Only include the top level "directory" in the prefix for the Expectations Store - ensure that there is nothing after the first delimiter.

Expected behavior
There should not be substitution happening between "." and "/" which appears to be causing an error.

Environment (please complete the following information):

Operating System: MacOS
Great Expectations Version: 0.13.21

donaldheppner · 2021-07-26T14:10:33Z

I've done some investigation, but only have suggestions on a fix.

How to reproduce

Your expectations store and validations store in s3 with the same prefix (in code config snippet)

    stores={
        "expectations_S3_store": {
            "class_name": "ExpectationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one", # prefix is the same
            },
        },
        "validations_S3_store": {
            "class_name": "ValidationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one", # prefix is the same
            },
        },
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },

What happens

list_keys in class TupleS3StoreBackend lists all objects in a bucket with a prefix, then:

removes those not starting with a prefix: one
removes those not ending with the expected suffix: .json
the expectations suite () file and the previously generated validations end in .json, so both are identified as keys:
a. ('test_configless2',)
b. ('test_configless2', '20210724T163836.159298Z', '20210724T163836.159298Z', '5bbef67b969ab4509ad5642a107ce61f')

ExpectationsSuiteIdentifier then takes the second key and

from_tuple method creates the identifier by joining the contents of a tuple key using a "." delimiter
to_tuple method splits based on the same delimiter. Because elements of the key contain ".", the original 3 tuple key becomes a 5 tuple key

Finally, TupleS3StoreBackend's _build_s3_object_key method creates a path from the expectation suite identifier, now a 5 tuple key, using the "/" delimiter. The expected s3 path of test_configless2/20210724T163836.159298Z/20210724T163836.159298Z/5bbef67b969ab4509ad5642a107ce61f.json becomes
test_configless2/20210724T163836/159298Z/20210724T163836/159298Z/5bbef67b969ab4509ad5642a107ce61f.json.

The other bug here is that the second key doesn't refer to an expectations suite, but a validation.

Possible Fixes

Check to make sure the prefixes for different stores are unique and force them to be different (probably the quickest fix)
Use configuration in a more prescriptive way; there should be no need to list_keys and then try to filter out the bad ones. Configuration should tell you exactly where to find the expectation suite file (my preferred fix)
Use a different delimiter in from_tuple and to_tuple (or avoid having to do this conversion altogether)

I'm new to the tool and code base, so I'm reluctant to take this further without guidance.

Full code to reproduce

Some paths and bucket names will need to be updated to match your environment. Code is derived from #2993 but uses the file store from the tutorial:

import great_expectations as ge

data_context_config = ge.data_context.types.base.DataContextConfig(
    datasources={
        "my_datasource": ge.data_context.types.base.DatasourceConfig(
            class_name="PandasDatasource",
            batch_kwargs_generators={
                "subdir_reader":{
                    "class_name": "SubdirReaderBatchKwargsGenerator",
                    "base_directory": "get/data"
                }
            },
            module_name="great_expectations.datasource",
            data_asset_type={
                "class_name": "PandasDataset",
                "module_name": "great_expectations.dataset"
            }
        )
    },
    validation_operators={
        "action_list_operator": {
            "class_name": "ActionListValidationOperator",
            "action_list": [
                {
                    "name": "store_validation_result",
                            "action": {"class_name": "StoreValidationResultAction"},
                },
                {
                    "name": "store_evaluation_params",
                            "action": {"class_name": "StoreEvaluationParametersAction"},
                },
            ],
        }
    },
    stores={
        "expectations_S3_store": {
            "class_name": "ExpectationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one",
            },
        },
        "validations_S3_store": {
            "class_name": "ValidationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one",
            },
        },
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },
    data_docs_sites={
        "s3_site": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "docs",
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": True,
            },
        }
    },
    store_backend_defaults=ge.data_context.types.base.S3StoreBackendDefaults(
        default_bucket_name="3054"),
)


context = ge.data_context.BaseDataContext(project_config=data_context_config)

batch_kwargs = {
    "datasource": "my_datasource",
    "path": "~/source/get/data/yellow_tripdata_sample_2019-01.csv",
    "limit": 1000,
}

suite = context.create_expectation_suite('test_configless2', overwrite_existing=True)

# batch = context.get_batch(batch_kwargs, suite)
batch = context.get_batch(batch_kwargs, suite)

batch.expect_table_row_count_to_be_between(min_value=1, max_value=100000000)

batch.save_expectation_suite(discard_failed_expectations=False)

results = context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch])

talagluck · 2021-07-27T21:34:00Z

Hi @donaldheppner - thank you so much for this in-depth investigation and thorough explanation! I will bring your three suggestions to our triage meeting next week and discuss them with the team. Then one of us will get back to you, and we can work together on implementing a solution.

talagluck · 2021-08-18T19:58:32Z

Hi @donaldheppner! Apologies for delays here. After some internal review, we've come to the conclusion that Option 2, from your proposed options, would be the right solution. Option 1 is appealing due its speed, but would break encapsulation (currently we evaluate the Stores separately, and Option 1 would require comparing them). Option 3 seems more roundabout, but I also might not be fully understanding it.

Would you be up for making a contribution for this? I'm happy to offer guidance by way of PR review, and also answering questions about our codebase. Please let me know what would work best for you.

Thanks!

donaldheppner · 2021-08-23T18:14:37Z

Thanks Tal, agreed re: Option 2 -- I'll give it a shot and appreciate the support!

…tions suite by pulling it from the provided parameters instead

donaldheppner · 2021-09-07T23:44:53Z

Hi @talagluck, I've implemented option 2 and have existing tests passing; I could use some guidance on where to add new tests (if necessary) as the bug concerns a conflict between stores. Further details are in the commit log. I took the opportunity to introduce some types along the way. Link: https://github.com/donaldheppner/great_expectations/tree/feature/bug_fix_3054

… is available instead of discovering keys (listing keys) in existing sources. (#3377) * BUGFIX #3054: Removed the file scan for the expectations suite by pulling it from the provided parameters instead * Updated call to DataContext.store_evaluation_parameters, it was being called with the incorrect arguments * Got unit tests passing defined at data_context/store/test_valuation_parameter_store.py, that rely on a store.list_keys() to pass (this may be another bug, as no files are identified as test fixtures) * Updated changelog * Fixed linting issues from PR * Added unit test to test_store_backends * Test passes * Linted * Lilnted merge Co-authored-by: talagluck <tal@superconductive.com>

talagluck added the help wanted Issues we'd love to see community contributions for. Join #contributors-contributing in our Slack! label Jul 15, 2021

spencerhardwick added this to test in test Jul 16, 2021

spencerhardwick removed this from test in test Jul 16, 2021

petermoyer added the core-team label Aug 6, 2021

donaldheppner added a commit to donaldheppner/great_expectations that referenced this issue Sep 7, 2021

BUGFIX great-expectations#3054: Removed the file scan for the expecta…

48453de

…tions suite by pulling it from the provided parameters instead

donaldheppner mentioned this issue Sep 9, 2021

[BUGFIX] Addresses issue #2993 (#3054) by using configuration when it is available instead of discovering keys (listing keys) in existing sources. #3377

Merged

6 tasks

talagluck closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

talagluck commented Jul 15, 2021

donaldheppner commented Jul 26, 2021

talagluck commented Jul 27, 2021

talagluck commented Aug 18, 2021

donaldheppner commented Aug 23, 2021 •

edited

donaldheppner commented Sep 7, 2021

S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

Comments

talagluck commented Jul 15, 2021

donaldheppner commented Jul 26, 2021

How to reproduce

What happens

Possible Fixes

Full code to reproduce

talagluck commented Jul 27, 2021

talagluck commented Aug 18, 2021

donaldheppner commented Aug 23, 2021 • edited

donaldheppner commented Sep 7, 2021

donaldheppner commented Aug 23, 2021 •

edited