Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 InvalidKeyError when S3 prefix for an Expectation Store is at the top level for the prefix #3054

Closed
talagluck opened this issue Jul 15, 2021 · 5 comments
Labels
core-team help wanted Issues we'd love to see community contributions for. Join #contributors-contributing in our Slack!

Comments

@talagluck
Copy link
Contributor

Describe the bug
This is describing the bug found in #2993 . With the following error

On S3 path looks like
test_configless2/20210705T133824.449268Z/20210705T133824.449268Z/some_string.json

But I get an error

Unable to retrieve object from TupleS3StoreBackend with the following Key: test_configless2/20210705T133824/449268Z/20210705T133824/449268Z/some_string.json

If you specify the prefix of an ExpectationStore as purely the top level prefix, without a delimiter and another name afterward, this causes a failure, e.g.:

# this will not work
  expectations_s3_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: "test_bucket"
      prefix: "my_prefix/" # does not have an additional word after the delimiter of the prefix

# this will work 
 expectations_s3_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: "test_bucket"
      prefix: "my_prefix/foo" # has an additional word after the delimiter of the prefix

This is the case whether the DataContext is being instantiated from a YAML or from an in-memory config. This may require S3 stores for Validations and DataDocs as well, though when these other stores lack the final word after the delimiter, this error does not occur.

To Reproduce
Steps to reproduce the behavior:

  1. Create a Great Expectations project
  2. Specify S3 Stores for Expectations, Validations, and Data Docs
  3. Only include the top level "directory" in the prefix for the Expectations Store - ensure that there is nothing after the first delimiter.

Expected behavior
There should not be substitution happening between "." and "/" which appears to be causing an error.

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 0.13.21
@talagluck talagluck added the help wanted Issues we'd love to see community contributions for. Join #contributors-contributing in our Slack! label Jul 15, 2021
@spencerhardwick spencerhardwick added this to test in test Jul 16, 2021
@spencerhardwick spencerhardwick removed this from test in test Jul 16, 2021
@donaldheppner
Copy link
Member

I've done some investigation, but only have suggestions on a fix.

How to reproduce

Your expectations store and validations store in s3 with the same prefix (in code config snippet)

    stores={
        "expectations_S3_store": {
            "class_name": "ExpectationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one", # prefix is the same
            },
        },
        "validations_S3_store": {
            "class_name": "ValidationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one", # prefix is the same
            },
        },
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },

What happens

list_keys in class TupleS3StoreBackend lists all objects in a bucket with a prefix, then:

  1. removes those not starting with a prefix: one
  2. removes those not ending with the expected suffix: .json
  3. the expectations suite () file and the previously generated validations end in .json, so both are identified as keys:
    a. ('test_configless2',)
    b. ('test_configless2', '20210724T163836.159298Z', '20210724T163836.159298Z', '5bbef67b969ab4509ad5642a107ce61f')

ExpectationsSuiteIdentifier then takes the second key and

  1. from_tuple method creates the identifier by joining the contents of a tuple key using a "." delimiter
  2. to_tuple method splits based on the same delimiter. Because elements of the key contain ".", the original 3 tuple key becomes a 5 tuple key

Finally, TupleS3StoreBackend's _build_s3_object_key method creates a path from the expectation suite identifier, now a 5 tuple key, using the "/" delimiter. The expected s3 path of test_configless2/20210724T163836.159298Z/20210724T163836.159298Z/5bbef67b969ab4509ad5642a107ce61f.json becomes
test_configless2/20210724T163836/159298Z/20210724T163836/159298Z/5bbef67b969ab4509ad5642a107ce61f.json.

The other bug here is that the second key doesn't refer to an expectations suite, but a validation.

Possible Fixes

  1. Check to make sure the prefixes for different stores are unique and force them to be different (probably the quickest fix)
  2. Use configuration in a more prescriptive way; there should be no need to list_keys and then try to filter out the bad ones. Configuration should tell you exactly where to find the expectation suite file (my preferred fix)
  3. Use a different delimiter in from_tuple and to_tuple (or avoid having to do this conversion altogether)

I'm new to the tool and code base, so I'm reluctant to take this further without guidance.

Full code to reproduce

Some paths and bucket names will need to be updated to match your environment. Code is derived from #2993 but uses the file store from the tutorial:

import great_expectations as ge

data_context_config = ge.data_context.types.base.DataContextConfig(
    datasources={
        "my_datasource": ge.data_context.types.base.DatasourceConfig(
            class_name="PandasDatasource",
            batch_kwargs_generators={
                "subdir_reader":{
                    "class_name": "SubdirReaderBatchKwargsGenerator",
                    "base_directory": "get/data"
                }
            },
            module_name="great_expectations.datasource",
            data_asset_type={
                "class_name": "PandasDataset",
                "module_name": "great_expectations.dataset"
            }
        )
    },
    validation_operators={
        "action_list_operator": {
            "class_name": "ActionListValidationOperator",
            "action_list": [
                {
                    "name": "store_validation_result",
                            "action": {"class_name": "StoreValidationResultAction"},
                },
                {
                    "name": "store_evaluation_params",
                            "action": {"class_name": "StoreEvaluationParametersAction"},
                },
            ],
        }
    },
    stores={
        "expectations_S3_store": {
            "class_name": "ExpectationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one",
            },
        },
        "validations_S3_store": {
            "class_name": "ValidationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "one",
            },
        },
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },
    data_docs_sites={
        "s3_site": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "3054",
                "prefix": "docs",
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": True,
            },
        }
    },
    store_backend_defaults=ge.data_context.types.base.S3StoreBackendDefaults(
        default_bucket_name="3054"),
)


context = ge.data_context.BaseDataContext(project_config=data_context_config)

batch_kwargs = {
    "datasource": "my_datasource",
    "path": "~/source/get/data/yellow_tripdata_sample_2019-01.csv",
    "limit": 1000,
}

suite = context.create_expectation_suite('test_configless2', overwrite_existing=True)

# batch = context.get_batch(batch_kwargs, suite)
batch = context.get_batch(batch_kwargs, suite)

batch.expect_table_row_count_to_be_between(min_value=1, max_value=100000000)

batch.save_expectation_suite(discard_failed_expectations=False)

results = context.run_validation_operator(
    "action_list_operator", assets_to_validate=[batch])

@talagluck
Copy link
Contributor Author

Hi @donaldheppner - thank you so much for this in-depth investigation and thorough explanation! I will bring your three suggestions to our triage meeting next week and discuss them with the team. Then one of us will get back to you, and we can work together on implementing a solution.

@talagluck
Copy link
Contributor Author

Hi @donaldheppner! Apologies for delays here. After some internal review, we've come to the conclusion that Option 2, from your proposed options, would be the right solution. Option 1 is appealing due its speed, but would break encapsulation (currently we evaluate the Stores separately, and Option 1 would require comparing them). Option 3 seems more roundabout, but I also might not be fully understanding it.

Would you be up for making a contribution for this? I'm happy to offer guidance by way of PR review, and also answering questions about our codebase. Please let me know what would work best for you.

Thanks!

@donaldheppner
Copy link
Member

donaldheppner commented Aug 23, 2021

Thanks Tal, agreed re: Option 2 -- I'll give it a shot and appreciate the support!

donaldheppner added a commit to donaldheppner/great_expectations that referenced this issue Sep 7, 2021
…tions suite by pulling it from the provided parameters instead
@donaldheppner
Copy link
Member

Hi @talagluck, I've implemented option 2 and have existing tests passing; I could use some guidance on where to add new tests (if necessary) as the bug concerns a conflict between stores. Further details are in the commit log. I took the opportunity to introduce some types along the way. Link: https://github.com/donaldheppner/great_expectations/tree/feature/bug_fix_3054

talagluck added a commit that referenced this issue Sep 17, 2021
… is available instead of discovering keys (listing keys) in existing sources. (#3377)

* BUGFIX #3054: Removed the file scan for the expectations suite by pulling it from the provided parameters instead

* Updated call to DataContext.store_evaluation_parameters, it was being called with the incorrect arguments

* Got unit tests passing defined at data_context/store/test_valuation_parameter_store.py, that rely on a store.list_keys() to pass (this may be another bug, as no files are identified as test fixtures)

* Updated changelog

* Fixed linting issues from PR

* Added unit test to test_store_backends

* Test passes

* Linted

* Lilnted merge

Co-authored-by: talagluck <tal@superconductive.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-team help wanted Issues we'd love to see community contributions for. Join #contributors-contributing in our Slack!
Projects
None yet
Development

No branches or pull requests

3 participants