Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI should support GCS and S3 paths for Pandas Datasources #2006

Closed
Aylr opened this issue Oct 21, 2020 · 1 comment
Closed

CLI should support GCS and S3 paths for Pandas Datasources #2006

Aylr opened this issue Oct 21, 2020 · 1 comment

Comments

@Aylr
Copy link
Contributor

Aylr commented Oct 21, 2020

Describe the bug
Suite new CLI fails when passed a file on GCS

To Reproduce
Steps to reproduce the behavior:
0. pip install gcsfs (this allows pandas to use gs:// urls natively in read_csv)

  1. Set up a vanilla pandas datasource
  gcs:
    data_asset_type:
      class_name: PandasDataset
    class_name: PandasDatasource
    module_name: great_expectations.datasource
  1. great_expectations suite new
  2. choose gcs datasource
  3. enter gs path to existing file such as gs://my-bucket/data/foo.csv
  4. Note the error. The CLI assumes a file is local or possibly S3 only.
Enter the path (relative or absolute) of a data file
: gs://my-bucket/data/foo.csv
Error: File 'gs:/my-bucket/data/foo.csv' does not exist.

Expected behavior
I expect the file to be found and proceed with the suite creation without arduous workarounds that require contributor level experience.

Environment (please complete the following information):

  • OS: [e.g. iOS] macOS
  • GE Version: [e.g. 0.10.0] 0.12.6

Additional context

Note when you have a working notebook (which requires selecting a local file then modifying the batch kwargs) you can prove that it should work:

import datetime
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.data_context.types.resource_identifiers import (
    ValidationResultIdentifier,
)

context = ge.data_context.DataContext()

# Feel free to change the name of your suite here. Renaming this will not
# remove the other one.
expectation_suite_name = "gcs"
suite = context.get_expectation_suite(expectation_suite_name)
suite.expectations = []

batch_kwargs = {
    "data_asset_name": "foo",
    "datasource": "gcs",
    "path": "gs://my-bucket/data/foo.csv",
    "reader_method": "read_csv",
}
batch = context.get_batch(batch_kwargs, suite)
batch.head()
@Aylr Aylr added blocker bug A bug that blocks a user from completing a use case bug Bugs bugs bugs! labels Oct 21, 2020
@eugmandel eugmandel added triage Used by the GE core team to flag issues that were not yet triaged core-team-priority and removed triage Used by the GE core team to flag issues that were not yet triaged labels Oct 22, 2020
eugmandel added a commit that referenced this issue Oct 27, 2020
@eugmandel eugmandel changed the title Suite new CLI fails when passed a file on GCS CLI should support GCS and S3 paths for Pandas Datasources Oct 28, 2020
@eugmandel eugmandel added enhancement and removed blocker bug A bug that blocks a user from completing a use case bug Bugs bugs bugs! labels Oct 28, 2020
eugmandel added a commit that referenced this issue Oct 28, 2020
…sts - this will help supporting S3 and GCS (#2014)

* Stop using Click to check if the user provided file exists - this will help supporting S3 and GCS - see #2006

* Modified the wording of the prompt the CLI uses to ask users for the path of a data file for a Pandas datasource to emphasize that  s3a:// and gs:// paths are ok too.

* Updated the changelog
@eugmandel
Copy link
Contributor

CLI will accept filesystem, "s3a://" and "gs://" paths as valid inputs when prompting for a data file for a Pandas Datasource. Implemented.

alexsherstinsky pushed a commit to alexsherstinsky/great_expectations that referenced this issue Feb 19, 2021
…sts - this will help supporting S3 and GCS (great-expectations#2014)

* Stop using Click to check if the user provided file exists - this will help supporting S3 and GCS - see great-expectations#2006

* Modified the wording of the prompt the CLI uses to ask users for the path of a data file for a Pandas datasource to emphasize that  s3a:// and gs:// paths are ok too.

* Updated the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants