Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Implementing Python code snippets under test for "https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/fluent/filesystem/how_to_connect_to_one_or_more_files_using_spark" #7927

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
04541c5
Implementing Python code snippets under test for "https://docs.greate…
May 16, 2023
62837fa
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 16, 2023
5720118
Implementing Python code snippets under test for "https://docs.greate…
May 16, 2023
26b5b1c
Update docs/docusaurus/docs/guides/connecting_to_your_data/fluent/dat…
alexsherstinsky May 16, 2023
18dfd99
Update docs/docusaurus/docs/components/connect_to_data/filesystem/_ti…
alexsherstinsky May 16, 2023
e80621f
Update docs/docusaurus/docs/components/connect_to_data/filesystem/_ti…
alexsherstinsky May 16, 2023
32bdf0e
simplify get_context() call
May 16, 2023
27cc017
Merge remote-tracking branch 'upstream/feature/DX-469/DX-441/alexsher…
May 16, 2023
7c912b7
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 16, 2023
dbd7f0d
Implementing Python code snippets under test for "https://docs.greate…
May 16, 2023
5640ba2
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 16, 2023
3f4fabf
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 16, 2023
a677d4e
cleanup
May 16, 2023
06f3483
Merge branch 'feature/DX-469/DX-441/alexsherstinsky/link/docusaurus_d…
May 16, 2023
e529c14
clean up
May 16, 2023
b604534
Merge branch 'feature/DX-469/DX-441/alexsherstinsky/link/docusaurus_d…
May 16, 2023
cb6c1b8
clean up
May 16, 2023
17eb28d
clean up
May 16, 2023
6f88cca
merge
May 17, 2023
252e55a
Make SQL splitter API public.
May 17, 2023
498ca8d
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 17, 2023
7f96cbe
Make all Fluent Datasource/DataAsset splitter API public.
May 17, 2023
b72b9ad
Update docs/docusaurus/docs/guides/connecting_to_your_data/fluent/dat…
alexsherstinsky May 17, 2023
6a0b9d8
Update great_expectations/datasource/fluent/sqlite_datasource.py
alexsherstinsky May 17, 2023
775859d
Update great_expectations/datasource/fluent/sqlite_datasource.py
alexsherstinsky May 17, 2023
fb3f360
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
f2a4bbb
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
94c7298
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
19e0c13
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
9ddc7fd
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
973e208
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
57cea5e
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 17, 2023
935671d
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 17, 2023
526bd2b
Merge remote-tracking branch 'upstream/feature/DX-469/DX-441/alexsher…
May 17, 2023
169d348
lint
May 17, 2023
0859241
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
1e8722f
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 17, 2023
1a1e599
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
290fbdf
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
c1a9c4a
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
f99c707
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
f368ce5
Implementing Python code snippets under test for "https://docs.greate…
May 17, 2023
41159e2
cleanup
May 18, 2023
9e59dfe
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
7161960
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
d72371b
Implementing Python code snippets under test for "https://docs.greate…
May 18, 2023
31641ab
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
673930c
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
a29800e
Add "batch.columns()" convenience method to Fluent DataAsset implemen…
May 18, 2023
281acaf
Merge branch 'feature/DX-469/DX-441/alexsherstinsky/link/docusaurus_d…
May 18, 2023
56123a2
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
f1b962f
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
8489bdd
public_api
May 18, 2023
da7a0cc
Merge branch 'feature/DX-469/DX-441/alexsherstinsky/link/docusaurus_d…
May 18, 2023
7d54063
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
061b9f9
merge
May 18, 2023
37e497e
remove PANDAS dependency -- it is a given
May 18, 2023
27768d5
Merge remote-tracking branch 'upstream/develop' into feature/DX-469/D…
May 18, 2023
28d9255
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
19ee08b
Merge branch 'feature/DX-469/DX-441/alexsherstinsky/link/docusaurus_d…
May 18, 2023
421985d
public API
May 18, 2023
8e9ebc2
public API
May 18, 2023
72b6ab8
public API
May 18, 2023
4ef1bb0
get_context correction
May 18, 2023
33ace8a
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
2e375f5
get_context correction
May 18, 2023
e5d0da1
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
cf5d1bb
docstrings
May 18, 2023
6d79f2d
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
a398c7c
public API exclusions
May 18, 2023
d8763e5
Merge branch 'develop' into feature/DX-469/DX-441/alexsherstinsky/lin…
May 18, 2023
2a52e95
Update docs/docusaurus/docs/guides/connecting_to_your_data/fluent/fil…
alexsherstinsky May 18, 2023
c3c7b4b
docstrings
May 18, 2023
1d1bc4f
Merge remote-tracking branch 'upstream/feature/DX-469/DX-441/alexsher…
May 18, 2023
23f110f
typo
May 18, 2023
8f6415f
Merge develop into feature/DX-469/DX-441/alexsherstinsky/link/docusau…
github-actions[bot] May 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -50,17 +50,14 @@ A Filesystem Datasource can be created with two pieces of information:

In our example, we will define these in advance by storing them in the Python variables `datasource_name` and `path_to_folder_containing_csv_files`:

```python title="Python code"
datasource_name = "MyNewDatasource"
path_to_folder_containing_csv_files = "../taxi_data"
```python name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py define_add_spark_filesystem_args"
```

<InfoFilesystemDatasourceRelativeBasePaths />

Once we have determined our `name` and `base_directory`, we pass them in as parameters when we create our Datasource:

```python title = "Python code"
datasource = context.sources.add_spark_filesystem(name=datasource_name, base_path=path_to_folder_containing_csv_files)
```python name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py create_datasource"
```

<TipFilesystemDatasourceNestedSourceDataFolders />
Expand All @@ -75,19 +72,15 @@ A Data Asset requires two pieces of information to be defined:

For this example, we will define these two values in advance by storing them in the Python variables `asset_name` and (since we are connecting to NYC taxi data in this example) `batching_regex`:

```python title="Python code"
name = "my_taxi_data_asset"
batching_regex = "yellow_tripdata_sample_2023_01\.csv"
```python name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py define_add_csv_asset_args"
```

Once we have determined those two values, we will pass them in as parameters when we create our Data Asset:
In addition, the argument `header` informs the Spark `DataFrame` reader that the files contain a header column, while the argument `infer_schema` instructs the Spark `DataFrame` reader to make a best effort to determine the schema of the columns automatically.

```python title="Python code"
data_asset = datasource.add_csv_asset(
name=name, batching_regex=batching_regex
)
```
Once we have determined those two values as well as the optional `header` and `infer_schema` arguments, we will pass them in as parameters when we create our Data Asset:

```python name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py add_asset"
```

### 4. Repeat step 3 as needed to add additional files as Data Assets

Expand Down
12 changes: 12 additions & 0 deletions docs/sphinx_api_docs_source/public_api_excludes.py
Expand Up @@ -687,4 +687,16 @@
"great_expectations/datasource/fluent/serializable_types/pyspark.py"
),
),
IncludeExcludeDefinition(
reason='The "columns()" property in this module is not included in the public API',
name="columns",
alexsherstinsky marked this conversation as resolved.
Show resolved Hide resolved
filepath=pathlib.Path("great_expectations/datasource/fluent/sql_datasource.py"),
),
IncludeExcludeDefinition(
reason='The "columns()" property in this module is not included in the public API',
name="columns",
filepath=pathlib.Path(
"great_expectations/datasource/fluent/spark_generic_splitters.py"
),
),
]
20 changes: 18 additions & 2 deletions great_expectations/validator/metrics_calculator.py
Expand Up @@ -5,6 +5,7 @@

import pandas as pd

from great_expectations.core._docs_decorators import public_api
from great_expectations.validator.computed_metric import MetricValue # noqa: TCH001
from great_expectations.validator.exception_info import ExceptionInfo # noqa: TCH001
from great_expectations.validator.metric_configuration import MetricConfiguration
Expand Down Expand Up @@ -41,9 +42,16 @@ def show_progress_bars(self) -> bool:
def show_progress_bars(self, enable: bool) -> None:
self._show_progress_bars = enable

@public_api
alexsherstinsky marked this conversation as resolved.
Show resolved Hide resolved
def columns(self, domain_kwargs: Optional[Dict[str, Any]] = None) -> List[str]:
"""
Convenience method to run "table.columns" metric.

Arguments:
domain_kwargs: Optional dictionary of domain kwargs (e.g., containing "batch_id").

Returns:
The list of Batch columns.
"""
if domain_kwargs is None:
domain_kwargs = {}
Expand All @@ -62,14 +70,22 @@ def columns(self, domain_kwargs: Optional[Dict[str, Any]] = None) -> List[str]:

return columns

@public_api
alexsherstinsky marked this conversation as resolved.
Show resolved Hide resolved
def head(
self,
n_rows: int = 5,
domain_kwargs: Optional[Dict[str, Any]] = None,
fetch_all: bool = False,
) -> pd.DataFrame:
"""
Convenience method to run "table.head" metric.
"""Convenience method to return the first several rows or records from a Batch of data.

Args:
n_rows: The number of rows to return.
domain_kwargs: If provided, the domain for which to return records.
fetch_all: If True, ignore n_rows and return the entire batch.

Returns:
A Pandas DataFrame containing the records' data.
"""
if domain_kwargs is None:
domain_kwargs = {}
Expand Down
18 changes: 8 additions & 10 deletions great_expectations/validator/validator.py
Expand Up @@ -328,8 +328,6 @@ def get_metric(
) -> Any:
"""Convenience method, return the value of the requested metric.

(To be deprecated in favor of using methods in "MetricsCalculator" class.)

Args:
metric: MetricConfiguration

Expand All @@ -345,8 +343,6 @@ def get_metrics(
"""
Convenience method that resolves requested metrics (specified as dictionary, keyed by MetricConfiguration ID).

(To be deprecated in favor of using methods in "MetricsCalculator" class.)

Args:
metrics: Dictionary of desired metrics to be resolved; metric_name is key and MetricConfiguration is value.

Expand All @@ -365,8 +361,6 @@ def compute_metrics(
"""
Convenience method that computes requested metrics (specified as elements of "MetricConfiguration" list).

(To be deprecated in favor of using methods in "MetricsCalculator" class.)

Args:
metric_configurations: List of desired MetricConfiguration objects to be resolved.
runtime_configuration: Additional run-time settings (see "Validator.DEFAULT_RUNTIME_CONFIGURATION").
Expand All @@ -381,11 +375,15 @@ def compute_metrics(
min_graph_edges_pbar_enable=min_graph_edges_pbar_enable,
)

@public_api
alexsherstinsky marked this conversation as resolved.
Show resolved Hide resolved
def columns(self, domain_kwargs: Optional[Dict[str, Any]] = None) -> List[str]:
"""
Convenience method to obtain Batch columns.
"""Convenience method to obtain Batch columns.

Arguments:
domain_kwargs: Optional dictionary of domain kwargs (e.g., containing "batch_id").

(To be deprecated in favor of using methods in "MetricsCalculator" class.)
Returns:
The list of Batch columns.
"""
return self._metrics_calculator.columns(domain_kwargs=domain_kwargs)

Expand All @@ -396,7 +394,7 @@ def head(
domain_kwargs: Optional[Dict[str, Any]] = None,
fetch_all: bool = False,
) -> pd.DataFrame:
"""Return the first several rows or records from a Batch of data.
"""Convenience method to return the first several rows or records from a Batch of data.

Args:
n_rows: The number of rows to return.
Expand Down
Expand Up @@ -52,4 +52,30 @@
# </snippet>

assert datasource.get_asset_names() == {"my_taxi_data_asset"}
assert datasource.get_asset(asset_name).name == "my_taxi_data_asset"

my_asset = datasource.get_asset(asset_name)
assert my_asset

my_batch_request = my_asset.build_batch_request({"year": "2019", "month": "03"})
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)
assert len(batches) == 1
assert set(batches[0].columns()) == {
"vendor_id",
"pickup_datetime",
"dropoff_datetime",
"passenger_count",
"trip_distance",
"rate_code_id",
"store_and_fwd_flag",
"pickup_location_id",
"dropoff_location_id",
"payment_type",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"improvement_surcharge",
"total_amount",
"congestion_surcharge",
}
@@ -0,0 +1,83 @@
"""
To run this code as a local test, use the following console command:
```
pytest -v --docs-tests -m integration -k "how_to_connect_to_one_or_more_files_using_spark" tests/integration/test_script_runner.py
```
"""
import pathlib


# Python
# <snippet name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py get_context">
import great_expectations as gx

context = gx.get_context()
# </snippet>

# Python
# <snippet name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py define_add_spark_filesystem_args">
datasource_name = "my_new_datasource"
path_to_folder_containing_csv_files = "<INSERT_PATH_TO_FILES_HERE>"
# </snippet>

path_to_folder_containing_csv_files = str(
pathlib.Path(
gx.__file__,
"..",
"..",
"tests",
"test_sets",
"taxi_yellow_tripdata_samples",
).resolve(strict=True)
)

# Python
# <snippet name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py create_datasource">
datasource = context.sources.add_spark_filesystem(
name=datasource_name, base_directory=path_to_folder_containing_csv_files
)
# </snippet>

assert datasource_name in context.datasources

# Python
# <snippet name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py define_add_csv_asset_args">
asset_name = "my_taxi_data_asset"
batching_regex = r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2}).csv"
# </snippet>

# Python
# <snippet name="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py add_asset">
datasource.add_csv_asset(
name=asset_name, batching_regex=batching_regex, header=True, infer_schema=True
)
# </snippet>

assert datasource.get_asset_names() == {"my_taxi_data_asset"}

my_asset = datasource.get_asset(asset_name)
assert my_asset

my_batch_request = my_asset.build_batch_request({"year": "2019", "month": "03"})
batches = my_asset.get_batch_list_from_batch_request(my_batch_request)
assert len(batches) == 1
assert set(batches[0].columns()) == {
"vendor_id",
"pickup_datetime",
"dropoff_datetime",
"passenger_count",
"trip_distance",
"rate_code_id",
"store_and_fwd_flag",
"pickup_location_id",
"dropoff_location_id",
"payment_type",
"fare_amount",
"extra",
"mta_tax",
"tip_amount",
"tolls_amount",
"improvement_surcharge",
"total_amount",
"congestion_surcharge",
}
Expand Up @@ -16,7 +16,7 @@
# <snippet name="tests/integration/docusaurus/reference/glossary/batch_request batch_request">
import great_expectations as gx

context = gx.data_context.FileDataContext.create(full_path_to_project_directory)
context = gx.get_context()

# data_directory is the full path to a directory containing csv files
datasource = context.sources.add_pandas_filesystem(
Expand Down
10 changes: 7 additions & 3 deletions tests/integration/test_script_runner.py
Expand Up @@ -323,7 +323,6 @@
user_flow_script="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/get_existing_data_asset_from_existing_datasource_pandas_filesystem_example.py",
data_context_dir="tests/integration/fixtures/no_datasources/great_expectations",
data_dir="tests/test_sets/taxi_yellow_tripdata_samples/first_3_files",
backend_dependencies=[BackendDependencies.PANDAS],
),
IntegrationTestFixture(
name="checkpoints_glossary",
Expand All @@ -336,7 +335,6 @@
user_flow_script="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/organize_batches_in_pandas_filesystem_datasource.py",
data_context_dir="tests/integration/fixtures/no_datasources/great_expectations",
data_dir="tests/test_sets/taxi_yellow_tripdata_samples/first_3_files",
backend_dependencies=[BackendDependencies.PANDAS],
),
IntegrationTestFixture(
name="how_to_organize_batches_in_a_sql_based_data_asset",
Expand All @@ -348,7 +346,13 @@
user_flow_script="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_pandas.py",
data_context_dir="tests/integration/fixtures/no_datasources/great_expectations",
data_dir="tests/test_sets/taxi_yellow_tripdata_samples/first_3_files",
backend_dependencies=[BackendDependencies.PANDAS],
),
IntegrationTestFixture(
name="how_to_connect_to_one_or_more_files_using_spark",
user_flow_script="tests/integration/docusaurus/connecting_to_your_data/fluent_datasources/how_to_connect_to_one_or_more_files_using_spark.py",
data_context_dir="tests/integration/fixtures/no_datasources/great_expectations",
data_dir="tests/test_sets/taxi_yellow_tripdata_samples/first_3_files",
backend_dependencies=[BackendDependencies.SPARK],
),
]

Expand Down