[FEATURE] Directory Asset BatchDefinition API #9874

joshua-stauffer · 2024-05-03T04:08:24Z

This PR is prework to add a fluent-style batch definition API to directory data assets. In order to support this, _FilePathDataAsset and direct descendents have been refactored in order to allow regex-based assets to use RegexPartitioners, and directory-based assets to use column partitioners. SparkPartitioners have been brought back as DataframePartitioners.

other refactors

_FilePathDataAsset has been renamed to PathDataAsset
great_expectations.datasource.fluent.data_asset.data_connector package has been moved out of data_asset to the fluent package.
concrete implementations of spark assets have been moved out of spark_file_path_datasource.py and into the data_asset.spark package
concrete implementations of pandas assets have been moved out of pandas_file_path_datasource.py and into the data_asset.pandas package

netlify · 2024-05-03T04:08:40Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`c05501d`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/663553957e2e4300083997ad

This reverts commit 362b4f0.

codecov · 2024-05-03T17:43:23Z

Codecov Report

Attention: Patch coverage is 80.16701% with 95 lines in your changes are missing coverage. Please review.

Project coverage is 77.99%. Comparing base (bd52c5f) to head (c05501d).
Report is 1 commits behind head on develop.

Files	Patch %	Lines
...e/fluent/data_asset/path/dataframe_partitioners.py	0.00%	58 Missing ⚠️
...tasource/fluent/data_asset/path/directory_asset.py	78.18%	12 Missing ⚠️
...ns/datasource/fluent/data_asset/path/file_asset.py	93.47%	3 Missing ⚠️
...source/fluent/data_asset/path/spark/delta_asset.py	93.54%	2 Missing ⚠️
...asource/fluent/data_asset/path/spark/json_asset.py	96.00%	2 Missing ⚠️
...tasource/fluent/data_asset/path/spark/orc_asset.py	93.33%	2 Missing ⚠️
...urce/fluent/data_asset/path/spark/parquet_asset.py	93.75%	2 Missing ⚠️
...asource/fluent/data_asset/path/spark/text_asset.py	93.54%	2 Missing ⚠️
...tasource/fluent/data_asset/path/path_data_asset.py	80.00%	1 Missing ⚠️
...source/fluent/data_asset/path/spark/spark_asset.py	91.66%	1 Missing ⚠️
... and 10 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9874      +/-   ##
===========================================
- Coverage    78.25%   77.99%   -0.26%     
===========================================
  Files          484      495      +11     
  Lines        42394    42537     +143     
===========================================
+ Hits         33174    33176       +2     
- Misses        9220     9361     +141

Flag	Coverage Δ
3.10	`64.21% <77.45%> (-0.19%)`	⬇️
3.11	`64.21% <77.45%> (-0.19%)`	⬇️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.86% <70.98%> (-0.16%)`	⬇️
3.11 aws_deps	`44.78% <73.06%> (-0.17%)`	⬇️
3.11 big	`55.74% <72.65%> (-0.03%)`	⬇️
3.11 databricks	`45.96% <70.77%> (-0.15%)`	⬇️
3.11 filesystem	`61.16% <73.27%> (-0.12%)`	⬇️
3.11 mssql	`48.77% <70.77%> (-0.21%)`	⬇️
3.11 mysql	`48.83% <70.77%> (-0.22%)`	⬇️
3.11 postgresql	`52.74% <70.77%> (-0.17%)`	⬇️
3.11 snowflake	`46.57% <70.77%> (-0.17%)`	⬇️
3.11 spark	`57.16% <77.03%> (-0.03%)`	⬇️
3.11 trino	`50.74% <70.77%> (-0.16%)`	⬇️
3.8	`64.23% <77.45%> (-0.18%)`	⬇️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.86% <70.98%> (-0.16%)`	⬇️
3.8 aws_deps	`44.80% <73.06%> (-0.17%)`	⬇️
3.8 big	`?`
3.8 databricks	`?`
3.8 filesystem	`61.18% <73.27%> (-0.12%)`	⬇️
3.8 mssql	`48.75% <70.77%> (-0.21%)`	⬇️
3.8 mysql	`48.81% <70.77%> (-0.22%)`	⬇️
3.8 postgresql	`?`
3.8 snowflake	`46.58% <70.77%> (-0.17%)`	⬇️
3.8 spark	`57.13% <77.03%> (-0.03%)`	⬇️
3.8 trino	`50.72% <70.77%> (-0.16%)`	⬇️
3.9	`64.23% <77.45%> (-0.19%)`	⬇️
cloud	`0.00% <0.00%> (ø)`
docs-basic	`49.10% <72.86%> (-0.01%)`	⬇️
docs-creds-needed	`50.22% <73.06%> (-0.02%)`	⬇️
docs-spark	`48.31% <74.11%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This reverts commit 84906b0.

This reverts commit 3f00c5f.

joshua-stauffer · 2024-05-03T21:03:42Z

great_expectations/datasource/fluent/data_asset/path/pandas/dynamic_assets.py

+from __future__ import annotations
+
+from typing import Type
+
+from great_expectations.datasource.fluent.data_asset.path.file_asset import FileDataAsset
+from great_expectations.datasource.fluent.dynamic_pandas import _generate_pandas_data_asset_models
+
+_PANDAS_FILE_TYPE_READER_METHOD_UNSUPPORTED_LIST = (
+    # "read_csv",
+    # "read_json",
+    # "read_excel",
+    # "read_parquet",
+    "read_clipboard",  # not path based
+    # "read_feather",
+    # "read_fwf",
+    "read_gbq",  # not path based
+    # "read_hdf",
+    # "read_html",
+    # "read_orc",
+    # "read_pickle",
+    # "read_sas",  # invalid json schema
+    # "read_spss",
+    "read_sql",  # not path based & type-name conflict
+    "read_sql_query",  # not path based
+    "read_sql_table",  # not path based
+    "read_table",  # type-name conflict
+    # "read_xml",
+)
+_FILE_PATH_ASSET_MODELS = _generate_pandas_data_asset_models(
+    FileDataAsset,
+    blacklist=_PANDAS_FILE_TYPE_READER_METHOD_UNSUPPORTED_LIST,
+    use_docstring_from_method=True,
+    skip_first_param=True,
+)
+CSVAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("csv", FileDataAsset)
+ExcelAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("excel", FileDataAsset)
+FWFAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("fwf", FileDataAsset)
+JSONAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("json", FileDataAsset)
+ORCAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("orc", FileDataAsset)
+ParquetAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("parquet", FileDataAsset)


rename module to generated_assets.py

joshua-stauffer added 6 commits May 2, 2024 22:26

refactor to package

32df56b

directory and regex asset

57c9bd5

add dataframe partitioners

128f134

rename to path asset

af9a9c9

refactor spark assets to asset package

6b79b91

move data_connector out of asset

d9fcf18

joshua-stauffer added 12 commits May 3, 2024 00:13

move pandas assets into data_assets package

3f22f7e

fix imports

4d99015

fix circular imports/bad path refactors

5f5462b

add back dir override

1a2ee44

fix forward refs

5739771

fix pyi imports

190aa16

update pyi files

970c085

fix imports

06a4413

ensure parent classes define required methods, fix more imports

510c6f8

make non-private

af07644

add pyi to spark path assets

362b4f0

Revert "add pyi to spark path assets"

82890a3

This reverts commit 362b4f0.

joshua-stauffer force-pushed the f/v1-306/directory_asset branch from d0a2e95 to 82890a3 Compare May 3, 2024 15:31

joshua-stauffer added 10 commits May 3, 2024 12:21

revert to RegexPartitioner

ad944a0

fix pydantic imports

286d795

more import fixes

cc91d8a

fix forward refs

4fd568a

up schemas for name changes

dcbbfed

rename pandas assets to dynamic

fbce3e1

rename RegexDataAsset -> FileDataAsset

7844820

refactor _FilePathDataAsset -> PathDataAsset

9473db2

remove method stub, multiple inheritance bug

d242fc9

add abc to intermediate classes

34a19a0

update schemas for rename

40809a1

joshua-stauffer added 3 commits May 3, 2024 13:54

move regex specific logic to FileDataAsset

45294dd

ensure regex specific logic lives in FileDataAsset

95cc4d3

Merge branch 'develop' into f/v1-306/directory_asset

9349bc1

joshua-stauffer marked this pull request as ready for review May 3, 2024 18:20

joshua-stauffer added 14 commits May 3, 2024 14:21

add abc

e791c3b

remove comment

be4b12a

update datasources.build_data_connector

3f00c5f

BatchRequest needs to be runtime type

af0c55a

cant use isinstance

84906b0

schema - random reordering

d286318

Revert "cant use isinstance"

2e9a322

This reverts commit 84906b0.

Revert "update datasources.build_data_connector"

5297270

This reverts commit 3f00c5f.

directory asset accepts batching_regex for nwo

9aec422

more schema changes

269929a

narrow type on pandas datasources

910cfdd

rename union for clarity

decba68

rename type constant

8a2f7a1

Merge branch 'develop' into f/v1-306/directory_asset

b52bf2a

joshua-stauffer commented May 3, 2024

View reviewed changes

tyler-hoffman approved these changes May 3, 2024

View reviewed changes

rename to generated_assets.py

c05501d

joshua-stauffer enabled auto-merge May 3, 2024 21:14

joshua-stauffer added this pull request to the merge queue May 3, 2024

Merged via the queue into develop with commit 3aaa855 May 3, 2024
69 checks passed

joshua-stauffer deleted the f/v1-306/directory_asset branch May 3, 2024 21:52

joshua-stauffer mentioned this pull request May 7, 2024

[FEATURE] DirectoryAsset BatchDefinition API #9888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Directory Asset BatchDefinition API #9874

[FEATURE] Directory Asset BatchDefinition API #9874

joshua-stauffer commented May 3, 2024 •

edited

netlify bot commented May 3, 2024 •

edited

codecov bot commented May 3, 2024 •

edited

joshua-stauffer May 3, 2024 •

edited

joshua-stauffer May 3, 2024

[FEATURE] Directory Asset BatchDefinition API #9874

[FEATURE] Directory Asset BatchDefinition API #9874

Conversation

joshua-stauffer commented May 3, 2024 • edited

other refactors

netlify bot commented May 3, 2024 • edited

✅ Deploy Preview for niobium-lead-7998 canceled.

codecov bot commented May 3, 2024 • edited

Codecov Report

joshua-stauffer May 3, 2024 • edited

Choose a reason for hiding this comment

joshua-stauffer May 3, 2024

Choose a reason for hiding this comment

joshua-stauffer commented May 3, 2024 •

edited

netlify bot commented May 3, 2024 •

edited

codecov bot commented May 3, 2024 •

edited

joshua-stauffer May 3, 2024 •

edited