Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Directory Asset BatchDefinition API #9874

Merged
merged 47 commits into from
May 3, 2024

Conversation

joshua-stauffer
Copy link
Member

@joshua-stauffer joshua-stauffer commented May 3, 2024

This PR is prework to add a fluent-style batch definition API to directory data assets. In order to support this, _FilePathDataAsset and direct descendents have been refactored in order to allow regex-based assets to use RegexPartitioners, and directory-based assets to use column partitioners. SparkPartitioners have been brought back as DataframePartitioners.

other refactors

  • _FilePathDataAsset has been renamed to PathDataAsset
  • great_expectations.datasource.fluent.data_asset.data_connector package has been moved out of data_asset to the fluent package.
  • concrete implementations of spark assets have been moved out of spark_file_path_datasource.py and into the data_asset.spark package
  • concrete implementations of pandas assets have been moved out of pandas_file_path_datasource.py and into the data_asset.pandas package

Copy link

netlify bot commented May 3, 2024

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit c05501d
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/663553957e2e4300083997ad

Copy link

codecov bot commented May 3, 2024

Codecov Report

Attention: Patch coverage is 80.16701% with 95 lines in your changes are missing coverage. Please review.

Project coverage is 77.99%. Comparing base (bd52c5f) to head (c05501d).
Report is 1 commits behind head on develop.

Files Patch % Lines
...e/fluent/data_asset/path/dataframe_partitioners.py 0.00% 58 Missing ⚠️
...tasource/fluent/data_asset/path/directory_asset.py 78.18% 12 Missing ⚠️
...ns/datasource/fluent/data_asset/path/file_asset.py 93.47% 3 Missing ⚠️
...source/fluent/data_asset/path/spark/delta_asset.py 93.54% 2 Missing ⚠️
...asource/fluent/data_asset/path/spark/json_asset.py 96.00% 2 Missing ⚠️
...tasource/fluent/data_asset/path/spark/orc_asset.py 93.33% 2 Missing ⚠️
...urce/fluent/data_asset/path/spark/parquet_asset.py 93.75% 2 Missing ⚠️
...asource/fluent/data_asset/path/spark/text_asset.py 93.54% 2 Missing ⚠️
...tasource/fluent/data_asset/path/path_data_asset.py 80.00% 1 Missing ⚠️
...source/fluent/data_asset/path/spark/spark_asset.py 91.66% 1 Missing ⚠️
... and 10 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9874      +/-   ##
===========================================
- Coverage    78.25%   77.99%   -0.26%     
===========================================
  Files          484      495      +11     
  Lines        42394    42537     +143     
===========================================
+ Hits         33174    33176       +2     
- Misses        9220     9361     +141     
Flag Coverage Δ
3.10 64.21% <77.45%> (-0.19%) ⬇️
3.11 64.21% <77.45%> (-0.19%) ⬇️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 53.86% <70.98%> (-0.16%) ⬇️
3.11 aws_deps 44.78% <73.06%> (-0.17%) ⬇️
3.11 big 55.74% <72.65%> (-0.03%) ⬇️
3.11 databricks 45.96% <70.77%> (-0.15%) ⬇️
3.11 filesystem 61.16% <73.27%> (-0.12%) ⬇️
3.11 mssql 48.77% <70.77%> (-0.21%) ⬇️
3.11 mysql 48.83% <70.77%> (-0.22%) ⬇️
3.11 postgresql 52.74% <70.77%> (-0.17%) ⬇️
3.11 snowflake 46.57% <70.77%> (-0.17%) ⬇️
3.11 spark 57.16% <77.03%> (-0.03%) ⬇️
3.11 trino 50.74% <70.77%> (-0.16%) ⬇️
3.8 64.23% <77.45%> (-0.18%) ⬇️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 53.86% <70.98%> (-0.16%) ⬇️
3.8 aws_deps 44.80% <73.06%> (-0.17%) ⬇️
3.8 big ?
3.8 databricks ?
3.8 filesystem 61.18% <73.27%> (-0.12%) ⬇️
3.8 mssql 48.75% <70.77%> (-0.21%) ⬇️
3.8 mysql 48.81% <70.77%> (-0.22%) ⬇️
3.8 postgresql ?
3.8 snowflake 46.58% <70.77%> (-0.17%) ⬇️
3.8 spark 57.13% <77.03%> (-0.03%) ⬇️
3.8 trino 50.72% <70.77%> (-0.16%) ⬇️
3.9 64.23% <77.45%> (-0.19%) ⬇️
cloud 0.00% <0.00%> (ø)
docs-basic 49.10% <72.86%> (-0.01%) ⬇️
docs-creds-needed 50.22% <73.06%> (-0.02%) ⬇️
docs-spark 48.31% <74.11%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joshua-stauffer joshua-stauffer marked this pull request as ready for review May 3, 2024 18:20
Comment on lines 1 to 40
from __future__ import annotations

from typing import Type

from great_expectations.datasource.fluent.data_asset.path.file_asset import FileDataAsset
from great_expectations.datasource.fluent.dynamic_pandas import _generate_pandas_data_asset_models

_PANDAS_FILE_TYPE_READER_METHOD_UNSUPPORTED_LIST = (
# "read_csv",
# "read_json",
# "read_excel",
# "read_parquet",
"read_clipboard", # not path based
# "read_feather",
# "read_fwf",
"read_gbq", # not path based
# "read_hdf",
# "read_html",
# "read_orc",
# "read_pickle",
# "read_sas", # invalid json schema
# "read_spss",
"read_sql", # not path based & type-name conflict
"read_sql_query", # not path based
"read_sql_table", # not path based
"read_table", # type-name conflict
# "read_xml",
)
_FILE_PATH_ASSET_MODELS = _generate_pandas_data_asset_models(
FileDataAsset,
blacklist=_PANDAS_FILE_TYPE_READER_METHOD_UNSUPPORTED_LIST,
use_docstring_from_method=True,
skip_first_param=True,
)
CSVAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("csv", FileDataAsset)
ExcelAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("excel", FileDataAsset)
FWFAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("fwf", FileDataAsset)
JSONAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("json", FileDataAsset)
ORCAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("orc", FileDataAsset)
ParquetAsset: Type[FileDataAsset] = _FILE_PATH_ASSET_MODELS.get("parquet", FileDataAsset)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename module to generated_assets.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshua-stauffer joshua-stauffer added this pull request to the merge queue May 3, 2024
Merged via the queue into develop with commit 3aaa855 May 3, 2024
69 checks passed
@joshua-stauffer joshua-stauffer deleted the f/v1-306/directory_asset branch May 3, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants