Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Marshmallow schema for Rule Based Profiler #3982

Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
ce6acf4
feat: start impl
cdkini Jan 11, 2022
705a4f8
feat: start implementing the config classes
cdkini Jan 11, 2022
69cdcd9
feat: complete first pass
cdkini Jan 11, 2022
67777df
chore: add TODOs
cdkini Jan 11, 2022
a4b9d05
feat: add batch request to schema
cdkini Jan 12, 2022
ba615df
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 12, 2022
2583279
refactor: use DictDot instead of SerializableDictDot
cdkini Jan 12, 2022
e4b0b5d
test: start writing tests
cdkini Jan 12, 2022
3deffaa
test: write initial round of tests for builders
cdkini Jan 12, 2022
b1a1fc6
feat: add logging support
cdkini Jan 12, 2022
c74581e
test: modify unsuccessfuly load tests
cdkini Jan 12, 2022
897fbc8
test: start tests for Rule and RuleBasedProfiler schemas
cdkini Jan 12, 2022
3c82e62
test: finish initial round of tests
cdkini Jan 12, 2022
00eda10
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 12, 2022
0031d37
feat: add default module names if missing
cdkini Jan 12, 2022
735ec28
test: write tests for dump
cdkini Jan 12, 2022
25c3c51
feat: remove nulls with post_dump
cdkini Jan 12, 2022
89fdc4f
feat: add logging to null removals
cdkini Jan 12, 2022
339c254
feat: subclass Schema to remove nulls
cdkini Jan 12, 2022
6157464
refactor: use dataclass impl after discussion with Don
cdkini Jan 12, 2022
71c2a16
chore: add additional logging stmt
cdkini Jan 12, 2022
d0e9364
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 12, 2022
ff402aa
chore: revert logging stmt
cdkini Jan 12, 2022
b953a37
docs: write docstrs
cdkini Jan 12, 2022
966fcc0
chore: add addl comment
cdkini Jan 12, 2022
b504d2b
refactor: change import stmt used with dataclasses
cdkini Jan 12, 2022
ffb4e21
chore: use filter_properties_dict instead of hand-rolled method
cdkini Jan 12, 2022
2ec1d3b
refactor: use existing filter_properties_dict method to clean data
cdkini Jan 12, 2022
5faec87
fix: clean up namespace collision with 'fields'
cdkini Jan 12, 2022
f33bc2c
docs: add ref to Marshmallow docs
cdkini Jan 13, 2022
db55de7
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 13, 2022
9555235
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 13, 2022
10a05aa
feat: finish implementation after 1st review
cdkini Jan 13, 2022
24ea32e
chore: remove frozen nature of RBP config class
cdkini Jan 13, 2022
68fa86c
chore: add type hints
cdkini Jan 13, 2022
8f78f6a
chore: misc updates per Alex's review
cdkini Jan 13, 2022
18e78a9
chore: ensure config is 1.0
cdkini Jan 13, 2022
a91f102
chore: add comment about methods per Alex
cdkini Jan 13, 2022
4e66184
Merge branch 'develop' into feature/great-464/great-481/marshmallow-s…
cdkini Jan 13, 2022
e6f72ec
Merge branch 'develop' into feature/great-464/great-481/marshmallow-s…
cdkini Jan 13, 2022
5d66b11
Merge branch 'develop' into feature/great-464/great-481/marshmallow-s…
cdkini Jan 13, 2022
4f149dd
Merge branch 'develop' of github.com:great-expectations/great_expecta…
cdkini Jan 13, 2022
0cbd156
refactor: rename __config__ to __config_class__
cdkini Jan 13, 2022
21291e4
Merge branch 'feature/great-464/great-481/marshmallow-schema-for-rule…
cdkini Jan 13, 2022
66ef0eb
feat: misc changes per call with Alex and Don
cdkini Jan 13, 2022
6d04ffb
Merge branch 'develop' into feature/great-464/great-481/marshmallow-s…
cdkini Jan 14, 2022
ebdd551
Merge branch 'develop' into feature/great-464/great-481/marshmallow-s…
cdkini Jan 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion great_expectations/data_context/types/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def object_to_yaml_str(obj):
class BaseYamlConfig(SerializableDictDot):
_config_schema_class = None

def __init__(self, commented_map: CommentedMap = None):
def __init__(self, commented_map: Optional[CommentedMap] = None):
if commented_map is None:
commented_map = CommentedMap()
self._commented_map = commented_map
Expand Down
Empty file.
231 changes: 231 additions & 0 deletions great_expectations/rule_based_profiler/config/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
import dataclasses
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Type

from ruamel.yaml.comments import CommentedMap

from great_expectations.data_context.types.base import BaseYamlConfig
from great_expectations.marshmallow__shade import INCLUDE, Schema, fields, post_load
from great_expectations.marshmallow__shade.decorators import post_dump
from great_expectations.types import DictDot
from great_expectations.util import filter_properties_dict


class NotNullSchema(Schema):
"""
Extension of Marshmallow Schema to facilitate implicit removal of null values before serialization.

The __config__ attribute is utilized to point a Schema to a configuration. It is the responsibility
of the child class to define its own __config__ to ensure proper serialization/deserialization.
cdkini marked this conversation as resolved.
Show resolved Hide resolved

Reference: https://marshmallow.readthedocs.io/en/stable/extending.html

"""

@post_load
def make_config(self, data: dict, **kwargs) -> Type[DictDot]:
"""Hook to convert the schema object into its respective config type.

Checks against config dataclass signature to ensure that unidentified kwargs are omitted
from the result object. This design allows us to maintain forwards comptability without
altering expected behavior.

Args:
data: The dictionary representation of the configuration object
kwargs: Marshmallow-specific kwargs required to maintain hook signature (unused herein)

Returns:
An instance of configuration class, which subclasses the DictDot serialization class

Raises:
NotImplementedError: If the subclass inheriting NotNullSchema fails to define a __config__

"""
if not hasattr(self, "__config__"):
cdkini marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError(
"The subclass extending NotNullSchema must define its own custom __config__"
)

# Removing **kwargs before creating config object
recognized_attrs = {f.name for f in dataclasses.fields(self.__config__)}
cleaned_data = filter_properties_dict(
properties=data,
keep_fields=recognized_attrs,
clean_nulls=False,
clean_falsy=False,
)

return self.__config__(**cleaned_data)

@post_dump
def remove_nulls(self, data: dict, **kwargs) -> dict:
"""Hook to clear the config object of any null values before being written as a dictionary.

Args:
data: The dictionary representation of the configuration object
kwargs: Marshmallow-specific kwargs required to maintain hook signature (unused herein)

Returns:
A cleaned dictionary that has no null values

"""
cleaned_data = filter_properties_dict(
cdkini marked this conversation as resolved.
Show resolved Hide resolved
properties=data,
clean_nulls=True,
clean_falsy=False,
)
return cleaned_data


@dataclass(frozen=True)
class DomainBuilderConfig(DictDot):
class_name: str
module_name: Optional[str] = None
batch_request: Optional[Dict[str, Any]] = None


class DomainBuilderConfigSchema(NotNullSchema):
class Meta:
unknown = INCLUDE

__config__ = DomainBuilderConfig

class_name = fields.String(required=True)
module_name = fields.String(
required=False,
all_none=True,
missing="great_expectations.rule_based_profiler.domain_builder",
cdkini marked this conversation as resolved.
Show resolved Hide resolved
)
batch_request = fields.Dict(keys=fields.String(), required=False, allow_none=True)


@dataclass(frozen=True)
class ParameterBuilderConfig(DictDot):
name: str
class_name: str
module_name: Optional[str] = None
batch_request: Optional[Dict[str, Any]] = None


class ParameterBuilderConfigSchema(NotNullSchema):
class Meta:
unknown = INCLUDE

__config__ = ParameterBuilderConfig

name = fields.String(required=True)
class_name = fields.String(required=True)
module_name = fields.String(
required=False,
all_none=True,
missing="great_expectations.rule_based_profiler.parameter_builder",
)
batch_request = fields.Dict(keys=fields.String(), required=False, allow_none=True)


@dataclass(frozen=True)
class ExpectationConfigurationBuilderConfig(DictDot):
expectation_type: str
class_name: str
module_name: Optional[str] = None
mostly: Optional[float] = None
meta: Optional[Dict] = None


class ExpectationConfigurationBuilderConfigSchema(NotNullSchema):
class Meta:
unknown = INCLUDE

__config__ = ExpectationConfigurationBuilderConfig

class_name = fields.String(required=True)
module_name = fields.String(
required=False,
all_none=True,
missing="great_expectations.rule_based_profiler.expectation_configuration_builder",
)
expectation_type = fields.String(required=True)
mostly = fields.Float(required=False, allow_none=True)
meta = fields.Dict(required=False, allow_none=True)


@dataclass(frozen=True)
class RuleConfig(DictDot):
name: str
domain_builder: DomainBuilderConfig
parameter_builders: List[ParameterBuilderConfig]
expectation_configuration_builders: List[ExpectationConfigurationBuilderConfig]


class RuleConfigSchema(NotNullSchema):
class Meta:
unknown = INCLUDE

__config__ = RuleConfig

name = fields.String(required=True)
domain_builder = fields.Nested(DomainBuilderConfigSchema, required=True)
parameter_builders = fields.List(
cls_or_instance=fields.Nested(ParameterBuilderConfigSchema, required=True),
required=True,
)
expectation_configuration_builders = fields.List(
cls_or_instance=fields.Nested(
ExpectationConfigurationBuilderConfigSchema, required=True
),
required=True,
)


@dataclass
cdkini marked this conversation as resolved.
Show resolved Hide resolved
class RuleBasedProfilerConfig(BaseYamlConfig):
name: str
config_version: float
rules: Dict[str, RuleConfig]
variables: Optional[Dict[str, Any]] = None
commented_map: Optional[CommentedMap] = None
cdkini marked this conversation as resolved.
Show resolved Hide resolved

def __post_init__(self):
# Required to fully set up the commented map and enable serialization
super().__init__(commented_map=self.commented_map)

@classmethod
def get_config_class(cls) -> Type["RuleBasedProfilerConfig"]:
cdkini marked this conversation as resolved.
Show resolved Hide resolved
return cls

@classmethod
def get_schema_class(cls) -> Type["RuleBasedProfilerConfigSchema"]:
return RuleBasedProfilerConfigSchema

# TODO(cdkini): Implement custom methods to ensure proper var substitution

# def dump(self, obj: Any, *, many: Optional[bool] = None) -> dict:
# pass

# def __repr__(self):
# pass

# def __deepcopy__(self):
# pass

cdkini marked this conversation as resolved.
Show resolved Hide resolved

class RuleBasedProfilerConfigSchema(NotNullSchema):
class Meta:
unknown = INCLUDE

__config__ = RuleBasedProfilerConfig

name = fields.String(required=True)
config_version = fields.Float(
required=True,
validate=lambda x: x == 1.0,
error_messages={
"Invalid: config version is not supported; it must be 1.0 per the current version of Great Expectations"
},
)
variables = fields.Dict(keys=fields.String(), required=False, allow_none=True)
rules = fields.Dict(
keys=fields.String(),
values=fields.Nested(RuleConfigSchema, required=True),
required=True,
)
Empty file.
Loading