Skip to content

Commit

Permalink
Merge pull request #635 from great-expectations/feature/custom_types
Browse files Browse the repository at this point in the history
CustomTypes Doc Updates
  • Loading branch information
jcampbell committed Aug 19, 2019
2 parents f7a7848 + db45460 commit e1de421
Show file tree
Hide file tree
Showing 15 changed files with 95 additions and 26 deletions.
40 changes: 40 additions & 0 deletions docs/core_concepts/custom_expectations.rst
Expand Up @@ -185,4 +185,44 @@ A similar approach works for the command-line tool.
dataset_class=custom_dataset.CustomPandasDataset
Using custom expectations with a DataSource
--------------------------------------------------------------------------------
To use custom expectations in a datasource or DataContext, you need to define the custom DataAsset in the datasource
configuration or batch_kwargs for a specific batch. Following the same example above, let's suppose you've defined
`CustomPandasDataset` in a module called `custom_dataset.py`. You can configure your datasource to return instances
of your custom DataAsset type by passing in a :ref:`ClassConfig` that describes your source.
If you are working a DataContext, simply placing `custom_dataset.py` in your configured plugin directory will make it
accessible, otherwise, you need to ensure the module is on the import path.
Once you do this, all the functionality of your new expectations will be available for use. For example, you could use
the datasource snippet below to configure a PandasDatasource that will produce instances of your new
CustomPandasDataset in a DataContext.
.. code-block:: yaml
datasources:
my_datasource:
type: pandas # class_name: PandasDatasource
data_asset_type:
module_name: custom_dataset
class_name: CustomPandasDataset
generators:
default:
type: subdir_reader # class_name: SubdirReaderGenerator
base_directory: /data
reader_options:
sep: \t
.. code-block:: bash
>> import great_expectations as ge
>> context = ge.DataContext()
>> my_df = context.get_batch("my_datasource/default/my_file")
>> my_df.expect_column_values_to_equal_1("all_twos")
{
"success": False,
"unexpected_list": [2,2,2,2,2,2,2,2]
}
4 changes: 2 additions & 2 deletions docs/core_concepts/data_context.rst
Expand Up @@ -11,13 +11,13 @@ as well as managed expectation suites should be stored in version control.

DataContexts use data sources you're already familiar with. Generators help introspect data stores and data execution
frameworks (such as airflow, Nifi, dbt, or dagster) to describe and produce batches of data ready for analysis. This
enables fetching, validation, profiling, and documentation of your data in a way that is meaningful within your
enables fetching, validation, profiling, and documentation of your data in a way that is meaningful within your
existing infrastructure and work environment.

DataContexts use a datasource-based namespace, where each accessible type of data has a three-part
normalized *data_asset_name*, consisting of *datasource/generator/generator_asset*.

- The datasource actually connects to a source of materialized data and returns Great Expectations DataAssets \
- The datasource actually connects to a source of data and returns Great Expectations DataAssets \
connected to a compute environment and ready for validation.

- The Generator knows how to introspect datasources and produce identifying "batch_kwargs" that define \
Expand Down
13 changes: 7 additions & 6 deletions docs/core_concepts/datasource.rst
Expand Up @@ -3,12 +3,10 @@
Datasources
============

Datasources are responsible for connecting to data infrastructure. Each Datasource is a source
of materialized data, such as a SQL database, S3 bucket, or local file directory.

Each Datasource also provides access to Great Expectations data assets that are connected to
a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
Pandas DataFrame.
Datasources are responsible for connecting data and compute infrastructure. Each Datasource provides
Great Expectations DataAssets (or batches in a DataContext) connected to a specific compute environment, such as a
SQL database, a Spark cluster, or a local in-memory Pandas DataFrame. Datasources know how to access data from
relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.

To bridge the gap between those worlds, Datasources interact closely with *generators* which
are aware of a source of data and can produce produce identifying information, called
Expand All @@ -23,6 +21,9 @@ a SqlAlchemyDataset corresponding to that batch of data and ready for validation
Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
and/or generators for a more generic datasource.

When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter
to configure the datasource to load and return DataAssets of the custom type.

See :ref:`batch_generator` for more detail about how batch generators interact with datasources and DAG runners.

See datasource module docs :ref:`datasource_module` for more detail about available datasources.
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/cli_init.rst
Expand Up @@ -120,7 +120,7 @@ Datasources allow you to configure connections to data to evaluate Expectations.
2. Relational databases via SQL Alchemy
3. Spark DataFrames

Therefore, a Datasource could be a local pandas environment with some configuration to parse CSV files from a directory; a connection to postgresql instance; a Spark cluster connected to an S3 bucket; etc. In the future, we plan to add support for other compute environments, such as dask and BigQuery. (If you'd like to use or contribute to those environments, please chime in on `GitHub issues <https://github.com/great-expectations/great_expectations/issues>`_.)
Therefore, a Datasource could be a local pandas environment with some configuration to parse CSV files from a directory; a connection to postgresql instance; a Spark cluster connected to an S3 bucket; etc. In the future, we plan to add support for other compute environments, such as dask. (If you'd like to use or contribute to those environments, please chime in on `GitHub issues <https://github.com/great-expectations/great_expectations/issues>`_.)

Our example project has a ``data/`` folder containing several CSVs. Within the CLI, we can configure a Pandas DataFrame Datasource like so:

Expand Down
7 changes: 5 additions & 2 deletions docs/guides/batch_generator.rst
Expand Up @@ -13,8 +13,11 @@ the Events table with a timestamp on February 7, 2012," which a SqlAlchemyDataso
could use to materialize a SqlAlchemyDataset corresponding to that batch of data and
ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For
example, an hourly slide of the Events table or “most recent `users` records.”
Batch
------

A batch is a sample from a data asset, sliced according to a particular rule.
For example, an hourly slide of the Events table or “most recent `users` records.”

A Batch is the primary unit of validation in the Great Expectations DataContext.
Batches include metadata that identifies how they were constructed--the same “batch_kwargs”
Expand Down
10 changes: 6 additions & 4 deletions docs/roadmap_and_changelog/changelog.rst
@@ -1,21 +1,23 @@
.. _changelog:


v.0.7.7
-----------------
* Fix databricks generator (thanks @sspitz3!)
* Add support for reader_method = "delta" to SparkDFDatasource
* Standardize the way that plugin module loading works. DataContext will begin to use the new-style class and plugin
identification moving forward; yml configs should specify class_name and module_name (with module_name optional for
GE types). For now, it is possible to use the "type" parameter in configuration (as before).
* Add support for custom data_asset_type to all datasources
* Add support for strict_min and strict_max to inequality-based expectations to allow strict inequality checks
(thanks @RoyalTS!)
* Add support for reader_method = "delta" to SparkDFDatasource
* Fix databricks generator (thanks @sspitz3!)
* Fix several memory and performance issues in SparkDFDataset.
- Use only distinct value count instead of bringing values to driver
- Migrate away from UDF for set membership, nullity, and regex expectations
* Fix several UI issues in the data_documentation
- Broken link on Home
- Scroll follows navigation properly
* Add support for strict_min and strict_max to inequality-based expectations to allow strict inequality checks
(thanks @RoyalTS!)


v.0.7.6
-----------------
Expand Down
1 change: 0 additions & 1 deletion great_expectations/data_context/types/__init__.py
@@ -1 +0,0 @@
from .configurations import ClassConfig
16 changes: 9 additions & 7 deletions great_expectations/datasource/datasource.py
Expand Up @@ -10,7 +10,7 @@

from ..data_context.util import NormalizedDataAssetName
from great_expectations.exceptions import BatchKwargsError
from great_expectations.data_context.types import ClassConfig
from great_expectations.types import ClassConfig
from great_expectations.exceptions import InvalidConfigError
import warnings
from importlib import import_module
Expand All @@ -33,12 +33,10 @@ class ReaderMethods(Enum):


class Datasource(object):
"""Datasources are responsible for connecting to data infrastructure. Each Datasource is a source
of materialized data, such as a SQL database, S3 bucket, or local file directory.
Each Datasource also provides access to Great Expectations data assets that are connected to
a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
Pandas Dataframe.
"""Datasources are responsible for connecting data and compute infrastructure. Each Datasource provides
Great Expectations DataAssets (or batches in a DataContext) connected to a specific compute environment, such as a
SQL database, a Spark cluster, or a local in-memory Pandas DataFrame. Datasources know how to access data from
relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.
To bridge the gap between those worlds, Datasources interact closely with *generators* which
are aware of a source of data and can produce produce identifying information, called
Expand All @@ -52,6 +50,9 @@ class Datasource(object):
Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
and/or generators for a more generic datasource.
When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter
to configure the datasource to load and return DataAssets of the custom type.
"""

@classmethod
Expand All @@ -76,6 +77,7 @@ def __init__(self, name, type_, data_context=None, data_asset_type=None, generat
name: the name for the datasource
type_: the type of the datasource
data_context: data context to which to connect
data_asset_type (ClassConfig): the type of DataAsset to produce
generators: generators to add to the datasource
"""
self._data_context = data_context
Expand Down
2 changes: 1 addition & 1 deletion great_expectations/datasource/pandas_source.py
Expand Up @@ -7,7 +7,7 @@
from great_expectations.datasource.generator.filesystem_path_generator import SubdirReaderGenerator, GlobReaderGenerator
from great_expectations.datasource.generator.in_memory_generator import InMemoryGenerator
from great_expectations.dataset.pandas_dataset import PandasDataset
from great_expectations.data_context.types import ClassConfig
from great_expectations.types import ClassConfig
from great_expectations.exceptions import BatchKwargsError


Expand Down
2 changes: 1 addition & 1 deletion great_expectations/datasource/spark_source.py
Expand Up @@ -9,7 +9,7 @@
from great_expectations.datasource.generator.databricks_generator import DatabricksTableGenerator
from great_expectations.datasource.generator.in_memory_generator import InMemoryGenerator

from great_expectations.data_context.types import ClassConfig
from great_expectations.types import ClassConfig

logger = logging.getLogger(__name__)

Expand Down
2 changes: 1 addition & 1 deletion great_expectations/datasource/sqlalchemy_source.py
Expand Up @@ -6,7 +6,7 @@
from great_expectations.dataset.sqlalchemy_dataset import SqlAlchemyDataset
from .generator.query_generator import QueryGenerator
from great_expectations.exceptions import DatasourceInitializationError
from great_expectations.data_context.types import ClassConfig
from great_expectations.types import ClassConfig

logger = logging.getLogger(__name__)

Expand Down
4 changes: 4 additions & 0 deletions great_expectations/types/__init__.py
Expand Up @@ -9,4 +9,8 @@
ExpectationSuite,
ValidationResult,
ValidationResultSuite,
)

from .configurations import (
ClassConfig
)
File renamed without changes.
2 changes: 2 additions & 0 deletions great_expectations/util.py
Expand Up @@ -16,6 +16,7 @@ def _convert_to_dataset_class(df, dataset_class, expectation_suite=None, profile
"""
Convert a (pandas) dataframe to a great_expectations dataset, with (optional) expectation_suite
"""
# TODO: Refactor this method to use the new ClassConfig (module_name and class_name convention).
if expectation_suite is not None:
# Create a dataset of the new class type, and manually initialize expectations according to
# the provided expectation suite
Expand All @@ -37,6 +38,7 @@ def read_csv(
profiler=None,
*args, **kwargs
):
# TODO: Refactor this method to use the new ClassConfig (module_name and class_name convention).
df = pd.read_csv(filename, *args, **kwargs)
df = _convert_to_dataset_class(
df, dataset_class, expectation_suite, profiler)
Expand Down
16 changes: 16 additions & 0 deletions tests/test_data_asset.py
Expand Up @@ -12,6 +12,22 @@
from six import PY2


def test_interactive_evaluation(dataset):
# We should be able to enable and disable interactive evaluation

# Default is on
assert dataset.get_config_value("interactive_evaluation") is True
res = dataset.expect_column_values_to_be_between("naturals", 1, 10, include_config=True)
assert res["success"] is True

# Disable
dataset.set_config_value("interactive_evaluation", False)
disable_res = dataset.expect_column_values_to_be_between("naturals", 1, 10) # No need to explicitly include_config
assert "success" not in disable_res

assert res["expectation_config"] == disable_res["stored_configuration"]


def test_data_asset_name_inheritance(dataset):
# A data_asset should have a generic type
data_asset = ge.data_asset.DataAsset()
Expand Down

0 comments on commit e1de421

Please sign in to comment.