Merge pull request #635 from great-expectations/feature/custom_types

CustomTypes Doc Updates
great-expectations · Aug 19, 2019 · e1de421 · e1de421
2 parents f7a7848 + db45460
commit e1de421
Show file tree

Hide file tree

Showing 15 changed files with 95 additions and 26 deletions.
diff --git a/docs/core_concepts/custom_expectations.rst b/docs/core_concepts/custom_expectations.rst
@@ -185,4 +185,44 @@ A similar approach works for the command-line tool.
         dataset_class=custom_dataset.CustomPandasDataset
 
 
+Using custom expectations with a DataSource
+--------------------------------------------------------------------------------
+
+To use custom expectations in a datasource or DataContext, you need to define the custom DataAsset in the datasource
+configuration or batch_kwargs for a specific batch. Following the same example above, let's suppose you've defined
+`CustomPandasDataset` in a module called `custom_dataset.py`. You can configure your datasource to return instances
+of your custom DataAsset type by passing in a :ref:`ClassConfig` that describes your source.
+
+If you are working a DataContext, simply placing `custom_dataset.py` in your configured plugin directory will make it
+accessible, otherwise, you need to ensure the module is on the import path.
+
+Once you do this, all the functionality of your new expectations will be available for use. For example, you could use
+the datasource snippet below to configure a PandasDatasource that will produce instances of your new
+CustomPandasDataset in a DataContext.
+
+.. code-block:: yaml
+
+    datasources:
+      my_datasource:
+        type: pandas  # class_name: PandasDatasource
+        data_asset_type:
+          module_name: custom_dataset
+          class_name: CustomPandasDataset
+        generators:
+          default:
+            type: subdir_reader  # class_name: SubdirReaderGenerator
+            base_directory: /data
+            reader_options:
+              sep: \t
+
+.. code-block:: bash
+
+    >> import great_expectations as ge
+    >> context = ge.DataContext()
+    >> my_df = context.get_batch("my_datasource/default/my_file")
 
+    >> my_df.expect_column_values_to_equal_1("all_twos")
+    {
+        "success": False,
+        "unexpected_list": [2,2,2,2,2,2,2,2]
+    }
diff --git a/docs/core_concepts/data_context.rst b/docs/core_concepts/data_context.rst
@@ -11,13 +11,13 @@ as well as managed expectation suites should be stored in version control.
 
 DataContexts use data sources you're already familiar with. Generators help introspect data stores and data execution
 frameworks (such as airflow, Nifi, dbt, or dagster) to describe and produce batches of data ready for analysis. This
-enables fetching, validation, profiling, and documentation of  your data in a way that is meaningful within your
+enables fetching, validation, profiling, and documentation of your data in a way that is meaningful within your
 existing infrastructure and work environment.
 
 DataContexts use a datasource-based namespace, where each accessible type of data has a three-part
 normalized *data_asset_name*, consisting of *datasource/generator/generator_asset*.
 
-- The datasource actually connects to a source of materialized data and returns Great Expectations DataAssets \
+- The datasource actually connects to a source of data and returns Great Expectations DataAssets \
   connected to a compute environment and ready for validation.
 
 - The Generator knows how to introspect datasources and produce identifying "batch_kwargs" that define \

diff --git a/docs/core_concepts/datasource.rst b/docs/core_concepts/datasource.rst
@@ -3,12 +3,10 @@
 Datasources
 ============
 
-Datasources are responsible for connecting to data infrastructure. Each Datasource is a source
-of materialized data, such as a SQL database, S3 bucket, or local file directory.
-
-Each Datasource also provides access to Great Expectations data assets that are connected to
-a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
-Pandas DataFrame.
+Datasources are responsible for connecting data and compute infrastructure. Each Datasource provides
+Great Expectations DataAssets (or batches in a DataContext) connected to a specific compute environment, such as a
+SQL database, a Spark cluster, or a local in-memory Pandas DataFrame. Datasources know how to access data from
+relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.
 
 To bridge the gap between those worlds, Datasources interact closely with *generators* which
 are aware of a source of data and can produce produce identifying information, called
@@ -23,6 +21,9 @@ a SqlAlchemyDataset corresponding to that batch of data and ready for validation
 Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
 and/or generators for a more generic datasource.
 
+When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter
+to configure the datasource to load and return DataAssets of the custom type.
+
 See :ref:`batch_generator` for more detail about how batch generators interact with datasources and DAG runners.
 
 See datasource module docs :ref:`datasource_module` for more detail about available datasources.

diff --git a/docs/getting_started/cli_init.rst b/docs/getting_started/cli_init.rst
@@ -120,7 +120,7 @@ Datasources allow you to configure connections to data to evaluate Expectations.
 2. Relational databases via SQL Alchemy
 3. Spark DataFrames
 
-Therefore, a Datasource could be a local pandas environment with some configuration to parse CSV files from a directory; a connection to postgresql instance; a Spark cluster connected to an S3 bucket; etc. In the future, we plan to add support for other compute environments, such as dask and BigQuery. (If you'd like to use or contribute to those environments, please chime in on `GitHub issues <https://github.com/great-expectations/great_expectations/issues>`_.)
+Therefore, a Datasource could be a local pandas environment with some configuration to parse CSV files from a directory; a connection to postgresql instance; a Spark cluster connected to an S3 bucket; etc. In the future, we plan to add support for other compute environments, such as dask. (If you'd like to use or contribute to those environments, please chime in on `GitHub issues <https://github.com/great-expectations/great_expectations/issues>`_.)
 
 Our example project has a ``data/`` folder containing several CSVs. Within the CLI, we can configure a Pandas DataFrame Datasource like so:
 

diff --git a/docs/guides/batch_generator.rst b/docs/guides/batch_generator.rst
@@ -13,8 +13,11 @@ the Events table with a timestamp on February 7, 2012," which a SqlAlchemyDataso
 could use to materialize a SqlAlchemyDataset corresponding to that batch of data and
 ready for validation.
 
-A batch is a sample from a data asset, sliced according to a particular rule. For
-example, an hourly slide of the Events table or “most recent `users` records.”
+Batch
+------
+
+A batch is a sample from a data asset, sliced according to a particular rule.
+For example, an hourly slide of the Events table or “most recent `users` records.”
 
 A Batch is the primary unit of validation in the Great Expectations DataContext.
 Batches include metadata that identifies how they were constructed--the same “batch_kwargs”

diff --git a/docs/roadmap_and_changelog/changelog.rst b/docs/roadmap_and_changelog/changelog.rst
@@ -1,21 +1,23 @@
 .. _changelog:
 
+
 v.0.7.7
 -----------------
-* Fix databricks generator (thanks @sspitz3!)
-* Add support for reader_method = "delta" to SparkDFDatasource
 * Standardize the way that plugin module loading works. DataContext will begin to use the new-style class and plugin
 identification moving forward; yml configs should specify class_name and module_name (with module_name optional for
 GE types). For now, it is possible to use the "type" parameter in configuration (as before).
 * Add support for custom data_asset_type to all datasources
+* Add support for strict_min and strict_max to inequality-based expectations to allow strict inequality checks
+(thanks @RoyalTS!)
+* Add support for reader_method = "delta" to SparkDFDatasource
+* Fix databricks generator (thanks @sspitz3!)
 * Fix several memory and performance issues in SparkDFDataset.
  - Use only distinct value count instead of bringing values to driver
  - Migrate away from UDF for set membership, nullity, and regex expectations
 * Fix several UI issues in the data_documentation
  - Broken link on Home
  - Scroll follows navigation properly
-* Add support for strict_min and strict_max to inequality-based expectations to allow strict inequality checks
-(thanks @RoyalTS!)
+
 
 v.0.7.6
 -----------------

diff --git a/great_expectations/data_context/types/__init__.py b/great_expectations/data_context/types/__init__.py
@@ -1 +0,0 @@
-from .configurations import ClassConfig

diff --git a/great_expectations/datasource/datasource.py b/great_expectations/datasource/datasource.py
@@ -10,7 +10,7 @@
 
 from ..data_context.util import NormalizedDataAssetName
 from great_expectations.exceptions import BatchKwargsError
-from great_expectations.data_context.types import ClassConfig
+from great_expectations.types import ClassConfig
 from great_expectations.exceptions import InvalidConfigError
 import warnings
 from importlib import import_module
@@ -33,12 +33,10 @@ class ReaderMethods(Enum):
 
 
 class Datasource(object):
-    """Datasources are responsible for connecting to data infrastructure. Each Datasource is a source 
-    of materialized data, such as a SQL database, S3 bucket, or local file directory.
-
-    Each Datasource also provides access to Great Expectations data assets that are connected to
-    a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
-    Pandas Dataframe.
+    """Datasources are responsible for connecting data and compute infrastructure. Each Datasource provides
+    Great Expectations DataAssets (or batches in a DataContext) connected to a specific compute environment, such as a
+    SQL database, a Spark cluster, or a local in-memory Pandas DataFrame. Datasources know how to access data from
+    relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.
 
     To bridge the gap between those worlds, Datasources interact closely with *generators* which
     are aware of a source of data and can produce produce identifying information, called 
@@ -52,6 +50,9 @@ class Datasource(object):
 
     Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
     and/or generators for a more generic datasource.
+
+    When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter
+    to configure the datasource to load and return DataAssets of the custom type.
     """
 
     @classmethod
@@ -76,6 +77,7 @@ def __init__(self, name, type_, data_context=None, data_asset_type=None, generat
             name: the name for the datasource
             type_: the type of the datasource
             data_context: data context to which to connect
+            data_asset_type (ClassConfig): the type of DataAsset to produce
             generators: generators to add to the datasource
         """
         self._data_context = data_context

diff --git a/great_expectations/datasource/pandas_source.py b/great_expectations/datasource/pandas_source.py
@@ -7,7 +7,7 @@
 from great_expectations.datasource.generator.filesystem_path_generator import SubdirReaderGenerator, GlobReaderGenerator
 from great_expectations.datasource.generator.in_memory_generator import InMemoryGenerator
 from great_expectations.dataset.pandas_dataset import PandasDataset
-from great_expectations.data_context.types import ClassConfig
+from great_expectations.types import ClassConfig
 from great_expectations.exceptions import BatchKwargsError
 
 

diff --git a/great_expectations/datasource/spark_source.py b/great_expectations/datasource/spark_source.py
@@ -9,7 +9,7 @@
 from great_expectations.datasource.generator.databricks_generator import DatabricksTableGenerator
 from great_expectations.datasource.generator.in_memory_generator import InMemoryGenerator
 
-from great_expectations.data_context.types import ClassConfig
+from great_expectations.types import ClassConfig
 
 logger = logging.getLogger(__name__)
 

diff --git a/great_expectations/datasource/sqlalchemy_source.py b/great_expectations/datasource/sqlalchemy_source.py
@@ -6,7 +6,7 @@
 from great_expectations.dataset.sqlalchemy_dataset import SqlAlchemyDataset
 from .generator.query_generator import QueryGenerator
 from great_expectations.exceptions import DatasourceInitializationError
-from great_expectations.data_context.types import ClassConfig
+from great_expectations.types import ClassConfig
 
 logger = logging.getLogger(__name__)
 

diff --git a/great_expectations/types/__init__.py b/great_expectations/types/__init__.py
@@ -9,4 +9,8 @@
     ExpectationSuite,
     ValidationResult,
     ValidationResultSuite,
+)
+
+from .configurations import (
+    ClassConfig
 )
diff --git a/...ions/data_context/types/configurations.py → great_expectations/types/configurations.py b/...ions/data_context/types/configurations.py → great_expectations/types/configurations.py
diff --git a/great_expectations/util.py b/great_expectations/util.py
@@ -16,6 +16,7 @@ def _convert_to_dataset_class(df, dataset_class, expectation_suite=None, profile
     """
     Convert a (pandas) dataframe to a great_expectations dataset, with (optional) expectation_suite
     """
+    # TODO: Refactor this method to use the new ClassConfig (module_name and class_name convention).
     if expectation_suite is not None:
         # Create a dataset of the new class type, and manually initialize expectations according to
         # the provided expectation suite
@@ -37,6 +38,7 @@ def read_csv(
     profiler=None,
     *args, **kwargs
 ):
+    # TODO: Refactor this method to use the new ClassConfig (module_name and class_name convention).
     df = pd.read_csv(filename, *args, **kwargs)
     df = _convert_to_dataset_class(
         df, dataset_class, expectation_suite, profiler)

diff --git a/tests/test_data_asset.py b/tests/test_data_asset.py
@@ -12,6 +12,22 @@
 from six import PY2
 
 
+def test_interactive_evaluation(dataset):
+    # We should be able to enable and disable interactive evaluation
+
+    # Default is on
+    assert dataset.get_config_value("interactive_evaluation") is True
+    res = dataset.expect_column_values_to_be_between("naturals", 1, 10, include_config=True)
+    assert res["success"] is True
+
+    # Disable
+    dataset.set_config_value("interactive_evaluation", False)
+    disable_res = dataset.expect_column_values_to_be_between("naturals", 1, 10)  # No need to explicitly include_config
+    assert "success" not in disable_res
+
+    assert res["expectation_config"] == disable_res["stored_configuration"]
+
+
 def test_data_asset_name_inheritance(dataset):
     # A data_asset should have a generic type
     data_asset = ge.data_asset.DataAsset()