Merge 830217a into 90ff14b

great-expectations · Jul 3, 2019 · 5b217d1 · 5b217d1
2 parents 90ff14b + 830217a
commit 5b217d1
Show file tree

Hide file tree

Showing 193 changed files with 18,774 additions and 2,419 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,3 +1,4 @@
+# dist: xenial
 language: python
 os:
   - linux
@@ -27,15 +28,19 @@ matrix:
   # - dist: xenial
   #   python: 3.7
   #   env: PANDAS=latest
+addons:
+  postgresql: "9.4"
 services:
   - postgresql
+  - mysql
 install:
-  # - ./travis-java.sh
+#   - ./travis-java.sh
   - pip install --only-binary=numpy,scipy numpy scipy
   - if [ "$PANDAS" = "latest" ]; then pip install pandas; else pip install pandas==$PANDAS; fi
   - pip install -r requirements-dev.txt
 before_script:
   - psql -c 'create database test_ci;' -U postgres
+  - mysql -u root --password="" -e 'create database test_ci;'
 script:
   - pytest --cov=great_expectations tests/
 after_success:

diff --git a/README.md b/README.md
@@ -10,19 +10,6 @@ Great Expectations
 *Always know what to expect from your data.*
 
 
-Coming soon...! (Temporary notice June 2019)
---------------------------------------------------------------------------------
-
-We're making some major revisions to the project right now, so expect a BIG update to documentation by the end of June. 
-
-In the meantime, the Great Expectations Slack channel is the best place to get up-to-date information:
-
-https://tinyurl.com/great-expectations-slack
-
-Teaser: the next round of revisions doesn't change the existing behavior of Great Expectations at all, but it does add tons of new support for profiling, documenting, and deploying Expectations. It significantly raises the bar for making Great Expectations fully production-ready.
-
-
-
 What is great_expectations?
 --------------------------------------------------------------------------------
 
@@ -46,9 +33,15 @@ To get more done with data, faster. Teams use great_expectations to
 How do I get started?
 --------------------------------------------------------------------------------
 
-It's easy! Just use pip install:
+It's easy! 
+    First use pip install:
+
+        $ pip install great_expectations
+
+    Then run this command in the root directory of the project you want to try Great Expectations on:
 
-    $ pip install great_expectations
+        $ great_expectations init
+
 
 You can also clone the repository, which includes examples of using great_expectations.
 

diff --git a/docs/source/autoinspection.rst b/docs/source/autoinspection.rst
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -52,7 +52,7 @@
 
 # General information about the project.
 project = u'great_expectations'
-copyright = u'2018, The Great Expectations Team'
+copyright = u'2019, The Great Expectations Team'
 author = u'The Great Expectations Team'
 
 # The version info for the project you're documenting, acts as replacement for

diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -0,0 +1,16 @@
+.. _contributing:
+
+Contributing
+==================
+
+.. toctree::
+   :maxdepth: 2
+
+Can I contribute?
+-----------------
+
+Absolutely. Yes, please. Start
+`here <https://github.com/great-expectations/great_expectations/blob/develop/CONTRIBUTING.md>`__,
+and don't be shy with questions!
+
+
diff --git a/docs/source/core_concepts.rst b/docs/source/core_concepts.rst
@@ -0,0 +1,14 @@
+.. _core_concepts:
+
+Core Concepts
+==================
+
+.. toctree::
+   :maxdepth: 2
+
+   /core_concepts/expectations
+   /core_concepts/validation
+   /core_concepts/data_context
+   /core_concepts/datasource
+   /core_concepts/custom_expectations
+   /core_concepts/glossary
diff --git a/docs/source/custom_expectations.rst → ...rce/core_concepts/custom_expectations.rst b/docs/source/custom_expectations.rst → ...rce/core_concepts/custom_expectations.rst
diff --git a/docs/source/core_concepts/data_context.rst b/docs/source/core_concepts/data_context.rst
@@ -0,0 +1,37 @@
+.. _data_context:
+
+Data Context
+===================
+
+A DataContext represents a Great Expectations project. It organizes storage and access for
+expectation suites, datasources, notification settings, and data fixtures.
+
+The DataContext is configured via a yml file stored in a directory called great_expectations; the configuration file
+as well as managed expectation suites should be stored in version control.
+
+DataContexts use data sources you're already familiar with. Generators help introspect data stores and data execution
+frameworks (such as airflow, Nifi, dbt, or dagster) to describe and produce batches of data ready for analysis. This
+enables fetching, validation, profiling, and documentation of  your data in a way that is meaningful within your
+existing infrastructure and work environment.
+
+DataContexts use a datasource-based namespace, where each accessible type of data has a three-part
+normalized *data_asset_name*, consisting of *datasource/generator/generator_asset*.
+
+- The datasource actually connects to a source of materialized data and returns Great Expectations DataAssets \
+  connected to a compute environment and ready for validation.
+
+- The Generator knows how to introspect datasources and produce identifying "batch_kwargs" that define \
+  particular slices of data.
+
+- The generator_asset is a specific name -- often a table name or other name familiar to users -- that \
+  generators can slice into batches.
+
+An expectation suite is a collection of expectations ready to be applied to a batch of data. Since
+in many projects it is useful to have different expectations evaluate in different contexts--profiling
+vs. testing; warning vs. error; high vs. low compute; ML model or dashboard--suites provide a namespace
+option for selecting which expectations a DataContext returns.
+
+In many simple projects, the datasource or generator name may be omitted and the DataContext will infer
+the correct name when there is no ambiguity.
+
+Similarly, if no expectation suite name is provided, the DataContext will assume the name "default".
diff --git a/docs/source/core_concepts/datasource.rst b/docs/source/core_concepts/datasource.rst
@@ -0,0 +1,29 @@
+.. _datasource:
+
+Datasources
+============
+
+Datasources are responsible for connecting to data infrastructure. Each Datasource is a source
+of materialized data, such as a SQL database, S3 bucket, or local file directory.
+
+Each Datasource also provides access to Great Expectations data assets that are connected to
+a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
+Pandas Dataframe.
+
+To bridge the gap between those worlds, Datasources interact closely with *generators* which
+are aware of a source of data and can produce produce identifying information, called
+"batch_kwargs" that datasources can use to get individual batches of data. They add flexibility
+in how to obtain data such as with time-based partitioning, downsampling, or other techniques
+appropriate for the datasource.
+
+For example, a generator could produce a SQL query that logically represents "rows in the Events
+table with a timestamp on February 7, 2012," which a SqlAlchemyDatasource could use to materialize
+a SqlAlchemyDataset corresponding to that batch of data and ready for validation.
+
+Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
+and/or generators for a more generic datasource.
+
+See :ref:`batch_generator` for more detail about how batch generators interact with datasources and DAG runners.
+
+See datasource module docs :ref:`datasource_module` for more detail about available datasources.
+
diff --git a/docs/source/expectations.rst → docs/source/core_concepts/expectations.rst b/docs/source/expectations.rst → docs/source/core_concepts/expectations.rst
@@ -125,13 +125,13 @@ You can also add notes or even structured metadata to expectations to describe t
 Saving Expectations
 ------------------------------------------------------------------------------
 
-At the end of your exploration, call `save_expectations` to store all Expectations from your session to your pipeline test files.
+At the end of your exploration, call `save_expectation_suite` to store all Expectations from your session to your pipeline test files.
 
 This is how you always know what to expect from your data.
 
 .. code-block:: bash
 
-    >> my_df.save_expectations_config("my_titanic_expectations.json")
+    >> my_df.save_expectation_suite("my_titanic_expectations.json")
 
 For more detail on how to control expectation output, please see :ref:`standard_arguments` and :ref:`result_format`.
 

diff --git a/docs/source/glossary.rst → docs/source/core_concepts/glossary.rst b/docs/source/glossary.rst → docs/source/core_concepts/glossary.rst
@@ -58,10 +58,12 @@ Datetime and JSON parsing
 Aggregate functions
 --------------------------------------------------------------------------------
 
+* :func:`expect_column_distinct_values_to_be_in_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_be_in_set>`
 * :func:`expect_column_distinct_values_to_contain_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_contain_set>`
 * :func:`expect_column_distinct_values_to_equal_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_equal_set>`
 * :func:`expect_column_mean_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_mean_to_be_between>`
 * :func:`expect_column_median_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_median_to_be_between>`
+* :func:`expect_column_quantile_values_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_quantile_values_to_be_between>`
 * :func:`expect_column_stdev_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_stdev_to_be_between>`
 * :func:`expect_column_unique_value_count_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_unique_value_count_to_be_between>`
 * :func:`expect_column_proportion_of_unique_values_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_proportion_of_unique_values_to_be_between>`

diff --git a/docs/source/validation.rst → docs/source/core_concepts/validation.rst b/docs/source/validation.rst → docs/source/core_concepts/validation.rst
@@ -4,16 +4,31 @@
 Validation
 ================================================================================
 
-Once you've constructed and stored Expectations, you can use them to validate new data.
+Once you've constructed and stored Expectations, you can use them to validate new data. Validation generates a report
+that details any specific deviations from expected values.
+
+We recommend using a :ref:`data_context` to manage expectation suites and coordinate validation across runs.
+
+
+Validation Result
+----------------------------
+
+The report contains information about:
+
+  - the overall sucess (the `success` field),
+  - summary statistics of the expectations (the `statistics` field), and
+  - the detailed results of each expectation (the `results` field).
+
+An example report looks like the following:
 
 .. code-block:: bash
 
     >> import json
     >> import great_expectations as ge
-    >> my_expectations_config = json.load(file("my_titanic_expectations.json"))
+    >> my_expectation_suite = json.load(file("my_titanic_expectations.json"))
     >> my_df = ge.read_csv(
         "./tests/examples/titanic.csv",
-        expectations_config=my_expectations_config
+        expectation_suite=my_expectation_suite
     )
     >> my_df.validate()
 
@@ -86,13 +101,6 @@ Once you've constructed and stored Expectations, you can use them to validate ne
       }
     }
 
-Calling great_expectations's ``validate`` method generates a JSON-formatted report.
-The report contains information about:
-
-  - the overall sucess (the `success` field),
-  - summary statistics of the expectations (the `statistics` field), and
-  - the detailed results of each expectation (the `results` field).
-
 
 Command-line validation
 ------------------------------------------------------------------------------
@@ -177,10 +185,15 @@ Deployment patterns
 
 Useful deployment patterns include:
 
-* Include validation at the end of a complex data transformation, to verify that no cases were lost, duplicated, or improperly merged.
-* Include validation at the *beginning* of a script applying a machine learning model to a new batch of data, to verify that its distributed similarly to the training and testing set.
-* Automatically trigger table-level validation when new data is dropped to an FTP site or S3 bucket, and send the validation report to the uploader and bucket owner by email.
+* Include validation at the end of a complex data transformation, to verify that \
+  no cases were lost, duplicated, or improperly merged.
+* Include validation at the *beginning* of a script applying a machine learning model to a new batch of data, to \
+  verify that its distributed similarly to the training and testing set.
+* Automatically trigger table-level validation when new data is dropped to an FTP site or S3 bucket, and send the \
+  validation report to the uploader and bucket owner by email.
 * Schedule database validation jobs using cron, then capture errors and warnings (if any) and post them to Slack.
-* Validate as part of an Airflow task: if Expectations are violated, raise an error and stop DAG propagation until the problem is resolved. Alternatively, you can implement expectations that raise warnings without halting the DAG.
+* Validate as part of an Airflow task: if Expectations are violated, raise an error and stop DAG propagation until \
+  the problem is resolved. Alternatively, you can implement expectations that raise warnings without halting the DAG.
 
-For certain deployment patterns, it may be useful to parameterize expectations, and supply evaluation parameters at validation time. See :ref:`evaluation_parameters` for more information.
+For certain deployment patterns, it may be useful to parameterize expectations, and supply evaluation parameters at \
+validation time. See :ref:`evaluation_parameters` for more information.
diff --git a/docs/source/data_context_module.rst b/docs/source/data_context_module.rst