Skip to content

Commit

Permalink
Merge 830217a into 90ff14b
Browse files Browse the repository at this point in the history
  • Loading branch information
jcampbell committed Jul 3, 2019
2 parents 90ff14b + 830217a commit 5b217d1
Show file tree
Hide file tree
Showing 193 changed files with 18,774 additions and 2,419 deletions.
7 changes: 6 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# dist: xenial
language: python
os:
- linux
Expand Down Expand Up @@ -27,15 +28,19 @@ matrix:
# - dist: xenial
# python: 3.7
# env: PANDAS=latest
addons:
postgresql: "9.4"
services:
- postgresql
- mysql
install:
# - ./travis-java.sh
# - ./travis-java.sh
- pip install --only-binary=numpy,scipy numpy scipy
- if [ "$PANDAS" = "latest" ]; then pip install pandas; else pip install pandas==$PANDAS; fi
- pip install -r requirements-dev.txt
before_script:
- psql -c 'create database test_ci;' -U postgres
- mysql -u root --password="" -e 'create database test_ci;'
script:
- pytest --cov=great_expectations tests/
after_success:
Expand Down
23 changes: 8 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,6 @@ Great Expectations
*Always know what to expect from your data.*


Coming soon...! (Temporary notice June 2019)
--------------------------------------------------------------------------------

We're making some major revisions to the project right now, so expect a BIG update to documentation by the end of June.

In the meantime, the Great Expectations Slack channel is the best place to get up-to-date information:

https://tinyurl.com/great-expectations-slack

Teaser: the next round of revisions doesn't change the existing behavior of Great Expectations at all, but it does add tons of new support for profiling, documenting, and deploying Expectations. It significantly raises the bar for making Great Expectations fully production-ready.



What is great_expectations?
--------------------------------------------------------------------------------

Expand All @@ -46,9 +33,15 @@ To get more done with data, faster. Teams use great_expectations to
How do I get started?
--------------------------------------------------------------------------------

It's easy! Just use pip install:
It's easy!
First use pip install:

$ pip install great_expectations

Then run this command in the root directory of the project you want to try Great Expectations on:

$ pip install great_expectations
$ great_expectations init


You can also clone the repository, which includes examples of using great_expectations.

Expand Down
34 changes: 0 additions & 34 deletions docs/source/autoinspection.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@

# General information about the project.
project = u'great_expectations'
copyright = u'2018, The Great Expectations Team'
copyright = u'2019, The Great Expectations Team'
author = u'The Great Expectations Team'

# The version info for the project you're documenting, acts as replacement for
Expand Down
16 changes: 16 additions & 0 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _contributing:

Contributing
==================

.. toctree::
:maxdepth: 2

Can I contribute?
-----------------

Absolutely. Yes, please. Start
`here <https://github.com/great-expectations/great_expectations/blob/develop/CONTRIBUTING.md>`__,
and don't be shy with questions!


14 changes: 14 additions & 0 deletions docs/source/core_concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.. _core_concepts:

Core Concepts
==================

.. toctree::
:maxdepth: 2

/core_concepts/expectations
/core_concepts/validation
/core_concepts/data_context
/core_concepts/datasource
/core_concepts/custom_expectations
/core_concepts/glossary
File renamed without changes.
37 changes: 37 additions & 0 deletions docs/source/core_concepts/data_context.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
.. _data_context:

Data Context
===================

A DataContext represents a Great Expectations project. It organizes storage and access for
expectation suites, datasources, notification settings, and data fixtures.

The DataContext is configured via a yml file stored in a directory called great_expectations; the configuration file
as well as managed expectation suites should be stored in version control.

DataContexts use data sources you're already familiar with. Generators help introspect data stores and data execution
frameworks (such as airflow, Nifi, dbt, or dagster) to describe and produce batches of data ready for analysis. This
enables fetching, validation, profiling, and documentation of your data in a way that is meaningful within your
existing infrastructure and work environment.

DataContexts use a datasource-based namespace, where each accessible type of data has a three-part
normalized *data_asset_name*, consisting of *datasource/generator/generator_asset*.

- The datasource actually connects to a source of materialized data and returns Great Expectations DataAssets \
connected to a compute environment and ready for validation.

- The Generator knows how to introspect datasources and produce identifying "batch_kwargs" that define \
particular slices of data.

- The generator_asset is a specific name -- often a table name or other name familiar to users -- that \
generators can slice into batches.

An expectation suite is a collection of expectations ready to be applied to a batch of data. Since
in many projects it is useful to have different expectations evaluate in different contexts--profiling
vs. testing; warning vs. error; high vs. low compute; ML model or dashboard--suites provide a namespace
option for selecting which expectations a DataContext returns.

In many simple projects, the datasource or generator name may be omitted and the DataContext will infer
the correct name when there is no ambiguity.

Similarly, if no expectation suite name is provided, the DataContext will assume the name "default".
29 changes: 29 additions & 0 deletions docs/source/core_concepts/datasource.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. _datasource:

Datasources
============

Datasources are responsible for connecting to data infrastructure. Each Datasource is a source
of materialized data, such as a SQL database, S3 bucket, or local file directory.

Each Datasource also provides access to Great Expectations data assets that are connected to
a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory
Pandas Dataframe.

To bridge the gap between those worlds, Datasources interact closely with *generators* which
are aware of a source of data and can produce produce identifying information, called
"batch_kwargs" that datasources can use to get individual batches of data. They add flexibility
in how to obtain data such as with time-based partitioning, downsampling, or other techniques
appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents "rows in the Events
table with a timestamp on February 7, 2012," which a SqlAlchemyDatasource could use to materialize
a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources
and/or generators for a more generic datasource.

See :ref:`batch_generator` for more detail about how batch generators interact with datasources and DAG runners.

See datasource module docs :ref:`datasource_module` for more detail about available datasources.

Original file line number Diff line number Diff line change
Expand Up @@ -125,13 +125,13 @@ You can also add notes or even structured metadata to expectations to describe t
Saving Expectations
------------------------------------------------------------------------------
At the end of your exploration, call `save_expectations` to store all Expectations from your session to your pipeline test files.
At the end of your exploration, call `save_expectation_suite` to store all Expectations from your session to your pipeline test files.
This is how you always know what to expect from your data.
.. code-block:: bash
>> my_df.save_expectations_config("my_titanic_expectations.json")
>> my_df.save_expectation_suite("my_titanic_expectations.json")
For more detail on how to control expectation output, please see :ref:`standard_arguments` and :ref:`result_format`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,12 @@ Datetime and JSON parsing
Aggregate functions
--------------------------------------------------------------------------------

* :func:`expect_column_distinct_values_to_be_in_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_be_in_set>`
* :func:`expect_column_distinct_values_to_contain_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_contain_set>`
* :func:`expect_column_distinct_values_to_equal_set <great_expectations.dataset.dataset.Dataset.expect_column_distinct_values_to_equal_set>`
* :func:`expect_column_mean_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_mean_to_be_between>`
* :func:`expect_column_median_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_median_to_be_between>`
* :func:`expect_column_quantile_values_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_quantile_values_to_be_between>`
* :func:`expect_column_stdev_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_stdev_to_be_between>`
* :func:`expect_column_unique_value_count_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_unique_value_count_to_be_between>`
* :func:`expect_column_proportion_of_unique_values_to_be_between <great_expectations.dataset.dataset.Dataset.expect_column_proportion_of_unique_values_to_be_between>`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,31 @@
Validation
================================================================================

Once you've constructed and stored Expectations, you can use them to validate new data.
Once you've constructed and stored Expectations, you can use them to validate new data. Validation generates a report
that details any specific deviations from expected values.

We recommend using a :ref:`data_context` to manage expectation suites and coordinate validation across runs.


Validation Result
----------------------------

The report contains information about:

- the overall sucess (the `success` field),
- summary statistics of the expectations (the `statistics` field), and
- the detailed results of each expectation (the `results` field).

An example report looks like the following:

.. code-block:: bash
>> import json
>> import great_expectations as ge
>> my_expectations_config = json.load(file("my_titanic_expectations.json"))
>> my_expectation_suite = json.load(file("my_titanic_expectations.json"))
>> my_df = ge.read_csv(
"./tests/examples/titanic.csv",
expectations_config=my_expectations_config
expectation_suite=my_expectation_suite
)
>> my_df.validate()
Expand Down Expand Up @@ -86,13 +101,6 @@ Once you've constructed and stored Expectations, you can use them to validate ne
}
}
Calling great_expectations's ``validate`` method generates a JSON-formatted report.
The report contains information about:

- the overall sucess (the `success` field),
- summary statistics of the expectations (the `statistics` field), and
- the detailed results of each expectation (the `results` field).

Command-line validation
------------------------------------------------------------------------------
Expand Down Expand Up @@ -177,10 +185,15 @@ Deployment patterns

Useful deployment patterns include:

* Include validation at the end of a complex data transformation, to verify that no cases were lost, duplicated, or improperly merged.
* Include validation at the *beginning* of a script applying a machine learning model to a new batch of data, to verify that its distributed similarly to the training and testing set.
* Automatically trigger table-level validation when new data is dropped to an FTP site or S3 bucket, and send the validation report to the uploader and bucket owner by email.
* Include validation at the end of a complex data transformation, to verify that \
no cases were lost, duplicated, or improperly merged.
* Include validation at the *beginning* of a script applying a machine learning model to a new batch of data, to \
verify that its distributed similarly to the training and testing set.
* Automatically trigger table-level validation when new data is dropped to an FTP site or S3 bucket, and send the \
validation report to the uploader and bucket owner by email.
* Schedule database validation jobs using cron, then capture errors and warnings (if any) and post them to Slack.
* Validate as part of an Airflow task: if Expectations are violated, raise an error and stop DAG propagation until the problem is resolved. Alternatively, you can implement expectations that raise warnings without halting the DAG.
* Validate as part of an Airflow task: if Expectations are violated, raise an error and stop DAG propagation until \
the problem is resolved. Alternatively, you can implement expectations that raise warnings without halting the DAG.

For certain deployment patterns, it may be useful to parameterize expectations, and supply evaluation parameters at validation time. See :ref:`evaluation_parameters` for more information.
For certain deployment patterns, it may be useful to parameterize expectations, and supply evaluation parameters at \
validation time. See :ref:`evaluation_parameters` for more information.
52 changes: 0 additions & 52 deletions docs/source/data_context_module.rst

This file was deleted.

0 comments on commit 5b217d1

Please sign in to comment.