Skip to content

Latest commit

 

History

History
143 lines (100 loc) · 5.47 KB

profiling_reference.rst

File metadata and controls

143 lines (100 loc) · 5.47 KB

Warning

This doc is spare parts: leftover pieces of old documentation. It's potentially helpful, but may be incomplete, incorrect, or confusing.

Profiling Reference

Profiling produces a special kind of :ref:`data_docs` that are purely descriptive.

Expectations and Profiling

In order to characterize a data asset, Profiling uses an Expectation Suite. Unlike the Expectations that are typically used for data validation, these expectations do not necessarily apply any constraints; they can simply identify statistics or other data characteristics that should be evaluated and made available in GE. For example, when the BasicDatasetProfiler encounters a numeric column, it will add an expect_column_mean_to_be_between expectation but choose the min_value and max_value to both be None: essentially only saying that it expects a mean to exist.

{
  "expectation_type": "expect_column_mean_to_be_between",
  "kwargs": {
    "column": "rating",
    "min_value": null,
    "max_value": null
  }
}

To "profile" a datasource, therefore, the :class:`~great_expectations.profile.basic_dataset_profiler.\ BasicDatasetProfiler` included in GE will generate a large number of very loosely-specified expectations. Effectively it is asserting that the given statistic is relevant for evaluating batches of that data asset, but it is not yet sure what the statistic's value should be.

In addition to creating an expectation suite, profiling data tests the suite against data. The validation_result contains the output of that expectation suite when validated against the same batch of data. For a loosely specified expectation like in our example above, getting the observed value was the sole purpose of the expectation.

{
  "success": true,
  "result": {
    "observed_value": 4.05,
    "element_count": 10000,
    "missing_count": 0,
    "missing_percent": 0
  }
}

Running a profiler on a data asset can also be useful to produce a large number of expectations to review and potentially transfer to a new expectation suite used for validation in a pipeline.

How to Run Profiling

Run During Init

The great_expectations init command will auto-generate an example Expectation Suite using a very basic profiler that quickly glances at 1,000 rows of your data. This is not a production suite - it is only meant to show examples of Expectations, many of which may not be meaningful.

Expectation Suites generated by the profiler will be saved in the configured expectations directory for Expectation Suites. The Expectation Suite name by default is the name of the profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default. When profiling is complete, Great Expectations will build and launch Data Docs based on your data.

Run From Command Line

The GE command-line interface can profile a datasource:

great_expectations datasource profile DATASOURCE_NAME

Expectation Suites generated by the profiler will be saved in the configured expectations directory for Expectation Suites. The Expectation Suite name by default is the name of the profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default. When profiling is complete, Great Expectations will build and launch Data Docs based on your data.

See :ref:`data_docs` for more information.

Run From Jupyter Notebook

If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:

from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler

# obtain the DataContext object
context = ge.data_context.DataContext()

# load a batch from the data asset
batch = context.get_batch('ratings')

# run the profiler on the batch - this returns an expectation suite and validation results for this suite
expectation_suite, validation_result = BasicDatasetProfiler.profile(batch)

# save the resulting expectation suite with a custom name
context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")

Custom Profilers

Like most things in Great Expectations, Profilers are designed to be extensible. You can develop your own profiler by subclassing DatasetProfiler, or from the parent DataAssetProfiler class itself. For help, advice, and ideas on developing custom profilers, please get in touch on the Great Expectations slack channel.

Profiling Limitations

Inferring Data Types

When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.

Data Samples

Since profiling and expectations are so tightly linked, getting samples of expected data requires a slightly different approach than the normal path for profiling. Stay tuned for more in this area!