# DATA PROBLEM

In this tutorial, we will be creating an Expectation Suite for this example data set that allows us to assert that we expect at least 1 passenger per taxi ride based on what we see in the January 2019 data (and based on what we expect about taxi rides!). We will then use that Expectation Suite to catch data quality issues in the February data set.



The NYC taxi data we’re going to use in this tutorial is an open data set which is updated every month. Each record in the data corresponds to one taxi ride and contains information such as the pick up and drop-off location, the payment amount, and the number of passengers, among others.

In this tutorial, we provide two CSV files, each with a 10,000 row sample of the Yellow Taxi Trip Records set:

* yellow_tripdata_sample_2019-01.csv: a sample of the January 2019 taxi data

* yellow_tripdata_sample_2019-02.csv: a sample of the February 2019 taxi data'

In [1]:
import great_expectations as gx
gx.__version__

'0.17.23'

## **INITIALIZING DATA CONTEXT**

In [2]:
# Empty folder for 'project_root_dir'
data_context_path = r'.\gx_project'

# Initializing File Data Context
context = gx.data_context.FileDataContext.create(project_root_dir=data_context_path)

# Verificação 
print(context)

{
  "anonymous_usage_statistics": {
    "data_context_id": "ed5719c9-c2e2-441d-8659-29f318aad1e6",
    "usage_statistics_url": "https://stats.greatexpectations.io/great_expectations/v1/usage_statistics",
    "explicit_id": true,
    "explicit_url": false,
    "enabled": true
  },
  "checkpoint_store_name": "checkpoint_store",
  "config_variables_file_path": "uncommitted/config_variables.yml",
  "config_version": 3.0,
  "data_docs_sites": {
    "local_site": {
      "class_name": "SiteBuilder",
      "show_how_to_buttons": true,
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "uncommitted/data_docs/local_site/"
      },
      "site_index_builder": {
        "class_name": "DefaultSiteIndexBuilder"
      }
    }
  },
  "datasources": {},
  "evaluation_parameter_store_name": "evaluation_parameter_store",
  "expectations_store_name": "expectations_store",
  "fluent_datasources": {
    "default_pandas_datasource": {
      "type": "panda

## **CREDENTIALS**

* Variáveis de Ambiente
* YAML or Secret

[Credential](https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials)

In [None]:
# Variáveis de ambiente
export MY_DB_PW=password
export POSTGRES_CONNECTION_STRING=postgresql://postgres:${MY_DB_PW}@localhost:5432/postgres

set MY_DB_PW=password
set POSTGRES_CONNECTION_STRING="postgresql://postgres:%MY_DB_PW%@localhost:5432/postgres"

In [None]:
# The password can be added as an environment variable
pg_datasource = context.sources.add_or_update_sql(
    name="my_postgres_db",
    connection_string="postgresql://postgres:${MY_DB_PW}@localhost:5432/postgres",
)

# Alternately, the full connection string can be added as an environment Variable
pg_datasource = context.sources.add_or_update_sql(
    name="my_postgres_db", connection_string="${POSTGRES_CONNECTION_STRING}"
)

## **EXPECTATION STORE - SUITE**

By default, new Expectations are stored as Expectation Suites in JSON format in the expectations/ subdirectory of your gx/ folder.

Eles podem ser armazenado no S3, Azure Blob Storage, GCP, Local Filesystem ou Postgres.

[Documentação](https://docs.greatexpectations.io/docs/guides/setup/configuring_metadata_stores/configure_expectation_stores)

CLI method `great_expectations suite new`

In [4]:
# Create an ExpectationSuite
suite_1 = context.add_expectation_suite(expectation_suite_name="first_suite")



<bound method AbstractDataContext.add_expectation_suite of {
  "anonymous_usage_statistics": {
    "data_context_id": "ed5719c9-c2e2-441d-8659-29f318aad1e6",
    "usage_statistics_url": "https://stats.greatexpectations.io/great_expectations/v1/usage_statistics",
    "explicit_id": true,
    "explicit_url": false,
    "enabled": true
  },
  "checkpoint_store_name": "checkpoint_store",
  "config_variables_file_path": "uncommitted/config_variables.yml",
  "config_version": 3.0,
  "data_docs_sites": {
    "local_site": {
      "class_name": "SiteBuilder",
      "show_how_to_buttons": true,
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "uncommitted/data_docs/local_site/"
      },
      "site_index_builder": {
        "class_name": "DefaultSiteIndexBuilder"
      }
    }
  },
  "datasources": {},
  "evaluation_parameter_store_name": "evaluation_parameter_store",
  "expectations_store_name": "expectations_store",
  "fluent_datasources"

In [None]:
# Alterar o local do Store
# in the gx/ folder
mkdir shared_expectations
mv expectations/npi_expectations.json shared_expectations/

### **EXPECTATIONS - Direto ao Suite**

Neste caso, não há Validators, Source Data ou Batch of Data, então não irá ocorrer a inpeção de dados diretamente.

In [None]:
from great_expectations.core.expectation_configuration import ExpectationConfiguration

# Create an Expectation
expectation_configuration_1 = ExpectationConfiguration(
    # Name of expectation type being added
    expectation_type="expect_table_columns_to_match_ordered_list",
    # These are the arguments of the expectation
    # The keys allowed in the dictionary are Parameters and
    # Keyword Arguments of this Expectation Type
    kwargs={
        "column_list": [
            "account_id",
            "user_id",
            "transaction_id",
            "transaction_type",
            "transaction_amt_usd",
        ]
    },
    # This is how you can optionally add a comment about this expectation.
    # It will be rendered in Data Docs.
    # See this guide for details:
    # `How to add comments to Expectations and display them in Data Docs`.
    meta={
        "notes": {
            "format": "markdown",
            "content": "Some clever comment about this expectation. **Markdown** `Supported`",
        }
    },
)
# Add the Expectation to the suite
suite.add_expectation(expectation_configuration=expectation_configuration_1)

In [None]:
expectation_configuration_2 = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={
        "column": "transaction_type",
        "value_set": ["purchase", "refund", "upgrade"],
    },
    # Note optional comments omitted
)
suite.add_expectation(expectation_configuration=expectation_configuration_2)

In [None]:
expectation_configuration_3 = ExpectationConfiguration(
    expectation_type="expect_column_values_to_not_be_null",
    kwargs={
        "column": "account_id",
        "mostly": 1.0,
    },
    meta={
        "notes": {
            "format": "markdown",
            "content": "Some clever comment about this expectation. **Markdown** `Supported`",
        }
    },
)
suite.add_expectation(expectation_configuration=expectation_configuration_3)

In [None]:
expectation_configuration_4 = ExpectationConfiguration(
    expectation_type="expect_column_values_to_not_be_null",
    kwargs={
        "column": "user_id",
        "mostly": 0.75,
    },
    meta={
        "notes": {
            "format": "markdown",
            "content": "Some clever comment about this expectation. **Markdown** `Supported`",
        }
    },
)
suite.add_expectation(expectation_configuration=expectation_configuration_4)

In [None]:
context.save_expectation_suite(expectation_suite=suite)

## **DATA SOURCES**

* [Doc](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/fluent/filesystem/connect_filesystem_source_data)
* [SQL - Postgres](https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/fluent/database/connect_sql_source_data)

Diretamente em Validator com Pandas

In [7]:
validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

ALTERNATIVA: Criando um Data Source

In [None]:
# Give your Datasource a name
datasource_name = None
datasource = context.sources.add_pandas(datasource_name)

# Give your first Asset a name
asset_name = None
path_to_data = None
# to use sample data uncomment next line
# path_to_data = "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()