## Módulo: Analytics Engineering
    
## Aula 4 - Parte 1

### Programação da Aula 4:

> ### 1. **Projeto com o "Great Expectations" e o PostgreSQL**;
> ### 2. **Desenvolvimento do projeto final**.

#### Link para o formulário para informar os integrantes do grupo do projeto:
https://forms.gle/8kCUMyV7TDZCWz5t6

#### Link para o formulário de Feedback da aula:
https://forms.gle/aD2HdXo8jfW8WqRb6

### Instalação da biblioteca "great_expectations"

In [1]:
!pip install great_expectations



In [2]:
!pip show great_expectations

Name: great-expectations
Version: 0.17.21
Summary: Always know what to expect from your data.
Home-page: https://greatexpectations.io
Author: The Great Expectations Team
Author-email: team@greatexpectations.io
License: Apache-2.0
Location: c:\users\andradema\anaconda3\envs\ada\lib\site-packages
Requires: altair, Click, colorama, cryptography, Ipython, ipywidgets, jinja2, jsonpatch, jsonschema, makefun, marshmallow, mistune, nbformat, notebook, numpy, packaging, pandas, pydantic, pyparsing, python-dateutil, pytz, requests, ruamel.yaml, scipy, tqdm, typing-extensions, tzlocal, urllib3
Required-by: 


### Chamada do "contexto"

In [3]:
import great_expectations as gx

context = gx.get_context()
print(context)

{
  "anonymous_usage_statistics": {
    "enabled": true,
    "explicit_id": true,
    "usage_statistics_url": "https://stats.greatexpectations.io/great_expectations/v1/usage_statistics",
    "explicit_url": false,
    "data_context_id": "5c874057-25d9-42c5-8158-8ea2a75b455d"
  },
  "checkpoint_store_name": "checkpoint_store",
  "config_variables_file_path": "uncommitted/config_variables.yml",
  "config_version": 3.0,
  "data_docs_sites": {
    "local_site": {
      "class_name": "SiteBuilder",
      "show_how_to_buttons": true,
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "uncommitted/data_docs/local_site/"
      },
      "site_index_builder": {
        "class_name": "DefaultSiteIndexBuilder"
      }
    }
  },
  "datasources": {},
  "evaluation_parameter_store_name": "evaluation_parameter_store",
  "expectations_store_name": "expectations_store",
  "fluent_datasources": {},
  "include_rendered_content": {
    "expectation_vali

### No primeiro momento o "contexto" não possuí nenhuma fonte de dados

In [4]:
context.list_datasources()

[]

### Configuração de uma nova fonte de dados do PostgreSQL

In [5]:
#string de conexão para o PostgreSQL
my_connection_string = (
    #"postgresql+psycopg2://<username>:<password>@<host>:<port>/<database>"
    "postgresql+psycopg2://postgres:ada@localhost:5432/ada"
)

In [6]:
#adiciona uma nova de dados do tipo Postgres
datasource = context.sources.add_postgres(
    name="ge_datasource", connection_string=my_connection_string
)

### Agora a lista de fonte de dados possui o Postgres

In [7]:
context.list_datasources()

[{'type': 'postgres',
  'name': 'ge_datasource',
  'connection_string': PostgresDsn('postgresql+psycopg2://postgres:ada@localhost:5432/ada', )}]

### Adicionando um "data asset" na fonte de dados adicionada, no caso abaixo, a tabela "ibm_prices_silver" do banco de dados

In [8]:
asset_name = "silver"
asset_table_name = "ibm_prices_silver"

table_asset = datasource.add_table_asset(name=asset_name, table_name=asset_table_name)

### Adicionando mais um "data asset" na fonte de dados, mas agora ao invés de passar a tabela, será passado a query:

In [9]:
asset_name = "gold_filter"
asset_query = "SELECT * from ibm_prices_gold where date > '2023-09-29'"

query_asset = datasource.add_query_asset(name=asset_name, query=asset_query)

### Resultado final com os "data assets" criados:

In [10]:
context.list_datasources()

[{'type': 'postgres',
  'name': 'ge_datasource',
  'assets': [{'name': 'silver',
    'type': 'table',
    'order_by': [],
    'batch_metadata': {},
    'table_name': 'ibm_prices_silver',
    'schema_name': None},
   {'name': 'gold_filter',
    'type': 'query',
    'order_by': [],
    'batch_metadata': {},
    'query': "SELECT * from ibm_prices_gold where date > '2023-09-29'"}],
  'connection_string': PostgresDsn('postgresql+psycopg2://postgres:ada@localhost:5432/ada', )}]

### Agora que existe uma fonte de dados e seus componetes ("datasource" e "data asset"), pode-se adquirir uma amostra desses dados chamado de "Batch":

In [98]:
my_datasource = context.get_datasource("ge_datasource") #Fonte de dados Postgres
my_table_asset = my_datasource.get_asset(asset_name="silver") #Asset da tabela silver
batch_request = my_table_asset.build_batch_request() #Resgata os dados do asset

### Adiciona um novo conjunto de expectativas ou "Expectation Suite"

In [12]:
context.add_or_update_expectation_suite("suite_silver")

{
  "expectation_suite_name": "suite_silver",
  "ge_cloud_id": null,
  "expectations": [],
  "data_asset_type": null,
  "meta": {
    "great_expectations_version": "0.17.21"
  }
}

### A partir da amostra "Batch" e do conjunto de expectativas "Expectation Suite" cria um validador:

In [13]:
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="suite_silver",
)
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,datetime,1__open,2__high,3__low,4__close,5__volume,diff_high_low
0,2023-10-05 19:00:00,141.52,141.52,141.11,141.5,796259,0.41
1,2023-10-05 18:00:00,141.48,141.52,140.92,141.52,798695,0.6
2,2023-10-05 17:00:00,141.3,141.5,141.01,141.01,378,0.49
3,2023-10-05 16:00:00,141.52,141.52,141.16,141.5,2591487,0.36
4,2023-10-05 15:00:00,141.64,141.7,141.3,141.52,726001,0.4


### Exemplo com o outro "asset" da tabela Gold

In [32]:
gold_filter_asset = my_datasource.get_asset(asset_name="gold_filter")
batch_request_gold = gold_filter_asset.build_batch_request()

context.add_or_update_expectation_suite("suite_gold_filter")

validator = context.get_validator(
    batch_request=batch_request_gold,
    expectation_suite_name="suite_gold_filter",
)
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,date,max_high,min_low,mean_diff_high_low
0,2023-10-02,141.46,139.86,0.70875
1,2023-10-03,141.64,139.79,0.581438
2,2023-10-04,141.33,139.77,0.501563
3,2023-10-05,141.7,140.19,0.5104


### Adiciona uma expectativa nova no conjunto:

In [33]:
#expectativa de os valores da coluna "mean_diff_high_low" não podem ser nulos
validator.expect_column_values_to_not_be_null(column="mean_diff_high_low")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 4,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### Salva o conjunto de expectativas

In [34]:
validator.save_expectation_suite(discard_failed_expectations=False)

### A partir do validador, cria um novo checkpoint e processa o mesmo

In [35]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_gold_filter",
    validator=validator
)

In [36]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/10 [00:00<?, ?it/s]

### Repete todo o processo com o "asset" da camada "silver"

In [37]:
silver_asset = my_datasource.get_asset(asset_name="silver")
batch_request_silver = silver_asset.build_batch_request()

context.add_or_update_expectation_suite("suite_silver")

validator = context.get_validator(
    batch_request=batch_request_silver,
    expectation_suite_name="suite_silver",
)
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,datetime,1__open,2__high,3__low,4__close,5__volume,diff_high_low
0,2023-10-05 19:00:00,141.52,141.52,141.11,141.5,796259,0.41
1,2023-10-05 18:00:00,141.48,141.52,140.92,141.52,798695,0.6
2,2023-10-05 17:00:00,141.3,141.5,141.01,141.01,378,0.49
3,2023-10-05 16:00:00,141.52,141.52,141.16,141.5,2591487,0.36
4,2023-10-05 15:00:00,141.64,141.7,141.3,141.52,726001,0.4


### Adiciona novas expectativas

In [46]:
#expectativa de que a coluna '1__open' precisa ser do tipo 'Real'
validator.expect_column_values_to_be_of_type(column='1__open', type_='REAL') 
#expectativa de que a coluna '5__volume' precisa ser do tipo 'Integer'
validator.expect_column_values_to_be_of_type(column='5__volume', type_='INTEGER')

#expectativa de valores esperados entre 0 e 1000 para a coluna 'diff_high_low'
validator.expect_column_values_to_be_between(
    column="diff_high_low",
    min_value=0,
    max_value=1000,
)

#expectativa de valores esperados entre 0 e 100000 para a coluna '5__volume'
validator.expect_column_values_to_be_between(
    column="5__volume",
    min_value=0,
    max_value=100000,
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 100,
    "unexpected_count": 65,
    "unexpected_percent": 65.0,
    "partial_unexpected_list": [
      796259,
      798695,
      2591487,
      726001,
      275617,
      256181,
      252763,
      260604,
      291643,
      159429,
      452484,
      452301,
      1543392,
      618636,
      179083,
      151139,
      185099,
      274589,
      299959,
      295922
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 65.0,
    "unexpected_percent_nonmissing": 65.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### Salva o novo conjunto de expectativas e processa o novo checkpoint

In [47]:
validator.save_expectation_suite(discard_failed_expectations=False)

In [48]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_silver",
    validator=validator
)

In [49]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

### Divide os dados em vários "batchs" ou amostras, por ano, mês e dia

In [66]:
silver_asset = my_datasource.get_asset(asset_name="silver")
silver_asset.add_splitter_year_and_month_and_day(column_name="datetime") #Divide os dados em vários "batchs" ou amostras, por ano, mês e dia

TableAsset(name='silver', type='table', id=None, order_by=[], batch_metadata={}, splitter=SplitterYearAndMonthAndDay(column_name='datetime', method_name='split_on_year_and_month_and_day'), table_name='ibm_prices_silver', schema_name=None)

In [68]:
my_batch_request = silver_asset.build_batch_request()
batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request) #Retorna uma lista com todos as amostras criadas com o divisor criado anteriormente
batches

[Batch(datasource=PostgresDatasource(type='postgres', name='ge_datasource', id=None, assets=[TableAsset(name='silver', type='table', id=None, order_by=[], batch_metadata={}, splitter=SplitterYearAndMonthAndDay(column_name='datetime', method_name='split_on_year_and_month_and_day'), table_name='ibm_prices_silver', schema_name=None), QueryAsset(name='gold_filter', type='query', id=None, order_by=[], batch_metadata={}, splitter=None, query="SELECT * from ibm_prices_gold where date > '2023-09-29'")], connection_string=PostgresDsn('postgresql+psycopg2://postgres:ada@localhost:5432/ada', ), create_temp_table=True, kwargs={}), data_asset=TableAsset(name='silver', type='table', id=None, order_by=[], batch_metadata={}, splitter=SplitterYearAndMonthAndDay(column_name='datetime', method_name='split_on_year_and_month_and_day'), table_name='ibm_prices_silver', schema_name=None), batch_request=BatchRequest(datasource_name='ge_datasource', data_asset_name='silver', options={'year': 2023, 'month': 10, 

### Gera o resultado do perfil das amostras ou batchs

In [69]:
data_assistant_result = context.assistants.onboarding.run(
    batch_request=my_batch_request)




Generating Expectations:   0%|          | 0/8 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/126 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/175 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/175 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/7 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/49 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/63 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/77 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/0 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/126 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/7 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/42 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/35 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/70 [00:00<?, ?it/s]

### Plota os resultados do perfil

In [70]:
data_assistant_result.plot_metrics()

138 Metrics calculated, 45 Metric plots implemented
Use DataAssistantResult.metrics_by_domain to show all calculated Metrics


interactive(children=(Dropdown(description='Select Plot Type: ', layout=Layout(margin='0px', width='max-conten…



### Usa as amostras para gerar um novo validador

In [75]:
silver_asset = my_datasource.get_asset(asset_name="silver")
batch_request_silver = silver_asset.build_batch_request()

context.add_or_update_expectation_suite("suite_silver_multiple")

validator = context.get_validator(
    batch_request=batch_request_silver,
    expectation_suite_name="suite_silver_multiple",
)
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,datetime,1__open,2__high,3__low,4__close,5__volume,diff_high_low
0,2023-09-29 19:00:00,140.3,140.59,140.09,140.3,1109963,0.5
1,2023-09-29 18:00:00,140.3,140.48,140.1,140.3,1110560,0.38
2,2023-09-29 17:00:00,140.47,140.88,140.3,140.33,17348,0.58
3,2023-09-29 16:00:00,140.34,141.25,139.9,140.48,3579991,1.35
4,2023-09-29 15:00:00,140.42,140.65,139.97,140.34,1232057,0.68


### Cria as expectativas

In [76]:
#expectativa de que a coluna '1__open' precisa ser do tipo 'Real'
validator.expect_column_values_to_be_of_type(column='1__open', type_='REAL') 
#expectativa de que a coluna '5__volume' precisa ser do tipo 'Integer'
validator.expect_column_values_to_be_of_type(column='5__volume', type_='INTEGER')

#expectativa de valores esperados entre 0 e 1000 para a coluna 'diff_high_low'
validator.expect_column_values_to_be_between(
    column="diff_high_low",
    min_value=0,
    max_value=1000,
)

#expectativa de valores esperados entre 0 e 100000 para a coluna '5__volume'
validator.expect_column_values_to_be_between(
    column="5__volume",
    min_value=0,
    max_value=100000,
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 15,
    "unexpected_count": 10,
    "unexpected_percent": 66.66666666666666,
    "partial_unexpected_list": [
      1109963,
      1110560,
      3579991,
      1232057,
      511054,
      486733,
      444146,
      526265,
      673145,
      450032
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 66.66666666666666,
    "unexpected_percent_nonmissing": 66.66666666666666
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### Process os novos resultados

In [77]:
validator.save_expectation_suite(discard_failed_expectations=False)

In [78]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_silver_multiple",
    validator=validator
)

In [79]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

### Existe a possibilidade de filtrar os dados a partir da divisão feita anteriormente

In [83]:
silver_asset = my_datasource.get_asset(asset_name="silver")

options = silver_asset.batch_request_options #retorna a lista com todas as opções de filtragem de amostras
print(options)

('year', 'month', 'day')


### Cria um validador apenas com o ano de 2023 e mês 10 e gera um novo checkpoint com apenas essa amostra

In [93]:
silver_asset = my_datasource.get_asset(asset_name="silver")

batch_request_silver = silver_asset.build_batch_request(options={'year': 2023, 'month': 10}) #filtra amostras apenas do ano de 2023 e mês 10

context.add_or_update_expectation_suite("suite_silver_multiple_query")

validator = context.get_validator(
    batch_request=batch_request_silver,
    expectation_suite_name="suite_silver_multiple_query",
)
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,datetime,1__open,2__high,3__low,4__close,5__volume,diff_high_low
0,2023-10-05 19:00:00,141.52,141.52,141.11,141.5,796259,0.41
1,2023-10-05 18:00:00,141.48,141.52,140.92,141.52,798695,0.6
2,2023-10-05 17:00:00,141.3,141.5,141.01,141.01,378,0.49
3,2023-10-05 16:00:00,141.52,141.52,141.16,141.5,2591487,0.36
4,2023-10-05 15:00:00,141.64,141.7,141.3,141.52,726001,0.4


In [94]:
#expectativa de que a coluna '1__open' precisa ser do tipo 'Real'
validator.expect_column_values_to_be_of_type(column='1__open', type_='REAL') 
#expectativa de que a coluna '5__volume' precisa ser do tipo 'Integer'
validator.expect_column_values_to_be_of_type(column='5__volume', type_='INTEGER')

#expectativa de valores esperados entre 0 e 1000 para a coluna 'diff_high_low'
validator.expect_column_values_to_be_between(
    column="diff_high_low",
    min_value=0,
    max_value=1000,
)

#expectativa de valores esperados entre 0 e 100000 para a coluna '5__volume'
validator.expect_column_values_to_be_between(
    column="5__volume",
    min_value=0,
    max_value=100000,
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/11 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 15,
    "unexpected_count": 10,
    "unexpected_percent": 66.66666666666666,
    "partial_unexpected_list": [
      796259,
      798695,
      2591487,
      726001,
      275617,
      256181,
      252763,
      260604,
      291643,
      159429
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 66.66666666666666,
    "unexpected_percent_nonmissing": 66.66666666666666
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [95]:
validator.save_expectation_suite(discard_failed_expectations=False)

In [96]:
checkpoint = context.add_or_update_checkpoint(
    name="checkpoint_silver_multiple_query",
    validator=validator
)

In [97]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]