### Data Quality - Tabela Silver Pagamentos Trimestrais

#### Data Quality com Great Expectations  

Vamos usar o Great Expectations para de forma padronizada, validar os intervaloes e valores das coluns em nosso Dataframe  

1. Check for duplicates

2. Check for unique values in columns

3. Check for missing values

4. Categorical value distributions

5. Schema validation

6. Temporal consistency check

7. Cross-field validation

8. Dependency check

In [5]:
# IMPORTS AND LIBRARIES

import os 

import great_expectations as gx
import great_expectations.expectations as gxe

In [9]:

# Criando GX Context (Ephemeral)
context = gx.get_context(mode="ephemeral")


# Data Source
data_source_name = "pagamentos_trimestrais_data"

data_source = context.data_sources.add_spark(name= data_source_name)


# Asset
data_asset_name = "pagamentos_trimestrais_asset"
data_asset = data_source.add_dataframe_asset(name = data_asset_name)

# Batch Definition
batch_definition_name = "full_pagamentos_trimestrais_batch"
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)

# Criando Expectations Suite
expectation_suite_name = "pagamentos_trimestrais_suite"
expectation_suite_ref = gx.ExpectationSuite(name = expectation_suite_name )
expectation_suite = context.suites.add(expectation_suite_ref)



#### Carregando a tabela Silver

In [7]:
# PARAMETERS


os.environ["MINIO_KEY"] = "developer"
os.environ["MINIO_SECRET"] = "developer01"
os.environ["MINIO_ENDPOINT"] = "http://minio:9000"


bucket_name = "bank-databr"

# Paths Data Storage
root_path_dir = f"{bucket_name}"
silver_path_dir = f"{root_path_dir}/silver"



In [8]:

# SPARK SESSION

from pyspark.sql import SparkSession, DataFrame

spark = SparkSession.builder \
                    .appName("SilverDataQuality") \
                    .config("spark.hadoop.fs.s3a.endpoint", os.environ["MINIO_ENDPOINT"]) \
                    .config("spark.hadoop.fs.s3a.access.key", os.environ["MINIO_KEY"]) \
                    .config("spark.hadoop.fs.s3a.secret.key", os.environ["MINIO_SECRET"]) \
                    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
                    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .getOrCreate()

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
software.amazon.awssdk#s3 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0a973060-d30f-4544-8c3d-1e717017af97;1.0
	confs: [default]
	found software.amazon.awssdk#s3;2.26.30 in central
	found software.amazon.awssdk#aws-xml-protocol;2.26.30 in central
	found software.amazon.awssdk#aws-query-protocol;2.26.30 in central
	found software.amazon.awssdk#protocol-core;2.26.30 in central
	found software.amazon.awssdk#sdk-core;2.26.30 in central
	found software.amazon.awssdk#annotations;2.26.30 in central
	found software.amazon.awssdk#http-client-spi;2.26.30 in central
	found software.amazon.awssdk#utils;2.26.30 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found org.slf4j#slf4j-api;1.7.36 in central
	found software.amazon.awssdk#metric

In [11]:

from pyspark.sql.functions import col, lit

table_name = "s_cartoes_trimestral_bc"
data_source_file_path = f"s3a://{silver_path_dir}/{table_name}"

dt_partition = "2025-01-20" # Partição da tabela Silver, Data de referência

print(f"* Data Source Pagamentos Trimestrais: {data_source_file_path}")

df_pagamentos_trimestral_silver = spark.read \
                                       .format("delta") \
                                       .load(data_source_file_path) \
                                       .where(col('dt_partition') == dt_partition)

print("\n* Schema Original do arquivo origem")
df_pagamentos_trimestral_silver.printSchema
df_pagamentos_trimestral_silver.show(n=1, vertical=True, truncate=True)

* Data Source Pagamentos Trimestrais: s3a://bank-databr/silver/s_cartoes_trimestral_bc


                                                                                


* Schema Original do arquivo origem


25/02/01 04:21:03 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 9:>                                                          (0 + 1) / 1]

-RECORD 0-------------------------------------
 datatrimestre                | 2024-09-30    
 quantidadeBoleto             | 1525449.9     
 quantidadeCartaoCredito      | 5061154.29    
 quantidadeCartaoDebito       | 4121806.1     
 quantidadeCartaoPrePago      | 3132432.21    
 quantidadeCheque             | 44243.73      
 quantidadeConvenios          | 661708.42     
 quantidadeDOC                | 0.0           
 quantidadeDebitoDireto       | 4056152.31    
 quantidadePix                | 1.654642727E7 
 quantidadeSaques             | 664224.81     
 quantidadeTEC                | 0.0           
 quantidadeTED                | 203394.26     
 quantidadeTransIntrabancaria | 771817.52     
 valorBoleto                  | 2463425.06    
 valorCartaoCredito           | 662702.46     
 valorCartaoDebito            | 241932.98     
 valorCartaoPrePago           | 78531.64      
 valorCheque                  | 194330.6      
 valorConvenios               | 961282.82     
 valorDOC    

                                                                                

In [15]:
# Criando Expectations e adicionando ao Suite da tabela Silver
 
# É esperado que tenhamos dados de um Trimestre em específico, único.
expectation_suite.add_expectation(

    gxe.ExpectColumnValuesToBeUnique(
        column = "datatrimestre",
        
    )

)

# É esperado a existência da coluna quantidadePix
expectation_suite.add_expectation(

    gxe.ExpectColumnToExist(
        column = "quantidadePix",
        
    )

)

# É esperado a existência da coluna valorPix
expectation_suite.add_expectation(

    gxe.ExpectColumnToExist(
        column = "valorPix",
        
    )

)

# É que quantidadePix não tenha mais de 10% de Nulos
expectation_suite.add_expectation(

    gxe.ExpectColumnValuesToNotBeNull(
        column = "quantidadePix",
        mostly=0.90
    )

)

# É que valorPix não tenha mais de 10% de Nulos
expectation_suite.add_expectation(

    gxe.ExpectColumnValuesToNotBeNull(
        column = "valorPix",
        mostly=0.90
    )

)

# É que quantidadePix seja maior do que Zero
expectation_suite.add_expectation(

    gxe.ExpectColumnMinToBeBetween(
        column = "quantidadePix",
        min_value = 0
    )

)


# É que valorPix seja maior do que Zero
expectation_suite.add_expectation(

    gxe.ExpectColumnMinToBeBetween(
        column = "valorPix",
        min_value = 0
    )

)

# É que TotalQtdTransacoes não tenha mais do que 10% Missing
expectation_suite.add_expectation(

    gxe.ExpectColumnValuesToNotBeNull(
        column = "TotalQtdTransacoes",
        mostly=0.90
    )

)


# É que TotalValores não tenha mais do que 10% Missing
expectation_suite.add_expectation(

    gxe.ExpectColumnValuesToNotBeNull(
        column = "TotalValores",
        mostly=0.90
    )

)


# Criando o Validation Definition
validation_def_name = "pagamentos_trimestrais_validation"
validation_definition_ref = gx.ValidationDefinition(
    data = batch_definition,
    suite = expectation_suite,
    name = validation_def_name
)


validation_definition = context.validation_definitions.add(validation_definition_ref)

# Definindo Batch Parameters

batch_parameters = {"dataframe": df_pagamentos_trimestral_silver}
valiadation_results = validation_definition.run(batch_parameters = batch_parameters)

print(valiadation_results)

Calculating Metrics: 100%|██████████| 40/40 [00:07<00:00,  5.44it/s]            

{
  "success": false,
  "results": [
    {
      "success": false,
      "expectation_config": {
        "type": "expect_column_values_to_be_unique",
        "kwargs": {
          "batch_id": "pagamentos_trimestrais_data-pagamentos_trimestrais_asset",
          "column": "datatrimestre"
        },
        "meta": {},
        "id": "817e5203-5e6e-4e9f-bacf-0d292676295a"
      },
      "result": {
        "element_count": 161,
        "unexpected_count": 161,
        "unexpected_percent": 100.0,
        "partial_unexpected_list": [
          "2019-03-31",
          "2019-03-31",
          "2019-03-31",
          "2019-03-31",
          "2019-03-31",
          "2019-03-31",
          "2019-03-31",
          "2019-06-30",
          "2019-06-30",
          "2019-06-30",
          "2019-06-30",
          "2019-06-30",
          "2019-06-30",
          "2019-06-30",
          "2019-09-30",
          "2019-09-30",
          "2019-09-30",
          "2019-09-30",
          "2019-09-30",
        


