# **HiBlu**: Data Validation - Great Expectation

By Mavericks Team - Hacktiv8 | Data Resource: [FAQ Blu](https://blubybcadigital.id/info/faq)

---

# Introduction

Data validation is a crucial step in the ETL (Extract, Transform, Load) process to ensure that the data stored and used in analysis or reports meets the desired quality standards. Great Expectations is an open-source tool that helps ensure data meets predefined expectations.

There are several data validations to be performed such as:
* Existing Columns
* Data Type
* Not Null
* Question Marks in the `Question` column


*This Notebook Worked on Google Collab*

# Import Libraries

In [1]:
# Importing libraries
import pandas as pd

## Intalling GX
!pip install -q great-expectations

from great_expectations.data_context import FileDataContext
context = FileDataContext.create(project_root_dir='./')

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Connecting to a Datasource

In [2]:
# Naming datasource
datasource_name = 'faq-csv'
datasource = context.sources.add_pandas(datasource_name)

# Naming data asset
asset_name ='faq'
path_to_data = '/content/FAQ_cleaned.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Building batch request
batch_request = asset.build_batch_request()

# Creating Expectation Suite

In [3]:
# Creat an expectation suite
expectation_suite_name = 'faq-hiblu'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Question,Answer
0,Apa itu blu?,blu merupakan aplikasi mobile banking dari BCA...
1,Apa perbedaan blu dengan BCA Digital?,blu adalah aplikasi mobile banking milik BCA D...
2,Apa perbedaan BCA Digital dengan BCA?,"BCA Digital merupakan anak perusahaan BCA, bag..."
3,Apa keuntungan pakai aplikasi blu?,"Gak terbatas ruang dan waktu, aplikasi blu bis..."
4,Apakah blu punya kantor cabang offline?,"blu gak punya kantor cabang offline, tapi tena..."


# Expectations

## Existing Columns

In [None]:
# Expect question column to exist
validator.expect_column_to_exist('Question')

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "Question",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [None]:
# Expect answer column to exist
validator.expect_column_to_exist('Answer')




Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "Answer",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## Data Type

In [None]:
# Expect question to be of type string
validator.expect_column_values_to_be_of_type("Question", "str")

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "Question",
      "type_": "str",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {
    "element_count": 500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [None]:
# Expect answer to be of type string
validator.expect_column_values_to_be_of_type('Answer', "str")

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "Answer",
      "type_": "str",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {
    "element_count": 500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## Not Null

In [None]:
# Expect question column to not be null
validator.expect_column_values_to_not_be_null("Question")

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "Question",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {
    "element_count": 500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [None]:
# Expect answer column to not be null
validator.expect_column_values_to_not_be_null("Answer")




Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "Answer",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {
    "element_count": 500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## Simbol Tanda Tanya di `Question`

In [5]:
# Expect '?' in Question column
validator.expect_column_values_to_match_regex(column="Question", regex=r".*\?.*")




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_match_regex",
    "kwargs": {
      "column": "Question",
      "regex": ".*\\?.*",
      "batch_id": "faq-csv-faq"
    },
    "meta": {}
  },
  "result": {
    "element_count": 500,
    "unexpected_count": 15,
    "unexpected_percent": 3.0,
    "partial_unexpected_list": [
      "Apakah ada biaya jika transfer ke BCA pakai blu",
      "Bagaimana bila status transaksi berhasil dan saldo sudah terdebet namun uang belum diterima oleh rekening bank tujuan",
      "Syarat & Ketentuan",
      "Anti-Fraud",
      "Syarat & Ketentuan",
      "Anti-Fraud",
      "Saya lupa PIN saat mau bertransaksi",
      "Saya lupa password",
      "Saya perlu mengubah informasi personal",
      "Saya tinggal di luar negeri, apakah saya bisa menggunakan blu",
      "Saya lupa PIN transaksi akun Bimapay x blu, apa yang harus saya lakukan",
      "Saya lupa PIN transaksi akun blu, apa yang harus saya lakukan",
    

In the last validation check, Great Expectations indicates that there are some `Question` entries that do not contain question mark symbols. This is because the FAQ Blu page includes "questions" that are more statement-oriented rather than interrogative.