Milestone 3

Nama  : Maulidya Fauziyyah
Batch : FTDS-011-HCK

This file contains data validation using Great Expectation.

---

In [1]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

In [2]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-supermarket-sales'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'supermarket-sales-data'
path_to_data = '/Users/maulidyaa/github-classroom/FTDS-assignment-bay/p2-ftds011-hck-m3-maulidyafauziyyah/dags/P2M3_maulidya_fauziyyah_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

In [3]:
# Creat an expectation suite
expectation_suite_name = 'expectation-supermarket-sales'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,invoice_id,branch,city,customer_type,gender,product_line,unit_price,quantity,tax_5,total,date,time,payment,cogs,gross_margin_percentage,gross_income,rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74,7,7.0,7.0,2019-01-05,2024-01-26 13:00:00,Ewallet,7.0,7.0,7.0,7.0
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15,5,5.0,5.0,2019-03-08,2024-01-26 10:00:00,Cash,5.0,5.0,5.0,5.0
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46,7,7.0,7.0,2019-03-03,2024-01-26 13:00:00,Credit card,7.0,7.0,7.0,7.0
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58,8,8.0,8.0,2019-01-27,2024-01-26 20:00:00,Ewallet,8.0,8.0,8.0,8.0
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86,7,7.0,7.0,2019-02-08,2024-01-26 10:00:00,Ewallet,7.0,7.0,7.0,7.0


Done creating an expectation suite named 'expectation-supermarket-sales', adding or updating it in a data context, creating a validator based on the expectation suite, and then checking the validator's data.

In [4]:
# Expectation 1 : Column `invoice_id` have to be unique

validator.expect_column_values_to_be_unique('invoice_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'invoice_id' must be unique for the accurate identification and tracking of individual invoices in Supermarket Sales dataset.

In [5]:
# Expectation 2 : Column `rating` must between min_value and max_value

validator.expect_column_values_to_be_between(column='rating',min_value=0,max_value=10.0)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'rating' is crucial for maintaining data integrity and accuracy in the context of measuring customer satisfaction on products.

In [6]:
# Expectation 3 : Column `payment_type` must contain one of the following 3 things :
# Ewallet
# Cash
# Credit card

validator.expect_column_values_to_be_in_set('payment', ['Ewallet', 'Cash', 'Credit card'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

To ensures data integrity by verifying that all entries in the payment column are limited to one of the three specified payment methods, preventing any unexpected or invalid data entries.

In [7]:
# Expectation 4 : Column 'gross_income' must in type list

validator.expect_column_values_to_be_in_type_list('gross_income', ['float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Essential for consistent and accurate numerical computations, especially when dealing with this financial data.

In [8]:
# Expectation 5 : Column 'invoice_id' must match Regex Pattern

validator.expect_column_values_to_match_regex('invoice_id', '^\d{3}-\d{2}-\d{4}$')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

To ensures that all values in the 'invoice_id' column conform to a specific format, which in this case is a pattern of three digits, a hyphen, two digits, another hyphen, followed by four digits, thereby improving data consistency and reliability.

In [9]:
# Expectation 6 : Column 'date' must match strftime format

validator.expect_column_values_to_match_strftime_format('date', '%Y-%m-%d')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

To ensure that all values in the 'date' column conform to a specific date format, ensuring consistency and facilitating accurate date-related operations in data processing and analysis.

In [10]:
# Expectation 7 : Column 'date' must be in set

validator.expect_column_most_common_value_to_be_in_set('payment', ['Cash', 'Credit card', 'Ewallet'])

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": [
      "Ewallet"
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Predefined set of acceptable payment methods ('Cash', 'Credit card', 'Ewallet'), thereby maintaining data consistency and integrity in financial transactions.

In [11]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

Saves the current set of defined data validation (expectations) into an expectation suite.

In [12]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

Creates  a checkpoint named 'checkpoint_1' in a data validation context, associating it with a specified validator to automate and streamline the execution of data validation tasks.

In [13]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/37 [00:00<?, ?it/s]

Executes a predefined set of data validation tests (known as a checkpoint) on your dataset to ensure it meets specified quality criteria, and returns the results of these tests.

In [14]:
# Build data docs

context.build_data_docs()

{'local_site': 'file:///Users/maulidyaa/github-classroom/FTDS-assignment-bay/p2-ftds011-hck-m3-maulidyafauziyyah/dags/gx/uncommitted/data_docs/local_site/index.html'}

Generates and updates the data documentation for the dataset, creating a visual and easily understandable representation of the data validation results and profiles.