# Milestone 3

## Name: Irvandhi Stanly Winata
## Batch: HCK - 013

### **Objective:** This notebook is used for data validation using Great Expectation

Before conducting data valdation using great expectation, we must create a data context and connect it to a data source. Then, we must input our csv file as an asset and add it to the data source.

In [1]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-onlinefoods-stanly'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'milestone3_stanly'
path_to_data = 'P2M3_Irvandhi_Stanly_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()


Then we create an Expectation Suite that serves as a collection of verifiable assertions regarding data. These suites amalgamate various Expectations into a comprehensive depiction of the data. Expectation Suite names can be customized as long as they remain unique within a specific project. Validators are utilized to interact with batches of data and generate these Expectation Suites. Each time an Expectation is evaluated using validator.expect_*, it undergoes immediate validation against the dataset.

In [4]:
# Creat an expectation suite
expectation_suite_name = 'expectation-onlinefood-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,age,gender,marital_status,occupation,monthly_income,educational_qualifications,family_size,latitude,longitude,pin_code,output,feedback
0,1,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Yes,Positive
1,2,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Yes,Positive
2,3,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Yes,Negative
3,4,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Yes,Positive
4,5,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Yes,Positive


## D.1 - Expectations

An Expectation is a verifiable assertion about source data, akin to assertions in traditional Python unit tests. It offers a flexible and declarative language for describing anticipated behaviors of the data. Essentially, Expectations outline what we anticipate from the data, serving as benchmarks against which the actual data can be validated

In [5]:
# Expectation 1 : Column `id` must be unique as it is our primary key and an identifier for each unique orders

validator.expect_column_values_to_be_unique('id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 388,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [6]:
# Expectation 2 : Column `age` must be between 18 and 100 for a reasnable human age that can order food online

validator.expect_column_values_to_be_between(
    column='age', min_value=18, max_value=100
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 388,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [8]:
# Expectation 3 : Column `educational_qualifications` must contain one of the following 4 things: 'Post Graduate', 'Graduate', 'Ph.D', 'Uneducated', 'School'

validator.expect_column_values_to_be_in_set('educational_qualifications', ['Post Graduate', 'Graduate', 'Ph.D', 'Uneducated', 'School'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 388,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [11]:
# Expectation 4 : Column `latitude` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('latitude', ['integer', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [13]:
# Expectation 5 : Column family_size must be in form of integer

validator.expect_column_values_to_be_of_type("family_size", "int")

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [14]:
# Expectation 6 : The "pin_code" column must consist of exactly six digits, ensuring they adhere to the standard format of a PIN code.

validator.expect_column_values_to_match_regex("pin_code", r"^\d{6}$")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 388,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [21]:
# Expectation 7 : Column `feedback` must only contain two unique values, positive and negative

validator.expect_column_unique_value_count_to_be_between(
    column='feedback', min_value=2, max_value=2
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 2
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [22]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)