# Great Expectation

# Objectives

This notebook aims to leverage Great Expectations for data validation to ensure the dataset meets quality standards before analysis. By integrating Great Expectations, we can profile the data, run checks to confirm data consistency, and generate comprehensive validation reports. This process helps maintain data integrity and prepares it for reliable downstream tasks.

# Import Libraries

In [1]:
import pandas as pd
import great_expectations as gx

Import libraries has been successfully performed.

# Data Loading

In [2]:
# Read CSV file and save into a DataFrame
df = pd.read_csv('clean_data.csv')

# Show dataset
df

Unnamed: 0,employee_id,department,gender,age,job_title,hire_date,years_at_company,education_level,performance_score,monthly_salary,work_hours_per_week,projects_handled,overtime_hours,sick_days,remote_work_frequency,team_size,training_hours,promotions,employee_satisfaction_score,resigned
0,1,IT,Male,55,Specialist,2022-01-19,2,High School,5,6750.0,33,32,22,2,0,14,66,0,2.63,False
1,2,Finance,Male,29,Developer,2024-04-18,0,High School,5,7500.0,34,34,13,14,100,12,61,2,1.72,False
2,3,Finance,Male,55,Specialist,2015-10-26,8,High School,3,5850.0,37,27,6,3,50,10,1,0,3.17,False
3,4,Customer Support,Female,48,Analyst,2016-10-22,7,Bachelor,2,4800.0,52,10,28,12,100,10,0,1,1.86,False
4,5,Engineering,Female,36,Analyst,2021-07-23,3,Bachelor,2,4800.0,38,11,29,13,100,15,9,1,1.25,False


In [3]:
# Convert DataFrame to Great Expectation DataFrame
df_ge = gx.from_pandas(df)

Data has been successfully loaded and converted to great expectation dataframe.

# Expectations

## Expectation 1: To be Unique

The expectation expect_column_values_to_be_unique ensures that each `employee_id` is unique, preventing duplicate entries for employees. This guarantees the integrity of the dataset, aligning with the requirement that each employee must have a distinct identifier.

In [4]:
# Expectation to be unique
to_be_unique = df_ge.expect_column_values_to_be_unique("employee_id")
to_be_unique

{
  "success": true,
  "result": {
    "element_count": 100000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `employee_id` column has met the expectation to be unique, indicating that there are no duplicate values. 

## Expectation 2: To be Between

The expectation expect_column_values_to_be_between ensures that employee `ages` are between 22 and 60, reflecting the productive working age range. This aligns with the dataset’s information, ensuring data integrity and consistency.

In [31]:
# Expecattion to be between
to_be_between = df_ge.expect_column_values_to_be_between("age", min_value=22, max_value=60) 
to_be_between

{
  "success": true,
  "result": {
    "element_count": 100000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The employee `ages` column has met the expectation to be between 22 and 60.

## Expectation 3: To be in Set

The expectation expect_column_values_to_be_in_set ensures that the values in the `education_level` column are within the valid set: "High School," "Bachelor," "Master," or "PhD." These education levels are appropriate for the workforce, aligning with typical qualifications expected for employees in the dataset.

In [16]:
# Expectation to be in set
to_be_in_set = df_ge.expect_column_values_to_be_in_set("education_level", ["High School", "Bachelor", "Master", "PhD"])
to_be_in_set

{
  "success": true,
  "result": {
    "element_count": 100000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `education_level` column has met the expectation which within the valid set.

## Expectation 4: To be in Type List

The expectation expect_column_values_to_be_in_type_list ensures that the values in the `monthly_salary` column are of type float64. This is important for accurate numerical analysis and calculations involving salary data in the dataset.

In [36]:
# Expectation to be in type list
to_be_in_type = df_ge.expect_column_values_to_be_in_type_list("monthly_salary", ["float64"])
to_be_in_type

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `monthly_salary` column has met the expectation of having values of type float64.

## Expectation 5: To be Value Match

The expectation expect_column_values_to_match_regex ensures that the values in the `hire_date` column match the date format YYYY-MM-DD. This guarantees that the hire dates are properly formatted, maintaining consistency and preventing data entry errors related to date formatting.

In [17]:
# Expectation to be value match
to_be_value_match = df_ge.expect_column_values_to_match_regex(column="hire_date", regex=r"\d{4}-\d{2}-\d{2}")
to_be_value_match

{
  "success": true,
  "result": {
    "element_count": 100000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `hire_date` column has met the expectation of hacing the date format YYYY-MM-DD.

## Expectation 6: To be Dateutil Parseable

The expectation expect_column_values_to_be_dateutil_parseable ensures that the values in the `hire_date` column are parseable by the dateutil library. This guarantees that the hire dates are in a valid format and can be correctly interpreted as dates.

In [30]:
# Expectation to be dateutil parseable
to_be_parseable = df_ge.expect_column_values_to_be_dateutil_parseable(column='hire_date')
to_be_parseable

{
  "success": true,
  "result": {
    "element_count": 100000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `hire_date` column has met the expectation that parseable by the dateutil library.

## Expectation 7: Median to be Between

The expectation expect_column_median_to_be_betwee expectationn ensures that the median of `employee_satisfaction_score` is between 1 and 5, consistent with the dataset's information where the score range is 1-5.

In [33]:
# Expectation median to be between
median_validation = df_ge.expect_column_median_to_be_between('employee_satisfaction_score', min_value=1, max_value=5)
median_validation

{
  "success": true,
  "result": {
    "observed_value": 3.0,
    "element_count": 100000,
    "missing_count": null,
    "missing_percent": null
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {}
}

The `employee_satisfaction_score` has met the expectation of having a median between 1 to 5.