# Great Expectation

Nama : Ilham Wahdini

Batch : RMT 043

This program is to check and validate the data with tool great_expectations before next process, make sure data in good quality and meet the expectation. Dataset used is data of Global Renewable Energy Usage FY 2020-2024.

In [None]:
# Import libraries

import pandas as pd
import great_expectations as ge

In [None]:
# Load cleaned data

df = pd.read_csv("P2M3_ilham_wahdini_data_clean.csv")


In [None]:
# Convert to validator

validator = ge.from_pandas(df)
validator.head()

Unnamed: 0,household_id,region,country,energy_source,monthly_usage_kwh,year,household_size,income_level,urban_rural,adoption_year,subsidy_received,cost_savings_usd
0,H01502,North America,USA,Hydro,1043.49,2024,5,Low,Urban,2012,No,10.46
1,H02587,Australia,Australia,Geothermal,610.01,2024,4,High,Rural,2023,No,43.49
2,H02654,North America,USA,Biomass,1196.75,2024,8,Low,Rural,2017,Yes,93.28
3,H01056,South America,Colombia,Biomass,629.67,2024,7,High,Urban,2023,No,472.85
4,H00706,Africa,Egypt,Hydro,274.46,2022,7,Middle,Rural,2010,No,65.98


In [None]:
# 1. To be Unique

validator.expect_column_values_to_be_unique('household_id')

{
  "success": true,
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Result is True, it is sure that the data is unique or there is no duplicate.

In [None]:
# 2. To be between min_value and max_value

validator.expect_column_values_to_be_between('year', min_value=2020, max_value=2024)

{
  "success": true,
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The result is True, it is sure that the data of Global Renewable Energy Usage has range for year 2020-2024

In [None]:
# 3. To be in Set

valid_income = ['Low', 'Middle', 'High']
validator.expect_column_values_to_be_in_set('income_level', valid_income)

{
  "success": true,
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The result is True, it is sure for column income_level has valid category as per mention.

In [None]:
# 4. To be in Type List

validator.expect_column_values_to_be_in_type_list('monthly_usage_kwh', ['integer', 'float'])

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The result is True, it's sure that for column monthly_usage_kwh has format type integer or float. That column has to be numerical.

In [None]:
# 5. To Match Regex

validator.expect_column_values_to_match_regex('household_id', r'^H\d+')


{
  "success": true,
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Result is True, it is valid that the format writing of the household_id is based on its format '^H\d+'.

In [None]:
# 6. To be between value lengths

validator.expect_column_value_lengths_to_be_between(column='region', min_value=4, max_value=13)


{
  "success": true,
  "result": {
    "element_count": 1000,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The result is Ture, it is valid that the length of the character in column region has length between 4 (Asia) and 13 (North America / South America).

In [None]:
# 7. To be between min_value

validator.expect_column_min_to_be_between('household_size', min_value=1)

{
  "success": true,
  "result": {
    "observed_value": 1,
    "element_count": 1000,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The result is True, it is sure that the household_size in the dataset valid with at least min 1 person. Household can't be zero. 

Based on all the great expectation analysis, it is sure that the data after cleaning is valid and reliable for the next process to do data analysis and visualization.