## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [None]:
# Write your code from here
# Run this in terminal or Jupyter cell before starting
# !pip install great_expectations pandas

import great_expectations as ge
import pandas as pd

# Step 1: Prepare a sample dataset
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)

# Step 2: Create a Great Expectations dataset
ge_df = ge.from_pandas(df)

# Step 3: Create expectations
ge_df.expect_column_to_exist('name')
ge_df.expect_column_to_exist('age')
ge_df.expect_column_values_to_be_of_type('age', 'int64')
ge_df.expect_column_values_to_not_be_null('email')

# Step 4: Validate the data against expectations
results = ge_df.validate()

# Step 5: Print validation results summary
print("Validation Success:", results.success)
print("Expectations results:")
for res in results.results:
    print(f"{res.expectation_config.expectation_type}: {res.success}")

# Note:
# For a full Great Expectations project with Data Context, you would use:
# great_expectations init
# and work with data_context to create suites and run validations.


### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [None]:
# Write your code from here
import great_expectations as ge
import pandas as pd

data = {
    'name': ['Alice', 'Bob', None, 'David'],
    'age': [25, 30, 22, None],
    'email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com']
}
df = pd.DataFrame(data)
ge_df = ge.from_pandas(df)

# Define expectations
ge_df.expect_column_to_exist('name')
ge_df.expect_column_values_to_not_be_null('name')
ge_df.expect_column_values_to_not_be_null('age')
ge_df.expect_column_values_to_be_of_type('age', 'float64')  # pandas may interpret int with NaN as float
ge_df.expect_column_values_to_match_regex('email', r"[^@]+@[^@]+\.[^@]+")  # simple email format check

# Validate dataset
results = ge_df.validate()

# Print validation summary
print("Validation Success:", results.success)
for res in results.results:
    print(f"{res.expectation_config.expectation_type}: {res.success}")

# Generate HTML report file
html_report = results.rendered_content
with open("validation_report.html", "w") as f:
    f.write(html_report)

print("Validation report saved as validation_report.html")


### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [None]:
# Write your code from here
import great_expectations as ge
import pandas as pd
from apscheduler.schedulers.blocking import BlockingScheduler

data = {
    'customer_id': [101, 102, 103, 101],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'purchase_amount': [250, 300, 150, 400]
}
df = pd.DataFrame(data)
ge_df = ge.from_pandas(df)

def run_data_quality_checks():
    ge_df.expect_column_values_to_be_unique('customer_id')
    ge_df.expect_column_values_to_be_between('purchase_amount', min_value=100, max_value=500)

    results = ge_df.validate()
    print("Validation Success:", results.success)
    for res in results.results:
        print(f"{res.expectation_config.expectation_type} on {res.expectation_config.kwargs.get('column')} : {res.success}")
    
    if not results.success:
        print("Alert: Data quality issues detected!")

scheduler = BlockingScheduler()
scheduler.add_job(run_data_quality_checks, 'interval', days=1)

print("Starting scheduled data quality checks (every 24 hours)...")
run_data_quality_checks()  # Run once immediately
scheduler.start()
