## Automate Data Quality Checks with Great Expectations
**Introduction**: In this activity, you will learn how to automate data quality checks using the Great Expectations framework. This includes setting up expectations and generating validation reports.

### Task 1: Setup and Initial Expectations

1. Objective: Set up Great Expectations and create initial expectations for a dataset.
2. Steps:
    - Install Great Expectations using pip.
    - Initialize a data context.
    - Create basic expectations on a sample dataset.
    - Eg., Implement a basic setup and expectation for column presence and type.

In [2]:
# Write your code from here

import great_expectations as ge
from great_expectations.core.batch import BatchRequest
import pandas as pd
import schedule
import time

# --------------------------------
# Prepare sample data as DataFrame
# --------------------------------

data = {
    "CustomerID": [101, 102, 103, 104, 105],
    "Name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "Email": ["alice@example.com", "bob@example.com", None, "dave@example.com", "eve@example.com"],
    "Age": [30, 25, 22, 40, 28]
}

df = pd.DataFrame(data)

# --------------------------------
# Task 1: Setup Great Expectations
# --------------------------------

# Initialize Data Context in-memory (using SimpleNamespace as temp root dir)
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import (
    DataContextConfig,
    FilesystemStoreBackendDefaults,
)

# Configure context to run in-memory without CLI initialization
data_context_config = DataContextConfig(
    store_backend_defaults=FilesystemStoreBackendDefaults(root_directory="./great_expectations"),
    anonymous_usage_statistics={"enabled": False}
)
context = BaseDataContext(project_config=data_context_config)

# Create a GE dataset from pandas dataframe
ge_df = ge.from_pandas(df)

# Create Expectation Suite
expectation_suite_name = "customer_data_suite"
if not context.list_expectation_suites():
    context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

suite = context.get_expectation_suite(expectation_suite_name)

# Add basic expectations for column presence and types
ge_df.expect_table_columns_to_match_ordered_list(column_list=["CustomerID", "Name", "Email", "Age"])
ge_df.expect_column_values_to_not_be_null("CustomerID")
ge_df.expect_column_values_to_be_of_type("CustomerID", "int64")
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_be_of_type("Name", "object")
ge_df.expect_column_values_to_be_unique("CustomerID")

# Save the expectation suite to context
ge_df.save_expectation_suite(expectation_suite_name)

print(f"Expectation suite '{expectation_suite_name}' created with basic expectations.\n")

# --------------------------------
# Task 2: Validate Dataset and Generate Report
# --------------------------------

# Create batch from pandas dataframe
batch = ge_df.get_batch()

# Validate data against the suite
results = ge_df.validate(expectation_suite=suite, result_format="SUMMARY")

print("Validation Results Summary:")
print(results['statistics'])

# Generate a simple HTML report (saved locally)
html_report_path = "./validation_report.html"
ge_df.save_expectation_suite(expectation_suite_name)
results.render(renderer_type="renderer.jupyter_notebook").to_html(html_report_path)
print(f"HTML validation report generated at: {html_report_path}\n")

# --------------------------------
# Task 3: Advanced Expectations & Scheduling
# --------------------------------

# Add advanced expectation: CustomerID uniqueness (already added above)

# Define function to run validation (for scheduling)
def run_validation():
    print("\nRunning scheduled validation...")
    batch = ge_df.get_batch()
    results = ge_df.validate(expectation_suite=suite)
    if results["success"]:
        print("Validation successful!")
    else:
        print("Validation failed. Check expectations.")
    # Here you could extend to save/send report, alert, etc.

# Schedule daily validation check (simulate by running once here)
schedule.every().day.at("10:00").do(run_validation)

# Simulate scheduler running for 5 seconds (runs scheduled jobs)
print("Simulating scheduled validation (runs once for demo)...")
schedule.run_all()
time.sleep(5)

print("Great Expectations automation demo complete.")

ModuleNotFoundError: No module named 'great_expectations'

### Task 2: Validate Datasets and Generate Reports

1. Objective: Validate a dataset against defined expectations and generate a report.
2. Steps:
    - Execute the validation process on the dataset.
    - Review the validation results and generate a report.
    - Eg., Validate completeness and consistency expectations, and view the results.


In [1]:
# Write your code from here
pip install great_expectations schedule


SyntaxError: invalid syntax (540874044.py, line 2)

### Task 3: Advanced Expectations and Scheduling

1. Objective: Create advanced expectations for conditional checks and automate the validation.
2. Steps:
    - Define advanced expectations based on complex conditions.
    - Use scheduling tools to automate periodic checks.
    - E.g., an expectation that customer IDs must be unique and schedule a daily check.

In [None]:
# Write your code from here