### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
import pandas as pd
import great_expectations as ge
from great_expectations.core.batch import BatchRequest

# Sample dataset (Task 1)
data = {
    "Name": ["Alice", "Bob", None, "Dave", "Eve", "Alice"],
    "Email": ["alice@example.com", "bob@example", "carol@example.com", "dave@example.com", "eve@example.com", "alice@example.com"],
    "Age": [30, 25, 22, None, 28, 30]
}

df = pd.DataFrame(data)

# Convert to Great Expectations dataframe
ge_df = ge.from_pandas(df)

# Task 1: Data Quality Metrics

# Completeness: % non-null per column
completeness = df.notnull().mean() * 100

# Validity: % emails containing '@'
valid_emails = df['Email'].dropna().apply(lambda x: '@' in x)
validity = valid_emails.mean() * 100

# Uniqueness: count distinct emails
uniqueness = df['Email'].nunique()

print("=== Data Quality Metrics ===")
print(f"Completeness (%) per column:\n{completeness}\n")
print(f"Validity (%) of Email addresses: {validity:.2f}%")
print(f"Number of unique Email addresses: {uniqueness}")

# Task 2: Calculate Overall Data Quality Score (simple average of three metrics)
# We'll take average completeness across columns, plus validity and uniqueness normalized

avg_completeness = completeness.mean()
# Normalize uniqueness by dividing by total rows (to get % unique)
uniqueness_pct = uniqueness / len(df) * 100

quality_score = (avg_completeness + validity + uniqueness_pct) / 3

print(f"\nOverall Data Quality Score: {quality_score:.2f}%")

# Task 3: Create Great Expectations Suite for Completeness
suite = ge_df.get_expectation_suite(expectation_suite_name="data_quality_suite", overwrite_existing=True)

# Expect completeness: no nulls in Name, Email, Age
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_not_be_null("Email")
ge_df.expect_column_values_to_not_be_null("Age")

# Save suite
suite.save_expectation_suite()

# Task 4: Run validation and generate HTML report
results = ge_df.validate(expectation_suite=suite)

# Save validation result to HTML report
from great_expectations.render.renderer import ValidationResultsPageRenderer
from great_expectations.render.view import DefaultJinjaPageView
from great_expectations.render.util import RenderedAtomicContent

renderer = ValidationResultsPageRenderer()
resource = renderer.render(validation_result_suite=results)
view = DefaultJinjaPageView()
html_report = view.render(resource=resource)

with open("validation_report.html", "w") as f:
    f.write(html_report)

print("\nValidation report saved as validation_report.html")

# Task 5 & 6: Automate data quality score and trigger cleaning if below threshold

THRESHOLD = 80  # example threshold

def clean_data(df):
    print("\nData quality below threshold. Running cleaning script...")
    # Simple cleaning: drop rows with nulls or invalid emails (missing '@')
    cleaned_df = df.dropna(subset=["Name", "Email", "Age"])
    cleaned_df = cleaned_df[cleaned_df["Email"].str.contains("@")]
    return cleaned_df

if quality_score < THRESHOLD:
    df = clean_data(df)
    print("Cleaned Data:")
    print(df)
else:
    print("\nData quality is good. No cleaning needed.")


ModuleNotFoundError: No module named 'great_expectations'

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
