### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [10]:
# Write your code from here
import pandas as pd

# Step 1: Load sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Email': ['alice@example.com', 'bob@example', None, 'dave@example.com'],
    'Age': [25, 30, None, 40]
}

df = pd.DataFrame(data)

# Step 2: Define Data Quality Metrics

# Completeness: % of non-null values per column
completeness = df.notnull().mean() * 100

# Validity: % of valid emails (containing '@')
valid_emails = df['Email'].apply(lambda x: isinstance(x, str) and '@' in x)
validity = valid_emails.mean() * 100

# Uniqueness: number of unique emails (excluding NaN)
uniqueness = df['Email'].nunique()

# Display metrics
print("Completeness (%):\n", completeness.round(2))
print("\nEmail Validity (%):", round(validity, 2))
print("Email Uniqueness (count):", uniqueness)


Completeness (%):
 Name     75.0
Email    75.0
Age      75.0
dtype: float64

Email Validity (%): 75.0
Email Uniqueness (count): 3


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [11]:
# Write your code from here
import pandas as pd

# Sample dataset (same as in Task 1)
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Email': ['alice@example.com', 'bob@example', None, 'dave@example.com'],
    'Age': [25, 30, None, 40]
}

df = pd.DataFrame(data)

# Metric 1: Completeness (% of non-null values per column averaged)
completeness = df.notnull().mean().mean() * 100

# Metric 2: Validity (% of Email values containing '@')
valid_emails = df['Email'].apply(lambda x: isinstance(x, str) and '@' in x)
validity = valid_emails.mean() * 100

# Metric 3: Uniqueness (% of unique non-null emails)
unique_emails = df['Email'].dropna().nunique()
total_emails = df['Email'].dropna().shape[0]
uniqueness = (unique_emails / total_emails * 100) if total_emails > 0 else 0

# Calculate Data Quality Score (simple average)
metrics = [completeness, validity, uniqueness]
data_quality_score = sum(metrics) / len(metrics)

# Display results
print(f"Completeness: {completeness:.2f}%")
print(f"Validity: {validity:.2f}%")
print(f"Uniqueness: {uniqueness:.2f}%")
print(f"\nOverall Data Quality Score: {data_quality_score:.2f}%")


Completeness: 75.00%
Validity: 75.00%
Uniqueness: 100.00%

Overall Data Quality Score: 83.33%


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [12]:
# Write your code from here
import great_expectations as ge

# Step 1: Load the DataFrame using pandas
df = pd.read_csv('data.csv')

# Step 2: Create a DataContext (default directory where expectations will be stored)
context = ge.data_context.DataContext()

# Step 3: Create an Expectation Suite (name it "data_quality_suite" for example)
suite = context.create_expectation_suite('data_quality_suite', overwrite_existing=True)

# Step 4: Create a Batch (data from your DataFrame)
batch = ge.dataset.PandasDataset(df)

# Step 5: Add expectations

# Example 1: Expecting that 'Name' column is not null
batch.expect_column_values_to_be_in_set('Name', batch['Name'].dropna().unique())

# Example 2: Expecting that 'Email' column has valid emails (contains @)
batch.expect_column_values_to_match_like('Email', r".*@.*")

# Example 3: Expecting 'Age' to be between 18 and 100
batch.expect_column_value_lengths_to_be_between('Age', min_value=18, max_value=100)

# Save the expectations suite
context.add_expectation_suite(suite)
context.save_expectation_suite(suite)


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here
import great_expectations as ge

# Example DataFrame
df = ge.dataset.PandasDataset(your_dataframe)

# Example Expectations
df.expect_column_values_to_be_in_set('column_name', ['value1', 'value2'])
df.expect_column_proportion_of_unique_values_to_be_between('column_name', min_value=0.9)
results = df.validate()
print(results)



### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here
import pandas as pd
import great_expectations as ge

# Function to calculate the Data Quality Score
def calculate_data_quality_score(dataframe):
    context = ge.data_context.DataContext("/path/to/great_expectations")

    df = ge.dataset.PandasDataset(dataframe)

    # Example Expectations
    df.expect_column_values_to_be_in_set('column_name', ['value1', 'value2'])
    df.expect_column_proportion_of_unique_values_to_be_between('column_name', min_value=0.9)
    
    # Run validation
    validation_result = df.validate()

    # Calculate Data Quality Score
    expectations_results = validation_result['results']
    total_expectations = len(expectations_results)
    passed_expectations = sum([result['success'] for result in expectations_results])

    data_quality_score = passed_expectations / total_expectations

    return data_quality_score

# Function to generate an HTML report
def generate_html_report(context):
    context.add_data_docs(site_name="my_data_quality_site")
    context.open_data_docs()

# Load data (replace with your own data source)
your_dataframe = pd.read_csv("your_data.csv")

# Calculate Data Quality Score
score = calculate_data_quality_score(your_dataframe)
print(f"Data Quality Score: {score * 100:.2f}%")

# Generate HTML report
context = ge.data_context.DataContext("/path/to/great_expectations")
generate_html_report(context)



### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
import pandas as pd

# Clean missing values (for simplicity, replacing NaNs with the mean for numeric columns)
def clean_missing_values(df):
    return df.fillna(df.mean())

# Clean duplicates
def remove_duplicates(df):
    return df.drop_duplicates()

# Correct data types (example: converting a column to integer)
def correct_data_types(df):
    df['column_name'] = df['column_name'].astype(int)
    return df

# Implement data cleaning logic for various cases
def apply_cleaning(df):
    df = clean_missing_values(df)
    df = remove_duplicates(df)
    df = correct_data_types(df)
    return df

