## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [6]:


import pandas as pd
import re

def calculate_accuracy(df_main, df_ref, key_column, compare_columns):
    """
    Calculate accuracy percentage by comparing specified columns in two dataframes.

    Parameters:
        df_main (pd.DataFrame): Main dataset.
        df_ref (pd.DataFrame): Reference dataset.
        key_column (str): Column name to join both datasets on.
        compare_columns (list of str): List of columns to compare accuracy.

    Returns:
        dict: Accuracy percentage for each compared column.
    """
    if key_column not in df_main.columns or key_column not in df_ref.columns:
        raise ValueError(f"Key column '{key_column}' must exist in both datasets.")

    merged = pd.merge(df_main, df_ref, on=key_column, suffixes=('_main', '_ref'))

    accuracy_results = {}
    for col in compare_columns:
        col_main = f"{col}_main"
        col_ref = f"{col}_ref"
        if col_main not in merged.columns or col_ref not in merged.columns:
            raise ValueError(f"Column '{col}' not found in merged dataset.")
        correct = (merged[col_main] == merged[col_ref]).sum()
        total = len(merged)
        accuracy_results[col] = (correct / total) * 100
    return accuracy_results


def is_phone_consistent(phone, pattern=re.compile(r'^\d{3}-\d{3}-\d{4}$')):
    """
    Check if a phone number matches the expected format: xxx-xxx-xxxx.

    Parameters:
        phone (str): Phone number string.
        pattern (re.Pattern): Compiled regex pattern to check format.

    Returns:
        bool: True if phone matches pattern, False otherwise.
    """
    if pd.isnull(phone):
        return False
    if not isinstance(phone, str):
        return False
    return bool(pattern.match(phone))


def calculate_completeness(df):
    """
    Calculate completeness percentage per column and overall.

    Parameters:
        df (pd.DataFrame): Dataset.

    Returns:
        tuple: (completeness_per_column: pd.Series, overall_completeness: float)
    """
    completeness_per_column = df.notnull().mean() * 100
    overall_completeness = completeness_per_column.mean()
    return completeness_per_column, overall_completeness


# Example usage with try-except for error handling
try:
    # Example datasets (same as before)
    main_data = {
        'CustomerID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', None, 'Dave', 'Eve'],
        'Email': ['alice@example.com', None, 'carol@example.com', 'dave@example.com', 'eve@example.com'],
        'Phone': ['123-456-7890', '987-654-3210', '1234567890', '555-555-5555', '111-222-3333'],
        'Age': [30, 25, 22, None, 28]
    }
    reference_data = {
        'CustomerID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
        'Email': ['alice@example.com', 'bob@example.com', 'carol@example.com', 'dave@example.com', 'eve@example.com']
    }
    df_main = pd.DataFrame(main_data)
    df_ref = pd.DataFrame(reference_data)

    # Completeness
    completeness_col, overall_completeness = calculate_completeness(df_main)
    print("Completeness per column:\n", completeness_col)
    print(f"Overall completeness: {overall_completeness:.2f}%\n")

    # Accuracy
    accuracy_results = calculate_accuracy(df_main, df_ref, 'CustomerID', ['Name', 'Email'])
    for col, score in accuracy_results.items():
        print(f"Accuracy for {col}: {score:.2f}%")
    print()

    # Consistency
    consistency_flags = df_main['Phone'].apply(is_phone_consistent)
    consistency_score = consistency_flags.mean() * 100
    print(f"Consistency Score for Phone format: {consistency_score:.2f}%")

except Exception as e:
    print("Error encountered:", e)

Completeness per column:
 CustomerID    100.0
Name           80.0
Email          80.0
Phone         100.0
Age            80.0
dtype: float64
Overall completeness: 88.00%

Accuracy for Name: 80.00%
Accuracy for Email: 80.00%

Consistency Score for Phone format: 80.00%


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [5]:
# Write your code from here

import pandas as pd
from your_module import calculate_accuracy, is_phone_consistent, calculate_completeness  # adjust import path

def test_completeness():
    df = pd.DataFrame({
        'A': [1, None, 3],
        'B': [None, None, 3]
    })
    completeness_col, overall = calculate_completeness(df)
    assert round(completeness_col['A'], 2) == 66.67
    assert round(completeness_col['B'], 2) == 33.33
    assert round(overall, 2) == 50.0

def test_accuracy():
    df_main = pd.DataFrame({'ID': [1,2], 'Name': ['A', 'B']})
    df_ref = pd.DataFrame({'ID': [1,2], 'Name': ['A', 'C']})
    result = calculate_accuracy(df_main, df_ref, 'ID', ['Name'])
    assert round(result['Name'], 2) == 50.0

def test_phone_consistency():
    assert is_phone_consistent('123-456-7890') is True
    assert is_phone_consistent('1234567890') is False
    assert is_phone_consistent(None) is False
    assert is_phone_consistent(1234567890) is False

ModuleNotFoundError: No module named 'your_module'

### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
