## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [None]:
# Write your code from here
import pandas as pd

def completeness_score(df):
    total_values = df.size
    non_missing_values = df.count().sum()
    score = (non_missing_values / total_values) * 100 if total_values > 0 else 0.0
    return score

# Example usage:
data = {
    'CustomerID': [1, 2, 3, 4, None],
    'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'Email': ['alice@example.com', None, 'charlie@example.com', 'dave@example.com', 'eve@example.com']
}

df = pd.DataFrame(data)
score = completeness_score(df)
print(f"Completeness Score: {score:.2f}%")


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
# Write your code from here
import pandas as pd

def accuracy_score(main_df, ref_df, key_columns):
    if main_df.empty or ref_df.empty:
        return 0.0

    merged = pd.merge(main_df, ref_df, on=key_columns, how='inner', suffixes=('_main', '_ref'))
    if merged.empty:
        return 0.0
    
    accuracy_counts = 0
    total_counts = len(merged)
    
    for col in main_df.columns:
        if col not in key_columns and col in ref_df.columns:
            accuracy_counts += (merged[f"{col}_main"] == merged[f"{col}_ref"]).sum()
            total_counts += len(merged)
    
    # Avoid division by zero
    if total_counts == 0:
        return 0.0
    
    accuracy = (accuracy_counts / total_counts) * 100
    return accuracy

# Example usage:
main_data = {
    'ID': [1, 2, 3, 4],
    'Sales': [100, 150, 200, 250],
    'Region': ['East', 'West', 'East', 'North']
}

ref_data = {
    'ID': [1, 2, 3, 4],
    'Sales': [100, 140, 200, 260],
    'Region': ['East', 'West', 'East', 'North']
}

main_df = pd.DataFrame(main_data)
ref_df = pd.DataFrame(ref_data)

score = accuracy_score(main_df, ref_df, key_columns=['ID'])
print(f"Accuracy Score: {score:.2f}%")



### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
import pandas as pd
import re

def consistency_score(df, column, pattern):
    if column not in df.columns or df.empty:
        return 0.0
    
    total = len(df)
    if total == 0:
        return 0.0
    
    # Check if each value matches the regex pattern
    consistent_count = df[column].apply(lambda x: bool(re.fullmatch(pattern, str(x)))).sum()
    
    score = (consistent_count / total) * 100
    return score

# Example usage:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Phone': ['123-456-7890', '987-654-3210', '1234567890', '123-45-6789']  # last one invalid format
}

df = pd.DataFrame(data)

phone_pattern = r'\d{3}-\d{3}-\d{4}'  # US phone format: 123-456-7890

score = consistency_score(df, 'Phone', phone_pattern)
print(f"Consistency Score: {score:.2f}%")
