## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
import pandas as pd
import re

# ----------------------------
# Sample datasets for the tasks
# ----------------------------

# Main dataset (Task 1 & 3)
main_data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', None, 'Dave', 'Eve'],
    'Email': ['alice@example.com', None, 'carol@example.com', 'dave@example.com', 'eve@example.com'],
    'Phone': ['123-456-7890', '987-654-3210', '1234567890', '555-555-5555', '111-222-3333'],
    'Age': [30, 25, 22, None, 28]
}

# Reference dataset (Task 2)
reference_data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
    'Email': ['alice@example.com', 'bob@example.com', 'carol@example.com', 'dave@example.com', 'eve@example.com']
}

# Create DataFrames
df_main = pd.DataFrame(main_data)
df_ref = pd.DataFrame(reference_data)

# ----------------------------
# Task 1: Completeness Score
# ----------------------------

print("Task 1: Completeness Score")

missing_cols = df_main.columns[df_main.isnull().any()].tolist()
print("Columns with missing values:", missing_cols)

completeness_per_col = df_main.notnull().mean() * 100
print("Completeness (%) per column:")
print(completeness_per_col)

overall_completeness = completeness_per_col.mean()
print(f"Overall Completeness Score: {overall_completeness:.2f}%\n")

# ----------------------------
# Task 2: Accuracy Score
# ----------------------------

print("Task 2: Accuracy Score")

# We will check accuracy on 'Name' and 'Email' columns by matching values for same CustomerID

# Merge datasets on CustomerID for comparison
merged = pd.merge(df_main, df_ref, on='CustomerID', suffixes=('_main', '_ref'))

# Function to calculate accuracy for a column
def accuracy(col_main, col_ref):
    correct = (merged[col_main] == merged[col_ref]).sum()
    total = len(merged)
    return (correct / total) * 100

accuracy_name = accuracy('Name_main', 'Name_ref')
accuracy_email = accuracy('Email_main', 'Email_ref')

print(f"Accuracy for Name: {accuracy_name:.2f}%")
print(f"Accuracy for Email: {accuracy_email:.2f}%")

overall_accuracy = (accuracy_name + accuracy_email) / 2
print(f"Overall Accuracy Score: {overall_accuracy:.2f}%\n")

# ----------------------------
# Task 3: Consistency Score
# ----------------------------

print("Task 3: Consistency Score")

# Rule: Phone number should follow pattern xxx-xxx-xxxx
pattern = re.compile(r'^\d{3}-\d{3}-\d{4}$')

def is_phone_consistent(phone):
    if pd.isnull(phone):
        return False
    return bool(pattern.match(phone))

consistency_flags = df_main['Phone'].apply(is_phone_consistent)
consistency_score = consistency_flags.mean() * 100

print(f"Consistency Score for Phone format: {consistency_score:.2f}%")

# For completeness, also show how many phone numbers are inconsistent
print(f"Inconsistent phone entries: {len(df_main) - consistency_flags.sum()} out of {len(df_main)}")


Task 1: Completeness Score
Columns with missing values: ['Name', 'Email', 'Age']
Completeness (%) per column:
CustomerID    100.0
Name           80.0
Email          80.0
Phone         100.0
Age            80.0
dtype: float64
Overall Completeness Score: 88.00%

Task 2: Accuracy Score
Accuracy for Name: 80.00%
Accuracy for Email: 80.00%
Overall Accuracy Score: 80.00%

Task 3: Consistency Score
Consistency Score for Phone format: 80.00%
Inconsistent phone entries: 1 out of 5


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
# Write your code from here


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
