## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [None]:
# Write your code from here
import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, None, 22, 29],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com', 'eve@example.com'],
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)

missing_values = df.isnull().sum()
print("\nMissing values per column:")
print(missing_values)

completeness_score = df.notnull().sum() / len(df)

print("\nCompleteness Score for each column:")
print(completeness_score)


Dataset:
      Name   Age              Email
0    Alice  25.0  alice@example.com
1      Bob  30.0    bob@example.com
2  Charlie   NaN               None
3    David  22.0  david@example.com
4      Eve  29.0    eve@example.com

Missing values per column:
Name     0
Age      1
Email    1
dtype: int64

Completeness Score for each column:
Name     1.0
Age      0.8
Email    0.8
dtype: float64


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
import pandas as pd

main_data = {
    'product_id': [101, 102, 103, 104, 105],
    'sales_amount': [150, 200, 250, 180, 220]
}

reference_data = {
    'product_id': [101, 102, 103, 104, 105],
    'sales_amount': [150, 205, 250, 180, 220] 
}

main_df = pd.DataFrame(main_data)
reference_df = pd.DataFrame(reference_data)

merged_df = pd.merge(main_df, reference_df, on='product_id', suffixes=('_main', '_reference'))

merged_df['match'] = merged_df['sales_amount_main'] == merged_df['sales_amount_reference']

accuracy_percentage = merged_df['match'].sum() / len(merged_df) * 100

print("\nMerged DataFrame with Match Status:")
print(merged_df)

print(f"\nAccuracy Score: {accuracy_percentage:.2f}%")




Merged DataFrame with Match Status:
   product_id  sales_amount_main  sales_amount_reference  match
0         101                150                     150   True
1         102                200                     205  False
2         103                250                     250   True
3         104                180                     180   True
4         105                220                     220   True

Accuracy Score: 80.00%


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
import pandas as pd
import re

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'phone_number': ['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123', 'invalid_number']
}

df = pd.DataFrame(data)

def is_valid_phone_number(phone_number):
    pattern = r'^\d{3}-\d{3}-\d{4}$'
    return bool(re.match(pattern, phone_number))

df['valid_phone'] = df['phone_number'].apply(is_valid_phone_number)

valid_count = df['valid_phone'].sum()
total_count = len(df)
consistency_score = (valid_count / total_count) * 100

print("Dataset with Valid Phone Check:")
print(df)

print(f"\nConsistency Score: {consistency_score:.2f}%")




Dataset with Valid Phone Check:
      name    phone_number  valid_phone
0    Alice    123-456-7890         True
1      Bob    234-567-8901         True
2  Charlie    345-678-9012         True
3    David    456-789-0123         True
4      Eve  invalid_number        False

Consistency Score: 80.00%
