# Measuring Completeness

**Activity Overview**: Evaluate data completeness by checking missing data rates and handling partially available records.

## Title: Customer Profiles

**Task**: Calculate the missing data rate for customer profiles.

**Steps**:
1. List all required fields for a complete customer profile (e.g., name, address, email,
phone number).
2. Analyze the dataset to count how many profiles have missing fields.
3. Calculate the percentage of missing data fields across all profiles.

In [1]:
import pandas as pd

def calculate_missing_rate(df, required_fields):
    try:
        # Step 1: Check if all required fields exist in the dataset
        missing_columns = [field for field in required_fields if field not in df.columns]
        if missing_columns:
            raise KeyError(f"Missing required fields: {missing_columns}")

        # Step 2: (Optional) Warn if field data types look incorrect (e.g., numeric instead of string)
        for field in required_fields:
            if df[field].dtype not in ['object', 'string']:
                print(f"⚠️ Warning: Field '{field}' has unexpected data type: {df[field].dtype}")

        # Step 3: Calculate missing values
        df['missing_fields_count'] = df[required_fields].isnull().sum(axis=1)
        total_required_values = len(df) * len(required_fields)
        total_missing_values = df[required_fields].isnull().sum().sum()
        missing_rate = (total_missing_values / total_required_values) * 100

        # Step 4: Summary per field
        missing_per_field = df[required_fields].isnull().sum()

        # Step 5: Output results
        print("\n📊 Customer Profile Completeness Report")
        print(f"- Total profiles: {len(df)}")
        print(f"- Total required fields: {total_required_values}")
        print(f"- Total missing values: {total_missing_values}")
        print(f"- Overall missing rate: {missing_rate:.2f}%")

        print("\n🔍 Missing fields per column:")
        print(missing_per_field)

        return {
            "total_profiles": len(df),
            "total_fields": total_required_values,
            "total_missing": total_missing_values,
            "missing_percentage": missing_rate,
            "missing_per_field": missing_per_field
        }

    except KeyError as e:
        print(f"❌ KeyError: {e}")
    except TypeError as e:
        print(f"❌ TypeError: {e}")
    except Exception as e:
        print(f"❌ Unexpected error: {e}")

# === Example Usage with Simulated Data ===

# Sample customer profile dataset
customer_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', None, 'Dana'],
    'address': ['123 Main St', None, '456 Oak Ave', '789 Pine Rd'],
    'email': ['alice@example.com', None, 'carol@example.com', 'dana@example.com'],
    'phone_number': ['111-222-3333', '222-333-4444', None, None]
})

# Define the required fields for completeness
required_fields = ['name', 'address', 'email', 'phone_number']

# Call the function
calculate_missing_rate(customer_df, required_fields)


📊 Customer Profile Completeness Report
- Total profiles: 4
- Total required fields: 16
- Total missing values: 5
- Overall missing rate: 31.25%

🔍 Missing fields per column:
name            1
address         1
email           1
phone_number    2
dtype: int64


{'total_profiles': 4,
 'total_fields': 16,
 'total_missing': 5,
 'missing_percentage': 31.25,
 'missing_per_field': name            1
 address         1
 email           1
 phone_number    2
 dtype: int64}