### Data Quality Check for dim_customer


- This notebook performs structural, integrity, and consistency checks on dim_customer before it is consumed in Power BI reporting.
- Power BI dataset will connects directly to gdb041.dim_customer.
- Each row of this dataset represents details for a specific customer_code

### Database configuration

This notebook reads database credentials from environment variables.
Credentials are **not stored in the notebook or repository**.

Required variables:

- DB_USER
- DB_PASSWORD
- DB_HOST
- DB_NAME

For local development:

- Create a `.env` file (see `.env.example`)
- Load variables using `python-dotenv`

### 1. Import Necessary Libraries

In [1]:
# Core libraries for data manipulation and numerical operations
import numpy as np
import pandas as pd

#import sql engine
from sqlalchemy import create_engine 

### 2. Load Data

In [2]:
# Import libraries
from dotenv import load_dotenv
import os
import sys

# Load environment variables
load_dotenv()

# Required environment variables
REQUIRED_ENV_VARS = ["DB_USER", "DB_PASSWORD", "DB_HOST", "DB_NAME"]

# Validate environment variables (Fail Fast)
missing_vars = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)]

if missing_vars:
    sys.exit(
        f"""
        ❌ Missing required environment variables: {', '.join(missing_vars)}

        Please set them in your .env file or system environment before running the script.
        Example:
            DB_USER=your_username
            DB_PASSWORD=your_password
            DB_HOST=localhost
            DB_NAME=your_database
        """
    )

# Create database engine
engine = create_engine(
    f"mysql+mysqlconnector://{os.getenv('DB_USER')}:"
    f"{os.getenv('DB_PASSWORD')}@"
    f"{os.getenv('DB_HOST')}/"
    f"{os.getenv('DB_NAME')}"
)

# Define query
query = """
SELECT 
    customer_code,
    customer,
    market,
    platform,
    channel
FROM gdb041.dim_customer
"""

try:
    df_customer = pd.read_sql_query(query, engine)
    print("✅ Data loaded successfully.")

except Exception as e:
    raise RuntimeError(f"❌ Failed to load data from dim_customer: {e}")

# Preview
df_customer.head()

✅ Data loaded successfully.


Unnamed: 0,customer_code,customer,market,platform,channel
0,90002012,Electricalsocity,India,Brick & Mortar,Retailer
1,90002013,Electricalslytical,India,Brick & Mortar,Retailer
2,90002010,Ebay,India,E-Commerce,Retailer
3,90002011,Atliq Exclusive,India,Brick & Mortar,Retailer
4,90002014,Expression,India,Brick & Mortar,Retailer


### 3. Initial Data Quality Check

In [3]:
def initial_report(df):
    print(" *** initial report ***\n" + "-"*40)

    print(f"*** Structure:\n- Total Rows: {df.shape[0]}\n- Total Columns: {df.shape[1]}")
    print(f"- Column Names: {list(df.columns)}\n")

    
    print(" *** Data Types:")
    for col, dtype in df.dtypes.items():
        print(f"  {col}: {dtype}")
    print()

    print(" *** Mixed Data Types:")
    has_mixed_types = False
    for col in df.columns:
        try:
            type_counts = df[col].apply(type).value_counts()
            if len(type_counts) > 1:
                has_mixed_types = True
                print(f"  {col}:")
                for t, count in type_counts.items():
                    print(f"    - {t.__name__}: {count}")
        except Exception as e:
            print(f"  {col}: Error checking types - {e}")

    if not has_mixed_types:
        print("  No mixed data types found")
    print()

    print("*** Distinct Values per Column:")
    for col in df.columns:
        print(f"  {col}: {df[col].nunique()}")
    print()

    print("*** Null Values and Percentages:")
    has_null_value=False
    nulls = df.isnull().sum()
    for col in df.columns:
        pct_missing = np.mean(df[col].isnull())
        if nulls[col] > 0: # Only print if there are missing values
            has_null_value=True
            print(f"  {col}: Missing Values: {nulls[col]}, Pct: {round(pct_missing * 100, 3)}%")
    if not has_null_value:
        print("  No null values found")
    print()

    
    print(f"\n*** Duplicates: {df.duplicated().sum()}")

    print("*** Negative or Zero Values:")
    has_issues = False
    for col in df.select_dtypes(include='number').columns:
        zero_count = (df[col] == 0).sum()
        negative_count = (df[col] < 0).sum()
    
        if zero_count > 0 or negative_count > 0:
            has_issues = True
            print(f"  {col}:")
            if zero_count > 0:
                print(f"    - Zero values: {zero_count}")
            if negative_count > 0:
                print(f"    - Negative values: {negative_count}")

    if not has_issues:
        print("  No negative or zero values found")
    print()

initial_report(df_customer)

 *** initial report ***
----------------------------------------
*** Structure:
- Total Rows: 209
- Total Columns: 5
- Column Names: ['customer_code', 'customer', 'market', 'platform', 'channel']

 *** Data Types:
  customer_code: object
  customer: object
  market: object
  platform: object
  channel: object

 *** Mixed Data Types:
  No mixed data types found

*** Distinct Values per Column:
  customer_code: 209
  customer: 74
  market: 27
  platform: 2
  channel: 3

*** Null Values and Percentages:
  No null values found


*** Duplicates: 0
*** Negative or Zero Values:
  No negative or zero values found



### 4. Check what customers are available in the datasets

In [4]:
# Display value counts for each column to validate uniqueness and detect potential data quality issues.
df_customer["customer"].value_counts(dropna=False).sort_index()

customer
Acclaimed Stores        2
All-Out                 1
Amazon                 29
Argos (Sainsbury's)     3
Atlas Stores            2
                       ..
Unity Stores            1
Vijay Sales             1
Viveks                  1
Zone                    2
walmart                 2
Name: count, Length: 74, dtype: int64

### 4.1 Identify issues in customer data (whitespace, similarity matching, case variation, typo detection)

In [5]:
#identify types of anomalies (typos, trailing, duplicates) in product column
import pandas as pd
from difflib import SequenceMatcher
import re

def find_categorical_anomalies(df, column_name):
    """
    Identify potential duplicates/typos in any categorical column.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        The DataFrame containing the data
    column_name : str
        Name of the categorical column to analyze
    """
    values = df[column_name].dropna().unique()
    
    print(f"Analyzing column: '{column_name}'")
    print(f"Total unique values: {len(values)}\n")
    
    # 1. Check for leading/trailing whitespace
    print("="*60)
    print("1. WHITESPACE ISSUES")
    print("="*60)
    whitespace_issues = []
    for val in values:
        if val != val.strip():
            whitespace_issues.append(val)
    
    if whitespace_issues:
        print(f"Found {len(whitespace_issues)} values with whitespace:")
        for val in whitespace_issues:
            print(f"  '{val}' -> '{val.strip()}'")
    else:
        print("No whitespace issues found.")
    
    # 2. Check for very similar names (potential typos)
    print("\n" + "="*60)
    print("2. SIMILAR VALUES (Potential Typos)")
    print("="*60)
    
    similar_pairs = []
    values_list = list(values)
    
    for i in range(len(values_list)):
        for j in range(i + 1, len(values_list)):
            val1 = values_list[i].strip().lower()
            val2 = values_list[j].strip().lower()
            
            # Calculate similarity ratio
            similarity = SequenceMatcher(None, val1, val2).ratio()
            
            # If very similar (>0.8 threshold), flag it
            if similarity > 0.9:
                similar_pairs.append({
                    'value1': values_list[i],
                    'value2': values_list[j],
                    'similarity': round(similarity, 3)
                })
    
    if similar_pairs:
        for pair in sorted(similar_pairs, key=lambda x: x['similarity'], reverse=True):
            print(f"  {pair['similarity']:.1%} similar:")
            print(f"    - '{pair['value1']}'")
            print(f"    - '{pair['value2']}'")
            print()
    else:
        print("No highly similar values found.")
    
    # 3. Check for case variations
    print("="*60)
    print("3. CASE VARIATIONS")
    print("="*60)
    
    case_variations = {}
    for val in values:
        normalized = val.strip().lower()
        if normalized not in case_variations:
            case_variations[normalized] = []
        case_variations[normalized].append(val)
    
    duplicates = {k: v for k, v in case_variations.items() if len(v) > 1}
    
    if duplicates:
        print(f"Found {len(duplicates)} values with case variations:")
        for key, variants in duplicates.items():
            print(f"\n  '{key}' has {len(variants)} variations:")
            for v in variants:
                print(f"    - '{v}'")
    else:
        print("No case variations found.")
    
    return {
        'whitespace_issues': whitespace_issues,
        'similar_pairs': similar_pairs,
        'case_variations': duplicates
    }

# Usage
anomalies = find_categorical_anomalies(df_customer, 'customer')

Analyzing column: 'customer'
Total unique values: 74

1. WHITESPACE ISSUES
No whitespace issues found.

2. SIMILAR VALUES (Potential Typos)
  90.5% similar:
    - 'Electricalsara Stores'
    - 'Electricalsbea Stores'

3. CASE VARIATIONS
No case variations found.


### 5. Check what markets are available in the datasets

In [6]:
df_customer['market'].value_counts(dropna=False).sort_index()

market
Australia          7
Austria            8
Bangladesh         5
Brazil             2
Canada            11
Chile              2
China              3
Columbia           1
France            10
Germany           11
India             18
Indonesia          4
Italy             11
Japan             10
Mexico             2
Netherlands        9
Newzealand         8
Norway             9
Pakistan           5
Philiphines        6
Poland             8
Portugal          12
South Korea        5
Spain             11
Sweden             5
USA               15
United Kingdom    11
Name: count, dtype: int64

### 6. Check what platfrom are available in the datasets

In [7]:
df_customer['platform'].value_counts(dropna=False).sort_index()

platform
Brick & Mortar    150
E-Commerce         59
Name: count, dtype: int64

### 7. Check what channels are available in the datasets

In [8]:
df_customer['channel'].value_counts(dropna=False).sort_index()

channel
Direct          40
Distributor      5
Retailer       164
Name: count, dtype: int64

### 8. Check grain and functional dependency

In [9]:
# 1. Duplicate key check
dupe_keys = (
    df_customer
        .groupby('customer_code')
        .size()
        .reset_index(name='row_count')
        .query('row_count > 1')
)

# 2. Functional dependency check (customer_code → other columns)
nunique_per_customer = (
    df_customer
        .groupby('customer_code')
        .nunique()
)

fd_violations_mask = nunique_per_customer.gt(1).any(axis=1)
fd_violations = nunique_per_customer[fd_violations_mask]

# 3. Identify which columns violate FD for each customer
violation_columns = (
    nunique_per_customer
        .gt(1)
        .replace(False, pd.NA)
        .dropna(how='all')
)

# 4. Combine everything into one report
report = (
    dupe_keys
        .merge(fd_violations, on='customer_code', how='outer', suffixes=('_dupe', '_fd'))
        .merge(violation_columns, on='customer_code', how='left', suffixes=('', '_violates'))
)

report

Unnamed: 0,customer_code,row_count,customer,market,platform,channel,customer_violates,market_violates,platform_violates,channel_violates


### Notes:
No duplicates at grain level. All columns are functionaly dependent on customer_code. The grain is customer_code

### 9. Checks whether a single customer appears in multiple markets.

In [10]:
df_customer.groupby('customer')['market'].nunique()

customer
Acclaimed Stores        2
All-Out                 1
Amazon                 25
Argos (Sainsbury's)     3
Atlas Stores            2
                       ..
Unity Stores            1
Vijay Sales             1
Viveks                  1
Zone                    2
walmart                 2
Name: market, Length: 74, dtype: int64

### Notes:
Yes, a single customer appears in multiple markets

### 10. Checks whether a single customer appears in multiple platform.

In [11]:
df_customer.groupby('customer')['platform'].nunique().sort_values(ascending=False)

customer
Acclaimed Stores       1
All-Out                1
Amazon                 1
Argos (Sainsbury's)    1
Atlas Stores           1
                      ..
Unity Stores           1
Vijay Sales            1
Viveks                 1
Zone                   1
walmart                1
Name: platform, Length: 74, dtype: int64

### Notes:
one customer only exist in one platform

### 11. Check inconsistencies in customer_code

In [12]:
cc_code_anomalies = find_categorical_anomalies(df_customer, 'customer_code')

Analyzing column: 'customer_code'
Total unique values: 209

1. WHITESPACE ISSUES
No whitespace issues found.

2. SIMILAR VALUES (Potential Typos)
No highly similar values found.
3. CASE VARIATIONS
No case variations found.
