# Generate Customer Core Fields Sample Data

## Overview
This notebook generates complete sample data for the Customer table by reading from a CSV file with unique customer names and generating all additional required fields using intelligent business rules and referential integrity.

## Input File
- **File**: `customer_names_unique_513.csv`
- **Fields**: FirstName, LastName, Gender
- **Records**: 513 unique customer names

## Intelligent Business Rules & Distributions

### Core Fields (from CSV)
- **CustomerId**: Sequential CID-001 to CID-513
- **FirstName, LastName, Gender**: From input CSV file

### Business Classification (Intelligent Assignment)
- **CustomerTypeId**: 70% Individual, 20% Business, 10% Government
- **CustomerRelationshipTypeId**: **Intelligently assigned based on CustomerTypeId**
  - **Individual Customers**: 45% Standard, 40% Premium, 15% VIP
  - **Business Customers**: 60% SMB, 30% Premier, 10% Partner
  - **Government Customers**: 50% Local, 35% State, 15% Federal
- **IsActive**: **Tier-based retention rates**
  - **Premium Tiers** (VIP, Premier, Partner, Federal): 98% active
  - **Mid Tiers** (Premium, State): 96% active
  - **Standard Tiers** (Standard, SMB, Local): 94% active
- **CreatedBy**: **Channel-based assignment**
  - **Individual/Business**: 90% Sales, 10% Services
  - **Government**: 10% Sales, 90% Services (procurement-focused)

### Date Fields (Intelligent Age-Aware Generation)
- **DateOfBirth**: Age distribution as of Dec 31, 2021
  - 90% between ages 23-75
  - 10% between ages 76-100
  - Valid dates with proper leap year handling
- **CustomerEstablishedDate**: **Intelligently generated based on DateOfBirth and CustomerTypeId**
  - **Individual Customers**: Established at ages 19-70 (realistic adult earning years)
  - **Business Customers**: Established at ages 25-65 (decision-maker ages)
  - **Government Customers**: Established at ages 28-60 (mid level and senior government roles)
  - **Constraint**: Must fall within Jan 1, 2018 to Dec 31, 2021 date range
  - **Result**: No 100-year-old new customers! Realistic age patterns for all segments

### Contact Information
- **PrimaryPhone**: 100% of customers have US phone numbers
  - Format: (XXX) 555-01XX (fake but realistic)
  - Uses valid North American area codes
  - Values: in the United States, the phone number 555-0100 to 555-0199 may be used with any North American area code (except 800, 866, 877, 888, or 900).  International phone numbers should use sequential numbers.
- **SecondaryPhone**: 30% of customers have secondary phone numbers
  - Format: (XXX) 555-02XX (different range to avoid duplicates)
  - US numbers only to match location data

### Email Information
- **PrimaryEmail**: 100% of customers have email addresses
  - Format: firstname@contoso.com (using FirstName from CSV)
  - Uses pre-approved fictitious domain contoso.com
  - Compliant with official email guidelines
- **SecondaryEmail**: 15% of customers have secondary email addresses
  - Format: firstname@fabrikam.com (using FirstName from CSV)
  - Uses pre-approved fictitious domain fabrikam.com

### CustomerTradeName Rules
- **Coverage**: Only Business and Government customers get trade names
- **Individual customers**: No trade name records (realistic business logic)
- **TradeNameId**: Sequential TN-001, TN-002, etc.
- **Business Trade Names**: Compliance-approved names only
  - Contoso Corp, Contoso Industries, Contoso Solutions
  - Fabrikam Inc, Fabrikam Systems, Fabrikam Group
- **Government Trade Names**: Fictional planetary authorities (avoids real government mapping)
  - Planet Mars Authority, Jupiter Department of Commerce
  - Saturn Municipal Services, Venus Regional Office
  - Neptune State Agency, Mercury City Government
- **Integration**: Perfect referential integrity with Customer table via CustomerId

## Data Integrity Rules
✅ **CustomerRelationshipTypeId Validation**: Must match CustomerTypeId category
- Individual customers: Only Standard, Premium, VIP
- Business customers: Only SMB, Premier, Partner  
- Government customers: Only Local, State, Federal

✅ **Realistic Business Logic**: 
- Premium customers have higher retention rates
- Government customers primarily acquired through Services channel
- Business tier distribution reflects market reality (more SMB than Partners)

✅ **Age-Aware Date Logic**: 
- CustomerEstablishedDate intelligently calculated from DateOfBirth
- Realistic age ranges for customer acquisition by segment
- Eliminates unrealistic scenarios (100-year-old new customers)

✅ **Referential Integrity**: All generated IDs maintain proper relationships

## Generated Fields
- **Core Fields**: CustomerId, FirstName, LastName, Gender (from CSV)
- **Business Fields**: CustomerTypeId, IsActive, CustomerRelationshipTypeId, CreatedBy
- **Date Fields**: DateOfBirth, CustomerEstablishedDate
- **Contact Fields**: PrimaryPhone, SecondaryPhone, PrimaryEmail, SecondaryEmail

## Output Files
- **Customer_core_fields.csv**: Contains all Customer table fields ready for database import
- **CustomerTradeName.csv**: Contains trade names for Business and Government customers only
- Fully compliant with retail business requirements
- Perfect for realistic reporting scenarios and demos

## Reporting Scenarios Enabled
- ✅ Customer segmentation analysis by type and relationship tier
- ✅ Channel effectiveness (Sales vs Services by customer type)
- ✅ Retention analysis by customer tier
- ✅ Individual vs Business vs Government customer analysis
- ✅ "Show all Contoso companies and their contacts"
- ✅ "Government entities by planetary jurisdiction"
- ✅ Customer relationship mapping with trade names
- ✅ Complete audit trail for all customer types

---

In [None]:
#
import pandas as pd
import numpy as np
import random
import os
from datetime import datetime, date, timedelta

# Set seed for reproducible results
random.seed(42)
np.random.seed(42)

# Configuration
INPUT_FOLDER = "C:\\temp\\samples\\input"
INPUT_FILE = "customer_names_unique_513.csv"
OUTPUT_FOLDER = "C:\\temp\\samples\\output"
OUTPUT_FILE = "Customer_Samples.csv"

# Create output directory
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# Remove existing output file if it exists
output_path = os.path.join(OUTPUT_FOLDER, OUTPUT_FILE)
if os.path.exists(output_path):
    os.remove(output_path)
    print(f"🗑️ Removed existing file: {output_path}")

print(f"🎯 GENERATING COMPLETE CUSTOMER SAMPLE DATA")
print(f"Input: {INPUT_FOLDER}\\{INPUT_FILE}")
print(f"Output: {OUTPUT_FOLDER}\\{OUTPUT_FILE}")
print("="*50)

# Read the input CSV file
try:
    input_path = os.path.join(INPUT_FOLDER, INPUT_FILE)
    print(f"📂 Reading input file: {input_path}")
    
    df_input = pd.read_csv(input_path)
    
    print(f"✅ File read successfully!")
    print(f"📊 Input file shape: {df_input.shape}")
    print(f"📋 Input columns: {list(df_input.columns)}")
    
    # Verify expected columns exist
    expected_columns = ['FirstName', 'LastName', 'Gender']
    missing_columns = [col for col in expected_columns if col not in df_input.columns]
    
    if missing_columns:
        print(f"❌ Missing expected columns: {missing_columns}")
        raise ValueError(f"Missing columns: {missing_columns}")
    
    print(f"✅ All expected columns found")
    
except Exception as e:
    print(f"❌ Error reading input file: {e}")
    print("\n💡 Please ensure the file exists and has the correct columns:")
    print("   - FirstName, LastName, Gender")
    raise

# Field generation functions
def generate_customer_ids(num_records):
    """Generate sequential CustomerId values"""
    return [f"CID-{i+1:03d}" for i in range(num_records)]

def generate_customer_type_id(num_records):
    """Generate CustomerTypeId (70% Individual, 20% Business, 10% Government)"""
    customer_types = ['Individual', 'Business', 'Government']
    probabilities = [0.7, 0.2, 0.1]
    return np.random.choice(customer_types, size=num_records, p=probabilities)

def generate_is_active(customer_type_ids, relationship_type_ids):
    """Generate IsActive with tier-based retention rates"""
    is_active_values = []
    
    for ctype, rtype in zip(customer_type_ids, relationship_type_ids):
        # Premium tiers have higher retention rates
        if rtype in ['VIP', 'Premier', 'Partner', 'Federal']:
            # Premium tiers: 98% active
            is_active = np.random.random() < 0.98
        elif rtype in ['Premium', 'State']:
            # Mid tiers: 96% active
            is_active = np.random.random() < 0.96
        else:
            # Standard tiers: 94% active
            is_active = np.random.random() < 0.94
        
        is_active_values.append(is_active)
    
    return is_active_values

def generate_date_of_birth(num_records):
    """Generate DateOfBirth (Age distribution as of Dec 31, 2021)"""
    reference_date = date(2021, 12, 31)
    birth_dates = []
    
    for _ in range(num_records):
        # 90% between ages 23-75, 10% between ages 76-100
        if np.random.random() < 0.9:
            age = np.random.randint(23, 76)
        else:
            age = np.random.randint(76, 101)
        
        birth_year = reference_date.year - age
        birth_month = np.random.randint(1, 13)
        
        # Handle different months' day ranges
        if birth_month in [1, 3, 5, 7, 8, 10, 12]:
            max_day = 31
        elif birth_month in [4, 6, 9, 11]:
            max_day = 30
        else:  # February
            if birth_year % 4 == 0 and (birth_year % 100 != 0 or birth_year % 400 == 0):
                max_day = 29
            else:
                max_day = 28
        
        birth_day = np.random.randint(1, max_day + 1)
        
        try:
            birth_date = date(birth_year, birth_month, birth_day)
            birth_dates.append(birth_date)
        except ValueError:
            birth_dates.append(date(birth_year, 1, 1))
    
    return birth_dates

def generate_customer_established_date_intelligent(birth_dates, customer_type_ids):
    """Generate CustomerEstablishedDate intelligently based on DateOfBirth and CustomerTypeId"""
    established_dates = []
    
    # Date constraints for customer establishment period
    min_establishment_date = date(2018, 1, 1)
    max_establishment_date = date(2021, 12, 31)
    
    # Age ranges for customer establishment by type
    age_ranges = {
        'Individual': (19, 70),    # Revised: 19-70 years (realistic adult earning years)
        'Business': (25, 65),      # 25-65 years (decision-maker ages)
        'Government': (28, 60)     # 28-60 years (senior government roles)
    }
    
    for birth_date, customer_type in zip(birth_dates, customer_type_ids):
        min_age, max_age = age_ranges.get(customer_type, (19, 70))  # Default to Individual range
        
        # Calculate possible establishment date range based on age constraints
        min_establishment_age_date = birth_date + timedelta(days=min_age * 365)
        max_establishment_age_date = birth_date + timedelta(days=max_age * 365)
        
        # Constrain to business date range (2018-2021)
        actual_min_date = max(min_establishment_age_date, min_establishment_date)
        actual_max_date = min(max_establishment_age_date, max_establishment_date)
        
        # Handle edge cases where date ranges don't overlap
        if actual_min_date > actual_max_date:
            # If person is too young/old for the business date range, use business range
            actual_min_date = min_establishment_date
            actual_max_date = max_establishment_date
        
        # Generate random date within the valid range
        total_days = (actual_max_date - actual_min_date).days
        if total_days <= 0:
            # Fallback to a date within business range
            random_days = np.random.randint(0, (max_establishment_date - min_establishment_date).days + 1)
            establishment_date = min_establishment_date + timedelta(days=random_days)
        else:
            random_days = np.random.randint(0, total_days + 1)
            establishment_date = actual_min_date + timedelta(days=random_days)
        
        established_dates.append(establishment_date)
    
    return established_dates

def generate_created_by(customer_type_ids):
    """Generate CreatedBy with channel-based assignment"""
    created_by_values = []
    
    for ctype in customer_type_ids:
        if ctype == 'Government':
            # Government: 10% Sales, 90% Services (procurement-focused)
            created_by = np.random.choice(['Sales', 'Services'], p=[0.1, 0.9])
        else:
            # Individual/Business: 90% Sales, 10% Services
            created_by = np.random.choice(['Sales', 'Services'], p=[0.9, 0.1])
        
        created_by_values.append(created_by)
    
    return created_by_values

def generate_customer_relationship_type_id(customer_type_ids):
    """Generate CustomerRelationshipTypeId intelligently based on CustomerTypeId"""
    relationship_types = {
        'Individual': ['Standard', 'Premium', 'VIP'],
        'Business': ['SMB', 'Premier', 'Partner'],
        'Government': ['Local', 'State', 'Federal']
    }
    
    probabilities = {
        'Individual': [0.45, 0.40, 0.15],  # 45% Standard, 40% Premium, 15% VIP
        'Business': [0.60, 0.30, 0.10],    # 60% SMB, 30% Premier, 10% Partner
        'Government': [0.50, 0.35, 0.15]   # 50% Local, 35% State, 15% Federal
    }
    
    assigned_relationships = []
    
    for ctype in customer_type_ids:
        if ctype in relationship_types:
            relationship = np.random.choice(relationship_types[ctype], p=probabilities[ctype])
            assigned_relationships.append(relationship)
        else:
            # Fallback for any unexpected customer types
            assigned_relationships.append('Standard')
    
    return assigned_relationships

def generate_primary_phone(num_records):
    """Generate PrimaryPhone (100% of records, US only)"""
    # Telephone numbers: in the United States, the phone number 555-0100 to 555-0199 
    # may be used with any North American area code (except 800, 866, 877, 888, or 900).  
    # International phone numbers should use sequential numbers.
    na_area_codes = [
        201, 202, 203, 205, 206, 207, 208, 209, 210, 212, 213, 214, 215, 216, 217, 218, 219,
        224, 225, 228, 229, 231, 234, 239, 240, 248, 251, 252, 253, 254, 256, 260, 262, 267,
        269, 270, 276, 281, 301, 302, 303, 304, 305, 307, 308, 309, 310, 312, 313, 314, 315,
        316, 317, 318, 319, 320, 321, 323, 325, 330, 331, 334, 336, 337, 339, 341, 347, 351,
        352, 360, 361, 364, 380, 385, 386, 401, 402, 404, 405, 406, 407, 408, 409, 410, 412,
        413, 414, 415, 417, 419, 423, 424, 425, 430, 432, 434, 435, 440, 443, 458, 463, 464,
        469, 470, 475, 478, 479, 480, 484, 501, 502, 503, 504, 505, 507, 508, 509, 510, 512,
        513, 515, 516, 517, 518, 520, 530, 540, 541, 551, 559, 561, 562, 563, 564, 567, 570,
        571, 573, 574, 575, 580, 585, 586, 601, 602, 603, 605, 606, 607, 608, 609, 610, 612,
        614, 615, 616, 617, 618, 619, 620, 623, 626, 628, 630, 631, 636, 641, 646, 650, 651,
        657, 660, 661, 662, 667, 678, 682, 701, 702, 703, 704, 706, 707, 708, 712, 713, 714,
        715, 716, 717, 718, 719, 720, 724, 725, 727, 731, 732, 734, 737, 740, 747, 754, 757,
        760, 762, 763, 765, 769, 770, 772, 773, 774, 775, 781, 785, 786, 801, 802, 803, 804,
        805, 806, 808, 810, 812, 813, 814, 815, 816, 817, 818, 828, 830, 831, 832, 843, 845,
        847, 848, 850, 856, 857, 858, 859, 860, 862, 863, 864, 865, 870, 872, 878, 901, 903,
        904, 906, 907, 908, 909, 910, 912, 913, 914, 915, 916, 917, 918, 919, 920, 925, 928,
        929, 930, 931, 934, 936, 937, 940, 941, 947, 949, 951, 952, 954, 956, 959, 970, 971,
        972, 973, 978, 979, 980, 984, 985, 989
    ]
    
    phone_numbers = []
    for i in range(num_records):
        area_code = np.random.choice(na_area_codes)
        last_two_digits = np.random.randint(0, 100)
        phone = f"({area_code}) 555-01{last_two_digits:02d}"
        phone_numbers.append(phone)
    
    return phone_numbers

def generate_secondary_phone(num_records):
    """Generate SecondaryPhone (30% of records have secondary phone, US only)"""
    # Telephone numbers: in the United States, the phone number 555-0100 to 555-0199 
    # may be used with any North American area code (except 800, 866, 877, 888, or 900).  
    # International phone numbers should use sequential numbers.
    na_area_codes = [
        201, 202, 203, 205, 206, 207, 208, 209, 210, 212, 213, 214, 215, 216, 217, 218, 219,
        224, 225, 228, 229, 231, 234, 239, 240, 248, 251, 252, 253, 254, 256, 260, 262, 267,
        269, 270, 276, 281, 301, 302, 303, 304, 305, 307, 308, 309, 310, 312, 313, 314, 315,
        316, 317, 318, 319, 320, 321, 323, 325, 330, 331, 334, 336, 337, 339, 341, 347, 351,
        352, 360, 361, 364, 380, 385, 386, 401, 402, 404, 405, 406, 407, 408, 409, 410, 412,
        413, 414, 415, 417, 419, 423, 424, 425, 430, 432, 434, 435, 440, 443, 458, 463, 464,
        469, 470, 475, 478, 479, 480, 484, 501, 502, 503, 504, 505, 507, 508, 509, 510, 512,
        513, 515, 516, 517, 518, 520, 530, 540, 541, 551, 559, 561, 562, 563, 564, 567, 570,
        571, 573, 574, 575, 580, 585, 586, 601, 602, 603, 605, 606, 607, 608, 609, 610, 612,
        614, 615, 616, 617, 618, 619, 620, 623, 626, 628, 630, 631, 636, 641, 646, 650, 651,
        657, 660, 661, 662, 667, 678, 682, 701, 702, 703, 704, 706, 707, 708, 712, 713, 714,
        715, 716, 717, 718, 719, 720, 724, 725, 727, 731, 732, 734, 737, 740, 747, 754, 757,
        760, 762, 763, 765, 769, 770, 772, 773, 774, 775, 781, 785, 786, 801, 802, 803, 804,
        805, 806, 808, 810, 812, 813, 814, 815, 816, 817, 818, 828, 830, 831, 832, 843, 845,
        847, 848, 850, 856, 857, 858, 859, 860, 862, 863, 864, 865, 870, 872, 878, 901, 903,
        904, 906, 907, 908, 909, 910, 912, 913, 914, 915, 916, 917, 918, 919, 920, 925, 928,
        929, 930, 931, 934, 936, 937, 940, 941, 947, 949, 951, 952, 954, 956, 959, 970, 971,
        972, 973, 978, 979, 980, 984, 985, 989
    ]
    
    phone_numbers = []
    for i in range(num_records):
        # 30% chance of having a secondary phone
        if np.random.random() < 0.3:
            area_code = np.random.choice(na_area_codes)
            last_two_digits = np.random.randint(0, 100)
            phone = f"({area_code}) 555-02{last_two_digits:02d}"
            phone_numbers.append(phone)
        else:
            phone_numbers.append(None)
    
    return phone_numbers

def generate_primary_email(first_names):
    """Generate PrimaryEmail using FirstName@contoso.com format"""
    primary_emails = []
    
    for first_name in first_names:
        # Convert FirstName to lowercase for email format
        email_username = first_name.lower()
        email = f"{email_username}@contoso.com"
        primary_emails.append(email)
    
    return primary_emails

def generate_secondary_email(first_names):
    """Generate SecondaryEmail for 15% of customers using FirstName@fabrikam.com format"""
    secondary_emails = []
    
    for first_name in first_names:
        # 15% chance of having a secondary email
        if np.random.random() < 0.15:
            # Convert FirstName to lowercase for email format
            email_username = first_name.lower()
            email = f"{email_username}@fabrikam.com"
            secondary_emails.append(email)
        else:
            secondary_emails.append(None)
    
    return secondary_emails

def generate_customer_trade_names(customer_ids, customer_types):
    """Generate CustomerTradeName records for Business and Government customers only"""
    
    # Business trade names (using Contoso/Fabrikam per compliance)
    business_names = [
        "Contoso Corp", "Contoso Industries", "Contoso Solutions",
        "Fabrikam Inc", "Fabrikam Systems", "Fabrikam Group"
    ]
    
    # Government trade names (using fictional planets per compliance)
    government_names = [
        "Planet Mars Authority", "Jupiter Department of Commerce", 
        "Saturn Municipal Services", "Venus Regional Office",
        "Neptune State Agency", "Mercury City Government"
    ]
    
    trade_name_records = []
    trade_name_counter = 1
    
    for customer_id, customer_type in zip(customer_ids, customer_types):
        if customer_type in ['Business', 'Government']:
            # Generate TradeNameId
            trade_name_id = f"TN-{trade_name_counter:03d}"
            
            # Select appropriate trade name based on customer type
            if customer_type == 'Business':
                trade_name = np.random.choice(business_names)
            else:  # Government
                trade_name = np.random.choice(government_names)
            
            trade_name_records.append({
                'CustomerId': customer_id,
                'CustomerTypeId': customer_type,
                'TradeNameId': trade_name_id,
                'TradeName': trade_name,
                'CustomerTradeNameNote': f'Generated {customer_type.lower()} trade name'
            })
            
            trade_name_counter += 1
    
    return trade_name_records

print("🔄 Generating all customer fields...")

# Get number of records from input file
num_records = len(df_input)
print(f"📊 Processing {num_records} customer records")

# Generate all fields with intelligent assignment
customer_ids = generate_customer_ids(num_records)
customer_type_ids = generate_customer_type_id(num_records)
relationship_type_ids = generate_customer_relationship_type_id(customer_type_ids)
birth_dates = generate_date_of_birth(num_records)
established_dates = generate_customer_established_date_intelligent(birth_dates, customer_type_ids)
is_active_values = generate_is_active(customer_type_ids, relationship_type_ids)
created_by_values = generate_created_by(customer_type_ids)  # <- ADD THIS LINE
primary_phones = generate_primary_phone(num_records)
secondary_phones = generate_secondary_phone(num_records)
primary_emails = generate_primary_email(df_input['FirstName'].values)
secondary_emails = generate_secondary_email(df_input['FirstName'].values)

# Generate CustomerTradeName data for Business and Government customers
trade_name_records = generate_customer_trade_names(customer_ids, customer_type_ids)

# Create complete output DataFrame with logical column ordering
df_output = pd.DataFrame({
    'CustomerId': customer_ids,
    'CustomerTypeId': customer_type_ids,
    'CustomerRelationshipTypeId': relationship_type_ids,
    'DateOfBirth': birth_dates,
    'CustomerEstablishedDate': established_dates,
    'IsActive': is_active_values,
    'FirstName': df_input['FirstName'].values,
    'LastName': df_input['LastName'].values,
    'Gender': df_input['Gender'].values,
    'PrimaryPhone': primary_phones,
    'SecondaryPhone': secondary_phones,
    'PrimaryEmail': primary_emails,
    'SecondaryEmail': secondary_emails,
    'CreatedBy': created_by_values
})

print("✅ Complete customer data generation complete!")

🎯 GENERATING COMPLETE CUSTOMER SAMPLE DATA
Input: C:\temp\samples\input\customer_names_unique_513.csv
Output: C:\temp\samples\output\Customer_Core_Fields_Samples.csv
📂 Reading input file: C:\temp\samples\input\customer_names_unique_513.csv
✅ File read successfully!
📊 Input file shape: (513, 3)
📋 Input columns: ['FirstName', 'LastName', 'Gender']
✅ All expected columns found
🔄 Generating all customer fields...
📊 Processing 513 customer records
✅ Complete customer data generation complete!


In [6]:
# filepath: c:\Repos\Code\SampleDataPrep\src\notebooks\data\Generate_Customer_Core_Fields.ipynb
# Display comprehensive distributions and statistics
print("\n📊 COMPLETE CUSTOMER DATA ANALYSIS")
print("="*50)

# Basic counts
total_customers = len(df_output)
print(f"📊 Total Customers: {total_customers}")

# CustomerTypeId Distribution
print("\n🎯 CustomerTypeId Distribution:")
type_dist = df_output['CustomerTypeId'].value_counts()
type_pct = df_output['CustomerTypeId'].value_counts(normalize=True) * 100
for ctype in ['Individual', 'Business', 'Government']:
    count = type_dist.get(ctype, 0)
    percent = type_pct.get(ctype, 0)
    print(f"  {ctype:12}: {count:3d} ({percent:5.1f}%)")

# IsActive Distribution
print("\n🎯 IsActive Distribution:")
active_dist = df_output['IsActive'].value_counts()
active_pct = df_output['IsActive'].value_counts(normalize=True) * 100
for active in [True, False]:
    count = active_dist.get(active, 0)
    percent = active_pct.get(active, 0)
    print(f"  {str(active):5}: {count:3d} ({percent:5.1f}%)")

# Gender Distribution
print("\n🎯 Gender Distribution:")
gender_dist = df_output['Gender'].value_counts()
gender_pct = df_output['Gender'].value_counts(normalize=True) * 100
for gender in gender_dist.index:
    count = gender_dist[gender]
    percent = gender_pct[gender]
    print(f"  {gender}: {count:3d} ({percent:5.1f}%)")

# Age Distribution Analysis
print("\n🎯 Age Distribution (as of Dec 31, 2021):")
reference_date = date(2021, 12, 31)
ages = [(reference_date - birth_date).days // 365 for birth_date in df_output['DateOfBirth']]

age_23_75 = len([age for age in ages if 23 <= age <= 75])
age_76_100 = len([age for age in ages if 76 <= age <= 100])
print(f"  Age 23-75 : {age_23_75:3d} ({age_23_75/len(ages)*100:5.1f}%)")
print(f"  Age 76-100: {age_76_100:3d} ({age_76_100/len(ages)*100:5.1f}%)")
print(f"  Min Age   : {min(ages)}")
print(f"  Max Age   : {max(ages)}")

# CreatedBy Distribution
print("\n🎯 CreatedBy Distribution:")
created_dist = df_output['CreatedBy'].value_counts()
created_pct = df_output['CreatedBy'].value_counts(normalize=True) * 100
for creator in ['Sales', 'Services']:
    count = created_dist.get(creator, 0)
    percent = created_pct.get(creator, 0)
    print(f"  {creator:8}: {count:3d} ({percent:5.1f}%)")

# CustomerRelationshipTypeId Distribution
print("\n🎯 CustomerRelationshipTypeId Distribution:")
relationship_dist = df_output['CustomerRelationshipTypeId'].value_counts()
relationship_pct = df_output['CustomerRelationshipTypeId'].value_counts(normalize=True) * 100
for rel_type in ['VIP', 'Premium', 'Standard', 'SMB', 'Premier', 'Partner', 'Local', 'State', 'Federal']:
    count = relationship_dist.get(rel_type, 0)
    percent = relationship_pct.get(rel_type, 0)
    print(f"  {rel_type:8}: {count:3d} ({percent:5.1f}%)")

# Phone Number Distribution Analysis
print("\n🎯 Phone Number Distribution:")
us_primary = len([p for p in df_output['PrimaryPhone'] if p.startswith('(')])
print(f"  Primary US Numbers  : {us_primary:3d} (100.0%)")

secondary_count = len([p for p in df_output['SecondaryPhone'] if p is not None])
secondary_null = len([p for p in df_output['SecondaryPhone'] if p is None])
print(f"  Has Secondary Phone : {secondary_count:3d} ({secondary_count/len(df_output)*100:5.1f}%)")
print(f"  No Secondary Phone  : {secondary_null:3d} ({secondary_null/len(df_output)*100:5.1f}%)")

# Date Range Analysis
print("\n🎯 Date Ranges:")
min_birth = df_output['DateOfBirth'].min()
max_birth = df_output['DateOfBirth'].max()
min_established = df_output['CustomerEstablishedDate'].min()
max_established = df_output['CustomerEstablishedDate'].max()

print(f"  Birth dates: {min_birth} to {max_birth}")
print(f"  Established: {min_established} to {max_established}")

# Email Distribution Analysis
print("\n🎯 Email Distribution:")
primary_email_count = len([e for e in df_output['PrimaryEmail'] if e and '@contoso.com' in e])
secondary_email_count = len([e for e in df_output['SecondaryEmail'] if e is not None])
secondary_null = len([e for e in df_output['SecondaryEmail'] if e is None])

print(f"  Primary Emails      : {primary_email_count:3d} (100.0%)")
print(f"  Domain: contoso.com : {primary_email_count:3d} (100.0%)")
print(f"  Has Secondary Email : {secondary_email_count:3d} ({secondary_email_count/len(df_output)*100:5.1f}%)")
print(f"  No Secondary Email  : {secondary_null:3d} ({secondary_null/len(df_output)*100:5.1f}%)")

# Sample email analysis
print("\n📧 Sample Email Addresses (First 10):")
sample_emails = df_output[['FirstName', 'PrimaryEmail', 'SecondaryEmail']].head(10)
for idx, row in sample_emails.iterrows():
    secondary_info = f" | {row['SecondaryEmail']}" if row['SecondaryEmail'] else " | None"
    print(f"  {row['FirstName']} -> {row['PrimaryEmail']}{secondary_info}")

# Display sample records
print(f"\n📋 Sample Records (First 10):")
sample_df = df_output.head(10)
print(sample_df.to_string(index=False))

# Save customer core fields to CSV
output_path = os.path.join(OUTPUT_FOLDER, OUTPUT_FILE)
df_output.to_csv(output_path, index=False)

print(f"\n💾 CUSTOMER DATA SAVED TO: {output_path}")

# Generate and save CustomerTradeName data
if trade_name_records:
    df_trade_names = pd.DataFrame(trade_name_records)
    trade_name_output_path = os.path.join(OUTPUT_FOLDER, "CustomerTradeName.csv")
    df_trade_names.to_csv(trade_name_output_path, index=False)
    
    print(f"💾 TRADE NAME DATA SAVED TO: {trade_name_output_path}")
    print(f"📊 Trade Name Records: {len(df_trade_names)}")
    
    # Display trade name distribution
    print(f"\n🎯 CustomerTradeName Distribution:")
    business_count = len([r for r in trade_name_records if r['CustomerTypeId'] == 'Business'])
    government_count = len([r for r in trade_name_records if r['CustomerTypeId'] == 'Government'])
    print(f"  Business Trade Names : {business_count:3d}")
    print(f"  Government Trade Names: {government_count:3d}")
    
    # Sample trade names
    print(f"\n📋 Sample Trade Names (First 10):")
    sample_trade_names = df_trade_names.head(10)
    for idx, row in sample_trade_names.iterrows():
        print(f"  {row['CustomerId']} ({row['CustomerTypeId']}): {row['TradeName']}")

print(f"\n📊 Total Customer Records: {len(df_output)}")
print(f"📈 Customer Columns: {', '.join(df_output.columns)}")
print("\n✅ Complete customer and trade name data generation complete!")
print(f"\n📋 Ready for database import with all required fields!")


📊 COMPLETE CUSTOMER DATA ANALYSIS
📊 Total Customers: 513

🎯 CustomerTypeId Distribution:
  Individual  : 353 ( 68.8%)
  Business    : 102 ( 19.9%)
  Government  :  58 ( 11.3%)

🎯 IsActive Distribution:
  True : 494 ( 96.3%)
  False:  19 (  3.7%)

🎯 Gender Distribution:
  Male: 265 ( 51.7%)
  Female: 248 ( 48.3%)

🎯 Age Distribution (as of Dec 31, 2021):
  Age 23-75 : 457 ( 89.1%)
  Age 76-100:  56 ( 10.9%)
  Min Age   : 23
  Max Age   : 100

🎯 CreatedBy Distribution:
  Sales   : 407 ( 79.3%)
  Services: 106 ( 20.7%)

🎯 CustomerRelationshipTypeId Distribution:
  VIP     :  56 ( 10.9%)
  Premium : 138 ( 26.9%)
  Standard: 159 ( 31.0%)
  SMB     :  72 ( 14.0%)
  Premier :  22 (  4.3%)
  Partner :   8 (  1.6%)
  Local   :  33 (  6.4%)
  State   :  19 (  3.7%)
  Federal :   6 (  1.2%)

🎯 Phone Number Distribution:
  Primary US Numbers  : 513 (100.0%)
  Has Secondary Phone : 134 ( 26.1%)
  No Secondary Phone  : 379 ( 73.9%)

🎯 Date Ranges:
  Birth dates: 1921-05-04 to 1998-08-23
  Establish