# Process Customer Names - Keep Unique LastNames

## Overview
This notebook reads customer_names.csv and keeps only unique LastName values, sorted alphabetically. This is an one time effort. No need to be repeated. The output file is copied to 'C:\temp\samples\customer_names_unique_513.csv'. This file is used as an input base file to generate Customer data. 

## Input
- File: `C:\temp\samples\customer_names.csv`
- Columns: FirstName, LastName, Gender

## Output  
- File: `C:\temp\samples\customer_names_unique.csv`
- Unique LastName records sorted A-Z

---

In [None]:
import pandas as pd
import os

# Configuration
INPUT_FOLDER = "C:\\temp\\samples"
INPUT_FILE = "customer_names.csv"
OUTPUT_FILE = "customer_names_unique.csv"

print(f"🎯 PROCESSING CUSTOMER NAMES")
print(f"Input: {INPUT_FOLDER}\\{INPUT_FILE}")
print(f"Output: {INPUT_FOLDER}\\{OUTPUT_FILE}")
print("="*50)

# Read the CSV file
input_path = os.path.join(INPUT_FOLDER, INPUT_FILE)

# Check for encoding issues first
encoding_ok = True
try:
    print("🔍 Checking file for encoding issues...")
    last_firstname = ""
    last_lastname = ""
    
    with open(input_path, 'r', encoding='utf-8') as f:
        line_number = 0
        for line in f:
            line_number += 1
            # Try to extract first and last name from the line
            if line_number > 1:  # Skip header row
                try:
                    parts = line.strip().split(',')
                    if len(parts) >= 2:
                        last_firstname = parts[0].strip('"')
                        last_lastname = parts[1].strip('"')
                except:
                    pass  # If we can't parse, just continue
                    
    print(f"✅ File is readable with UTF-8 encoding")
    
except UnicodeDecodeError as e:
    print(f"❌ Encoding error found at line {line_number}")
    print(f"Last successfully read names: {last_firstname} {last_lastname}")
    print(f"Error details: {e}")
    print(f"💡 Please check line {line_number} in your CSV file around names after {last_firstname} {last_lastname}")
    encoding_ok = False

# Only proceed if encoding is OK
if encoding_ok:
    try:
        df = pd.read_csv(input_path)
        print(f"✅ Successfully read {len(df)} records")
        print(f"📋 Columns found: {list(df.columns)}")
        
        # Check if required columns exist
        required_columns = ['FirstName', 'LastName', 'Gender']
        missing_columns = [col for col in required_columns if col not in df.columns]
        
        if missing_columns:
            print(f"❌ Missing columns: {missing_columns}")
            print(f"Available columns: {list(df.columns)}")
        else:
            print(f"✅ All required columns found: {required_columns}")
            
            # Display original data info
            print(f"\n📊 Original Data:")
            print(f"   Total records: {len(df)}")
            print(f"   Unique LastNames: {df['LastName'].nunique()}")
            
            # Keep only unique LastName (first occurrence of each LastName)
            df_unique = df.drop_duplicates(subset=['LastName'], keep='first')
            
            # Sort by LastName alphabetically (A to Z)
            df_unique = df_unique.sort_values('LastName').reset_index(drop=True)
            
            print(f"\n📊 After Processing:")
            print(f"   Unique records: {len(df_unique)}")
            print(f"   Records removed: {len(df) - len(df_unique)}")
            
            # Display first and last few records
            print(f"\n📋 First 10 unique LastNames:")
            print(df_unique[['FirstName', 'LastName', 'Gender']].head(10).to_string(index=False))
            
            if len(df_unique) > 10:
                print(f"\n📋 Last 5 unique LastNames:")
                print(df_unique[['FirstName', 'LastName', 'Gender']].tail(5).to_string(index=False))
            
            # Save to output file
            output_path = os.path.join(INPUT_FOLDER, OUTPUT_FILE)
            df_unique.to_csv(output_path, index=False)
            
            print(f"\n💾 SAVED TO: {output_path}")
            print(f"📊 Final file contains {len(df_unique)} unique records")
            print(f"📈 Columns: {', '.join(df_unique.columns)}")
            print("\n✅ Customer names processing complete!")
            
    except FileNotFoundError:
        print(f"❌ File not found: {input_path}")
        print("Please make sure the customer_names.csv file exists in the specified location.")
    except Exception as e:
        print(f"❌ Error processing file: {e}")
else:
    print("❌ Cannot proceed due to encoding issues. Please fix the file first.")

🎯 PROCESSING CUSTOMER NAMES
Input: C:\temp\samples\customer_names.csv
Output: C:\temp\samples\customer_names_unique.csv
🔍 Checking file for encoding issues...
✅ File is readable with UTF-8 encoding
✅ Successfully read 589 records
📋 Columns found: ['FirstName', 'LastName', 'Gender']
✅ All required columns found: ['FirstName', 'LastName', 'Gender']

📊 Original Data:
   Total records: 589
   Unique LastNames: 513

📊 After Processing:
   Unique records: 513
   Records removed: 76

📋 First 10 unique LastNames:
FirstName  LastName Gender
Tsehayetu     Abera Female
    Antra     Abola Female
      Ida   Abolina Female
    Deniz      Acar   Male
    Guner     Aktas   Male
  Danaite Alemseged Female
  Asmerom    Ambaye   Male
    Kiana  Anderson Female
    Jakob    Anhalt   Male
Elisabeth   Arcouet Female

📋 Last 5 unique LastNames:
FirstName LastName Gender
    Sonja    Zorko Female
     Luka    Zupan   Male
   Jozica   Zupanc Female
    Jozef Zupancic   Male
    Alida Zvirbule   Male

💾 SAVED