# Generate Enriched Product Sample Data

## Overview
This notebook enriches Product_Input.csv with CategoryName from ProductCategory_48.csv by performing a lookup operation on CategoryID.

## Input Files
- **Product_Input.csv**: Product data with CategoryID but missing CategoryName
- **ProductCategory_48.csv**: CategoryID to CategoryName mapping

## Output
- **Product_Samples.csv**: Enriched product data with CategoryName filled in

## Process
1. Read both input files from C:\temp\samples
2. Perform lookup/merge on CategoryID
3. Generate enriched output file with all original data + CategoryName

---

In [None]:
import pandas as pd
import os

# Configuration
INPUT_FOLDER = "C:\\temp\\samples\\input"
PRODUCT_INPUT_FILE = "Product_Input_Base.csv"
CATEGORY_LOOKUP_FILE = "ProductCategory_48.csv"
OUTPUT_FOLDER = "C:\\temp\\samples\\output"
OUTPUT_FILE = "Product_Samples.csv"

print(f"🎯 ENRICHING PRODUCT DATA WITH CATEGORY NAMES")
print(f"Input Folder: {INPUT_FOLDER}")
print(f"Product Input: {PRODUCT_INPUT_FILE}")
print(f"Category Lookup: {CATEGORY_LOOKUP_FILE}")
print(f"Output: {OUTPUT_FILE}")
print("="*60)

try:
    # Read Product Input file
    product_input_path = os.path.join(INPUT_FOLDER, PRODUCT_INPUT_FILE)
    print(f"📂 Reading product input: {product_input_path}")
    df_products = pd.read_csv(product_input_path)
    
    print(f"✅ Product file read successfully!")
    print(f"📊 Shape: {df_products.shape}")
    print(f"📋 Columns: {list(df_products.columns)}")
    
    # Read Category Lookup file
    category_lookup_path = os.path.join(INPUT_FOLDER, CATEGORY_LOOKUP_FILE)
    print(f"\n📂 Reading category lookup: {category_lookup_path}")
    df_categories = pd.read_csv(category_lookup_path)
    
    print(f"✅ Category file read successfully!")
    print(f"📊 Shape: {df_categories.shape}")
    print(f"📋 Columns: {list(df_categories.columns)}")
    
    # Display sample data
    print(f"\n📖 Sample Product Data (First 5 rows):")
    print(df_products.head())
    
    print(f"\n📖 Sample Category Data (First 5 rows):")
    print(df_categories.head())
    
except Exception as e:
    print(f"❌ Error reading input files: {e}")
    print("\n💡 Please ensure both files exist in C:\\temp\\samples:")
    print("   - Product_Input.csv")
    print("   - ProductCategory_48.csv")
    raise

print("✅ Input files loaded successfully!")

🎯 ENRICHING PRODUCT DATA WITH CATEGORY NAMES
Input Folder: C:\temp\samples
Product Input: Product_Input_Base.csv
Category Lookup: ProductCategory_48.csv
Output: Product_Samples.csv
📂 Reading product input: C:\temp\samples\Product_Input_Base.csv
✅ Product file read successfully!
📊 Shape: (315, 8)
📋 Columns: ['ProductID', 'Name', 'Color', 'StandardCost', 'ListPrice', 'Size', 'Weight', 'CategoryID']

📂 Reading category lookup: C:\temp\samples\ProductCategory_48.csv
✅ Category file read successfully!
📊 Shape: (48, 2)
📋 Columns: ['CategoryID', 'CategoryName']

📖 Sample Product Data (First 5 rows):
   ProductID                       Name  Color  StandardCost  ListPrice Size  \
0          1  HL Road Frame - Black, 58  Black     1059.3100    1431.50   58   
1          2    HL Road Frame - Red, 58    Red     1059.3100    1431.50   58   
2          3      Sport-100 Helmet, Red    Red       13.0863      34.99  NaN   
3          4    Sport-100 Helmet, Black  Black       13.0863      34.99  NaN   


In [2]:
# Perform lookup/merge operation
print("\n🔄 PERFORMING CATEGORY LOOKUP")
print("="*50)

# Check for CategoryID column in both files
if 'CategoryID' not in df_products.columns:
    print("❌ CategoryID column not found in Product_Input.csv")
    print(f"Available columns: {list(df_products.columns)}")
    raise ValueError("CategoryID column missing from product data")

if 'CategoryID' not in df_categories.columns:
    print("❌ CategoryID column not found in ProductCategory_48.csv")
    print(f"Available columns: {list(df_categories.columns)}")
    raise ValueError("CategoryID column missing from category data")

# Check for CategoryName column in category file
category_name_col = None
for col in df_categories.columns:
    if 'categoryname' in col.lower() or 'category_name' in col.lower() or 'name' in col.lower():
        category_name_col = col
        break

if category_name_col is None:
    print("❌ CategoryName column not found in ProductCategory_48.csv")
    print(f"Available columns: {list(df_categories.columns)}")
    # Try to use second column if it exists
    if len(df_categories.columns) >= 2:
        category_name_col = df_categories.columns[1]
        print(f"🔄 Using second column as CategoryName: {category_name_col}")
    else:
        raise ValueError("CategoryName column not found")

print(f"✅ Using CategoryID for lookup")
print(f"✅ Using '{category_name_col}' as CategoryName")

# Before merge - check for missing CategoryIDs
product_categories = set(df_products['CategoryID'].dropna())
lookup_categories = set(df_categories['CategoryID'].dropna())
missing_categories = product_categories - lookup_categories

if missing_categories:
    print(f"⚠️  Warning: {len(missing_categories)} CategoryIDs in products not found in lookup:")
    print(f"   Missing: {sorted(list(missing_categories))}")

# Perform the merge
print(f"\n🔗 Merging product data with category names...")
df_enriched = df_products.merge(
    df_categories[['CategoryID', category_name_col]], 
    on='CategoryID', 
    how='left'
)

# Rename the category name column to standard name if needed
if category_name_col != 'CategoryName':
    df_enriched = df_enriched.rename(columns={category_name_col: 'CategoryName'})

print(f"✅ Merge completed!")
print(f"📊 Original products: {len(df_products)}")
print(f"📊 Enriched products: {len(df_enriched)}")

# Check for null CategoryNames
null_category_names = df_enriched['CategoryName'].isnull().sum()
if null_category_names > 0:
    print(f"⚠️  Warning: {null_category_names} products have null CategoryName")

print("\n📖 Sample Enriched Data (First 10 rows):")
display_cols = ['CategoryID', 'CategoryName'] + [col for col in df_enriched.columns if col not in ['CategoryID', 'CategoryName']][:3]
print(df_enriched[display_cols].head(10))


🔄 PERFORMING CATEGORY LOOKUP
✅ Using CategoryID for lookup
✅ Using 'CategoryName' as CategoryName

🔗 Merging product data with category names...
✅ Merge completed!
📊 Original products: 315
📊 Enriched products: 315

📖 Sample Enriched Data (First 10 rows):
   CategoryID CategoryName  ProductID                        Name  Color
0          18  Road Frames          1   HL Road Frame - Black, 58  Black
1          18  Road Frames          2     HL Road Frame - Red, 58    Red
2          35      Helmets          3       Sport-100 Helmet, Red    Red
3          35      Helmets          4     Sport-100 Helmet, Black  Black
4          27        Socks          5      Mountain Bike Socks, M  White
5          27        Socks          6      Mountain Bike Socks, L  White
6          35      Helmets          7      Sport-100 Helmet, Blue   Blue
7          23         Caps          8                AWC Logo Cap  Multi
8          25      Jerseys          9  Long-Sleeve Logo Jersey, S  Multi
9          25 

In [None]:
# Save enriched data and display summary
print("\n💾 SAVING ENRICHED PRODUCT DATA")
print("="*50)

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# Save to output file
output_path = os.path.join(OUTPUT_FOLDER, OUTPUT_FILE)
df_enriched.to_csv(output_path, index=False)

print(f"✅ Enriched data saved to: {output_path}")
print(f"📊 Total records: {len(df_enriched)}")
print(f"📈 Total columns: {len(df_enriched.columns)}")

# Display summary statistics
print(f"\n📊 ENRICHMENT SUMMARY")
print("="*30)
print(f"Original products: {len(df_products)}")
print(f"Category lookup entries: {len(df_categories)}")
print(f"Final enriched products: {len(df_enriched)}")

# Category distribution
print(f"\n🎯 Category Distribution:")
category_counts = df_enriched['CategoryName'].value_counts()
print(category_counts.head(10))

if len(category_counts) > 10:
    print(f"... and {len(category_counts) - 10} more categories")

# Check data quality
print(f"\n🔍 Data Quality Check:")
print(f"Products with CategoryName: {df_enriched['CategoryName'].notna().sum()}")
print(f"Products missing CategoryName: {df_enriched['CategoryName'].isna().sum()}")

print(f"\n📋 Final Column List:")
print(f"Columns: {', '.join(df_enriched.columns)}")

print(f"\n✅ Product enrichment complete!")
print(f"📁 Output file: {output_path}")


💾 SAVING ENRICHED PRODUCT DATA
✅ Enriched data saved to: C:\temp\samples\Product_Samples.csv
📊 Total records: 315
📈 Total columns: 9

📊 ENRICHMENT SUMMARY
Original products: 315
Category lookup entries: 48
Final enriched products: 315

🎯 Category Distribution:
CategoryName
Road Bikes         43
Road Frames        33
Mountain Bikes     32
Mountain Frames    28
Touring Bikes      22
Touring Frames     18
Wheels             14
Tires and Tubes    11
Saddles             9
Jerseys             8
Name: count, dtype: int64
... and 34 more categories

🔍 Data Quality Check:
Products with CategoryName: 315
Products missing CategoryName: 0

📋 Final Column List:
Columns: ProductID, Name, Color, StandardCost, ListPrice, Size, Weight, CategoryID, CategoryName

✅ Product enrichment complete!
📁 Output file: C:\temp\samples\Product_Samples.csv
