# Items Table vs CSV Column Analysis
## Date: July 5, 2025

### Objective
Compare the columns between the Items table in the database and the Items CSV file to identify any differences, missing fields, or inconsistencies.

### Analysis Plan
1. Load Items table from production database
2. Load Items CSV from the latest data backup
3. Compare column structures
4. Identify differences and provide recommendations

## 1. Import Required Libraries
Import pandas for data handling, sqlite3 for database operations, and configuration management.

In [1]:
import pandas as pd
import sqlite3
import sys
import os
from pathlib import Path

# Add src directory to path for imports
sys.path.append(str(Path('../src').resolve()))

# Import our configuration management
from data_pipeline.config import ConfigurationManager

print("üì¶ Libraries imported successfully")
print(f"üêç Python version: {sys.version}")
print(f"üêº Pandas version: {pd.__version__}")

üì¶ Libraries imported successfully
üêç Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
üêº Pandas version: 2.3.0


## 2. Initialize Configuration
Load configuration to get database and CSV paths dynamically.

In [15]:
# Initialize configuration manager
try:
    config = ConfigurationManager()
    
    # Get project root directory
    project_root = Path('../').resolve()  # Go up from notebooks to project root
    
    # Get database path using correct method - resolve relative to project root
    db_path_relative = config.get('data_sources', 'target_database')
    db_path = project_root / db_path_relative
    print(f"üóÑÔ∏è Database Path: {db_path}")
    
    # Get CSV data source paths and fix to project root
    paths = config.get_data_source_paths()
    csv_path_raw = paths['csv_backup_path']
    # Extract just the relative part after project root
    csv_relative = Path(csv_path_raw).name if 'Nangsel Pioneers' in csv_path_raw else csv_path_raw
    csv_path = project_root / "data" / "csv" / csv_relative
    
    json_path = paths['json_api_path']
    print(f"üìÅ CSV Data Path: {csv_path}")
    
    # Construct Items CSV file path
    items_csv_path = csv_path / "Item.csv"
    print(f"üìÑ Items CSV Path: {items_csv_path}")
    
    # Verify paths exist
    db_exists = os.path.exists(db_path)
    csv_exists = os.path.exists(items_csv_path)
    
    print(f"‚úÖ Database exists: {db_exists}")
    print(f"‚úÖ Items CSV exists: {csv_exists}")
    
    if not db_exists:
        print("‚ö†Ô∏è WARNING: Database file not found!")
    if not csv_exists:
        print("‚ö†Ô∏è WARNING: Items CSV file not found!")
        
except Exception as e:
    print(f"‚ùå Configuration error: {e}")
    raise

üóÑÔ∏è Database Path: C:\Users\User\Documents\Projects\Automated_Operations\Zoho_Data_Sync\data\database\production.db
üìÅ CSV Data Path: C:\Users\User\Documents\Projects\Automated_Operations\Zoho_Data_Sync\data\csv\Nangsel Pioneers_2025-06-22
üìÑ Items CSV Path: C:\Users\User\Documents\Projects\Automated_Operations\Zoho_Data_Sync\data\csv\Nangsel Pioneers_2025-06-22\Item.csv
‚úÖ Database exists: True
‚úÖ Items CSV exists: True


## 3. Load Items Table from Database
Connect to the production database and load the Items table structure and data.

In [6]:
# Load Items table from database
try:
    # Connect to database
    conn = sqlite3.connect(db_path)
    
    # Load Items table into DataFrame
    db_items_df = pd.read_sql_query("SELECT * FROM Items", conn)
    
    # Get table schema information
    cursor = conn.cursor()
    cursor.execute("PRAGMA table_info(Items)")
    table_info = cursor.fetchall()
    
    conn.close()
    
    # Display basic information
    print(f"üìä Items table loaded from database:")
    print(f"   Rows: {len(db_items_df):,}")
    print(f"   Columns: {len(db_items_df.columns)}")
    print(f"   Memory usage: {db_items_df.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Extract column names
    db_columns = list(db_items_df.columns)
    print(f"\nüè∑Ô∏è Database table columns ({len(db_columns)}):")
    for i, col in enumerate(db_columns, 1):
        print(f"   {i:2d}. {col}")
    
    # Show first few rows
    print(f"\nüìã Sample data (first 3 rows):")
    display(db_items_df.head(3))
    
except Exception as e:
    print(f"‚ùå Error loading Items table: {e}")
    db_items_df = None
    db_columns = []

üìä Items table loaded from database:
   Rows: 925
   Columns: 24
   Memory usage: 1045.0 KB

üè∑Ô∏è Database table columns (24):
    1. ItemID
    2. ItemName
    3. SKU
    4. ItemType
    5. Category
    6. Description
    7. Rate
    8. Unit
    9. PurchaseRate
   10. TaxID
   11. TaxName
   12. TaxPercentage
   13. PurchaseTaxID
   14. PurchaseTaxName
   15. PurchaseTaxPercentage
   16. InventoryAccountID
   17. InventoryAccountName
   18. AccountID
   19. AccountName
   20. PurchaseAccountID
   21. PurchaseAccountName
   22. IsActive
   23. CreatedTime
   24. LastModifiedTime

üìã Sample data (first 3 rows):


Unnamed: 0,ItemID,ItemName,SKU,ItemType,Category,Description,Rate,Unit,PurchaseRate,TaxID,...,PurchaseTaxPercentage,InventoryAccountID,InventoryAccountName,AccountID,AccountName,PurchaseAccountID,PurchaseAccountName,IsActive,CreatedTime,LastModifiedTime
0,3990265000000085007,ABC Warehouse stock,,Inventory,,Stock for sale to retailers,BTN 1.00,,BTN 0.855,,...,,,,,,,,Inactive,,
1,3990265000000085020,AAB Distributer Goods Direct from Factory,,Sales and Purchases,,Goods directly sold to distributors,BTN 1.00,,BTN 1.00,,...,,,,,,,,Active,,
2,3990265000000130052,Stock Warehouse,,Sales and Purchases,,,BTN 1.00,,BTN 0.87,,...,,,,,,,,Inactive,,


## 4. Load Items CSV into DataFrame
Read the Items CSV file using the dynamically resolved path from configuration.

In [7]:
# Load Items CSV file
try:
    # Read CSV file
    csv_items_df = pd.read_csv(items_csv_path)
    
    # Display basic information
    print(f"üìä Items CSV loaded:")
    print(f"   File: {items_csv_path}")
    print(f"   Rows: {len(csv_items_df):,}")
    print(f"   Columns: {len(csv_items_df.columns)}")
    print(f"   Memory usage: {csv_items_df.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Extract column names
    csv_columns = list(csv_items_df.columns)
    print(f"\nüè∑Ô∏è CSV file columns ({len(csv_columns)}):")
    for i, col in enumerate(csv_columns, 1):
        print(f"   {i:2d}. {col}")
    
    # Show first few rows
    print(f"\nüìã Sample data (first 3 rows):")
    display(csv_items_df.head(3))
    
    # Check for any null/empty data patterns
    null_counts = csv_items_df.isnull().sum()
    columns_with_nulls = null_counts[null_counts > 0]
    
    if len(columns_with_nulls) > 0:
        print(f"\n‚ö†Ô∏è Columns with null values:")
        for col, count in columns_with_nulls.items():
            print(f"   {col}: {count} nulls ({count/len(csv_items_df)*100:.1f}%)")
    else:
        print(f"\n‚úÖ No null values found in CSV")
    
except Exception as e:
    print(f"‚ùå Error loading Items CSV: {e}")
    csv_items_df = None
    csv_columns = []

üìä Items CSV loaded:
   File: C:\Users\User\Documents\Projects\Automated_Operations\Zoho_Data_Sync\data\csv\Nangsel Pioneers_2025-06-22\Item.csv
   Rows: 925
   Columns: 41
   Memory usage: 1352.1 KB

üè∑Ô∏è CSV file columns (41):
    1. Item ID
    2. Item Name
    3. SKU
    4. Description
    5. Rate
    6. Account
    7. Account Code
    8. Tax Name
    9. Tax Percentage
   10. Tax Type
   11. Purchase Tax Name
   12. Purchase Tax Percentage
   13. Purchase Tax Type
   14. Product Type
   15. Source
   16. Reference ID
   17. Last Sync Time
   18. Status
   19. Usage unit
   20. Purchase Rate
   21. Purchase Account
   22. Purchase Account Code
   23. Purchase Description
   24. Inventory Account
   25. Inventory Account Code
   26. Inventory Valuation Method
   27. Reorder Point
   28. Vendor
   29. Opening Stock
   30. Opening Stock Value
   31. Stock On Hand
   32. Item Type
   33. Region
   34. Vehicle
   35. CF.SKU category
   36. CF.Product Sale Category
   37. CF.Item Loc

Unnamed: 0,Item ID,Item Name,SKU,Description,Rate,Account,Account Code,Tax Name,Tax Percentage,Tax Type,...,Item Type,Region,Vehicle,CF.SKU category,CF.Product Sale Category,CF.Item Location,CF.Product Category,CF.Manufacturer,CF.M Box,CF.S Box Qty
0,3990265000000085007,ABC Warehouse stock,,Stock for sale to retailers,BTN 1.00,Sales,I-1000,,,,...,Inventory,,,Warehouse,,TPH,Others unclassified,,,
1,3990265000000085020,AAB Distributer Goods Direct from Factory,,Goods directly sold to distributors,BTN 1.00,Sales to Distributors,,,,,...,Sales and Purchases,,,DISTRIBUTOR GOODS,,TPH,Others unclassified,,,
2,3990265000000130052,Stock Warehouse,,,BTN 1.00,Sales,I-1000,,,,...,Sales and Purchases,,,,,TPH,Others unclassified,,,



‚ö†Ô∏è Columns with null values:
   SKU: 89 nulls (9.6%)
   Description: 914 nulls (98.8%)
   Account: 8 nulls (0.9%)
   Account Code: 33 nulls (3.6%)
   Tax Name: 925 nulls (100.0%)
   Tax Percentage: 925 nulls (100.0%)
   Tax Type: 925 nulls (100.0%)
   Purchase Tax Name: 925 nulls (100.0%)
   Purchase Tax Percentage: 925 nulls (100.0%)
   Purchase Tax Type: 925 nulls (100.0%)
   Reference ID: 925 nulls (100.0%)
   Last Sync Time: 925 nulls (100.0%)
   Usage unit: 26 nulls (2.8%)
   Purchase Account Code: 33 nulls (3.6%)
   Purchase Description: 905 nulls (97.8%)
   Inventory Account: 8 nulls (0.9%)
   Inventory Account Code: 8 nulls (0.9%)
   Inventory Valuation Method: 35 nulls (3.8%)
   Reorder Point: 649 nulls (70.2%)
   Vendor: 66 nulls (7.1%)
   Opening Stock: 696 nulls (75.2%)
   Opening Stock Value: 696 nulls (75.2%)
   Stock On Hand: 35 nulls (3.8%)
   Region: 925 nulls (100.0%)
   Vehicle: 925 nulls (100.0%)
   CF.SKU category: 1 nulls (0.1%)
   CF.Product Sale Category: 6

## 5. Compare Columns Between Database Table and CSV
Analyze the column structures to identify differences, missing fields, and mismatches.

In [8]:
# Column comparison analysis
if db_columns and csv_columns:
    # Convert to sets for set operations
    db_cols_set = set(db_columns)
    csv_cols_set = set(csv_columns)
    
    # Find differences
    common_columns = db_cols_set & csv_cols_set
    only_in_db = db_cols_set - csv_cols_set
    only_in_csv = csv_cols_set - db_cols_set
    
    print(f"üìä COLUMN COMPARISON ANALYSIS")
    print(f"{'='*60}")
    print(f"Database columns: {len(db_columns)}")
    print(f"CSV columns: {len(csv_columns)}")
    print(f"Common columns: {len(common_columns)}")
    print(f"Only in database: {len(only_in_db)}")
    print(f"Only in CSV: {len(only_in_csv)}")
    
    # Calculate coverage percentages
    db_coverage = (len(common_columns) / len(db_columns)) * 100 if db_columns else 0
    csv_coverage = (len(common_columns) / len(csv_columns)) * 100 if csv_columns else 0
    
    print(f"\nüìà COVERAGE ANALYSIS:")
    print(f"Database coverage (common/total): {db_coverage:.1f}%")
    print(f"CSV coverage (common/total): {csv_coverage:.1f}%")
    
    # Store results for detailed display
    comparison_results = {
        'common': sorted(list(common_columns)),
        'only_db': sorted(list(only_in_db)),
        'only_csv': sorted(list(only_in_csv)),
        'db_total': len(db_columns),
        'csv_total': len(csv_columns),
        'common_count': len(common_columns)
    }
    
    print(f"\n‚úÖ Column comparison completed successfully")
    
else:
    print(f"‚ùå Cannot perform comparison - missing data")
    comparison_results = None

üìä COLUMN COMPARISON ANALYSIS
Database columns: 24
CSV columns: 41
Common columns: 3
Only in database: 21
Only in CSV: 38

üìà COVERAGE ANALYSIS:
Database coverage (common/total): 12.5%
CSV coverage (common/total): 7.3%

‚úÖ Column comparison completed successfully


## 6. Display Detailed Column Differences
Show the specific columns that are unique to each source and provide recommendations.

In [9]:
# Display detailed differences
if comparison_results:
    print(f"üîç DETAILED COLUMN ANALYSIS")
    print(f"{'='*80}")
    
    # Common columns
    if comparison_results['common']:
        print(f"\n‚úÖ COMMON COLUMNS ({len(comparison_results['common'])}):")
        print(f"   These columns exist in both database and CSV:")
        for i, col in enumerate(comparison_results['common'], 1):
            print(f"   {i:2d}. {col}")
    
    # Columns only in database
    if comparison_results['only_db']:
        print(f"\nüóÑÔ∏è COLUMNS ONLY IN DATABASE ({len(comparison_results['only_db'])}):")
        print(f"   These columns exist in the database but NOT in the CSV:")
        for i, col in enumerate(comparison_results['only_db'], 1):
            print(f"   {i:2d}. {col}")
        print(f"   üìù Note: These might be generated fields or defaults added during ETL")
    
    # Columns only in CSV
    if comparison_results['only_csv']:
        print(f"\nüìÑ COLUMNS ONLY IN CSV ({len(comparison_results['only_csv'])}):")
        print(f"   These columns exist in the CSV but NOT in the database:")
        for i, col in enumerate(comparison_results['only_csv'], 1):
            print(f"   {i:2d}. {col}")
        print(f"   ‚ö†Ô∏è Warning: These CSV fields are being dropped during ETL!")
    
    # Summary and recommendations
    print(f"\nüìã SUMMARY & RECOMMENDATIONS:")
    print(f"{'='*50}")
    
    if comparison_results['only_csv']:
        print(f"üö® ACTION REQUIRED:")
        print(f"   - {len(comparison_results['only_csv'])} CSV columns are being dropped")
        print(f"   - These fields contain data that is not being preserved")
        print(f"   - Consider updating mappings.py to include these fields")
    
    if comparison_results['only_db']:
        print(f"‚ÑπÔ∏è INFO:")
        print(f"   - {len(comparison_results['only_db'])} database columns are generated/default fields")
        print(f"   - This is normal for ETL processes")
    
    coverage_ratio = comparison_results['common_count'] / comparison_results['csv_total'] * 100
    if coverage_ratio < 90:
        print(f"‚ö†Ô∏è WARNING: Low CSV coverage ({coverage_ratio:.1f}%) - many fields being dropped")
    elif coverage_ratio < 100:
        print(f"‚ö†Ô∏è NOTICE: Partial CSV coverage ({coverage_ratio:.1f}%) - some fields being dropped")
    else:
        print(f"‚úÖ EXCELLENT: Complete CSV coverage ({coverage_ratio:.1f}%)")
        
else:
    print(f"‚ùå Cannot display detailed analysis - comparison failed")

üîç DETAILED COLUMN ANALYSIS

‚úÖ COMMON COLUMNS (3):
   These columns exist in both database and CSV:
    1. Description
    2. Rate
    3. SKU

üóÑÔ∏è COLUMNS ONLY IN DATABASE (21):
   These columns exist in the database but NOT in the CSV:
    1. AccountID
    2. AccountName
    3. Category
    4. CreatedTime
    5. InventoryAccountID
    6. InventoryAccountName
    7. IsActive
    8. ItemID
    9. ItemName
   10. ItemType
   11. LastModifiedTime
   12. PurchaseAccountID
   13. PurchaseAccountName
   14. PurchaseRate
   15. PurchaseTaxID
   16. PurchaseTaxName
   17. PurchaseTaxPercentage
   18. TaxID
   19. TaxName
   20. TaxPercentage
   21. Unit
   üìù Note: These might be generated fields or defaults added during ETL

üìÑ COLUMNS ONLY IN CSV (38):
   These columns exist in the CSV but NOT in the database:
    1. Account
    2. Account Code
    3. CF.Item Location
    4. CF.M Box
    5. CF.Manufacturer
    6. CF.Product Category
    7. CF.Product Sale Category
    8. CF.S Box

## 7. Export Analysis Results
Create a comparison DataFrame and optionally export the results for documentation.

In [11]:
# Create comprehensive analysis DataFrame
if comparison_results and db_columns and csv_columns:
    # Create a comprehensive comparison DataFrame
    all_columns = sorted(set(db_columns + csv_columns))
    
    analysis_data = []
    for col in all_columns:
        in_db = col in db_columns
        in_csv = col in csv_columns
        status = "COMMON" if (in_db and in_csv) else "DB_ONLY" if in_db else "CSV_ONLY"
        
        analysis_data.append({
            'Column': col,
            'In_Database': in_db,
            'In_CSV': in_csv,
            'Status': status
        })
    
    analysis_df = pd.DataFrame(analysis_data)
    
    # Display summary table
    print(f"üìä COMPREHENSIVE COLUMN ANALYSIS TABLE")
    print(f"{'='*60}")
    
    status_counts = analysis_df['Status'].value_counts()
    print(f"Status Summary:")
    for status, count in status_counts.items():
        print(f"   {status}: {count} columns")
    
    print(f"\nüìã Full Analysis Table:")
    display(analysis_df)
    
    # Optionally save to CSV for documentation
    output_path = "../reports/items_column_analysis_2025_07_05.csv"
    try:
        os.makedirs("../reports", exist_ok=True)
        analysis_df.to_csv(output_path, index=False)
        print(f"\nüíæ Analysis results saved to: {output_path}")
    except Exception as e:
        print(f"\n‚ö†Ô∏è Could not save analysis: {e}")
    
    print(f"\nüéØ FINAL ANALYSIS COMPLETE")
    print(f"{'='*40}")
    print(f"Total columns analyzed: {len(all_columns)}")
    print(f"Database coverage: {len([c for c in analysis_data if c['In_Database']])} columns")
    print(f"CSV coverage: {len([c for c in analysis_data if c['In_CSV']])} columns")
    print(f"Perfect match: {status_counts.get('COMMON', 0) == len(all_columns)}")
    
else:
    print(f"‚ùå Cannot create analysis table - missing data")

üìä COMPREHENSIVE COLUMN ANALYSIS TABLE
Status Summary:
   CSV_ONLY: 38 columns
   DB_ONLY: 21 columns
   COMMON: 3 columns

üìã Full Analysis Table:


Unnamed: 0,Column,In_Database,In_CSV,Status
0,Account,False,True,CSV_ONLY
1,Account Code,False,True,CSV_ONLY
2,AccountID,True,False,DB_ONLY
3,AccountName,True,False,DB_ONLY
4,CF.Item Location,False,True,CSV_ONLY
...,...,...,...,...
57,TaxPercentage,True,False,DB_ONLY
58,Unit,True,False,DB_ONLY
59,Usage unit,False,True,CSV_ONLY
60,Vehicle,False,True,CSV_ONLY



üíæ Analysis results saved to: ../reports/items_column_analysis_2025_07_05.csv

üéØ FINAL ANALYSIS COMPLETE
Total columns analyzed: 62
Database coverage: 24 columns
CSV coverage: 41 columns
Perfect match: False


## 8. Key Findings Summary
Display the most critical differences and recommendations for action.

In [12]:
# Display critical findings with specific examples
print("üö® CRITICAL FINDINGS - WHY THE DIFFERENCES EXIST")
print("=" * 60)

print(f"\n‚úÖ ONLY 3 COMMON COLUMNS (out of 41 CSV columns):")
common_list = sorted(list(comparison_results['common']))
for i, col in enumerate(common_list, 1):
    print(f"   {i}. '{col}'")

print(f"\nüìÑ CSV-ONLY COLUMNS BEING DROPPED (38 columns):")
csv_only_list = sorted(list(comparison_results['only_csv']))

# Group by type for better understanding
custom_fields = [col for col in csv_only_list if col.startswith('CF.')]
original_names = [col for col in csv_only_list if any(x in col for x in ['Item ID', 'Item Name', 'Item Type'])]
business_fields = [col for col in csv_only_list if not col.startswith('CF.') and not any(x in col for x in ['Item ID', 'Item Name', 'Item Type'])]

print(f"\n   üè∑Ô∏è Custom Fields ({len(custom_fields)}) - Business configurations:")
for col in custom_fields:
    print(f"      ‚Ä¢ {col}")

print(f"\n   üìù Original Zoho Names ({len(original_names)}) - Name mismatches:")
for col in original_names:
    print(f"      ‚Ä¢ '{col}' (CSV) ‚Üí needs mapping to canonical name")

print(f"\n   üíº Core Business Fields ({len(business_fields)}) - Essential data:")
important_fields = ['Account', 'Opening Stock', 'Stock On Hand', 'Vendor', 'Status', 'Reorder Point']
for col in business_fields:
    marker = "‚ö†Ô∏è CRITICAL" if any(imp in col for imp in important_fields) else "  "
    print(f"      {marker} ‚Ä¢ {col}")

print(f"\nüóÑÔ∏è DATABASE-ONLY COLUMNS (21 columns):")
db_only_list = sorted(list(comparison_results['only_db']))
print("   These are generated/canonical fields (normal for ETL):")
for col in db_only_list[:10]:  # Show first 10
    print(f"      ‚Ä¢ {col}")
if len(db_only_list) > 10:
    print(f"      ... and {len(db_only_list) - 10} more")

print(f"\nüéØ ROOT CAUSE:")
print(f"   1. Items mapping in mappings.py is SEVERELY INCOMPLETE")
print(f"   2. Only 3 out of 41 CSV fields are properly mapped")
print(f"   3. 93% of CSV data is being silently dropped!")
print(f"   4. Business-critical data (stock, vendor, custom fields) is lost")

print(f"\nüìã IMMEDIATE ACTIONS NEEDED:")
print(f"   1. Update mappings.py to include all 38 missing CSV‚ÜíDB mappings")
print(f"   2. Map custom fields (CF.*) to preserve business configurations")
print(f"   3. Map stock/inventory fields for proper inventory management")
print(f"   4. Re-run ETL pipeline to verify no data loss")

coverage_pct = (len(common_list) / len(csv_columns)) * 100
print(f"\n‚ö†Ô∏è CURRENT CSV COVERAGE: {coverage_pct:.1f}% (UNACCEPTABLE)")
print(f"   TARGET: 100% (all CSV fields should be preserved)")

üö® CRITICAL FINDINGS - WHY THE DIFFERENCES EXIST

‚úÖ ONLY 3 COMMON COLUMNS (out of 41 CSV columns):
   1. 'Description'
   2. 'Rate'
   3. 'SKU'

üìÑ CSV-ONLY COLUMNS BEING DROPPED (38 columns):

   üè∑Ô∏è Custom Fields (7) - Business configurations:
      ‚Ä¢ CF.Item Location
      ‚Ä¢ CF.M Box
      ‚Ä¢ CF.Manufacturer
      ‚Ä¢ CF.Product Category
      ‚Ä¢ CF.Product Sale Category
      ‚Ä¢ CF.S Box Qty
      ‚Ä¢ CF.SKU category

   üìù Original Zoho Names (3) - Name mismatches:
      ‚Ä¢ 'Item ID' (CSV) ‚Üí needs mapping to canonical name
      ‚Ä¢ 'Item Name' (CSV) ‚Üí needs mapping to canonical name
      ‚Ä¢ 'Item Type' (CSV) ‚Üí needs mapping to canonical name

   üíº Core Business Fields (28) - Essential data:
      ‚ö†Ô∏è CRITICAL ‚Ä¢ Account
      ‚ö†Ô∏è CRITICAL ‚Ä¢ Account Code
      ‚ö†Ô∏è CRITICAL ‚Ä¢ Inventory Account
      ‚ö†Ô∏è CRITICAL ‚Ä¢ Inventory Account Code
         ‚Ä¢ Inventory Valuation Method
         ‚Ä¢ Last Sync Time
      ‚ö†Ô∏è CRITICAL ‚Ä¢ Ope

In [14]:
# CONCISE SUMMARY - Why the differences exist
print("üîç ITEMS TABLE vs CSV - ROOT CAUSE ANALYSIS")
print("=" * 50)

print("\nüí• THE PROBLEM:")
print(f"   ‚Ä¢ Only {len(comparison_results['common'])} out of {len(csv_columns)} CSV fields are saved to database")
print(f"   ‚Ä¢ {len(comparison_results['only_csv'])} CSV fields are being completely DROPPED")
print(f"   ‚Ä¢ Data loss: {(len(comparison_results['only_csv'])/len(csv_columns)*100):.0f}%")

print("\nüîß WHY THIS HAPPENS:")
print("   ‚Ä¢ Items mapping in mappings.py is severely incomplete")
print("   ‚Ä¢ Only basic fields (Rate, SKU, Description) are mapped")
print("   ‚Ä¢ Custom fields (CF.*) are not mapped at all")
print("   ‚Ä¢ Business fields (stock, vendor, accounts) are ignored")

print("\nüìã EXAMPLES OF DROPPED FIELDS:")
examples = [col for col in comparison_results['only_csv'] if any(x in col for x in ['Stock', 'CF.', 'Vendor', 'Account'])][:8]
for field in examples:
    print(f"   ‚ùå '{field}' - LOST!")

print("\n‚úÖ SOLUTION:")
print("   1. Update mappings.py to include ALL CSV fields")
print("   2. Map custom fields to preserve business data")
print("   3. Re-run ETL to verify 100% data preservation")

print(f"\nüéØ TARGET: 100% CSV coverage (currently {len(comparison_results['common'])/len(csv_columns)*100:.1f}%)")

üîç ITEMS TABLE vs CSV - ROOT CAUSE ANALYSIS

üí• THE PROBLEM:
   ‚Ä¢ Only 3 out of 41 CSV fields are saved to database
   ‚Ä¢ 38 CSV fields are being completely DROPPED
   ‚Ä¢ Data loss: 93%

üîß WHY THIS HAPPENS:
   ‚Ä¢ Items mapping in mappings.py is severely incomplete
   ‚Ä¢ Only basic fields (Rate, SKU, Description) are mapped
   ‚Ä¢ Custom fields (CF.*) are not mapped at all
   ‚Ä¢ Business fields (stock, vendor, accounts) are ignored

üìã EXAMPLES OF DROPPED FIELDS:
   ‚ùå 'Account' - LOST!
   ‚ùå 'Account Code' - LOST!
   ‚ùå 'CF.Item Location' - LOST!
   ‚ùå 'CF.M Box' - LOST!
   ‚ùå 'CF.Manufacturer' - LOST!
   ‚ùå 'CF.Product Category' - LOST!
   ‚ùå 'CF.Product Sale Category' - LOST!
   ‚ùå 'CF.S Box Qty' - LOST!

‚úÖ SOLUTION:
   1. Update mappings.py to include ALL CSV fields
   2. Map custom fields to preserve business data
   3. Re-run ETL to verify 100% data preservation

üéØ TARGET: 100% CSV coverage (currently 7.3%)


In [16]:
# Get the exact CSV columns that need to be added to mappings
print("=== CSV COLUMNS FOR MAPPINGS UPDATE ===")
print()

csv_columns = list(csv_items_df.columns)
print(f"üìã ALL CSV COLUMNS ({len(csv_columns)} total):")
for i, col in enumerate(csv_columns, 1):
    print(f"  {i:2d}. '{col}'")

print()
print("üîç CSV-ONLY COLUMNS (need to be added to mapping):")
csv_only_fields = [col for col in csv_columns if col not in common_columns]
for i, col in enumerate(csv_only_fields, 1):
    print(f"  {i:2d}. '{col}'")

print()
print("üìù MAPPING ENTRIES NEEDED:")
print("# Add these to ITEMS_CSV_MAP in mappings.py:")
for col in csv_only_fields:
    # Create canonical name (normalize for database)
    canonical = col.replace(' ', '_').replace('.', '_').replace('-', '_')
    print(f"    '{col}': '{canonical}',")

=== CSV COLUMNS FOR MAPPINGS UPDATE ===

üìã ALL CSV COLUMNS (41 total):
   1. 'Item ID'
   2. 'Item Name'
   3. 'SKU'
   4. 'Description'
   5. 'Rate'
   6. 'Account'
   7. 'Account Code'
   8. 'Tax Name'
   9. 'Tax Percentage'
  10. 'Tax Type'
  11. 'Purchase Tax Name'
  12. 'Purchase Tax Percentage'
  13. 'Purchase Tax Type'
  14. 'Product Type'
  15. 'Source'
  16. 'Reference ID'
  17. 'Last Sync Time'
  18. 'Status'
  19. 'Usage unit'
  20. 'Purchase Rate'
  21. 'Purchase Account'
  22. 'Purchase Account Code'
  23. 'Purchase Description'
  24. 'Inventory Account'
  25. 'Inventory Account Code'
  26. 'Inventory Valuation Method'
  27. 'Reorder Point'
  28. 'Vendor'
  29. 'Opening Stock'
  30. 'Opening Stock Value'
  31. 'Stock On Hand'
  32. 'Item Type'
  33. 'Region'
  34. 'Vehicle'
  35. 'CF.SKU category'
  36. 'CF.Product Sale Category'
  37. 'CF.Item Location'
  38. 'CF.Product Category'
  39. 'CF.Manufacturer'
  40. 'CF.M Box'
  41. 'CF.S Box Qty'

üîç CSV-ONLY COLUMNS (nee

In [17]:
# STRATEGY: Fix the mapping by updating canonical schema
print("üéØ ITEMS MAPPING FIX STRATEGY")
print("="*50)
print()

# The issue is that CSV mapping preserves original names 
# but schema expects canonical names. We need to:
# 1. Add missing columns to canonical schema
# 2. Keep CSV mapping as-is (it's already comprehensive)

print("üîç Current CSV Mapping Status:")
print(f"   ‚Ä¢ CSV columns: {len(csv_columns)}")
print(f"   ‚Ä¢ Common mapped: {len(common_columns)}")
print(f"   ‚Ä¢ Missing from schema: {len(csv_only_fields)}")

print()
print("üí° FIX APPROACH:")
print("   1. Add missing columns to CANONICAL_SCHEMA['Items']['header_columns']")
print("   2. Keep existing ITEMS_CSV_MAP (it has most fields)")
print("   3. Test ETL pipeline for 100% coverage")

print()
print("üìã COLUMNS TO ADD TO CANONICAL SCHEMA:")
for i, col in enumerate(csv_only_fields[:10], 1):  # Show first 10
    canonical = col.replace(' ', '_').replace('.', '_').replace('-', '_')
    print(f"   {i:2d}. '{canonical}': 'TEXT',  # From CSV: '{col}'")
if len(csv_only_fields) > 10:
    print(f"   ... and {len(csv_only_fields) - 10} more")

üéØ ITEMS MAPPING FIX STRATEGY

üîç Current CSV Mapping Status:
   ‚Ä¢ CSV columns: 41
   ‚Ä¢ Common mapped: 3
   ‚Ä¢ Missing from schema: 38

üí° FIX APPROACH:
   1. Add missing columns to CANONICAL_SCHEMA['Items']['header_columns']
   2. Keep existing ITEMS_CSV_MAP (it has most fields)
   3. Test ETL pipeline for 100% coverage

üìã COLUMNS TO ADD TO CANONICAL SCHEMA:
    1. 'Item_ID': 'TEXT',  # From CSV: 'Item ID'
    2. 'Item_Name': 'TEXT',  # From CSV: 'Item Name'
    3. 'Account': 'TEXT',  # From CSV: 'Account'
    4. 'Account_Code': 'TEXT',  # From CSV: 'Account Code'
    5. 'Tax_Name': 'TEXT',  # From CSV: 'Tax Name'
    6. 'Tax_Percentage': 'TEXT',  # From CSV: 'Tax Percentage'
    7. 'Tax_Type': 'TEXT',  # From CSV: 'Tax Type'
    8. 'Purchase_Tax_Name': 'TEXT',  # From CSV: 'Purchase Tax Name'
    9. 'Purchase_Tax_Percentage': 'TEXT',  # From CSV: 'Purchase Tax Percentage'
   10. 'Purchase_Tax_Type': 'TEXT',  # From CSV: 'Purchase Tax Type'
   ... and 28 more


In [18]:
# Generate the exact canonical schema columns to add
print("üìã CANONICAL SCHEMA COLUMNS TO ADD")
print("="*60)

# Get CSV-only fields that need to be added to schema
missing_fields = [col for col in csv_only_fields if col in ['Account', 'Account Code', 'Item ID', 'Item Name', 'Item Type', 'Tax Name', 'Tax Percentage', 'Tax Type', 'Usage unit', 'Opening Stock', 'Opening Stock Value', 'Stock On Hand', 'Source', 'Status', 'Vendor', 'Vehicle', 'Region', 'Reference ID', 'Last Sync Time', 'Reorder Point', 'Product Type', 'Purchase Account', 'Purchase Account Code', 'Purchase Description', 'Purchase Tax Name', 'Purchase Tax Percentage', 'Purchase Tax Type', 'Inventory Account', 'Inventory Account Code', 'Inventory Valuation Method']]

# Add custom fields
custom_fields = [col for col in csv_only_fields if col.startswith('CF.')]

print("üè¢ CORE BUSINESS FIELDS:")
for field in missing_fields:
    # Map to canonical name that matches current CSV mapping
    print(f"            '{field}': 'TEXT',")

print()
print("üîß CUSTOM FIELDS:")
for field in custom_fields:
    print(f"            '{field}': 'TEXT',")

print()
print(f"üìä SUMMARY:")
print(f"   ‚Ä¢ Core business fields to add: {len(missing_fields)}")
print(f"   ‚Ä¢ Custom fields to add: {len(custom_fields)}")
print(f"   ‚Ä¢ Total new columns: {len(missing_fields) + len(custom_fields)}")
print(f"   ‚Ä¢ This will fix {len(csv_only_fields)} missing CSV fields")

üìã CANONICAL SCHEMA COLUMNS TO ADD
üè¢ CORE BUSINESS FIELDS:
            'Item ID': 'TEXT',
            'Item Name': 'TEXT',
            'Account': 'TEXT',
            'Account Code': 'TEXT',
            'Tax Name': 'TEXT',
            'Tax Percentage': 'TEXT',
            'Tax Type': 'TEXT',
            'Purchase Tax Name': 'TEXT',
            'Purchase Tax Percentage': 'TEXT',
            'Purchase Tax Type': 'TEXT',
            'Product Type': 'TEXT',
            'Source': 'TEXT',
            'Reference ID': 'TEXT',
            'Last Sync Time': 'TEXT',
            'Status': 'TEXT',
            'Usage unit': 'TEXT',
            'Purchase Account': 'TEXT',
            'Purchase Account Code': 'TEXT',
            'Purchase Description': 'TEXT',
            'Inventory Account': 'TEXT',
            'Inventory Account Code': 'TEXT',
            'Inventory Valuation Method': 'TEXT',
            'Reorder Point': 'TEXT',
            'Vendor': 'TEXT',
            'Opening Stock': 'TEXT',


In [19]:
# Simple approach - get all CSV-only fields for schema
print("ALL CSV-ONLY FIELDS TO ADD TO CANONICAL SCHEMA:")
print("-" * 50)

# All CSV fields that aren't in the database schema
all_missing = csv_only_fields

# Show them in a format ready for mappings.py
for field in all_missing:
    print(f"            '{field}': 'TEXT',")

print(f"\nTotal: {len(all_missing)} fields to add")

ALL CSV-ONLY FIELDS TO ADD TO CANONICAL SCHEMA:
--------------------------------------------------
            'Item ID': 'TEXT',
            'Item Name': 'TEXT',
            'Account': 'TEXT',
            'Account Code': 'TEXT',
            'Tax Name': 'TEXT',
            'Tax Percentage': 'TEXT',
            'Tax Type': 'TEXT',
            'Purchase Tax Name': 'TEXT',
            'Purchase Tax Percentage': 'TEXT',
            'Purchase Tax Type': 'TEXT',
            'Product Type': 'TEXT',
            'Source': 'TEXT',
            'Reference ID': 'TEXT',
            'Last Sync Time': 'TEXT',
            'Status': 'TEXT',
            'Usage unit': 'TEXT',
            'Purchase Rate': 'TEXT',
            'Purchase Account': 'TEXT',
            'Purchase Account Code': 'TEXT',
            'Purchase Description': 'TEXT',
            'Inventory Account': 'TEXT',
            'Inventory Account Code': 'TEXT',
            'Inventory Valuation Method': 'TEXT',
            'Reorder Point': 'TE

In [20]:
# TEST THE UPDATED MAPPINGS
print("üß™ TESTING UPDATED ITEMS MAPPINGS")
print("="*50)

# Reload the mappings to get the updated schema
import importlib
import sys
if 'data_pipeline.mappings' in sys.modules:
    importlib.reload(sys.modules['data_pipeline.mappings'])

from data_pipeline.mappings import CANONICAL_SCHEMA, ITEMS_CSV_MAP

# Check the updated schema
items_schema = CANONICAL_SCHEMA['Items']['header_columns']
print(f"üìä Updated schema columns: {len(items_schema)}")

# Check mapping coverage now
csv_columns = list(csv_items_df.columns)
schema_columns = list(items_schema.keys())

# Find overlaps
mapped_columns = set()
for csv_col in csv_columns:
    if csv_col in ITEMS_CSV_MAP:
        canonical_col = ITEMS_CSV_MAP[csv_col]
        if canonical_col in schema_columns:
            mapped_columns.add(csv_col)

print(f"‚úÖ CSV columns successfully mapped: {len(mapped_columns)}/{len(csv_columns)}")
print(f"üìà Coverage improvement: {len(mapped_columns)/len(csv_columns)*100:.1f}%")

# Show any remaining unmapped columns
unmapped = [col for col in csv_columns if col not in mapped_columns]
if unmapped:
    print(f"‚ùå Still unmapped: {unmapped}")
else:
    print("üéâ ALL CSV COLUMNS ARE NOW MAPPED!")

üß™ TESTING UPDATED ITEMS MAPPINGS
üìä Updated schema columns: 53
‚úÖ CSV columns successfully mapped: 41/41
üìà Coverage improvement: 100.0%
üéâ ALL CSV COLUMNS ARE NOW MAPPED!
