# 🎯 AI Fashion Assistant v2.0 - SSOT Schema Normalization

**Phase 1, Notebook 2/3** ⭐ **CRITICAL**

---

## 🎯 Objectives

1. **Define SSOT (Single Source of Truth)** schema
2. Validate `schema.py` module
3. Apply schema to cleaned data
4. Create SSOT-compliant dataset
5. Validate product IDs (no name-based matching!)
6. Test normalization functions

---

## 🌟 Why SSOT?

**Problem:** Different notebooks use different:
- Column names (productName vs productDisplayName)
- ID types (int vs str)
- Normalization (lowercase, strip, Turkish chars)

**Solution:** ONE schema for ALL notebooks!

---

## 📋 Quality Gates

- ✓ Schema validation 100% passes
- ✓ All notebooks can import schema
- ✓ ID mapping consistent (int, no duplicates)
- ✓ Normalization deterministic

---

In [1]:
# ============================================================
# 1) SETUP
# ============================================================

from google.colab import drive
drive.mount("/content/drive", force_remount=False)

Mounted at /content/drive


In [2]:
# ============================================================
# 2) ADD src/ TO PATH (IMPORT SCHEMA MODULE)
# ============================================================

import sys
from pathlib import Path

# Project paths
PROJECT_ROOT = Path("/content/drive/MyDrive/ai_fashion_assistant_v2")
SRC_DIR = PROJECT_ROOT / "src"

# Add to Python path
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))
    print(f"✅ Added to path: {SRC_DIR}")

# Test import
try:
    import schema
    print("✅ schema.py imported successfully!")
    print(f"   Location: {schema.__file__}")
except ImportError as e:
    print(f"❌ Failed to import schema: {e}")
    print(f"\nPlease ensure schema.py exists at: {SRC_DIR / 'schema.py'}")
    raise

✅ Added to path: /content/drive/MyDrive/ai_fashion_assistant_v2/src
✅ schema.py imported successfully!
   Location: /content/drive/MyDrive/ai_fashion_assistant_v2/src/schema.py


In [3]:
# ============================================================
# 3) IMPORTS
# ============================================================

import pandas as pd
import numpy as np
import json
from typing import Dict, List, Optional
from tqdm.auto import tqdm
import warnings

# Import SSOT schema classes
from schema import (
    Product,
    QueryRecord,
    Candidate,
    GroundTruth,
    normalize_text,
    validate_product,
    Intent,
    Gender,
    MasterCategory,
    Season,
    Usage
)

warnings.filterwarnings('ignore')

print("✅ All imports successful!")
print(f"\n📋 Available schema classes:")
print("   - Product")
print("   - QueryRecord")
print("   - Candidate")
print("   - GroundTruth")
print("   - normalize_text()")
print("   - validate_product()")

✅ All imports successful!

📋 Available schema classes:
   - Product
   - QueryRecord
   - Candidate
   - GroundTruth
   - normalize_text()
   - validate_product()


In [4]:
# ============================================================
# 4) LOAD CLEANED DATA
# ============================================================

PROCESSED_DIR = PROJECT_ROOT / "data/processed"
SCHEMA_DIR = PROJECT_ROOT / "data/schemas"

# Ensure schema dir exists
SCHEMA_DIR.mkdir(parents=True, exist_ok=True)

print("📂 Loading cleaned data...\n")

# Load meta_clean.csv (from previous notebook)
meta_clean_path = PROCESSED_DIR / "meta_clean.csv"

if not meta_clean_path.exists():
    raise FileNotFoundError(
        f"meta_clean.csv not found!\n"
        f"Please run 01_data_preparation.ipynb first.\n"
        f"Expected: {meta_clean_path}"
    )

df = pd.read_csv(meta_clean_path)

print(f"✅ Loaded: {len(df):,} products")
print(f"   Columns: {list(df.columns)}")
print(f"\nFirst row:")
display(df.head(1))

📂 Loading cleaned data...

✅ Loaded: 44,417 products
   Columns: ['id', 'gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage', 'productDisplayName', 'image_path', 'desc']

First row:


Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName,image_path,desc
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011,Casual,Turtle Check Men Navy Blue Shirt,/content/drive/MyDrive/ai_fashion_assistant_v1...,Turtle Check Men Navy Blue Shirt Apparel Topwe...


In [5]:
# ============================================================
# 5) TEST NORMALIZATION FUNCTIONS
# ============================================================

print("🧪 TESTING NORMALIZATION FUNCTIONS\n")
print("=" * 80)

# Test cases
test_strings = [
    "Kırmızı Kadın Elbise!",
    "  BEYAZ SPOR AYAKKABI  ",
    "Mavi-Lacivert Gömlek (XL)",
    "çok güzel şapka"
]

for test_str in test_strings:
    print(f"\nOriginal: '{test_str}'")

    # Standard normalization
    norm_standard = normalize_text(test_str, mode='standard')
    print(f"  standard:      '{norm_standard}'")

    # Aggressive normalization
    norm_aggressive = normalize_text(test_str, mode='aggressive')
    print(f"  aggressive:    '{norm_aggressive}'")

    # Turkish-aware normalization
    norm_turkish = normalize_text(test_str, mode='turkish_aware')
    print(f"  turkish_aware: '{norm_turkish}'")

print("\n" + "=" * 80)
print("✅ Normalization functions working!")

🧪 TESTING NORMALIZATION FUNCTIONS


Original: 'Kırmızı Kadın Elbise!'
  standard:      'kırmızı kadın elbise!'
  aggressive:    'kırmızı kadın elbise'
  turkish_aware: 'kirmizi kadin elbise'

Original: '  BEYAZ SPOR AYAKKABI  '
  standard:      'beyaz spor ayakkabi'
  aggressive:    'beyaz spor ayakkabi'
  turkish_aware: 'beyaz spor ayakkabi'

Original: 'Mavi-Lacivert Gömlek (XL)'
  standard:      'mavi-lacivert gömlek (xl)'
  aggressive:    'mavi lacivert gömlek xl'
  turkish_aware: 'mavi lacivert gomlek xl'

Original: 'çok güzel şapka'
  standard:      'çok güzel şapka'
  aggressive:    'çok güzel şapka'
  turkish_aware: 'cok guzel sapka'

✅ Normalization functions working!


In [6]:
# ============================================================
# 6) CONVERT TO SSOT PRODUCT SCHEMA
# ============================================================

print("🔄 CONVERTING TO SSOT SCHEMA...\n")

def dataframe_to_product(row: pd.Series) -> Product:
    """
    Convert DataFrame row to Product schema.
    """
    return Product(
        id=int(row['id']),
        productDisplayName=str(row['productDisplayName']),
        masterCategory=str(row.get('masterCategory', 'Unknown')),
        subCategory=str(row.get('subCategory', 'Unknown')),
        articleType=str(row.get('articleType', 'Unknown')),
        baseColour=str(row.get('baseColour', 'Unknown')),
        gender=str(row.get('gender', 'Unisex')),
        season=str(row.get('season', 'Unknown')),
        year=int(row['year']) if pd.notna(row.get('year')) else None,
        usage=str(row.get('usage', 'Unknown')) if pd.notna(row.get('usage')) else None,
        desc=str(row['desc']) if pd.notna(row.get('desc')) else None,
        image_path=str(row['image_path']) if pd.notna(row.get('image_path')) else None
    )

# Convert sample
print("Converting sample (first 1000 rows)...")
sample_products = []

for idx, row in tqdm(df.head(1000).iterrows(), total=1000, desc="Converting"):
    try:
        product = dataframe_to_product(row)
        validate_product(product)  # Validate
        sample_products.append(product)
    except Exception as e:
        print(f"\n⚠️ Error at row {idx}: {e}")
        continue

print(f"\n✅ Converted {len(sample_products):,} products")
print(f"   Validation passed: {len(sample_products):,}/{1000}")

# Show example
print("\n📋 Example Product (schema):")
example = sample_products[0]
print(json.dumps(example.to_dict(), indent=2, ensure_ascii=False))

🔄 CONVERTING TO SSOT SCHEMA...

Converting sample (first 1000 rows)...


Converting:   0%|          | 0/1000 [00:00<?, ?it/s]


✅ Converted 1,000 products
   Validation passed: 1,000/1000

📋 Example Product (schema):
{
  "id": 15970,
  "productDisplayName": "Turtle Check Men Navy Blue Shirt",
  "masterCategory": "Apparel",
  "subCategory": "Topwear",
  "articleType": "Shirts",
  "baseColour": "Navy Blue",
  "gender": "Men",
  "season": "Fall",
  "year": 2011,
  "usage": "Casual",
  "desc": "Turtle Check Men Navy Blue Shirt Apparel Topwear Shirts Navy Blue Men Fall Casual",
  "image_path": "/content/drive/MyDrive/ai_fashion_assistant_v1/data/raw/images/15970.jpg",
  "text_embedding": null,
  "image_embedding": null,
  "hybrid_embedding": null
}


In [7]:
# ============================================================
# 7) FULL CONVERSION & VALIDATION
# ============================================================

print("🔄 FULL CONVERSION (ALL PRODUCTS)...\n")

all_products = []
validation_errors = []

for idx, row in tqdm(df.iterrows(), total=len(df), desc="Converting all"):
    try:
        product = dataframe_to_product(row)
        validate_product(product)
        all_products.append(product)
    except Exception as e:
        validation_errors.append({
            'row_index': idx,
            'product_id': row.get('id'),
            'error': str(e)
        })

print(f"\n✅ Converted: {len(all_products):,} products")
print(f"❌ Errors: {len(validation_errors)}")

if validation_errors:
    print("\n⚠️ Validation errors (first 5):")
    for err in validation_errors[:5]:
        print(f"   Row {err['row_index']}: {err['error']}")

    # Save error log
    error_log_path = PROCESSED_DIR / "ssot_validation_errors.json"
    with open(error_log_path, 'w') as f:
        json.dump(validation_errors, f, indent=2)
    print(f"\n📝 Error log saved: {error_log_path}")

# Convert back to DataFrame for saving
df_ssot = pd.DataFrame([p.to_dict() for p in all_products])
print(f"\n✅ SSOT DataFrame created: {len(df_ssot):,} rows")

🔄 FULL CONVERSION (ALL PRODUCTS)...



Converting all:   0%|          | 0/44417 [00:00<?, ?it/s]


✅ Converted: 44,417 products
❌ Errors: 0

✅ SSOT DataFrame created: 44,417 rows


In [8]:
# ============================================================
# 8) VALIDATE ID CONSISTENCY
# ============================================================

print("🔍 VALIDATING ID CONSISTENCY...\n")
print("=" * 80)

# Check 1: All IDs are integers
id_types = df_ssot['id'].apply(type).value_counts()
print(f"ID types:")
print(id_types)

if len(id_types) == 1 and id_types.index[0] == int:
    print("✅ All IDs are integers")
else:
    print("❌ Mixed ID types detected!")

# Check 2: No duplicate IDs
duplicate_ids = df_ssot['id'].duplicated().sum()
if duplicate_ids == 0:
    print("✅ No duplicate IDs")
else:
    print(f"❌ Found {duplicate_ids} duplicate IDs!")
    print(df_ssot[df_ssot['id'].duplicated(keep=False)][['id', 'productDisplayName']].head(10))

# Check 3: ID range
id_min = df_ssot['id'].min()
id_max = df_ssot['id'].max()
print(f"\nID range: {id_min:,} to {id_max:,}")

# Check 4: No missing IDs
missing_ids = df_ssot['id'].isnull().sum()
if missing_ids == 0:
    print("✅ No missing IDs")
else:
    print(f"❌ Found {missing_ids} missing IDs!")

print("=" * 80)
print("✅ ID validation completed!")

🔍 VALIDATING ID CONSISTENCY...

ID types:
id
<class 'int'>    44417
Name: count, dtype: int64
✅ All IDs are integers
✅ No duplicate IDs

ID range: 1,163 to 60,000
✅ No missing IDs
✅ ID validation completed!


In [9]:
# ============================================================
# 9) SAVE SSOT-COMPLIANT DATASET
# ============================================================

print("💾 SAVING SSOT DATASET...\n")

# Save CSV
output_path = PROCESSED_DIR / "meta_ssot.csv"
df_ssot.to_csv(output_path, index=False, encoding='utf-8')

print(f"✅ Saved: {output_path}")
print(f"   Rows: {len(df_ssot):,}")
print(f"   Size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")

# Save as JSON (sample)
sample_json_path = PROCESSED_DIR / "meta_ssot_sample.json"
sample_data = [p.to_dict() for p in all_products[:100]]
with open(sample_json_path, 'w', encoding='utf-8') as f:
    json.dump(sample_data, f, indent=2, ensure_ascii=False)

print(f"\n✅ Sample JSON saved: {sample_json_path}")
print(f"   (100 products for inspection)")

💾 SAVING SSOT DATASET...

✅ Saved: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed/meta_ssot.csv
   Rows: 44,417
   Size: 11.01 MB

✅ Sample JSON saved: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed/meta_ssot_sample.json
   (100 products for inspection)


In [10]:
# ============================================================
# 10) GENERATE SCHEMA DOCUMENTATION
# ============================================================

print("📝 GENERATING SCHEMA DOCUMENTATION...\n")

# Product schema
product_schema = {
    "name": "Product",
    "description": "Single product in catalog",
    "fields": {
        "id": {"type": "int", "required": True, "description": "Unique product ID"},
        "productDisplayName": {"type": "str", "required": True, "description": "Display name"},
        "masterCategory": {"type": "str", "required": True, "description": "Top-level category"},
        "subCategory": {"type": "str", "required": True, "description": "Sub-category"},
        "articleType": {"type": "str", "required": True, "description": "Article type"},
        "baseColour": {"type": "str", "required": True, "description": "Base color"},
        "gender": {"type": "str", "required": True, "description": "Target gender"},
        "season": {"type": "str", "required": True, "description": "Season"},
        "year": {"type": "int", "required": False, "description": "Year"},
        "usage": {"type": "str", "required": False, "description": "Usage context"},
        "desc": {"type": "str", "required": False, "description": "Combined description"},
        "image_path": {"type": "str", "required": False, "description": "Image file path"}
    },
    "example": all_products[0].to_dict()
}

# Query schema
query_schema = {
    "name": "QueryRecord",
    "description": "User query with normalization and attributes",
    "fields": {
        "query_id": {"type": "str", "required": True, "description": "Unique query ID"},
        "query_text_original": {"type": "str", "required": True, "description": "Original input"},
        "query_text_tr": {"type": "str", "required": True, "description": "Turkish normalized"},
        "query_norm": {"type": "str", "required": True, "description": "Fully normalized"},
        "intent": {"type": "Intent", "required": False, "description": "User intent"},
        "slots": {"type": "Dict", "required": False, "description": "Extracted slots"}
    },
    "example": {
        "query_id": "q001",
        "query_text_original": "Kırmızı kadın elbise",
        "query_text_tr": "kırmızı kadın elbise",
        "query_norm": "kirmizi kadin elbise",
        "intent": "search",
        "slots": {"color": "red", "gender": "women", "articleType": "dress"}
    }
}

# Ground truth schema
gt_schema = {
    "name": "GroundTruth",
    "description": "Ground truth relevance labels",
    "fields": {
        "query_id": {"type": "str", "required": True, "description": "Query ID"},
        "product_id": {"type": "int", "required": True, "description": "Product ID (NOT name!)"},
        "relevance": {"type": "int", "required": True, "description": "0=not, 1=relevant, 2=highly"}
    },
    "example": {
        "query_id": "q001",
        "product_id": 1234,
        "relevance": 2
    }
}

# Save all schemas
all_schemas = {
    "version": "1.0",
    "created": pd.Timestamp.now().isoformat(),
    "schemas": {
        "Product": product_schema,
        "QueryRecord": query_schema,
        "GroundTruth": gt_schema
    }
}

schema_doc_path = SCHEMA_DIR / "ssot_schemas.json"
with open(schema_doc_path, 'w', encoding='utf-8') as f:
    json.dump(all_schemas, f, indent=2, ensure_ascii=False)

print(f"✅ Schema documentation saved: {schema_doc_path}")

📝 GENERATING SCHEMA DOCUMENTATION...

✅ Schema documentation saved: /content/drive/MyDrive/ai_fashion_assistant_v2/data/schemas/ssot_schemas.json


In [11]:
# ============================================================
# 11) QUALITY GATES VALIDATION
# ============================================================

print("\n🎯 QUALITY GATES VALIDATION")
print("=" * 80)

gates_passed = True

# Gate 1: Schema validation passes
validation_rate = len(all_products) / len(df)
if validation_rate >= 0.99:
    print(f"✅ Gate 1: Schema validation passes ({validation_rate*100:.2f}%)")
else:
    print(f"❌ Gate 1: Only {validation_rate*100:.2f}% validated!")
    gates_passed = False

# Gate 2: All notebooks can import schema
try:
    from schema import Product, QueryRecord
    print("✅ Gate 2: Schema module importable")
except:
    print("❌ Gate 2: Schema import failed!")
    gates_passed = False

# Gate 3: ID mapping consistent
if duplicate_ids == 0 and missing_ids == 0:
    print("✅ Gate 3: ID mapping consistent")
else:
    print(f"❌ Gate 3: ID issues (duplicates={duplicate_ids}, missing={missing_ids})")
    gates_passed = False

# Gate 4: Normalization deterministic
test_text = "Kırmızı Kadın Elbise"
norm1 = normalize_text(test_text, mode='aggressive')
norm2 = normalize_text(test_text, mode='aggressive')
if norm1 == norm2:
    print("✅ Gate 4: Normalization deterministic")
else:
    print("❌ Gate 4: Normalization non-deterministic!")
    gates_passed = False

print("=" * 80)

if gates_passed:
    print("\n🎉 ALL QUALITY GATES PASSED!")
    print("✅ SSOT schema established!")
    print("✅ Ready for Phase 1, Notebook 3 (EDA)")
else:
    print("\n⚠️ SOME QUALITY GATES FAILED!")
    print("   Please review and fix before proceeding.")


🎯 QUALITY GATES VALIDATION
✅ Gate 1: Schema validation passes (100.00%)
✅ Gate 2: Schema module importable
✅ Gate 3: ID mapping consistent
✅ Gate 4: Normalization deterministic

🎉 ALL QUALITY GATES PASSED!
✅ SSOT schema established!
✅ Ready for Phase 1, Notebook 3 (EDA)


---

## 📋 Summary

**Outputs Created:**
- ✅ `data/processed/meta_ssot.csv` - SSOT-compliant dataset
- ✅ `data/processed/meta_ssot_sample.json` - JSON sample
- ✅ `data/schemas/ssot_schemas.json` - Schema documentation

**Quality Gates:**
- ✅ Schema validation >99%
- ✅ Schema module importable
- ✅ ID mapping consistent
- ✅ Normalization deterministic

**Key Achievement:**
- ⭐ **Single Source of Truth established!**
- ⭐ **All future notebooks will use this schema**
- ⭐ **No more ID/name mismatches!**

**Next Notebook:** `03_exploratory_analysis.ipynb`

---