# üìä AI Fashion Assistant v2.0 - Data Preparation

**Phase 1, Notebook 1/3**

---

## üéØ Objectives

1. Load and validate `styles.csv`
2. Handle encoding issues (UTF-8 with fallback)
3. Clean and impute missing values
4. Validate image files
5. Build combined description field
6. Save clean dataset

---

## üìã Quality Gates

- ‚úì No missing critical fields (id, productDisplayName)
- ‚úì All images loadable
- ‚úì Encoding validated
- ‚úì Statistics logged

---

In [3]:
# ============================================================
# 1) SETUP
# ============================================================

from google.colab import drive
drive.mount("/content/drive", force_remount=False)

# Check GPU
!nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
NVIDIA A100-SXM4-40GB, 40960 MiB


In [7]:
# ============================================================
# 2) IMPORTS & PATH SETUP (FULLY FIXED)
# ============================================================

import pandas as pd
import numpy as np
from pathlib import Path
import json
from typing import Dict, List, Optional, Tuple
from tqdm.auto import tqdm
import warnings
from PIL import Image
import os

warnings.filterwarnings('ignore')

# ============================================================
# Paths
# ============================================================

# Project paths
PROJECT_ROOT = Path("/content/drive/MyDrive/ai_fashion_assistant_v2")
RAW_DATA_DIR = PROJECT_ROOT / "data/raw"
PROCESSED_DATA_DIR = PROJECT_ROOT / "data/processed"

# Old project path (for images - avoiding symlink issue)
OLD_PROJECT = Path("/content/drive/MyDrive/ai_fashion_assistant_v1")

# Ensure processed dir exists
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Project Root: {PROJECT_ROOT}")
print(f"üìÅ Raw Data: {RAW_DATA_DIR}")
print(f"üìÅ Processed Data: {PROCESSED_DATA_DIR}")
print(f"üìÅ Old Project: {OLD_PROJECT}")

# ============================================================
# Auto-detect files
# ============================================================

# Find styles.csv (check multiple locations)
print("\nüîç Searching for styles.csv...")

styles_locations = [
    RAW_DATA_DIR / "styles.csv",
    RAW_DATA_DIR / "text/styles.csv",
    RAW_DATA_DIR / "data/styles.csv",
    OLD_PROJECT / "data/raw/styles.csv",
]

STYLES_CSV = None
for loc in styles_locations:
    if loc.exists():
        STYLES_CSV = loc
        size_mb = loc.stat().st_size / 1024 / 1024
        print(f"‚úÖ Found: {STYLES_CSV}")
        print(f"   Size: {size_mb:.2f} MB")
        break

if STYLES_CSV is None:
    print("\n‚ùå styles.csv not found in any location!")
    print("   Searched:")
    for loc in styles_locations:
        print(f"   - {loc}")
    raise FileNotFoundError("styles.csv not found!")

# Find images directory (avoid symlink, use old project)
print("\nüîç Searching for images directory...")

images_locations = [
    OLD_PROJECT / "data/raw/images",  # ‚≠ê PREFER OLD (no symlink issue)
    RAW_DATA_DIR / "text/images",
    RAW_DATA_DIR / "images",
]

IMAGES_DIR = None
for loc in images_locations:
    print(f"  Checking: {loc}")
    try:
        if loc.exists() and loc.is_dir():
            # Test if readable (symlink may cause I/O error)
            test_files = list(loc.glob("*.jpg"))[:5]
            if test_files:
                total_images = len(list(loc.glob("*.jpg")))
                IMAGES_DIR = loc
                print(f"  ‚úÖ Readable! {total_images:,} images")
                break
            else:
                print(f"  ‚ö†Ô∏è Empty directory")
    except OSError as e:
        print(f"  ‚ùå I/O error (likely symlink): {e}")
        continue

if IMAGES_DIR is None:
    print("\n‚ùå images directory not found or not readable!")
    print("   Searched:")
    for loc in images_locations:
        print(f"   - {loc}")
    raise FileNotFoundError("images directory not found or not readable!")

print("\n" + "=" * 80)
print("‚úÖ ALL PATHS RESOLVED!")
print("=" * 80)
print(f"styles.csv: {STYLES_CSV}")
print(f"images/:    {IMAGES_DIR}")
print("=" * 80)

üìÅ Project Root: /content/drive/MyDrive/ai_fashion_assistant_v2
üìÅ Raw Data: /content/drive/MyDrive/ai_fashion_assistant_v2/data/raw
üìÅ Processed Data: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed
üìÅ Old Project: /content/drive/MyDrive/ai_fashion_assistant_v1

üîç Searching for styles.csv...
‚úÖ Found: /content/drive/MyDrive/ai_fashion_assistant_v2/data/raw/text/styles.csv
   Size: 4.13 MB

üîç Searching for images directory...
  Checking: /content/drive/MyDrive/ai_fashion_assistant_v1/data/raw/images
  ‚úÖ Readable! 44,192 images

‚úÖ ALL PATHS RESOLVED!
styles.csv: /content/drive/MyDrive/ai_fashion_assistant_v2/data/raw/text/styles.csv
images/:    /content/drive/MyDrive/ai_fashion_assistant_v1/data/raw/images


In [8]:
# ============================================================
# 3) LOAD DATA WITH ENCODING HANDLING
# ============================================================

def load_styles_csv(filepath: Path) -> pd.DataFrame:
    """
    Load styles.csv with encoding fallback.

    Try UTF-8 first, then ISO-8859-9 (Turkish), then latin1.
    """
    encodings = ['utf-8', 'ISO-8859-9', 'latin1']

    for encoding in encodings:
        try:
            df = pd.read_csv(filepath, encoding=encoding, on_bad_lines='skip')
            print(f"‚úÖ Loaded with encoding: {encoding}")
            return df
        except (UnicodeDecodeError, Exception) as e:
            print(f"‚ö†Ô∏è Failed with {encoding}: {e}")
            continue

    raise ValueError("Failed to load CSV with any encoding!")

# Load
styles_path = STYLES_CSV
print(f"\nüìã Loading: {styles_path}\n")

df = load_styles_csv(styles_path)

print(f"\n‚úÖ Loaded {len(df):,} rows")
print(f"   Columns: {list(df.columns)}")


üìã Loading: /content/drive/MyDrive/ai_fashion_assistant_v2/data/raw/text/styles.csv

‚úÖ Loaded with encoding: utf-8

‚úÖ Loaded 44,424 rows
   Columns: ['id', 'gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage', 'productDisplayName']


In [9]:
# ============================================================
# 4) INITIAL DATA INSPECTION
# ============================================================

print("üìä DATA OVERVIEW")
print("=" * 60)

# Shape
print(f"\nShape: {df.shape}")

# Columns and types
print("\nColumn Types:")
print(df.dtypes)

# Missing values
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing': missing,
    'Percentage': missing_pct
}).sort_values('Missing', ascending=False)
print(missing_df[missing_df['Missing'] > 0])

# First few rows
print("\nFirst 3 Rows:")
display(df.head(3))

üìä DATA OVERVIEW

Shape: (44424, 10)

Column Types:
id                      int64
gender                 object
masterCategory         object
subCategory            object
articleType            object
baseColour             object
season                 object
year                  float64
usage                  object
productDisplayName     object
dtype: object

Missing Values:
                    Missing  Percentage
usage                   317        0.71
season                   21        0.05
baseColour               15        0.03
productDisplayName        7        0.02
year                      1        0.00

First 3 Rows:


Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch


In [10]:
# ============================================================
# 5) DATA CLEANING
# ============================================================

print("üßπ CLEANING DATA...\n")

# Make a copy
df_clean = df.copy()

# 5.1) Remove duplicates by id
initial_count = len(df_clean)
df_clean = df_clean.drop_duplicates(subset=['id'], keep='first')
print(f"‚úÖ Removed {initial_count - len(df_clean)} duplicate IDs")

# 5.2) Handle missing critical fields
critical_fields = ['id', 'productDisplayName']

for field in critical_fields:
    if field not in df_clean.columns:
        raise ValueError(f"Critical field '{field}' not found!")

    missing_count = df_clean[field].isnull().sum()
    if missing_count > 0:
        print(f"‚ö†Ô∏è Dropping {missing_count} rows with missing '{field}'")
        df_clean = df_clean.dropna(subset=[field])

print(f"\n‚úÖ After cleaning critical fields: {len(df_clean):,} rows")

# 5.3) Impute missing categorical values
categorical_fields = [
    'masterCategory', 'subCategory', 'articleType',
    'baseColour', 'gender', 'season', 'usage'
]

for field in categorical_fields:
    if field in df_clean.columns:
        missing_before = df_clean[field].isnull().sum()
        if missing_before > 0:
            # Fill with 'Unknown'
            df_clean[field] = df_clean[field].fillna('Unknown')
            print(f"  Imputed {missing_before} missing values in '{field}' with 'Unknown'")

# 5.4) Handle year field
if 'year' in df_clean.columns:
    # Convert to int, fill missing with median
    df_clean['year'] = pd.to_numeric(df_clean['year'], errors='coerce')
    median_year = df_clean['year'].median()
    df_clean['year'] = df_clean['year'].fillna(median_year).astype(int)
    print(f"\n  Imputed year with median: {int(median_year)}")

print("\n‚úÖ Data cleaning completed!")

üßπ CLEANING DATA...

‚úÖ Removed 0 duplicate IDs
‚ö†Ô∏è Dropping 7 rows with missing 'productDisplayName'

‚úÖ After cleaning critical fields: 44,417 rows
  Imputed 10 missing values in 'baseColour' with 'Unknown'
  Imputed 21 missing values in 'season' with 'Unknown'
  Imputed 312 missing values in 'usage' with 'Unknown'

  Imputed year with median: 2012

‚úÖ Data cleaning completed!


In [11]:
# ============================================================
# 6) VALIDATE IMAGES
# ============================================================

print("üñºÔ∏è VALIDATING IMAGES...\n")

# Use auto-detected images directory
images_dir = IMAGES_DIR
print(f"üìÅ Images directory: {images_dir}")
print(f"üìÅ Images directory: {images_dir}")

def validate_image(image_id: int, images_dir: Path) -> Tuple[bool, Optional[str]]:
    """
    Check if image exists and is loadable.

    Returns:
        (valid: bool, error: Optional[str])
    """
    image_path = images_dir / f"{image_id}.jpg"

    if not image_path.exists():
        return False, "not_found"

    try:
        img = Image.open(image_path)
        img.verify()  # Check if corrupted
        return True, None
    except Exception as e:
        return False, f"corrupted: {str(e)}"

# Validate sample (first 1000 for speed, then all)
sample_size = 1000
print(f"\nüîç Validating sample of {sample_size} images...")

sample_ids = df_clean['id'].head(sample_size)
validation_results = []

for img_id in tqdm(sample_ids, desc="Validating"):
    valid, error = validate_image(img_id, images_dir)
    validation_results.append({
        'id': img_id,
        'valid': valid,
        'error': error
    })

validation_df = pd.DataFrame(validation_results)

# Stats
valid_count = validation_df['valid'].sum()
invalid_count = len(validation_df) - valid_count

print(f"\n‚úÖ Valid images: {valid_count}/{len(validation_df)} ({valid_count/len(validation_df)*100:.1f}%)")
print(f"‚ùå Invalid images: {invalid_count}")

if invalid_count > 0:
    print("\nError breakdown:")
    print(validation_df[~validation_df['valid']]['error'].value_counts())

# Add image_path column
df_clean['image_path'] = df_clean['id'].apply(
    lambda x: str(images_dir / f"{x}.jpg")
)

print("\n‚úÖ Image validation completed!")

üñºÔ∏è VALIDATING IMAGES...

üìÅ Images directory: /content/drive/MyDrive/ai_fashion_assistant_v1/data/raw/images
üìÅ Images directory: /content/drive/MyDrive/ai_fashion_assistant_v1/data/raw/images

üîç Validating sample of 1000 images...


Validating:   0%|          | 0/1000 [00:00<?, ?it/s]


‚úÖ Valid images: 995/1000 (99.5%)
‚ùå Invalid images: 5

Error breakdown:
error
not_found    5
Name: count, dtype: int64

‚úÖ Image validation completed!


In [12]:
# ============================================================
# 7) BUILD COMBINED DESCRIPTION FIELD
# ============================================================

print("üìù BUILDING DESCRIPTION FIELD...\n")

def build_description(row: pd.Series) -> str:
    """
    Create a combined text description from product attributes.

    Format: "{productDisplayName} {masterCategory} {subCategory} {articleType}
             {baseColour} {gender} {season} {usage}"
    """
    parts = []

    fields_in_order = [
        'productDisplayName',
        'masterCategory',
        'subCategory',
        'articleType',
        'baseColour',
        'gender',
        'season',
        'usage'
    ]

    for field in fields_in_order:
        if field in row and pd.notna(row[field]):
            value = str(row[field]).strip()
            if value and value != 'Unknown':
                parts.append(value)

    return ' '.join(parts)

# Apply
print("Creating 'desc' field...")
df_clean['desc'] = df_clean.apply(build_description, axis=1)

# Show examples
print("\nExample descriptions:")
for idx, row in df_clean.head(3).iterrows():
    print(f"\nID {row['id']}:")
    print(f"  {row['desc'][:150]}...")

# Check lengths
desc_lengths = df_clean['desc'].str.len()
print(f"\nüìä Description lengths:")
print(f"  Mean: {desc_lengths.mean():.0f} chars")
print(f"  Median: {desc_lengths.median():.0f} chars")
print(f"  Min: {desc_lengths.min()} chars")
print(f"  Max: {desc_lengths.max()} chars")

print("\n‚úÖ Description field created!")

üìù BUILDING DESCRIPTION FIELD...

Creating 'desc' field...

Example descriptions:

ID 15970:
  Turtle Check Men Navy Blue Shirt Apparel Topwear Shirts Navy Blue Men Fall Casual...

ID 39386:
  Peter England Men Party Blue Jeans Apparel Bottomwear Jeans Blue Men Summer Casual...

ID 59263:
  Titan Women Silver Watch Accessories Watches Watches Silver Women Winter Casual...

üìä Description lengths:
  Mean: 86 chars
  Median: 84 chars
  Min: 53 chars
  Max: 151 chars

‚úÖ Description field created!


In [13]:
# ============================================================
# 8) SAVE CLEAN DATA
# ============================================================

print("üíæ SAVING CLEAN DATA...\n")

# Output path
output_path = PROCESSED_DATA_DIR / "meta_clean.csv"

# Save
df_clean.to_csv(output_path, index=False, encoding='utf-8')
print(f"‚úÖ Saved to: {output_path}")
print(f"   Rows: {len(df_clean):,}")
print(f"   Size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")

# Save image directory path for future use
meta_info = {
    'images_directory': str(IMAGES_DIR),
    'styles_csv_source': str(STYLES_CSV),
    'total_products': len(df_clean),
    'creation_date': pd.Timestamp.now().isoformat()
}

meta_info_path = PROCESSED_DATA_DIR / "meta_info.json"
with open(meta_info_path, 'w') as f:
    json.dump(meta_info, f, indent=2)

print(f"\n‚úÖ Meta info saved: {meta_info_path}")

üíæ SAVING CLEAN DATA...

‚úÖ Saved to: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed/meta_clean.csv
   Rows: 44,417
   Size: 10.88 MB

‚úÖ Meta info saved: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed/meta_info.json


In [14]:
# ============================================================
# 9) GENERATE STATISTICS
# ============================================================

print("üìä GENERATING STATISTICS...\n")

stats = {
    'total_products': len(df_clean),
    'columns': list(df_clean.columns),
    'data_types': {col: str(dtype) for col, dtype in df_clean.dtypes.items()},
    'missing_values': df_clean.isnull().sum().to_dict(),
    'categorical_distributions': {},
    'description_stats': {
        'mean_length': float(desc_lengths.mean()),
        'median_length': float(desc_lengths.median()),
        'min_length': int(desc_lengths.min()),
        'max_length': int(desc_lengths.max())
    }
}

# Categorical distributions
categorical_cols = [
    'masterCategory', 'gender', 'baseColour', 'season', 'articleType'
]

for col in categorical_cols:
    if col in df_clean.columns:
        value_counts = df_clean[col].value_counts().head(10).to_dict()
        stats['categorical_distributions'][col] = value_counts

# Save stats
stats_path = PROCESSED_DATA_DIR / "data_stats.json"
with open(stats_path, 'w', encoding='utf-8') as f:
    json.dump(stats, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Statistics saved to: {stats_path}")

# Display key stats
print("\nüìä KEY STATISTICS:")
print("=" * 60)
print(f"Total Products: {stats['total_products']:,}")
print(f"\nTop 5 Categories:")
for cat, count in list(stats['categorical_distributions'].get('masterCategory', {}).items())[:5]:
    print(f"  {cat}: {count:,}")
print(f"\nTop 5 Colors:")
for color, count in list(stats['categorical_distributions'].get('baseColour', {}).items())[:5]:
    print(f"  {color}: {count:,}")
print(f"\nGender Distribution:")
for gender, count in stats['categorical_distributions'].get('gender', {}).items():
    print(f"  {gender}: {count:,}")

üìä GENERATING STATISTICS...

‚úÖ Statistics saved to: /content/drive/MyDrive/ai_fashion_assistant_v2/data/processed/data_stats.json

üìä KEY STATISTICS:
Total Products: 44,417

Top 5 Categories:
  Apparel: 21,397
  Accessories: 11,272
  Footwear: 9,219
  Personal Care: 2,398
  Free Items: 105

Top 5 Colors:
  Black: 9,728
  White: 5,538
  Blue: 4,918
  Brown: 3,493
  Grey: 2,741

Gender Distribution:
  Men: 22,144
  Women: 18,627
  Unisex: 2,161
  Boys: 830
  Girls: 655


In [15]:
# ============================================================
# 10) QUALITY GATES VALIDATION
# ============================================================

print("\nüéØ QUALITY GATES VALIDATION")
print("=" * 60)

gates_passed = True

# Gate 1: No missing critical fields
critical_missing = df_clean[['id', 'productDisplayName']].isnull().sum().sum()
if critical_missing == 0:
    print("‚úÖ Gate 1: No missing critical fields")
else:
    print(f"‚ùå Gate 1: {critical_missing} missing critical values!")
    gates_passed = False

# Gate 2: All images loadable (sample check)
if valid_count / len(validation_df) >= 0.95:
    print("‚úÖ Gate 2: Images loadable (>95%)")
else:
    print(f"‚ùå Gate 2: Only {valid_count/len(validation_df)*100:.1f}% images loadable!")
    gates_passed = False

# Gate 3: Encoding validated
try:
    df_clean['productDisplayName'].str.encode('utf-8')
    print("‚úÖ Gate 3: Encoding validated (UTF-8)")
except:
    print("‚ùå Gate 3: Encoding issues detected!")
    gates_passed = False

# Gate 4: Statistics logged
if stats_path.exists():
    print("‚úÖ Gate 4: Statistics logged")
else:
    print("‚ùå Gate 4: Statistics not saved!")
    gates_passed = False

print("=" * 60)
if gates_passed:
    print("\nüéâ ALL QUALITY GATES PASSED!")
    print("‚úÖ Ready for Phase 1, Notebook 2 (SSOT Schema)")
else:
    print("\n‚ö†Ô∏è SOME QUALITY GATES FAILED!")
    print("   Please review and fix before proceeding.")


üéØ QUALITY GATES VALIDATION
‚úÖ Gate 1: No missing critical fields
‚úÖ Gate 2: Images loadable (>95%)
‚úÖ Gate 3: Encoding validated (UTF-8)
‚úÖ Gate 4: Statistics logged

üéâ ALL QUALITY GATES PASSED!
‚úÖ Ready for Phase 1, Notebook 2 (SSOT Schema)


---

## üìã Summary

**Outputs Created:**
- ‚úÖ `data/processed/meta_clean.csv` - Clean product data
- ‚úÖ `data/processed/data_stats.json` - Dataset statistics

**Quality Gates:**
- ‚úÖ No missing critical fields
- ‚úÖ Images loadable (>95%)
- ‚úÖ Encoding validated
- ‚úÖ Statistics logged

**Next Notebook:** `02_schema_normalization_SSOT.ipynb`

---