# AI Fashion Assistant v2.4.5 - Multi-Modal RAG

**Architecture & Planning**

---

**Project:** AI Fashion Assistant (TÜBİTAK 2209-A)  
**Student:** Hatice Baydemir  
**Date:** January 6, 2026  
**Version:** 2.4.5

---

## Goal

Extend v2.2 RAG and v2.4 personalization with **image query support** for visual fashion search.

### Key Features

- Image upload and encoding (CLIP)
- Visual attribute extraction
- Multimodal retrieval (text + image fusion)
- Visual-aware RAG generation
- Integration with v2.4 personalization

---

## PART 1: Setup & Directory Structure

In [1]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/MyDrive/ai_fashion_assistant_v2')

print('Drive mounted')
print(f'Working directory: {os.getcwd()}')

Mounted at /content/drive
Drive mounted
Working directory: /content/drive/MyDrive/ai_fashion_assistant_v2


In [2]:
from pathlib import Path

# Create directory structure
BASE_DIR = Path('v2.4.5-multimodal-rag')
dirs = [
    BASE_DIR / 'notebooks',
    BASE_DIR / 'src',
    BASE_DIR / 'evaluation' / 'results',
    BASE_DIR / 'data' / 'test_images'
]

for dir_path in dirs:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f'Created: {dir_path}')

print('\nDirectory structure ready')

Created: v2.4.5-multimodal-rag/notebooks
Created: v2.4.5-multimodal-rag/src
Created: v2.4.5-multimodal-rag/evaluation/results
Created: v2.4.5-multimodal-rag/data/test_images

Directory structure ready


---

## PART 2: Requirements & Dependencies

In [3]:
# Install additional packages
!pip install -q pillow opencv-python-headless gradio

print('Additional packages installed')

Additional packages installed


In [4]:
import json
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt

print('Base imports complete')

Base imports complete


---

## PART 3: Component Inventory

In [5]:
# Check existing components from previous versions

components = {
    'v2.0 - Baseline': {
        'Text embeddings': 'paraphrase-multilingual-mpnet-base-v2 (768d)',
        'FAISS index': 'Text search with 44,417 products',
        'Performance': 'NDCG@10: 0.974'
    },
    'v2.1 - Visual Features': {
        'CLIP model': 'openai/clip-vit-large-patch14',
        'Attribute extractor': '10 visual categories (color, pattern, style)',
        'Image embeddings': '307K attributes extracted',
        'Learned fusion': 'alpha=0.7 (text weight)'
    },
    'v2.2 - RAG Pipeline': {
        'LLM': 'GROQ Llama-3.3-70B',
        'RAG score': '0.714',
        'Response time': '0.89s average',
        'FashionRAGPipeline': 'Production-ready class'
    },
    'v2.3 - AI Agents': {
        'LangChain': 'Conversational agent with memory',
        'Tools': 'Search, Recommend, GetDetails',
        'Success rate': '100%',
        'Response time': '2.6s average'
    },
    'v2.4 - Personalization': {
        'User management': 'Profiles, history, favorites',
        'Content-based': 'Multi-strategy personalization',
        'Performance': '76.7% preference match, 11.92ms response',
        'Integration': 'Intent-aware agent system'
    }
}

print('Component Inventory')
print('='*60)
for version, comps in components.items():
    print(f'\n{version}:')
    for name, desc in comps.items():
        print(f'  - {name}: {desc}')

print('\n' + '='*60)
print('All components available for v2.4.5')

Component Inventory

v2.0 - Baseline:
  - Text embeddings: paraphrase-multilingual-mpnet-base-v2 (768d)
  - FAISS index: Text search with 44,417 products
  - Performance: NDCG@10: 0.974

v2.1 - Visual Features:
  - CLIP model: openai/clip-vit-large-patch14
  - Attribute extractor: 10 visual categories (color, pattern, style)
  - Image embeddings: 307K attributes extracted
  - Learned fusion: alpha=0.7 (text weight)

v2.2 - RAG Pipeline:
  - LLM: GROQ Llama-3.3-70B
  - RAG score: 0.714
  - Response time: 0.89s average
  - FashionRAGPipeline: Production-ready class

v2.3 - AI Agents:
  - LangChain: Conversational agent with memory
  - Tools: Search, Recommend, GetDetails
  - Success rate: 100%
  - Response time: 2.6s average

v2.4 - Personalization:
  - User management: Profiles, history, favorites
  - Content-based: Multi-strategy personalization
  - Performance: 76.7% preference match, 11.92ms response
  - Integration: Intent-aware agent system

All components available for v2.4.5


---

## PART 4: Architecture Design

### System Architecture

```
User Input
    |
    ├─ Text Query ("blue dress")
    ├─ Image Upload (dress.jpg)
    └─ Multimodal (image + "but in red")
    |
    ↓
┌─────────────────────────────┐
│  Query Processing           │
├─────────────────────────────┤
│ • Image Encoding (CLIP)     │
│ • Attribute Extraction      │
│ • Text Generation from Img  │
└─────────────────────────────┘
    |
    ↓
┌─────────────────────────────┐
│  Multimodal Retrieval       │
├─────────────────────────────┤
│ • Text Index Search         │
│ • Image Index Search        │
│ • Learned Fusion (α=0.7)    │
│ • Attribute Filtering       │
└─────────────────────────────┘
    |
    ↓
┌─────────────────────────────┐
│  Personalization (v2.4)     │
├─────────────────────────────┤
│ • User Profile Match        │
│ • History-based Ranking     │
│ • Preference Filtering      │
└─────────────────────────────┘
    |
    ↓
┌─────────────────────────────┐
│  Visual-Aware RAG (v2.2+)   │
├─────────────────────────────┤
│ • Context with Attributes   │
│ • Visual Similarity Explain │
│ • LLM Generation            │
└─────────────────────────────┘
    |
    ↓
Natural Language Response
+ Ranked Products
+ Visual Explanations
```

In [6]:
# Architecture specifications

architecture_spec = {
    'input_modalities': {
        'text_only': 'Traditional keyword search',
        'image_only': 'Visual similarity search (NEW)',
        'multimodal': 'Image + text refinement (NEW)'
    },
    'query_processing': {
        'image_encoding': 'CLIP ViT-L/14 → 768d embedding',
        'attribute_extraction': 'Zero-shot classification (10 categories)',
        'text_generation': 'LLM-based query generation from attributes'
    },
    'retrieval': {
        'text_index': 'FAISS (44,417 products)',
        'image_index': 'FAISS (44,417 products)',
        'fusion_strategy': 'Learned fusion (α=0.7)',
        'post_filtering': 'Visual attribute matching'
    },
    'personalization': {
        'user_profile': 'Style, colors, size preferences',
        'history': 'Search patterns and favorites',
        're_ranking': 'Content-based filtering'
    },
    'generation': {
        'context_enhancement': 'Visual attributes in prompt',
        'explanation': 'Visual similarity reasoning',
        'llm': 'GROQ Llama-3.3-70B'
    },
    'performance_targets': {
        'ndcg@10': '> 0.75',
        'response_time': '< 2.0s',
        'visual_alignment': '> 0.75'
    }
}

print('Architecture Specification')
print('='*60)
print(json.dumps(architecture_spec, indent=2))

Architecture Specification
{
  "input_modalities": {
    "text_only": "Traditional keyword search",
    "image_only": "Visual similarity search (NEW)",
    "multimodal": "Image + text refinement (NEW)"
  },
  "query_processing": {
    "image_encoding": "CLIP ViT-L/14 \u2192 768d embedding",
    "attribute_extraction": "Zero-shot classification (10 categories)",
    "text_generation": "LLM-based query generation from attributes"
  },
  "retrieval": {
    "text_index": "FAISS (44,417 products)",
    "image_index": "FAISS (44,417 products)",
    "fusion_strategy": "Learned fusion (\u03b1=0.7)",
    "post_filtering": "Visual attribute matching"
  },
  "personalization": {
    "user_profile": "Style, colors, size preferences",
    "history": "Search patterns and favorites",
    "re_ranking": "Content-based filtering"
  },
  "generation": {
    "context_enhancement": "Visual attributes in prompt",
    "explanation": "Visual similarity reasoning",
    "llm": "GROQ Llama-3.3-70B"
  },
  "perfo

---

## PART 5: Data Requirements

In [7]:
# Data needed for v2.4.5

data_requirements = {
    'existing_data': {
        'product_embeddings_text': 'v2.0-baseline/data/processed/embeddings_mpnet.pkl',
        'product_metadata': 'v2.0-baseline/data/processed/products_processed.csv',
        'visual_attributes': 'v2.1-core-ml-plus/data/visual_attributes.csv',
        'faiss_text_index': 'v2.0-baseline/embeddings/text_index.faiss',
        'user_data': 'v2.4-complete/data/users/*.json'
    },
    'new_data_needed': {
        'test_images': {
            'count': '15-20 fashion product images',
            'sources': ['Google Images', 'Unsplash', 'Product websites'],
            'categories': ['dresses', 'shoes', 'jackets', 'pants', 'accessories'],
            'format': 'JPG/PNG, 512x512 minimum'
        },
        'image_embeddings': {
            'model': 'CLIP ViT-L/14',
            'dimension': '768',
            'products': 'All 44,417 products',
            'storage': 'FAISS index'
        },
        'evaluation_set': {
            'text_queries': '10 text-only queries',
            'image_queries': '10 image-only queries',
            'multimodal_queries': '10 image+text queries',
            'ground_truth': 'Manual relevance judgments'
        }
    },
    'estimated_storage': {
        'test_images': '~50 MB',
        'image_embeddings': '~135 MB (44417 × 768 × 4 bytes)',
        'faiss_index': '~150 MB',
        'total_new': '~335 MB'
    }
}

print('Data Requirements')
print('='*60)
print(json.dumps(data_requirements, indent=2))

Data Requirements
{
  "existing_data": {
    "product_embeddings_text": "v2.0-baseline/data/processed/embeddings_mpnet.pkl",
    "product_metadata": "v2.0-baseline/data/processed/products_processed.csv",
    "visual_attributes": "v2.1-core-ml-plus/data/visual_attributes.csv",
    "faiss_text_index": "v2.0-baseline/embeddings/text_index.faiss",
    "user_data": "v2.4-complete/data/users/*.json"
  },
  "new_data_needed": {
    "test_images": {
      "count": "15-20 fashion product images",
      "sources": [
        "Google Images",
        "Unsplash",
        "Product websites"
      ],
      "categories": [
        "dresses",
        "shoes",
        "jackets",
        "pants",
        "accessories"
      ],
      "format": "JPG/PNG, 512x512 minimum"
    },
    "image_embeddings": {
      "model": "CLIP ViT-L/14",
      "dimension": "768",
      "products": "All 44,417 products",
      "storage": "FAISS index"
    },
    "evaluation_set": {
      "text_queries": "10 text-only queries",

---

## PART 6: Test Image Collection Guide

### Test Image Collection Instructions

**Goal:** Collect 15-20 diverse fashion product images for testing multimodal search.

**Sources:**
1. **Unsplash** (royalty-free): https://unsplash.com/s/photos/fashion
2. **Pexels** (royalty-free): https://www.pexels.com/search/fashion/
3. **Product websites** (for testing only)

**Categories to Cover:**
- Dresses (3-4 images): casual, formal, floral, solid colors
- Shoes (3-4 images): sneakers, heels, boots
- Jackets (2-3 images): leather, denim, blazer
- Pants (2-3 images): jeans, formal, casual
- Accessories (2-3 images): bags, jewelry, scarves

**Image Requirements:**
- Clear product view (front-facing preferred)
- Good lighting and focus
- Minimal background clutter
- Size: At least 512x512 pixels
- Format: JPG or PNG

**Naming Convention:**
```
category_number_description.jpg

Examples:
dress_01_blue_floral.jpg
shoes_01_white_sneakers.jpg
jacket_01_black_leather.jpg
```

In [8]:
# Helper function to download and save test images
import urllib.request
from pathlib import Path

def download_test_image(url: str, filename: str, save_dir: Path):
    """Download image from URL and save to test_images directory"""
    save_path = save_dir / filename

    try:
        urllib.request.urlretrieve(url, save_path)
        print(f'Downloaded: {filename}')
        return True
    except Exception as e:
        print(f'Error downloading {filename}: {e}')
        return False

# Example usage (uncomment and add URLs)
# test_images_dir = BASE_DIR / 'data' / 'test_images'
#
# test_images = [
#     ('https://example.com/dress1.jpg', 'dress_01_blue_floral.jpg'),
#     ('https://example.com/shoes1.jpg', 'shoes_01_white_sneakers.jpg'),
#     # Add more URLs...
# ]
#
# for url, filename in test_images:
#     download_test_image(url, filename, test_images_dir)

print('Image download helper ready')
print('\nManual collection recommended for quality control')

Image download helper ready

Manual collection recommended for quality control


---

## PART 7: Baseline Performance (v2.2)

In [9]:
# Document baseline performance from v2.2 (text-only RAG)

baseline_performance = {
    'version': 'v2.2 - Text-Only RAG',
    'date': 'January 2, 2026',
    'metrics': {
        'rag_score': 0.714,
        'avg_response_time': 0.89,
        'coherence': 4.2,
        'relevance': 4.3,
        'queries_tested': 30
    },
    'limitations': [
        'Text-only queries (no image support)',
        'Cannot handle visual similarity',
        'Users struggle to describe visual style in words',
        'No visual attribute reasoning in responses'
    ],
    'use_cases_not_supported': [
        'Upload image → find similar',
        'Show me items like this picture',
        'Visual style matching',
        'Image + text refinement'
    ]
}

print('Baseline Performance (v2.2 - Text-Only RAG)')
print('='*60)
print(json.dumps(baseline_performance, indent=2))

print('\n' + '='*60)
print('v2.4.5 Goal: Add image query support while maintaining performance')

Baseline Performance (v2.2 - Text-Only RAG)
{
  "version": "v2.2 - Text-Only RAG",
  "date": "January 2, 2026",
  "metrics": {
    "rag_score": 0.714,
    "avg_response_time": 0.89,
    "coherence": 4.2,
    "relevance": 4.3,
    "queries_tested": 30
  },
  "limitations": [
    "Text-only queries (no image support)",
    "Cannot handle visual similarity",
    "Users struggle to describe visual style in words",
    "No visual attribute reasoning in responses"
  ],
  "use_cases_not_supported": [
    "Upload image \u2192 find similar",
    "Show me items like this picture",
    "Visual style matching",
    "Image + text refinement"
  ]
}

v2.4.5 Goal: Add image query support while maintaining performance


---

## PART 8: Implementation Plan

In [10]:
# 7-day implementation timeline

implementation_plan = {
    'Day 1 (Today)': {
        'status': 'IN PROGRESS',
        'tasks': [
            '✓ Repository setup',
            '✓ Architecture design',
            '✓ Component inventory',
            '⏳ Test image collection (15-20 images)',
            '⏳ Baseline documentation'
        ],
        'deliverables': [
            'Architecture document',
            '15 test images in data/test_images/',
            'Baseline metrics documented'
        ]
    },
    'Day 2 (Jan 7)': {
        'status': 'PLANNED',
        'notebook': '02_image_query_processing.ipynb',
        'tasks': [
            'Load CLIP model',
            'Image encoding function',
            'Test on 15 images',
            'Visual attribute extraction',
            'Image → text query generation (GROQ)',
            'Quality check and refinement'
        ],
        'deliverables': [
            'Image encoding working',
            'Attribute extraction integrated',
            'Query generation tested (15 examples)',
            'CSV with image → query mappings'
        ]
    },
    'Day 3 (Jan 8)': {
        'status': 'PLANNED',
        'notebook': '03_multimodal_retrieval.ipynb',
        'tasks': [
            'Load FAISS indices (text + image)',
            'MultiModalRetriever class',
            'Fusion algorithm (α=0.7)',
            'Test 3 strategies (text, image, multimodal)',
            'Attribute-based filtering'
        ],
        'deliverables': [
            'MultiModalRetriever working',
            '3 strategies tested',
            'Attribute filtering implemented',
            'Comparison results saved'
        ]
    },
    'Day 4 (Jan 9)': {
        'status': 'PLANNED',
        'notebook': '04_visual_aware_rag.ipynb',
        'tasks': [
            'Load v2.2 RAG pipeline',
            'VisualRAGPipeline class',
            'Visual prompt generation',
            'Test with 15 images',
            'Test with text refinement',
            'Response quality check'
        ],
        'deliverables': [
            'VisualRAGPipeline implemented',
            '15 image queries tested',
            'LLM responses with visual reasoning',
            'Results saved for evaluation'
        ]
    },
    'Day 5 (Jan 10)': {
        'status': 'PLANNED',
        'notebook': '05_evaluation_metrics.ipynb',
        'tasks': [
            'Create evaluation dataset (30 queries)',
            'Define metrics (retrieval + response quality)',
            'Run evaluation on 3 strategies',
            'Statistical analysis',
            'Visualization (4-panel figure)'
        ],
        'deliverables': [
            '30 queries evaluated',
            'Metrics comparison table',
            'Visualization (4-panel figure)',
            'CSV results for paper'
        ]
    },
    'Day 6 (Jan 11)': {
        'status': 'PLANNED',
        'tasks': [
            'Create production MultiModalRAGPipeline class',
            'Write unit tests (5+ tests)',
            'Integration with v2.4 agent',
            'Test backward compatibility'
        ],
        'deliverables': [
            'Production pipeline class',
            'Unit tests passing',
            'Integration with v2.4 agent',
            'Backward compatibility maintained'
        ]
    },
    'Day 7 (Jan 12)': {
        'status': 'PLANNED',
        'notebook': '06_final_evaluation_and_summary.ipynb',
        'tasks': [
            'Project summary',
            'Complete evaluation summary',
            'Generate comprehensive report',
            'README generation',
            'Update main README',
            'Git commit and push'
        ],
        'deliverables': [
            'Complete v2.4.5 README',
            'Final evaluation report (PDF)',
            'Main README updated',
            'All code committed to GitHub'
        ]
    }
}

print('7-Day Implementation Plan')
print('='*60)
for day, plan in implementation_plan.items():
    print(f"\n{day}: {plan['status']}")
    if 'notebook' in plan:
        print(f"  Notebook: {plan['notebook']}")
    print('  Tasks:')
    for task in plan['tasks']:
        print(f"    {task}")
    print('  Deliverables:')
    for deliverable in plan['deliverables']:
        print(f"    - {deliverable}")

7-Day Implementation Plan

Day 1 (Today): IN PROGRESS
  Tasks:
    ✓ Repository setup
    ✓ Architecture design
    ✓ Component inventory
    ⏳ Test image collection (15-20 images)
    ⏳ Baseline documentation
  Deliverables:
    - Architecture document
    - 15 test images in data/test_images/
    - Baseline metrics documented

Day 2 (Jan 7): PLANNED
  Notebook: 02_image_query_processing.ipynb
  Tasks:
    Load CLIP model
    Image encoding function
    Test on 15 images
    Visual attribute extraction
    Image → text query generation (GROQ)
    Quality check and refinement
  Deliverables:
    - Image encoding working
    - Attribute extraction integrated
    - Query generation tested (15 examples)
    - CSV with image → query mappings

Day 3 (Jan 8): PLANNED
  Notebook: 03_multimodal_retrieval.ipynb
  Tasks:
    Load FAISS indices (text + image)
    MultiModalRetriever class
    Fusion algorithm (α=0.7)
    Test 3 strategies (text, image, multimodal)
    Attribute-based filtering
  

---

## PART 9: Success Criteria

In [11]:
success_criteria = {
    'technical': {
        'image_query_support': 'Users can upload images and get results',
        'multimodal_ndcg': '> 0.75 (target: 0.758)',
        'response_time': '< 2.0s (acceptable for image queries)',
        'visual_alignment': '> 0.75 (attribute matching)',
        'backward_compatible': 'Text-only queries still work'
    },
    'functional': {
        'three_input_modes': 'Text, image, multimodal all working',
        'attribute_extraction': 'Accurate visual attributes (90%+ agreement)',
        'visual_reasoning': 'LLM explains visual similarity',
        'personalization_integrated': 'Works with v2.4 user profiles'
    },
    'quality': {
        'code_quality': 'Production-ready, documented, tested',
        'notebooks': '6 notebooks, all cells working',
        'evaluation': '30+ queries evaluated, statistical significance',
        'documentation': 'Complete README, architecture docs'
    },
    'user_study_readiness': {
        'stable_system': 'No crashes, consistent responses',
        'ui_integration': 'Streamlit app with image upload',
        'logging': 'All interactions logged for analysis',
        'fast_enough': 'Response time acceptable for real users'
    }
}

print('Success Criteria for v2.4.5')
print('='*60)
for category, criteria in success_criteria.items():
    print(f'\n{category.upper()}:')
    for criterion, description in criteria.items():
        print(f'  ☐ {criterion}: {description}')

print('\n' + '='*60)
print('All criteria must be met before user study')

Success Criteria for v2.4.5

TECHNICAL:
  ☐ image_query_support: Users can upload images and get results
  ☐ multimodal_ndcg: > 0.75 (target: 0.758)
  ☐ response_time: < 2.0s (acceptable for image queries)
  ☐ visual_alignment: > 0.75 (attribute matching)
  ☐ backward_compatible: Text-only queries still work

FUNCTIONAL:
  ☐ three_input_modes: Text, image, multimodal all working
  ☐ attribute_extraction: Accurate visual attributes (90%+ agreement)
  ☐ visual_reasoning: LLM explains visual similarity
  ☐ personalization_integrated: Works with v2.4 user profiles

QUALITY:
  ☐ code_quality: Production-ready, documented, tested
  ☐ notebooks: 6 notebooks, all cells working
  ☐ evaluation: 30+ queries evaluated, statistical significance
  ☐ documentation: Complete README, architecture docs

USER_STUDY_READINESS:
  ☐ stable_system: No crashes, consistent responses
  ☐ ui_integration: Streamlit app with image upload
  ☐ logging: All interactions logged for analysis
  ☐ fast_enough: Response t

---

## Summary

In [13]:
print('='*60)
print('ARCHITECTURE & PLANNING COMPLETE')
print('='*60)

print('\nCompleted Today:')
print('  ✓ Repository structure created')
print('  ✓ Architecture designed')
print('  ✓ Component inventory documented')
print('  ✓ Data requirements specified')
print('  ✓ Implementation plan detailed')

print('\nNext Steps:')
print('  1. Collect 15-20 test images')
print('  2. Load CLIP model')
print('  3. Implement image encoding')
print('  4. Test attribute extraction')
print('  5. Generate text queries from images')

print('\nOutput Files:')
print('  - v2.4.5-multimodal-rag/notebooks/01_architecture.ipynb (this file)')
print('  - Directory structure ready for development')

print('\nEstimated Timeline:')
print('  - Days 1-5: Core development')
print('  - Day 6: Production code')
print('  - Day 7: Documentation & finalization')

print('='*60)

ARCHITECTURE & PLANNING COMPLETE

Completed Today:
  ✓ Repository structure created
  ✓ Architecture designed
  ✓ Component inventory documented
  ✓ Data requirements specified
  ✓ Implementation plan detailed

Next Steps:
  1. Collect 15-20 test images
  2. Load CLIP model
  3. Implement image encoding
  4. Test attribute extraction
  5. Generate text queries from images

Output Files:
  - v2.4.5-multimodal-rag/notebooks/01_architecture.ipynb (this file)
  - Directory structure ready for development

Estimated Timeline:
  - Days 1-5: Core development
  - Day 6: Production code
  - Day 7: Documentation & finalization
