# Controllable Fashion Image Synthesis - AMD GPU Version

## Setup Instructions for AMD GPUs (ROCm)

### 1. Install PyTorch with ROCm Support

**IMPORTANT**: Install PyTorch BEFORE running this notebook!

```bash
# Check your ROCm version
rocm-smi --showdriverversion

# For ROCm 6.0-6.12+ (recommended - best compatibility)
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

# For older ROCm 5.7
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
```

### 2. Install Dependencies

```bash
cd /home/husnain/DLP
pip install -r requirements.txt
```

### 3. Prepare Your Dataset

Place your FashionGen dataset at:
- `/home/husnain/DLP/data/fashiongen_256_256_train.h5`

Or update the `FASHIONGEN_PATH` variable in Cell 1.

### 4. Key Changes for AMD GPUs

- ‚úÖ ROCm-compatible device detection
- ‚úÖ FP16 mixed precision training support
- ‚úÖ No bitsandbytes dependency (not ROCm compatible)
- ‚úÖ Memory efficient attention with XFormers (if available)
- ‚úÖ All paths updated to local filesystem

---

## Workflow Overview

1. **Data Preparation** - Extract and prepare training/evaluation data
2. **Environment Check** - Verify PyTorch and GPU availability
3. **GPU Memory Check** - Monitor GPU memory
4. **Training** - Fine-tune Stable Diffusion with LoRA
5. **Training Visualization** - Plot training loss curves
6. **Evaluation Setup** - Verify evaluation libraries
7. **Image Generation** - Generate baseline and LoRA-enhanced images

---


In [1]:
# CELL 1: Data Preparation
import os
import h5py
import csv
import numpy as np
from PIL import Image
from tqdm.auto import tqdm
import json
import shutil

# --- Configuration ---
FASHIONGEN_PATH = '/home/husnain/DLP/fashiongen_256_256_train.h5'  # Update this path to your dataset
WORKING_DIR = "./working"
TRAIN_ROOT = os.path.join(WORKING_DIR, "fashion_train")
TRAIN_IMAGES_DIR = os.path.join(TRAIN_ROOT, "images")
EVAL_ROOT = os.path.join(WORKING_DIR, "eval_data")
EVAL_GT_DIR = os.path.join(EVAL_ROOT, "gt")

# Create directories
if os.path.exists(TRAIN_ROOT): shutil.rmtree(TRAIN_ROOT)
if os.path.exists(EVAL_ROOT): shutil.rmtree(EVAL_ROOT)
os.makedirs(TRAIN_IMAGES_DIR, exist_ok=True)
os.makedirs(EVAL_GT_DIR, exist_ok=True)

# --- Load Data ---
print("üìÇ Opening dataset...")
h5_file = h5py.File(FASHIONGEN_PATH, 'r')
num_total = len(h5_file['input_image'])

# Define Split
TRAIN_SIZE = 100000 
EVAL_SIZE = 10000
train_indices = range(0, TRAIN_SIZE)
eval_indices = range(TRAIN_SIZE, TRAIN_SIZE + EVAL_SIZE)

# --- 1. Export Training Data ---
train_metadata = []
print(f"üöÄ Exporting {TRAIN_SIZE} Training samples...")
for idx in tqdm(train_indices):
    img = Image.fromarray(h5_file['input_image'][idx])
    file_name = f"{idx:06d}.jpg"
    img.save(os.path.join(TRAIN_IMAGES_DIR, file_name), quality=95)
    
    desc = h5_file['input_description'][idx]
    if isinstance(desc, bytes): desc = desc.decode('utf-8', errors='ignore')
    prompt = str(desc).split(',')[0]
    
    # Path relative to TRAIN_ROOT
    train_metadata.append({"file_name": f"images/{file_name}", "text": prompt})

with open(os.path.join(TRAIN_ROOT, "metadata.csv"), 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=["file_name", "text"])
    writer.writeheader()
    writer.writerows(train_metadata)

# --- 2. Export Evaluation Data ---
eval_configs = []
print(f"üöÄ Exporting {EVAL_SIZE} Evaluation samples...")
for idx in tqdm(eval_indices):
    img = Image.fromarray(h5_file['input_image'][idx])
    img.save(os.path.join(EVAL_GT_DIR, f"{idx:06d}.png"))
    
    desc = h5_file['input_description'][idx]
    if isinstance(desc, bytes): desc = desc.decode('utf-8', errors='ignore')
    prompt = str(desc).split(',')[0]
    
    eval_configs.append({"idx": idx, "prompt": prompt})

with open(os.path.join(EVAL_ROOT, "eval_configs.json"), 'w') as f:
    json.dump(eval_configs, f)

print("‚úÖ Data Prep Complete.")

  from .autonotebook import tqdm as notebook_tqdm


üìÇ Opening dataset...
üöÄ Exporting 100000 Training samples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100000/100000 [09:36<00:00, 173.34it/s]


üöÄ Exporting 10000 Evaluation samples...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [01:56<00:00, 86.01it/s]


‚úÖ Data Prep Complete.


In [2]:
# CELL 2: Installation (ROCm/AMD GPU Compatible)
import os
import torch

# Check PyTorch and device availability
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
    print(f"Device Count: {torch.cuda.device_count()}")
else:
    print("‚ö†Ô∏è No GPU detected. Training will use CPU (very slow).")

# Note: bitsandbytes is NOT installed as it has poor ROCm support
# Training will use FP32 or FP16 depending on GPU capability

# Download Training Script
print("üìú Downloading Training Script...")
os.system("wget -q https://raw.githubusercontent.com/huggingface/diffusers/v0.26.3/examples/text_to_image/train_text_to_image_lora.py -O train_text_to_image_lora.py")

import numpy
print(f"‚úÖ Setup Complete. NumPy Version: {numpy.__version__}")

PyTorch Version: 2.5.1+rocm6.2
CUDA Available: True
Device: AMD Instinct MI250X/MI250
Device Count: 8
üìú Downloading Training Script...
‚úÖ Setup Complete. NumPy Version: 1.26.4


In [3]:
# CELL 2.5: Check GPU Memory
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"üéÆ GPU: {torch.cuda.get_device_name(0)}")
    
    # Get GPU memory info
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"üíæ Total GPU Memory: {total_memory:.2f} GB")
    
    # Clear cache
    torch.cuda.empty_cache()
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    print(f"üìä Allocated: {allocated:.2f} GB | Reserved: {reserved:.2f} GB")
else:
    print("‚ö†Ô∏è No GPU available, will use CPU")

üéÆ GPU: AMD Instinct MI250X/MI250
üíæ Total GPU Memory: 63.98 GB
üìä Allocated: 0.00 GB | Reserved: 0.00 GB


In [None]:
# CELL 3: Run Training (ROCm/AMD GPU Compatible)
import torch

OUTPUT_DIR = "./working/fashion_lora_output"

# Determine mixed precision setting based on GPU
# Note: FP16 has gradient scaling issues with ROCm - using FP32 for stability
# FP32 is slower but more stable on AMD GPUs
mixed_precision = "no"  # FP32 for ROCm compatibility

print(f"üöÄ Starting training with mixed_precision={mixed_precision}")

!accelerate launch --mixed_precision={mixed_precision} train_text_to_image_lora.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --train_data_dir="./working/fashion_train" \
  --caption_column="text" \
  --resolution=256 \
  --random_flip \
  --train_batch_size=2 \
  --gradient_accumulation_steps=2 \
  --max_train_steps=5000 \
  --learning_rate=1e-04 \
  --max_grad_norm=1 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=500 \
  --output_dir={OUTPUT_DIR} \
  --checkpointing_steps=1000 \
  --seed=42 \
  --report_to="tensorboard"

print("‚úÖ Training Complete.")

üöÄ Starting training with mixed_precision=no
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `8`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--dynamo_backend` was set to a value of `'no'`
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
12/13/2025 18:11:07 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 8
Process index: 7
Local process index: 7
Device: cuda:7

Mixed precision type: no

  torch.utils._pytree._register_pytree_node(
12/13/202

Downloading data: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100001/100001 [00:00<00:00, 145031.10files/s]
Generating train split: 100000 examples [00:06, 16317.90 examples/s]
[rank1]:[W1213 18:11:47.557116190 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W1213 18:11:47.626904663 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W1213 18:11:47.718730583 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7

In [None]:
# CELL 4: Plot Training Loss from TensorBoard Logs
import matplotlib.pyplot as plt
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
import glob
import os

print("üìä Extracting training loss...")
log_dir = "./working/fashion_lora_output"

# Find the events file (it's inside a subfolder usually)
event_files = glob.glob(f"{log_dir}/**/events.out.tfevents.*", recursive=True)

if event_files:
    # Load the most recent event file
    ea = EventAccumulator(event_files[0])
    ea.Reload()
    
    # Check available tags
    tags = ea.Tags()['scalars']
    if 'train_loss' in tags:
        losses = ea.Scalars('train_loss')
        steps = [x.step for x in losses]
        vals = [x.value for x in losses]
        
        # Plot
        plt.figure(figsize=(10, 6))
        plt.plot(steps, vals, label="Train Loss", color='blue', alpha=0.6)
        
        # Add a moving average for smoothing
        if len(vals) > 20:
            window = 20
            avg_vals = [sum(vals[i:i+window])/window for i in range(len(vals)-window)]
            plt.plot(steps[window:], avg_vals, color='red', linewidth=2, label='Moving Avg')

        plt.xlabel("Step")
        plt.ylabel("Loss")
        plt.title("Training Loss Curve")
        plt.grid(True, alpha=0.3)
        plt.legend()
        
        # Save
        plt.savefig("training_loss.png", dpi=150)
        print("‚úÖ Saved 'training_loss.png'")
        plt.show()
    else:
        print("‚ö†Ô∏è 'train_loss' not found in logs. (Did you run for enough steps?)")
else:
    print("‚ö†Ô∏è No TensorBoard logs found in output directory.")

üìä Extracting training loss...
‚ö†Ô∏è 'train_loss' not found in logs. (Did you run for enough steps?)


In [None]:
# CELL 5: Verify Evaluation Libraries
import os

# Note: Evaluation tools should already be installed from requirements.txt
# This cell just verifies they're available

try:
    import lpips
    from skimage.metrics import structural_similarity as ssim
    print("‚úÖ LPIPS and SSIM available")
except ImportError as e:
    print(f"‚ö†Ô∏è Missing evaluation library: {e}")
    print("Run: pip install lpips scikit-image")

try:
    from cleanfid import fid
    print("‚úÖ CleanFID available")
except ImportError:
    print("‚ö†Ô∏è CleanFID not found. Run: pip install clean-fid")

print("‚úÖ Evaluation tools check complete.")

‚úÖ LPIPS and SSIM available
‚úÖ CleanFID available
‚úÖ Evaluation tools check complete.


In [None]:
# CELL 6: Generate Images for Evaluation (ROCm/AMD GPU Compatible)
import torch
import cv2
import numpy as np
import json
import os
from PIL import Image
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from tqdm.auto import tqdm

# --- Config ---
EVAL_ROOT = "./working/eval_data"
GT_DIR = os.path.join(EVAL_ROOT, "gt")
BASE_DIR = os.path.join(EVAL_ROOT, "baseline")
LORA_DIR = os.path.join(EVAL_ROOT, "lora")
os.makedirs(BASE_DIR, exist_ok=True)
os.makedirs(LORA_DIR, exist_ok=True)

with open(os.path.join(EVAL_ROOT, "eval_configs.json"), 'r') as f:
    eval_configs = json.load(f)

def get_canny_edge(pil_img):
    img = np.array(pil_img)
    edges = cv2.Canny(img, 100, 200)
    edges = np.stack([edges]*3, axis=-1)
    return Image.fromarray(edges)

# --- Device Setup for AMD GPU ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

print(f"üéÆ Using device: {device} with dtype: {dtype}")

# --- Load Pipeline ---
print("‚öôÔ∏è Loading Pipeline...")
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_canny", torch_dtype=dtype)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=dtype, safety_checker=None
).to(device)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

# Enable memory efficient attention if using ROCm
if torch.cuda.is_available():
    try:
        pipe.enable_xformers_memory_efficient_attention()
        print("‚úÖ XFormers memory efficient attention enabled")
    except:
        print("‚ö†Ô∏è XFormers not available, using default attention")

# --- 1. Generate Baseline (ControlNet Only) ---
print("üöÄ Generating Baseline Images...")
for item in tqdm(eval_configs, desc="Baseline"):
    idx = item['idx']
    if os.path.exists(os.path.join(BASE_DIR, f"{idx:06d}.png")): continue
    
    gt_img = Image.open(os.path.join(GT_DIR, f"{idx:06d}.png")).convert("RGB")
    edge_img = get_canny_edge(gt_img)
    
    with torch.inference_mode():
        gen = pipe(item['prompt'], image=edge_img, num_inference_steps=20).images[0]
    gen.save(os.path.join(BASE_DIR, f"{idx:06d}.png"))

# --- 2. Generate LoRA (ControlNet + Your Style) ---
print("üöÄ Generating LoRA Images...")
pipe.load_lora_weights("./working/fashion_lora_output", weight_name="pytorch_lora_weights.safetensors")

for item in tqdm(eval_configs, desc="LoRA"):
    idx = item['idx']
    if os.path.exists(os.path.join(LORA_DIR, f"{idx:06d}.png")): continue
    
    gt_img = Image.open(os.path.join(GT_DIR, f"{idx:06d}.png")).convert("RGB")
    edge_img = get_canny_edge(gt_img)
    
    with torch.inference_mode():
        gen = pipe(item['prompt'], image=edge_img, num_inference_steps=20).images[0]
    gen.save(os.path.join(LORA_DIR, f"{idx:06d}.png"))

print("‚úÖ Generation Complete.")

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("üßπ GPU memory cleared")

  from .autonotebook import tqdm as notebook_tqdm
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


üéÆ Using device: cuda with dtype: torch.float16
‚öôÔ∏è Loading Pipeline...


Loading pipeline components...: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:00<00:00,  7.39it/s]
You have disabled the safety checker for <class 'diffusers.pipelines.controlnet.pipeline_controlnet.StableDiffusionControlNetPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .


‚ö†Ô∏è XFormers not available, using default attention
üöÄ Generating Baseline Images...


Baseline: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 177634.42it/s]


üöÄ Generating LoRA Images...


LoRA: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 175229.95it/s]

‚úÖ Generation Complete.
üßπ GPU memory cleared



