# Data Preparation - Chest X-ray Abnormality Detection

## Section 1: Setup and Imports

In [1]:
# Set working directory to repository root
%cd /home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray

/home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray


In [2]:
# Install Roboflow if not already installed
# !pip install -q roboflow

In [3]:
# Import required libraries
import os
import shutil
import json
import random
from pathlib import Path
from collections import Counter
from typing import Dict, List, Tuple
import yaml

import numpy as np
from PIL import Image
from tqdm import tqdm
from roboflow import Roboflow

# Import preprocessing utilities
import sys
sys.path.insert(0, str(Path.cwd()))

from backend.src.utils.preprocessing import preprocess_image
# from backend.src.utils.class_mapping import get_vietnamese_class_name

print("‚úì Imports successful")

‚úì Imports successful


## Section 2: Download Dataset from Roboflow

Download VinBigData Chest X-ray Symptom Detection dataset version 3 (YOLOv11 format).

In [None]:
# Roboflow configuration
ROBOFLOW_API_KEY = "wQ9S049DhK8xjIhNy6zv"
WORKSPACE_NAME = "vinbigdataxrayproject"
PROJECT_NAME = "chest-xray-symptom-detection"
VERSION = 3
FORMAT = "yolov11"

print("Downloading Dataset from Roboflow")
print("=" * 80)
print(f"  Workspace: {WORKSPACE_NAME}")
print(f"  Project: {PROJECT_NAME}")
print(f"  Version: {VERSION}")
print(f"  Format: {FORMAT}")
print("=" * 80)

Downloading Dataset from Roboflow
  Workspace: vinbigdataxrayproject
  Project: chest-xray-symptom-detection
  Version: 3
  Format: yolov11


In [11]:
# Download dataset
rf = Roboflow(api_key=ROBOFLOW_API_KEY)
project = rf.workspace(WORKSPACE_NAME).project(PROJECT_NAME)
version = project.version(VERSION)

# Download to data/ directory
dataset = version.download(FORMAT, location="data/", overwrite=True)

print(f"\n‚úì Dataset downloaded to: {dataset.location}")

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in data/ to yolov11:: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1110489/1110489 [01:43<00:00, 10774.53it/s]





Extracting Dataset Version Zip to data/ in yolov11:: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30008/30008 [00:02<00:00, 13078.86it/s]



‚úì Dataset downloaded to: /home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray/data


In [12]:
# Verify downloaded data
dataset_dir = Path(dataset.location)
data_yaml_path = dataset_dir / "data.yaml"

if not data_yaml_path.exists():
    raise FileNotFoundError(f"data.yaml not found at {data_yaml_path}")

# Load data.yaml to get class information
with open(data_yaml_path, 'r') as f:
    data_config = yaml.safe_load(f)

print("\nDataset Information:")
print(f"  Classes: {data_config['nc']}")
print(f"  Class names: {data_config['names']}")

# Count images in each split
for split in ['train', 'valid', 'test']:
    split_dir = dataset_dir / split / 'images'
    if split_dir.exists():
        count = len(list(split_dir.glob('*.jpg'))) + len(list(split_dir.glob('*.png')))
        print(f"  {split.capitalize()}: {count:,} images")


Dataset Information:
  Classes: 14
  Class names: ['Aortic enlargement', 'Atelectasis', 'Calcification', 'Cardiomegaly', 'Consolidation', 'ILD', 'Infiltration', 'Lung Opacity', 'Nodule-Mass', 'Other lesion', 'Pleural effusion', 'Pleural thickening', 'Pneumothorax', 'Pulmonary fibrosis']
  Train: 10,499 images
  Valid: 3,000 images
  Test: 1,499 images


## Section 3: Class Mapping to Vietnamese

Map English class names to Vietnamese for better display in the application.

In [13]:
# Class mapping English -> Vietnamese
CLASS_MAPPING_VI = {
    "Aortic enlargement": "Ph√¨nh ƒë·ªông m·∫°ch ch·ªß",
    "Atelectasis": "X·∫πp ph·ªïi",
    "Calcification": "V√¥i h√≥a",
    "Cardiomegaly": "Tim to",
    "Consolidation": "ƒê√¥ng ƒë·∫∑c ph·ªïi",
    "ILD": "B·ªánh ph·ªïi k·∫Ω",
    "Infiltration": "Th√¢m nhi·ªÖm",
    "Lung Opacity": "ƒê·ª•c ph·ªïi",
    "Nodule/Mass": "N·ªët/Kh·ªëi u",
    "Other lesion": "T·ªïn th∆∞∆°ng kh√°c",
    "Pleural effusion": "Tr√†n d·ªãch m√†ng ph·ªïi",
    "Pleural thickening": "D√†y m√†ng ph·ªïi",
    "Pneumothorax": "Tr√†n kh√≠ m√†ng ph·ªïi",
    "Pulmonary fibrosis": "X∆° ph·ªïi",
    "Normal": "B√¨nh th∆∞·ªùng",
}

print("Class Mapping (English -> Vietnamese):")
print("=" * 80)
for eng, vie in CLASS_MAPPING_VI.items():
    print(f"  {eng:30s} -> {vie}")
print("=" * 80)

# Save mapping to configs/
config_dir = Path('configs')
config_dir.mkdir(exist_ok=True)

mapping_file = config_dir / 'class_mapping_vi.json'
with open(mapping_file, 'w', encoding='utf-8') as f:
    json.dump(CLASS_MAPPING_VI, f, ensure_ascii=False, indent=2)

print(f"\n‚úì Class mapping saved to: {mapping_file}")

Class Mapping (English -> Vietnamese):
  Aortic enlargement             -> Ph√¨nh ƒë·ªông m·∫°ch ch·ªß
  Atelectasis                    -> X·∫πp ph·ªïi
  Calcification                  -> V√¥i h√≥a
  Cardiomegaly                   -> Tim to
  Consolidation                  -> ƒê√¥ng ƒë·∫∑c ph·ªïi
  ILD                            -> B·ªánh ph·ªïi k·∫Ω
  Infiltration                   -> Th√¢m nhi·ªÖm
  Lung Opacity                   -> ƒê·ª•c ph·ªïi
  Nodule/Mass                    -> N·ªët/Kh·ªëi u
  Other lesion                   -> T·ªïn th∆∞∆°ng kh√°c
  Pleural effusion               -> Tr√†n d·ªãch m√†ng ph·ªïi
  Pleural thickening             -> D√†y m√†ng ph·ªïi
  Pneumothorax                   -> Tr√†n kh√≠ m√†ng ph·ªïi
  Pulmonary fibrosis             -> X∆° ph·ªïi
  Normal                         -> B√¨nh th∆∞·ªùng

‚úì Class mapping saved to: configs/class_mapping_vi.json


## Section 4: Analyze Dataset and Label "Normal" Images

- Identify images without bounding boxes (no abnormalities)
- Label them as "Normal" class
- Count samples per class

In [14]:
def analyze_dataset(dataset_dir: Path) -> Dict:
    """
    Analyze dataset and identify images without labels (Normal cases).
    
    Returns:
        Dictionary with analysis results
    """
    results = {
        'train': {'images': [], 'labels': [], 'normal_images': []},
        'valid': {'images': [], 'labels': [], 'normal_images': []},
        'test': {'images': [], 'labels': [], 'normal_images': []},
    }
    
    class_counts = {split: Counter() for split in ['train', 'valid', 'test']}
    
    for split in ['train', 'valid', 'test']:
        images_dir = dataset_dir / split / 'images'
        labels_dir = dataset_dir / split / 'labels'
        
        if not images_dir.exists():
            continue
        
        image_files = sorted(images_dir.glob('*.jpg')) + sorted(images_dir.glob('*.png'))
        
        for img_path in tqdm(image_files, desc=f"Analyzing {split}"):
            results[split]['images'].append(img_path)
            
            # Check if label file exists
            label_path = labels_dir / (img_path.stem + '.txt')
            
            if label_path.exists() and label_path.stat().st_size > 0:
                # Has labels - count classes
                results[split]['labels'].append(label_path)
                
                with open(label_path, 'r') as f:
                    for line in f:
                        parts = line.strip().split()
                        if len(parts) >= 5:
                            class_id = int(parts[0])
                            class_counts[split][class_id] += 1
            else:
                # No labels - Normal case
                results[split]['normal_images'].append(img_path)
    
    return results, class_counts

print("Analyzing dataset...")
analysis_results, class_counts = analyze_dataset(dataset_dir)

print("\nDataset Analysis:")
print("=" * 80)
for split in ['train', 'valid', 'test']:
    total_images = len(analysis_results[split]['images'])
    normal_images = len(analysis_results[split]['normal_images'])
    abnormal_images = total_images - normal_images
    
    print(f"\n{split.upper()}:")
    print(f"  Total images: {total_images:,}")
    print(f"  Abnormal (with labels): {abnormal_images:,}")
    print(f"  Normal (no labels): {normal_images:,} ({normal_images/total_images*100:.1f}%)")

print("\n" + "=" * 80)

Analyzing dataset...


Analyzing train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10499/10499 [00:00<00:00, 52271.10it/s]
Analyzing valid: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [00:00<00:00, 64892.82it/s]
Analyzing test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1499/1499 [00:00<00:00, 65956.75it/s]


Dataset Analysis:

TRAIN:
  Total images: 10,499
  Abnormal (with labels): 3,051
  Normal (no labels): 7,448 (70.9%)

VALID:
  Total images: 3,000
  Abnormal (with labels): 919
  Normal (no labels): 2,081 (69.4%)

TEST:
  Total images: 1,499
  Abnormal (with labels): 424
  Normal (no labels): 1,075 (71.7%)






In [15]:
# Display class distribution
print("\nClass Distribution (Abnormal Cases):")
print("=" * 80)

class_names = data_config['names']

for split in ['train', 'valid', 'test']:
    print(f"\n{split.upper()}:")
    total_annotations = sum(class_counts[split].values())
    
    for class_id in sorted(class_counts[split].keys()):
        count = class_counts[split][class_id]
        class_name = class_names[class_id] if class_id < len(class_names) else f"Class_{class_id}"
        class_name_vi = CLASS_MAPPING_VI.get(class_name, class_name)
        
        print(f"  [{class_id:2d}] {class_name:30s} ({class_name_vi:30s}): {count:5,} ({count/total_annotations*100:5.2f}%)")

print("\n" + "=" * 80)


Class Distribution (Abnormal Cases):

TRAIN:
  [ 0] Aortic enlargement             (Ph√¨nh ƒë·ªông m·∫°ch ch·ªß           ): 2,134 (20.04%)
  [ 1] Atelectasis                    (X·∫πp ph·ªïi                      ):   131 ( 1.23%)
  [ 2] Calcification                  (V√¥i h√≥a                       ):   304 ( 2.85%)
  [ 3] Cardiomegaly                   (Tim to                        ): 1,590 (14.93%)
  [ 4] Consolidation                  (ƒê√¥ng ƒë·∫∑c ph·ªïi                 ):   241 ( 2.26%)
  [ 5] ILD                            (B·ªánh ph·ªïi k·∫Ω                  ):   246 ( 2.31%)
  [ 6] Infiltration                   (Th√¢m nhi·ªÖm                    ):   434 ( 4.08%)
  [ 7] Lung Opacity                   (ƒê·ª•c ph·ªïi                      ):   909 ( 8.54%)
  [ 8] Nodule-Mass                    (Nodule-Mass                   ):   577 ( 5.42%)
  [ 9] Other lesion                   (T·ªïn th∆∞∆°ng kh√°c               ):   796 ( 7.48%)
  [10] Pleural effusion               (Tr√†n

## Section 5: Sample "Normal" Images (30% ~ 2000 images)

Keep only 30% of "Normal" images to balance the dataset.

In [16]:
def sample_normal_images(
    normal_images: List[Path],
    target_ratio: float = 0.3,
    max_count: int = 2000,
    random_seed: int = 42
) -> List[Path]:
    random.seed(random_seed)
    
    target_count = min(int(len(normal_images) * target_ratio), max_count)
    sampled = random.sample(normal_images, target_count)
    
    return sampled

# Sample normal images for each split
sampled_normal = {}

print("Sampling 30% of Normal images...")
print("=" * 80)

for split in ['train', 'valid', 'test']:
    normal_images = analysis_results[split]['normal_images']
    
    if len(normal_images) == 0:
        sampled_normal[split] = []
        continue
    
    sampled = sample_normal_images(normal_images, target_ratio=0.3, max_count=2000)
    sampled_normal[split] = sampled
    
    print(f"  {split.upper()}:")
    print(f"    Original Normal images: {len(normal_images):,}")
    print(f"    Sampled Normal images: {len(sampled):,} ({len(sampled)/len(normal_images)*100:.1f}%)")

print("=" * 80)
print(f"\n‚úì Total sampled Normal images: {sum(len(sampled_normal[s]) for s in ['train', 'valid', 'test']):,}")

Sampling 30% of Normal images...
  TRAIN:
    Original Normal images: 7,448
    Sampled Normal images: 2,000 (26.9%)
  VALID:
    Original Normal images: 2,081
    Sampled Normal images: 624 (30.0%)
  TEST:
    Original Normal images: 1,075
    Sampled Normal images: 322 (30.0%)

‚úì Total sampled Normal images: 2,946


## Section 6: Remove Classes with Low Sample Count

Remove classes that have fewer than threshold samples (default 700).

In [19]:
def filter_classes_by_count(
    class_counts: Dict[str, Counter],
    min_samples: int = 700
) -> Tuple[List[int], List[int]]:
    """
    Filter classes based on minimum sample count.
    
    Args:
        class_counts: Dictionary of class counts per split
        min_samples: Minimum number of samples required
    
    Returns:
        Tuple of (kept_class_ids, removed_class_ids)
    """
    # Count total samples per class across all splits
    total_counts = Counter()
    for split_counts in class_counts.values():
        total_counts.update(split_counts)
    
    # Filter classes
    kept_classes = [class_id for class_id, count in total_counts.items() if count >= min_samples]
    removed_classes = [class_id for class_id, count in total_counts.items() if count < min_samples]
    
    return sorted(kept_classes), sorted(removed_classes)

# Filter classes with < 1000 samples
MIN_SAMPLES = 1000

kept_classes, removed_classes = filter_classes_by_count(class_counts, min_samples=MIN_SAMPLES)

print(f"\nFiltering Classes (min_samples={MIN_SAMPLES})")
print("=" * 80)

# Count total samples per class
total_class_counts = Counter()
for split_counts in class_counts.values():
    total_class_counts.update(split_counts)

print("\nKEPT CLASSES:")
for class_id in kept_classes:
    class_name = class_names[class_id] if class_id < len(class_names) else f"Class_{class_id}"
    class_name_vi = CLASS_MAPPING_VI.get(class_name, class_name)
    count = total_class_counts[class_id]
    print(f"  [{class_id:2d}] {class_name:30s} ({class_name_vi:30s}): {count:5,} samples")

print("\nREMOVED CLASSES:")
for class_id in removed_classes:
    class_name = class_names[class_id] if class_id < len(class_names) else f"Class_{class_id}"
    class_name_vi = CLASS_MAPPING_VI.get(class_name, class_name)
    count = total_class_counts[class_id]
    print(f"  [{class_id:2d}] {class_name:30s} ({class_name_vi:30s}): {count:5,} samples (< {MIN_SAMPLES})")

print("=" * 80)
print(f"\n‚úì Kept {len(kept_classes)} classes, removed {len(removed_classes)} classes")


Filtering Classes (min_samples=1000)

KEPT CLASSES:
  [ 0] Aortic enlargement             (Ph√¨nh ƒë·ªông m·∫°ch ch·ªß           ): 3,067 samples
  [ 3] Cardiomegaly                   (Tim to                        ): 2,300 samples
  [ 7] Lung Opacity                   (ƒê·ª•c ph·ªïi                      ): 1,322 samples
  [ 9] Other lesion                   (T·ªïn th∆∞∆°ng kh√°c               ): 1,134 samples
  [10] Pleural effusion               (Tr√†n d·ªãch m√†ng ph·ªïi           ): 1,032 samples
  [11] Pleural thickening             (D√†y m√†ng ph·ªïi                 ): 1,981 samples
  [13] Pulmonary fibrosis             (X∆° ph·ªïi                       ): 1,617 samples

REMOVED CLASSES:
  [ 1] Atelectasis                    (X·∫πp ph·ªïi                      ):   186 samples (< 1000)
  [ 2] Calcification                  (V√¥i h√≥a                       ):   452 samples (< 1000)
  [ 4] Consolidation                  (ƒê√¥ng ƒë·∫∑c ph·ªïi                 ):   353 samples (< 1000

## Section 7: Create Preprocessed Dataset

Apply preprocessing pipeline to all images:
1. Grayscale conversion (if needed)
2. Histogram equalization
3. Normalization to [0, 1]

Save preprocessed images and updated labels to `data/preprocessed/`.

In [20]:
def create_preprocessed_dataset(
    dataset_dir: Path,
    output_dir: Path,
    analysis_results: Dict,
    sampled_normal: Dict,
    kept_classes: List[int],
    class_names: List[str],
    class_mapping_vi: Dict[str, str],
):
    """
    Create preprocessed dataset with filtered images and labels.
    
    - Apply preprocessing to all images
    - Keep only selected normal images
    - Filter out removed classes
    - Update class IDs to be sequential
    - Create "Normal" class (class_id = len(kept_classes))
    """
    # Create output directory structure
    for split in ['train', 'valid', 'test']:
        (output_dir / split / 'images').mkdir(parents=True, exist_ok=True)
        (output_dir / split / 'labels').mkdir(parents=True, exist_ok=True)
    
    # Create class ID mapping (old -> new)
    old_to_new_id = {old_id: new_id for new_id, old_id in enumerate(kept_classes)}
    
    # Add "Normal" class as last class
    normal_class_id = len(kept_classes)
    
    # Process each split
    stats = {'train': {}, 'valid': {}, 'test': {}}
    
    for split in ['train', 'valid', 'test']:
        print(f"\nProcessing {split.upper()} split...")
        
        processed_count = 0
        skipped_count = 0
        normal_count = 0
        
        # Process abnormal images (with labels)
        images_with_labels = [
            img for img in analysis_results[split]['images']
            if img not in analysis_results[split]['normal_images']
        ]
        
        for img_path in tqdm(images_with_labels, desc=f"  Abnormal images"):
            label_path = dataset_dir / split / 'labels' / (img_path.stem + '.txt')
            
            if not label_path.exists():
                continue
            
            # Read and filter labels
            filtered_labels = []
            with open(label_path, 'r') as f:
                for line in f:
                    parts = line.strip().split()
                    if len(parts) >= 5:
                        old_class_id = int(parts[0])
                        
                        # Keep only classes in kept_classes
                        if old_class_id in old_to_new_id:
                            new_class_id = old_to_new_id[old_class_id]
                            filtered_labels.append(f"{new_class_id} {' '.join(parts[1:])}")
            
            # Skip images with no valid labels after filtering
            if len(filtered_labels) == 0:
                skipped_count += 1
                continue
            
            # Preprocess image
            try:
                img = Image.open(img_path).convert('L')
                img_array = np.array(img)
                preprocessed = preprocess_image(img_array, apply_normalization=False)
                
                # Convert back to uint8 for saving
                preprocessed_uint8 = (preprocessed).astype(np.uint8)
                
                # Save preprocessed image
                output_img_path = output_dir / split / 'images' / img_path.name
                Image.fromarray(preprocessed_uint8).save(output_img_path)
                
                # Save filtered labels
                output_label_path = output_dir / split / 'labels' / (img_path.stem + '.txt')
                with open(output_label_path, 'w') as f:
                    f.write('\n'.join(filtered_labels))
                
                processed_count += 1
                
            except Exception as e:
                print(f"    Error processing {img_path.name}: {e}")
                skipped_count += 1
        
        # Process normal images (sampled)
        for img_path in tqdm(sampled_normal[split], desc=f"  Normal images"):
            try:
                img = Image.open(img_path).convert('L')
                img_array = np.array(img)
                preprocessed = preprocess_image(img_array, apply_normalization=False)
                
                # Convert back to uint8 for saving
                preprocessed_uint8 = (preprocessed).astype(np.uint8)
                
                # Save preprocessed image
                output_img_path = output_dir / split / 'images' / img_path.name
                Image.fromarray(preprocessed_uint8).save(output_img_path)
                
                # Create empty label file (no bounding box for normal)
                output_label_path = output_dir / split / 'labels' / (img_path.stem + '.txt')
                output_label_path.touch()
                
                normal_count += 1
                
            except Exception as e:
                print(f"    Error processing {img_path.name}: {e}")
        
        stats[split] = {
            'abnormal': processed_count,
            'normal': normal_count,
            'skipped': skipped_count,
            'total': processed_count + normal_count
        }
        
        print(f"    ‚úì Processed {processed_count:,} abnormal images")
        print(f"    ‚úì Processed {normal_count:,} normal images")
        print(f"    ‚ö† Skipped {skipped_count:,} images")
    
    # Create updated data.yaml
    new_class_names = [class_names[old_id] for old_id in kept_classes] + ["Normal"]
    new_class_names_vi = [class_mapping_vi.get(name, name) for name in new_class_names]
    
    data_yaml = {
        'path': str(output_dir.absolute()),
        'train': 'train/images',
        'val': 'valid/images',
        'test': 'test/images',
        'nc': len(new_class_names),
        'names': new_class_names,
    }
    
    with open(output_dir / 'data.yaml', 'w') as f:
        yaml.dump(data_yaml, f, default_flow_style=False, sort_keys=False)
    
    # Create data_vi.yaml with Vietnamese class names
    data_yaml_vi = data_yaml.copy()
    data_yaml_vi['names'] = new_class_names_vi
    
    with open(output_dir / 'data_vi.yaml', 'w', encoding='utf-8') as f:
        yaml.dump(data_yaml_vi, f, default_flow_style=False, sort_keys=False, allow_unicode=True)
    
    return stats, new_class_names, new_class_names_vi

# Create preprocessed dataset
output_dir = Path('data/preprocessed')

print("\nCreating Preprocessed Dataset")
print("=" * 80)

stats, new_class_names, new_class_names_vi = create_preprocessed_dataset(
    dataset_dir=dataset_dir,
    output_dir=output_dir,
    analysis_results=analysis_results,
    sampled_normal=sampled_normal,
    kept_classes=kept_classes,
    class_names=class_names,
    class_mapping_vi=CLASS_MAPPING_VI,
)

print("\n" + "=" * 80)
print("‚úì Preprocessed dataset created successfully!")
print(f"  Output directory: {output_dir.absolute()}")


Creating Preprocessed Dataset

Processing TRAIN split...


  Abnormal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3051/3051 [08:42<00:00,  5.84it/s]
  Normal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [05:43<00:00,  5.83it/s]


    ‚úì Processed 3,019 abnormal images
    ‚úì Processed 2,000 normal images
    ‚ö† Skipped 32 images

Processing VALID split...


  Abnormal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 919/919 [02:38<00:00,  5.80it/s]
  Normal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 624/624 [01:45<00:00,  5.90it/s]


    ‚úì Processed 906 abnormal images
    ‚úì Processed 624 normal images
    ‚ö† Skipped 13 images

Processing TEST split...


  Abnormal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 424/424 [01:15<00:00,  5.59it/s]
  Normal images: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 322/322 [00:57<00:00,  5.58it/s]

    ‚úì Processed 423 abnormal images
    ‚úì Processed 322 normal images
    ‚ö† Skipped 1 images

‚úì Preprocessed dataset created successfully!
  Output directory: /home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray/data/preprocessed





## Section 8: Final Summary

In [21]:
# Display final statistics
print("\nFINAL DATASET SUMMARY")
print("=" * 80)

print("\nDataset Location:")
print(f"  {output_dir.absolute()}")

print("\n Classes (English):")
for i, name in enumerate(new_class_names):
    print(f"  [{i:2d}] {name}")

print("\n Classes (Ti·∫øng Vi·ªát):")
for i, name in enumerate(new_class_names_vi):
    print(f"  [{i:2d}] {name}")

print("\nImages per Split:")
for split in ['train', 'valid', 'test']:
    print(f"\n  {split.upper()}:")
    print(f"    Total: {stats[split]['total']:,}")
    print(f"    Abnormal: {stats[split]['abnormal']:,} ({stats[split]['abnormal']/stats[split]['total']*100:.1f}%)")
    print(f"    Normal: {stats[split]['normal']:,} ({stats[split]['normal']/stats[split]['total']*100:.1f}%)")
    print(f"    Skipped: {stats[split]['skipped']:,}")

total_images = sum(stats[s]['total'] for s in ['train', 'valid', 'test'])
total_abnormal = sum(stats[s]['abnormal'] for s in ['train', 'valid', 'test'])
total_normal = sum(stats[s]['normal'] for s in ['train', 'valid', 'test'])

print("\n" + "=" * 80)
print(f"\nTOTAL: {total_images:,} images")
print(f"   Abnormal: {total_abnormal:,} ({total_abnormal/total_images*100:.1f}%)")
print(f"   Normal: {total_normal:,} ({total_normal/total_images*100:.1f}%)")
print("=" * 80)


FINAL DATASET SUMMARY

Dataset Location:
  /home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray/data/preprocessed

 Classes (English):
  [ 0] Aortic enlargement
  [ 1] Cardiomegaly
  [ 2] Lung Opacity
  [ 3] Other lesion
  [ 4] Pleural effusion
  [ 5] Pleural thickening
  [ 6] Pulmonary fibrosis
  [ 7] Normal

 Classes (Ti·∫øng Vi·ªát):
  [ 0] Ph√¨nh ƒë·ªông m·∫°ch ch·ªß
  [ 1] Tim to
  [ 2] ƒê·ª•c ph·ªïi
  [ 3] T·ªïn th∆∞∆°ng kh√°c
  [ 4] Tr√†n d·ªãch m√†ng ph·ªïi
  [ 5] D√†y m√†ng ph·ªïi
  [ 6] X∆° ph·ªïi
  [ 7] B√¨nh th∆∞·ªùng

Images per Split:

  TRAIN:
    Total: 5,019
    Abnormal: 3,019 (60.2%)
    Normal: 2,000 (39.8%)
    Skipped: 32

  VALID:
    Total: 1,530
    Abnormal: 906 (59.2%)
    Normal: 624 (40.8%)
    Skipped: 13

  TEST:
    Total: 745
    Abnormal: 423 (56.8%)
    Normal: 322 (43.2%)
    Skipped: 1


TOTAL: 7,294 images
   Abnormal: 4,348 (59.6%)
   Normal: 2,946 (40.4%)


## Section 9: (Optional) Create Gaussian Blur Augmented Data

T·∫°o th√™m augmented versions c·ªßa training images v·ªõi Gaussian blur.
- Ch·ªâ augment **training set** (kh√¥ng augment valid/test)
- M·ªói ·∫£nh t·∫°o th√™m 1 augmented version
- L∆∞u v√†o folder m·ªõi: `data/preprocessed_with_aug/`
- C·∫≠p nh·∫≠t data.yaml t∆∞∆°ng ·ª©ng

**L∆∞u √Ω:** Section n√†y l√† OPTIONAL. Ch·∫°y n·∫øu mu·ªën th√™m Gaussian blur augmentation.

In [None]:
def create_augmented_dataset(
    source_dir: Path,
    output_dir: Path,
    augment_train_only: bool = True,
    num_augmentations: int = 1,
):
    from backend.src.utils.augmentation import augment_image
    
    print(f"\nüé® Creating Augmented Dataset")
    print("=" * 80)
    print(f"  Source: {source_dir}")
    print(f"  Output: {output_dir}")
    print(f"  Augment train only: {augment_train_only}")
    print(f"  Augmentations per image: {num_augmentations}")
    print("=" * 80)
    
    # Determine which splits to process
    splits_to_augment = ['train'] if augment_train_only else ['train', 'valid', 'test']
    all_splits = ['train', 'valid', 'test']
    
    aug_stats = {}
    
    for split in all_splits:
        source_images_dir = source_dir / split / 'images'
        source_labels_dir = source_dir / split / 'labels'
        
        output_images_dir = output_dir / split / 'images'
        output_labels_dir = output_dir / split / 'labels'
        
        output_images_dir.mkdir(parents=True, exist_ok=True)
        output_labels_dir.mkdir(parents=True, exist_ok=True)
        
        if not source_images_dir.exists():
            continue
        
        print(f"\nProcessing {split.upper()} split...")
        
        # Get all images
        image_files = list(source_images_dir.glob('*.jpg')) + list(source_images_dir.glob('*.png'))
        
        original_count = 0
        augmented_count = 0
        
        # Copy original images
        for img_path in tqdm(image_files, desc=f"  Copying originals"):
            # Copy image
            shutil.copy(img_path, output_images_dir / img_path.name)
            
            # Copy label
            label_path = source_labels_dir / (img_path.stem + '.txt')
            if label_path.exists():
                shutil.copy(label_path, output_labels_dir / label_path.name)
            
            original_count += 1
        
        # Create augmented versions (only for specified splits)
        if split in splits_to_augment:
            for img_path in tqdm(image_files, desc=f"  Creating augmented versions"):
                # Load image
                img = Image.open(img_path).convert('L')
                img_array = np.array(img)
                
                # Create N augmented versions
                for aug_idx in range(num_augmentations):
                    # Apply Gaussian blur augmentation
                    img_augmented = augment_image(img_array, augmentation_probability=1.0)
                    
                    # Save augmented image with suffix
                    aug_img_name = f"{img_path.stem}_aug{aug_idx+1}{img_path.suffix}"
                    aug_img_path = output_images_dir / aug_img_name
                    Image.fromarray(img_augmented).save(aug_img_path)
                    
                    # Copy label with same suffix
                    label_path = source_labels_dir / (img_path.stem + '.txt')
                    if label_path.exists():
                        aug_label_name = f"{img_path.stem}_aug{aug_idx+1}.txt"
                        aug_label_path = output_labels_dir / aug_label_name
                        shutil.copy(label_path, aug_label_path)
                    
                    augmented_count += 1
        
        total_count = original_count + augmented_count
        aug_stats[split] = {
            'original': original_count,
            'augmented': augmented_count,
            'total': total_count
        }
        
        print(f"    ‚úì Original: {original_count:,}")
        print(f"    ‚úì Augmented: {augmented_count:,}")
        print(f"    ‚úì Total: {total_count:,}")
    
    # Copy and update data.yaml
    source_yaml = source_dir / 'data.yaml'
    output_yaml = output_dir / 'data.yaml'
    
    with open(source_yaml, 'r') as f:
        data_yaml = yaml.safe_load(f)
    
    # Update path
    data_yaml['path'] = str(output_dir.absolute())
    
    with open(output_yaml, 'w') as f:
        yaml.dump(data_yaml, f, default_flow_style=False, sort_keys=False)
    
    # Copy data_vi.yaml if exists
    source_yaml_vi = source_dir / 'data_vi.yaml'
    if source_yaml_vi.exists():
        output_yaml_vi = output_dir / 'data_vi.yaml'
        with open(source_yaml_vi, 'r', encoding='utf-8') as f:
            data_yaml_vi = yaml.safe_load(f)
        data_yaml_vi['path'] = str(output_dir.absolute())
        with open(output_yaml_vi, 'w', encoding='utf-8') as f:
            yaml.dump(data_yaml_vi, f, default_flow_style=False, sort_keys=False, allow_unicode=True)
    
    print("\n" + "=" * 80)
    print("‚úì Augmented dataset created successfully!")
    print(f"  Output directory: {output_dir.absolute()}")
    print("=" * 80)
    
    return aug_stats


augmented_output_dir = Path('data/preprocessed_with_aug')

aug_stats = create_augmented_dataset(
    source_dir=output_dir,
    output_dir=augmented_output_dir,
    augment_train_only=True,
    num_augmentations=1,
)

print("\nAugmented Dataset Summary:")
print("=" * 80)
for split in ['train', 'valid', 'test']:
    print(f"\n{split.upper()}:")
    print(f"  Original images: {aug_stats[split]['original']:,}")
    print(f"  Augmented images: {aug_stats[split]['augmented']:,}")
    print(f"  Total images: {aug_stats[split]['total']:,}")

total_original = sum(aug_stats[s]['original'] for s in ['train', 'valid', 'test'])
total_augmented = sum(aug_stats[s]['augmented'] for s in ['train', 'valid', 'test'])
total_all = sum(aug_stats[s]['total'] for s in ['train', 'valid', 'test'])

print("\n" + "=" * 80)
print(f"\nGRAND TOTAL:")
print(f"  Original: {total_original:,}")
print(f"  Augmented: {total_augmented:,}")
print(f"  Total: {total_all:,}")
print(f"  Augmentation ratio: {total_augmented/total_original*100:.1f}%")
print("=" * 80)



üé® Creating Augmented Dataset
  Source: data/preprocessed
  Output: data/preprocessed_with_aug
  Augment train only: True
  Augmentations per image: 1

üîÑ Processing TRAIN split...


  Copying originals: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5019/5019 [00:00<00:00, 8474.65it/s]
  Creating augmented versions: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5019/5019 [02:11<00:00, 38.11it/s]


    ‚úì Original: 5,019
    ‚úì Augmented: 5,019
    ‚úì Total: 10,038

üîÑ Processing VALID split...


  Copying originals: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1530/1530 [00:00<00:00, 8390.74it/s]
  Copying originals: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1530/1530 [00:00<00:00, 8390.74it/s]


    ‚úì Original: 1,530
    ‚úì Augmented: 0
    ‚úì Total: 1,530

üîÑ Processing TEST split...


  Copying originals: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 745/745 [00:00<00:00, 8002.12it/s]

    ‚úì Original: 745
    ‚úì Augmented: 0
    ‚úì Total: 745

‚úì Augmented dataset created successfully!
  Output directory: /home/minhquana/workspace/project_DeepLearning/computer_vision/Abnormal-Prediction-In-Chest-X-Ray/data/preprocessed_with_aug

üìä Augmented Dataset Summary:

TRAIN:
  Original images: 5,019
  Augmented images: 5,019
  Total images: 10,038

VALID:
  Original images: 1,530
  Augmented images: 0
  Total images: 1,530

TEST:
  Original images: 745
  Augmented images: 0
  Total images: 745


GRAND TOTAL:
  Original: 7,294
  Augmented: 5,019
  Total: 12,313
  Augmentation ratio: 68.8%

üí° To use augmented data for training:
   Update data_yaml in train_yolov11s.ipynb to:
   data_yaml = Path('data/preprocessed_with_aug/data.yaml')



