# 📊 Persiapan Dataset SmartCash

Notebook ini menjelaskan proses persiapan dataset untuk pelatihan model SmartCash.

## 📋 Daftar Isi
1. [Setup Environment](#setup)
2. [Persiapan Dataset](#dataset)
3. [Validasi Dataset](#validasi)

## 1. Setup Environment <a id='setup'></a>

Pertama, kita perlu setup environment dan import library yang diperlukan:

In [1]:
import os
import sys
from pathlib import Path

# Add project root to path
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import required modules
import yaml
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify project structure
required_dirs = ['configs', 'data', 'smartcash']
missing_dirs = [d for d in required_dirs if not (project_root / d).exists()]

if missing_dirs:
    raise RuntimeError(
        f"Missing required directories: {missing_dirs}\n"
        f"Please run this notebook from the 'notebooks' directory"
    )

print(f"✅ Project root: {project_root}")

✅ Project root: /Users/masdevid/Projects/smartcash


### 1.1 Load Configuration

Load konfigurasi dari file `base_config.yaml`:

In [2]:
# Load config
config_path = project_root / 'configs' / 'base_config.yaml'

if not config_path.exists():
    raise FileNotFoundError(
        f"Config file not found: {config_path}\n"
        f"Please create base_config.yaml in the configs directory"
    )

with open(config_path) as f:
    config = yaml.safe_load(f)

print(f"✅ Loaded config from: {config_path}")

✅ Loaded config from: /Users/masdevid/Projects/smartcash/configs/base_config.yaml


### 1.2 Import SmartCash Modules

Import modul-modul yang diperlukan dari SmartCash:

In [3]:
try:
    from smartcash.utils.logger import SmartCashLogger
    from smartcash.handlers.data_handler import DataHandler
    from smartcash.handlers.roboflow_handler import RoboflowHandler
    from smartcash.utils.preprocessing import ImagePreprocessor
    
    print("✅ Successfully imported SmartCash modules")
except ImportError as e:
    raise ImportError(
        f"Failed to import SmartCash modules: {str(e)}\n"
        f"Please make sure all required modules are installed"
    )

✅ Successfully imported SmartCash modules


## 2. Persiapan Dataset <a id='dataset'></a>

Setup data handler dan mulai persiapan dataset:

In [4]:
# Initialize logger
logger = SmartCashLogger('dataset_preparation')

# Setup data directories
data_dir = project_root / 'data'
raw_dir = data_dir / 'raw'
processed_dir = data_dir / 'processed'

# Create directories if they don't exist
raw_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

logger.info(f"Data directories setup complete:\n"
           f"Raw data: {raw_dir}\n"
           f"Processed data: {processed_dir}")

2025-02-20 15:54:49 - ℹ️ Data directories setup complete:
Raw data: /Users/masdevid/Projects/smartcash/data/raw
Processed data: /Users/masdevid/Projects/smartcash/data/processed


### 2.1 Load Dataset

Pilih sumber dataset (lokal atau Roboflow):

In [5]:
# Choose data source
USE_ROBOFLOW = False  # Set True to use Roboflow

try:
    if USE_ROBOFLOW:
        # Check API key
        api_key = os.getenv('ROBOFLOW_API_KEY')
        if not api_key:
            raise ValueError(
                "ROBOFLOW_API_KEY not found in environment variables\n"
                "Please set it in .env file"
            )
            
        handler = RoboflowHandler(
            config_path=str(config_path),
            data_dir=str(raw_dir),
            api_key=api_key,
            logger=logger
        )
        source_dir = handler.download_dataset()
    else:
        handler = DataHandler(
            config_path=str(config_path),
            data_dir=str(raw_dir),
            logger=logger
        )
        source_dir = str(raw_dir)
        
    logger.success(f"Dataset loaded from: {source_dir}")
except Exception as e:
    logger.error(f"Failed to load dataset: {str(e)}")
    raise

2025-02-20 15:54:49 - [32m✅ Dataset loaded from: /Users/masdevid/Projects/smartcash/data/raw[0m


### 2.2 Process Dataset

Proses dataset untuk setiap split data:

In [6]:
# Initialize preprocessor
preprocessor = ImagePreprocessor(
    config_path=str(config_path),
    logger=logger
)

# Process each split
splits = ['train', 'valid', 'test']

for split in splits:
    logger.info(f"Processing {split} split...")
    
    try:
        # Setup split directories
        split_dir = Path(source_dir) / split
        out_dir = processed_dir / split
        out_dir.mkdir(parents=True, exist_ok=True)
        
        # Get image and label files
        image_dir = split_dir / 'images'
        label_dir = split_dir / 'labels'
        
        if not image_dir.exists() or not label_dir.exists():
            logger.warning(f"Skipping {split}: directories not found")
            continue
            
        image_files = sorted(image_dir.glob('*.jpg'))
        label_files = sorted(label_dir.glob('*.txt'))
        
        # Process each file
        for img_path, lbl_path in zip(image_files, label_files):
            try:
                # Process and save
                preprocessor.process_image_and_label(
                    image_path=str(img_path),
                    label_path=str(lbl_path),
                    save_dir=str(out_dir),
                    augment=(split == 'train')
                )
            except Exception as e:
                logger.warning(f"Failed to process {img_path.name}: {str(e)}")
                continue
                
        logger.success(f"Completed processing {split} split")
        
    except Exception as e:
        logger.error(f"Failed to process {split} split: {str(e)}")
        continue

1 validation error for InitSchema
size
  Input should be a valid tuple [type=tuple_type, input_value=640, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/tuple_type


ValidationError: 6 validation errors for InitSchema
p
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
scale
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
ratio
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
size
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
interpolation
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
mask_interpolation
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing

## 3. Validasi Dataset <a id='validasi'></a>

Validasi hasil preprocessing dataset:

In [None]:
# Validate processed dataset
validation_results = {}

for split in splits:
    logger.info(f"Validating {split} split...")
    
    try:
        # Get processed directories
        split_dir = processed_dir / split
        image_dir = split_dir / 'images'
        label_dir = split_dir / 'labels'
        
        # Check directories exist
        if not image_dir.exists() or not label_dir.exists():
            raise FileNotFoundError(f"Missing directories for {split} split")
            
        # Count files
        image_files = list(image_dir.glob('*.jpg'))
        label_files = list(label_dir.glob('*.txt'))
        
        # Store results
        validation_results[split] = {
            'images': len(image_files),
            'labels': len(label_files),
            'status': 'OK' if len(image_files) == len(label_files) else 'ERROR'
        }
        
    except Exception as e:
        validation_results[split] = {
            'images': 0,
            'labels': 0,
            'status': f'ERROR: {str(e)}'
        }

# Print summary
logger.info("\nDataset Validation Summary:")
for split, result in validation_results.items():
    status_color = 'green' if result['status'] == 'OK' else 'red'
    logger.info(
        f"{split}: {result['images']} images, {result['labels']} labels "
        f"[{result['status']}]"
    )