# **Splitting Data**

Splitting data digunakan untuk memisahkan antara data *training*, data *testing*, dan *validation* data

### **Import Data, Library, dan Module**

Menghubungkan Google Drive dengan Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import shutil
from sklearn.model_selection import train_test_split

Mengakses dataset dari Google Drive

In [3]:
image_dir = '/content/drive/MyDrive/Skripsi/Final/Raw/Resized'
mask_dir = '/content/drive/MyDrive/Skripsi/Final/Mask/Resized'
output_dir = '/content/drive/MyDrive/Skripsi/Final/Split/Resized'

os.makedirs(output_dir, exist_ok=True)

In [4]:
# Dapatkan file yang cocok (nama sama di images & masks)
image_files = sorted([f for f in os.listdir(image_dir) if f.endswith(('.jpg'))])
mask_files = sorted([f for f in os.listdir(mask_dir) if f in image_files])

# Filter hanya pasangan file yang valid
matched_files = [(img, img) for img in image_files if img in mask_files]

### **Splitting Data**

Melakukan pembagian data dengan ratio 80% (training) dan 20% (testing)

In [5]:
# Split 70% data training, 20% data testing, dan 10% data validation
train_val_pairs, test_pairs = train_test_split(matched_files, test_size=0.2, random_state=42)
train_pairs, val_pairs = train_test_split(train_val_pairs, test_size=0.125, random_state=42)

def copy_files(pairs, split):
    img_out = os.path.join(output_dir, 'Images', split)
    mask_out = os.path.join(output_dir, 'Labels', split)
    os.makedirs(img_out, exist_ok=True)
    os.makedirs(mask_out, exist_ok=True)

    for img_file, mask_file in pairs:
        shutil.copy(os.path.join(image_dir, img_file), os.path.join(img_out, img_file))
        shutil.copy(os.path.join(mask_dir, mask_file), os.path.join(mask_out, mask_file))

copy_files(train_pairs, 'Train')
copy_files(val_pairs, 'Val')
copy_files(test_pairs, 'Test')

In [6]:
print(f"Total data: {len(matched_files)}")
print(f"Train: {len(train_pairs)}")
print(f"Val: {len(val_pairs)}")
print(f"Test: {len(test_pairs)}")
print("Data splitting selesai!")

Total data: 1690
Train: 1183
Val: 169
Test: 338
Data splitting selesai!


Dari data sebanyak 1690 citra, didapatkan citra untuk training sebanyak 1183 citra, citra untuk testing sebanyak 338 citra, dan citra untuk validasi sebanyak 169 citra