<a href="https://colab.research.google.com/github/olstjr/CZII-CryoET_Object_Identification/blob/main/making_datasets_for_yolo_additional_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# CZII: Creating Datasets for YOLO with Additional Data

This is an modified version of @ITK8191 notebook for generating datasets with additional synthetic data, denoised using Gaussian denoising. Check out the original notebook by [@ITK8191](https://www.kaggle.com/code/itsuki9180/czii-making-datasets-for-yolo).

There have been various discussions in the competition community regarding whether models trained with synthetic data perform better. For example, see the [David List discussion](https://www.kaggle.com/competitions/czii-cryo-et-object-identification/discussion/555247).

In my experiments, this was indeed true for YOLO. If someone manages to incorporate the original denoise model or IsoNet, I’m sure that better results could be achieved.


Weights for the model can be find here [CZII YOLO L trained with synthetic data
](https://www.kaggle.com/datasets/sersasj/czii-yolo-l-trained-with-synthetic-data).  
model was trained with TS_5_4, TS_69_2 TS_6_4 TS_6_6 as validation.


# Install and Import Modules

In [None]:
!pip install zarr opencv-python
!pip install SimpleITK
!rm -rf /kaggle/working/*

Collecting zarr
  Downloading zarr-2.18.3-py3-none-any.whl.metadata (5.7 kB)
Collecting asciitree (from zarr)
  Downloading asciitree-0.3.3.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting numcodecs>=0.10.0 (from zarr)
  Downloading numcodecs-0.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.9 kB)
Downloading zarr-2.18.3-py3-none-any.whl (210 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.7/210.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading numcodecs-0.13.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hBuilding wheels for collected packages: asciitree
  Building wheel for asciitree (setup.py) ... [?25ldone
[?25h  Created wheel for asciitree: filename=asciitree-0.3.3-py3-none-any.whl size=5034 sha256=b14a1101240

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zarr
from tqdm import tqdm
import glob, os
import cv2
import shutil
import SimpleITK as sitk


In [None]:
runs = sorted(glob.glob('/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/*'))
print(runs)
runs = [os.path.basename(x) for x in runs]
additional_runs = sorted(glob.glob('/kaggle/input/czii10441/10441/T*'))
print(additional_runs)
additional_runs = [os.path.basename(x) for x in additional_runs]
runs = runs + additional_runs
i2r_dict = {i: r for i, r in zip(range(len(runs)), runs)}
r2t_dict = {r: i for i, r in zip(range(len(runs)), runs)}
print("Runs:", i2r_dict)
#데이터를 추가하여 합친 후 YOLO학습용 데이터셋을 구축

['/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_5_4', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_69_2', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_6_4', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_6_6', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_73_6', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_86_3', '/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_99_9']
['/kaggle/input/czii10441/10441/TS_0', '/kaggle/input/czii10441/10441/TS_1', '/kaggle/input/czii10441/10441/TS_10', '/kaggle/input/czii10441/10441/TS_11', '/kaggle/input/czii10441/10441/TS_12', '/kaggle/input/czii10441/10441/TS_13', '/kaggle/input/czii10441/10441/TS_14', '/kaggle/input/czii10441/10441/TS_15', '/kaggle/input/czii10441/10441/TS_16', '/kaggle/input/czii10441/1044

# Normalize Function
Normalize the image to a value between 0 and 255.

In [None]:
def convert_to_8bit(x):
    lower, upper = np.percentile(x, (0.5, 99.5)) #이상치 제거
    x = np.clip(x, lower, upper) #이상치 제외한 범위 클리핑
    x = (x - x.min()) / (x.max() - x.min() + 1e-12) * 255 #정규화
    return x.round().astype("uint8") #8비트 정수로 변환

    #입력데이터를 정규화하는 함수

# Information about Labels

In [None]:
p2i_dict = {
    'apo-ferritin': 0,
    'beta-amylase': 1,
    'beta-galactosidase': 2,
    'ribosome': 3,
    'thyroglobulin': 4,
    'virus-like-particle': 5
} #입자 이름 정수로 매핑(YOLO)

i2p = {v: k for k, v in p2i_dict.items()} #p2i->i2p

particle_radius = {
    'apo-ferritin': 60,
    'beta-amylase': 65,
    'beta-galactosidase': 90,
    'ribosome': 150,
    'thyroglobulin': 130,
    'virus-like-particle': 135,
} #반지름 매핑

particle_names = ['apo-ferritin', 'beta-amylase', 'beta-galactosidase',
                  'ribosome', 'thyroglobulin', 'virus-like-particle']
                  #입자이름 리스트로 저장

In [None]:

from scipy.ndimage import gaussian_filter, median_filter

def denoise_tomogram(tomogram, method='gaussian', **kwargs):
    """
    Apply denoising to a tomogram.

    Parameters:
        tomogram (np.ndarray): The input t(omogram to denoise.
        method (str): The denoising method 'gaussian' or 'median').
        kwargs: Parameters for the respective method.선택할 필터의 추가 설정값

    Returns:
        np.ndarray: The denoised tomogram.
    """
    if method == 'gaussian':
        return gaussian_filter(tomogram, sigma=kwargs.get('sigma', 1))
    elif method == 'median':
        return median_filter(tomogram, size=kwargs.get('size', 3))
    else: #이름 잘못 입력하면 오류 생성
        raise ValueError(f"Unsupported denoising method: {method}")

In [None]:
def volume_wise_histogram_matching_3d(sim_tomogram, real_tomogram):
    """
    3D 토모그램에 Volume-wise 3D Histogram Matching 적용 - 한 이미지의 픽셀 값 분포를 다른 이미지의 분포와 일치시킴.
    Args:
        sim_tomogram (np.ndarray): 시뮬레이션 3D 토모그램 (NumPy 배열).
        real_tomogram (np.ndarray): 실제 3D 토모그램 (NumPy 배열, 참조 이미지).

    Returns:
        np.ndarray: Histogram matching이 적용된 3D 토모그램 (NumPy 배열).
    """
    # NumPy 배열을 SimpleITK 이미지 객체로 변환/simpleITK는 영상변환이 쉬운 의료영상 처리 라이브러리
    sim_tomogram_sitk = sitk.GetImageFromArray(sim_tomogram)
    real_tomogram_sitk = sitk.GetImageFromArray(real_tomogram)

    # HistogramMatchingImageFilter 생성
    histogram_matcher = sitk.HistogramMatchingImageFilter()

    # 필터 파라미터 설정 (선택 사항, 필요에 따라 조정)
    histogram_matcher.SetNumberOfHistogramLevels(1024) # 히스토그램 레벨 수 (기본값: 128)/클수록 세밀한 매칭
    histogram_matcher.SetNumberOfMatchPoints(7)   # 매칭 포인트 수 (기본값: 7)

    # Histogram Matching 필터 실행
    matched_tomogram_sitk = histogram_matcher.Execute(sim_tomogram_sitk, real_tomogram_sitk)

    # SimpleITK 이미지 객체를 NumPy 배열로 변환하여 반환
    matched_tomogram_np = sitk.GetArrayFromImage(matched_tomogram_sitk)
    return matched_tomogram_np
    #시뮬레이션 데이터의 픽셀 값이 실제 데이터의 픽셀 값 분포에 맞게 변형됨.
    #따라서 실제 데이터의 특성을 일부 반영한 형태의 데이터 리턴

In [None]:
name_map = {
    'apo-ferritin': 'ferritin_complex',
    'beta-amylase': 'beta_amylase',
    'beta-galactosidase': 'beta_galactosidase',
    'ribosome': 'cytosolic_ribosome',
    'thyroglobulin': 'thyroglobulin',
    'virus-like-particle': 'pp7_vlp',
}

In [None]:
def ndjson_to_json(ndjson_path):
    if not os.path.isfile(ndjson_path):
        raise FileNotFoundError(f"The file {ndjson_path} does not exist.")

    data = []
    try:
        with open(ndjson_path, 'r', encoding='utf-8') as ndjson_file:
            for line_number, line in enumerate(ndjson_file, start=1):
                stripped_line = line.strip()
                if stripped_line:
                    try:
                        json_object = json.loads(stripped_line)
                        data.append(json_object)
                    except json.JSONDecodeError as e:
                        raise json.JSONDecodeError(
                            f"Error decoding JSON on line {line_number}: {e.msg}",
                            e.doc,
                            e.pos
                        )
    except Exception as e:
        raise e

    return data
    #NDJSON형식의 데이터를 Python에서 사용하기 적합한 JSON객체로 변환하는 함수

In [None]:
import os
import glob
import json
import pandas as pd
import numpy as np
import zarr
import cv2
from tqdm import tqdm


def make_annotate_yolo(run_name, is_train_path=True, is_syntetic=False):
    dataset_split = 'train' if is_train_path else 'val'

    # Path to the denoised volume
    if is_syntetic:
        vol_path = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Tomograms/**/*.zarr', recursive=True)
        if not vol_path:
            print(f"No volume found for run {run_name} in synthetic data.")
            return
        vol_path = vol_path[0]
    else:
        vol_path = f'/kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/{run_name}/VoxelSpacing10.000/denoised.zarr'
    #볼륨 데이터 경로 설정
    print(f"Volume path: {vol_path}")
    if not os.path.exists(vol_path):
        print(f"Volume file not found: {vol_path}")
        return

    # Read the volume
    vol = zarr.open(vol_path, mode='r')
    vol = vol[0]
    #참조용 토모그램 볼륨
    ref_vol = '/kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/TS_5_4/VoxelSpacing10.000/denoised.zarr'
    ref_vol = zarr.open(ref_vol, mode='r')
    ref_vol = ref_vol[0]

    if is_syntetic:    #합성 데이터의 경우 가우시안 필터링 적용
        vol = denoise_tomogram(np.array(vol)[:184], method='gaussian', sigma=1.5)  # Apply denoise
        vol = volume_wise_histogram_matching_3d(vol, ref_vol) #히스토그램 매칭
    vol2 = convert_to_8bit(vol)

    n_imgs = vol2.shape[0] #첫번째 차원수만큼 슬라이싱할것이다.
    print(n_imgs)

    for j in range(n_imgs): #3D볼륨 데이터를 2D이미지로 변환하여 저장
        newvol = vol2[j]
        newvolf = np.stack([newvol]*3, axis=-1)
        newvolf = cv2.resize(newvolf, (640, 640)) #리사이즈
        image_filename = f'images/{dataset_split}/{run_name}_{j*10}.png'
        cv2.imwrite(image_filename, newvolf)
        # Create empty label file - 레이블 파일을 위한 빈 파일 생성
        label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
        with open(label_filename, 'w') as f:
            pass

    # Process each particle type
    for p, particle in enumerate(tqdm(particle_names, desc=f"Processing particles for run {run_name}")):
        if particle == "beta-amylase":
            continue

        if is_syntetic: #합성 데이터의 경우 ndjson파일을 찾고 로드하여 위치정보 추출
            particle_name_in_file = name_map.get(particle)
            if not particle_name_in_file:
                print(f"Particle name mapping not found for: {particle}")
                continue

            ndjson_each_particle = glob.glob(f'/kaggle/input/czii10441/10441/{run_name}/**/Annotations/**/*.ndjson', recursive=True)
            if not ndjson_each_particle:
                print(f"No NDJSON files found for particle: {particle} in run: {run_name}")
                continue

            filtered_ndjson_files = [f for f in ndjson_each_particle if particle_name_in_file in f]
            if not filtered_ndjson_files:
                print(f"No NDJSON files match the particle: {particle} for run: {run_name}")
                continue

            json_each_particle = ndjson_to_json(filtered_ndjson_files[0])
            df = pd.DataFrame(json_each_particle)
        else: #실제 데이터일 경우, 단백질 위치 정보를 포함하는 json파일을 로드
            json_each_particle = f"/kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/{run_name}/Picks/{particle}.json"

            if not os.path.exists(json_each_particle):
                print(f"JSON file not found: {json_each_particle}")
                continue
            print(f"Loading JSON file: {json_each_particle}")
            try:
                df = pd.read_json(json_each_particle)
            except ValueError as e:
                print(f"Error reading JSON file {json_each_particle}: {e}")
                continue
        if is_syntetic:
            column_name = 'location'
        else:
            column_name = 'points'

        if  column_name not in df.columns:
            print(f"'{column_name}' column not found in DataFrame for particle: {particle}")
            continue

        if is_syntetic:
            normalized_data = pd.json_normalize(df[column_name])
            df[['x', 'y', 'z']] = normalized_data * 10.012 #합성데이터의 경우 좌표를 스케일링하여 정규화

        else:
            for axis in ["x", "y", "z"]:   #df에서 x,y,z좌표를 추출하여 각 입자에 대한 위치정보 로드#######################################중요

                df[axis] = df[column_name].apply(lambda x: x["location"][axis] if "location" in x and axis in x["location"] else np.nan)
                print("aquiii",df.head())


        df.dropna(subset=["x", "y", "z"], inplace=True)



        radius = particle_radius.get(particle)
        if radius is None:
            print(f"Radius not defined for particle: {particle}")
            continue
        divide_by = 10.012
        for i, row in df.iterrows():

            start_z = np.round(row['z'] - radius).astype(np.int32)
            start_z = max(0, start_z//10)
            end_z = np.round(row['z'] + radius).astype(np.int32)
            end_z = min(n_imgs, end_z//10)
            for j in range(start_z, end_z):


                label_filename = f'labels/{dataset_split}/{run_name}_{j*10}.txt'
                x_center = row["x"] / divide_by / vol2.shape[1]
                y_center = row["y"] / divide_by / vol2.shape[2]
                box_width = (radius * 2) / divide_by / vol2.shape[1]
                box_height = (radius * 2) / divide_by / vol2.shape[2]
                with open(label_filename, 'a') as f:
                    f.write(f'{p2i_dict.get(particle, 0)} {x_center:.6f} {y_center:.6f} {box_width:.6f} {box_height:.6f}\n')

      #반지름만큼의 반경을 두고, z좌표를 중심으로 시작과 끝 좌표를 계산
      #해당범위 내의 z에 대해 입자들의 좌표를 10.012로 나누어 정규화


입자가 (x=50, y=60, z=30)에 위치하고, radius=5 일때,
해당 입자는z=30을 기준으로 z축 상에서 z=25에서 z=35 사이의 영역을 차지
이 영역에 해당하는 2D 이미지들을 슬라이싱하고, 그 이미지 내에서 입자의 중심 (x, y) 좌표와 크기(너비, 높이)를 계산하여 (형식에 맞게) YOLO 레이블 파일로 저장
따라서,위의 코드는 입자의 3D 위치 정보와 반지름을 기반으로 2D 이미지상의 해당 범위를 계산하고, 그에 맞는 레이블을 생성하는 과정

# Prepare Folders

In [None]:
os.makedirs("images/train", exist_ok=True)
os.makedirs("images/val", exist_ok=True)
os.makedirs("labels/train", exist_ok=True)
os.makedirs("labels/val", exist_ok=True) #디렉토리 생성

# Create Dataset

In [None]:
validation_indices = [0]  # TS_5_4

#runs = runs[:7]

for i, r in enumerate(runs):
    is_train_path = i not in validation_indices #현재 실험의 인덱스가 훈련or검증데이터 중 어디에 포함되어있는지 구분
    is_syntetic = i > 7 #합성데이터인지 구분
    print(f"Processing Run {i}: {r}, Is Train: {is_train_path}")
    make_annotate_yolo(r, is_train_path=is_train_path, is_syntetic=is_syntetic) #yolo레이블 형성

Processing Run 0: TS_5_4, Is Train: False
Volume path: /kaggle/input/czii-cryo-et-object-identification/train/static/ExperimentRuns/TS_5_4/VoxelSpacing10.000/denoised.zarr
184


Processing particles for run TS_5_4:  17%|█▋        | 1/6 [00:00<00:00,  8.62it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_5_4/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0   TS_5_4            NaN   
1         apo-ferritin  curation           0   TS_5_4            NaN   
2         apo-ferritin  curation           0   TS_5_4            NaN   
3         apo-ferritin  curation           0   TS_5_4            NaN   
4         apo-ferritin  curation           0   TS_5_4            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 468.514, 'y': 5915.906, 'z'...   
1  angstrom  {'location': {'x': 5674.694, 'y': 1114.354, 'z...   
2  angstrom  {'location': {'x': 5744.509, 'y': 1049.172, 'z...   
3  angstrom  {'location': {'x': 5880.769, 'y': 1125.348, 'z...   
4  angstrom  {'location': {'x': 4661.667, 'y': 1269.497, 'z...   

   trust_orientation         x  
0 

Processing particles for run TS_5_4: 100%|██████████| 6/6 [00:00<00:00, 14.85it/s]

  pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0             ribosome  curation           0   TS_5_4            NaN   
1             ribosome  curation           0   TS_5_4            NaN   
2             ribosome  curation           0   TS_5_4            NaN   
3             ribosome  curation           0   TS_5_4            NaN   
4             ribosome  curation           0   TS_5_4            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 4601.271, 'y': 601.066, 'z'...   
1  angstrom  {'location': {'x': 4803.789, 'y': 455.425, 'z'...   
2  angstrom  {'location': {'x': 4715.436, 'y': 825.374, 'z'...   
3  angstrom  {'location': {'x': 5003.275, 'y': 782.745, 'z'...   
4  angstrom  {'location': {'x': 710.459, 'y': 3815.845, 'z'...   

   trust_orientation         x         y         z  
0               True  4601.271   601.066   600.934  
1               True  4803.789   455.425   514.016  
2          




184


Processing particles for run TS_69_2:  50%|█████     | 3/6 [00:00<00:00, 25.59it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_69_2/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0  TS_69_2            NaN   
1         apo-ferritin  curation           0  TS_69_2            NaN   
2         apo-ferritin  curation           0  TS_69_2            NaN   
3         apo-ferritin  curation           0  TS_69_2            NaN   
4         apo-ferritin  curation           0  TS_69_2            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 770.625, 'y': 1111.161, 'z'...   
1  angstrom  {'location': {'x': 828.291, 'y': 1201.673, 'z'...   
2  angstrom  {'location': {'x': 668.986, 'y': 1041.449, 'z'...   
3  angstrom  {'location': {'x': 834.049, 'y': 592.958, 'z':...   
4  angstrom  {'location': {'x': 81.893, 'y': 2152.929, 'z':...   

   trust_orientation        x  
0 

Processing particles for run TS_69_2: 100%|██████████| 6/6 [00:00<00:00, 16.95it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_69_2/Picks/thyroglobulin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0  TS_69_2            NaN   
1        thyroglobulin  curation           0  TS_69_2            NaN   
2        thyroglobulin  curation           0  TS_69_2            NaN   
3        thyroglobulin  curation           0  TS_69_2            NaN   
4        thyroglobulin  curation           0  TS_69_2            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 1697.664, 'y': 2484.89, 'z'...   
1  angstrom  {'location': {'x': 1083.746, 'y': 4336.916, 'z...   
2  angstrom  {'location': {'x': 2535.083, 'y': 3739.817, 'z...   
3  angstrom  {'location': {'x': 719.11, 'y': 2318.413, 'z':...   
4  angstrom  {'location': {'x': 270.918, 'y': 4900.032, 'z'...   

   trust_orientation         x  





184


Processing particles for run TS_6_4:  50%|█████     | 3/6 [00:00<00:00, 23.10it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_6_4/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0   TS_6_4            NaN   
1         apo-ferritin  curation           0   TS_6_4            NaN   
2         apo-ferritin  curation           0   TS_6_4            NaN   
3         apo-ferritin  curation           0   TS_6_4            NaN   
4         apo-ferritin  curation           0   TS_6_4            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 616.51, 'y': 2880.471, 'z':...   
1  angstrom  {'location': {'x': 1099.033, 'y': 1820.423, 'z...   
2  angstrom  {'location': {'x': 1019.831, 'y': 1859.831, 'z...   
3  angstrom  {'location': {'x': 959.708, 'y': 1708.149, 'z'...   
4  angstrom  {'location': {'x': 1010.329, 'y': 1758.816, 'z...   

   trust_orientation         x  
0 

Processing particles for run TS_6_4: 100%|██████████| 6/6 [00:00<00:00, 14.98it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_6_4/Picks/thyroglobulin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0   TS_6_4            NaN   
1        thyroglobulin  curation           0   TS_6_4            NaN   
2        thyroglobulin  curation           0   TS_6_4            NaN   
3        thyroglobulin  curation           0   TS_6_4            NaN   
4        thyroglobulin  curation           0   TS_6_4            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 5251.785, 'y': 2090.452, 'z...   
1  angstrom  {'location': {'x': 5704.207, 'y': 2059.378, 'z...   
2  angstrom  {'location': {'x': 5926.681, 'y': 1787.182, 'z...   
3  angstrom  {'location': {'x': 4967.648, 'y': 2249.978, 'z...   
4  angstrom  {'location': {'x': 1242.919, 'y': 1464.644, 'z...   

   trust_orientation         x  
0




184


Processing particles for run TS_6_6:  50%|█████     | 3/6 [00:00<00:00, 23.92it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_6_6/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0   TS_6_6            NaN   
1         apo-ferritin  curation           0   TS_6_6            NaN   
2         apo-ferritin  curation           0   TS_6_6            NaN   
3         apo-ferritin  curation           0   TS_6_6            NaN   
4         apo-ferritin  curation           0   TS_6_6            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 1916.83, 'y': 3311.797, 'z'...   
1  angstrom  {'location': {'x': 1996.861, 'y': 3231.277, 'z...   
2  angstrom  {'location': {'x': 2206.512, 'y': 2975.302, 'z...   
3  angstrom  {'location': {'x': 285.292, 'y': 1379.331, 'z'...   
4  angstrom  {'location': {'x': 753.781, 'y': 2633.219, 'z'...   

   trust_orientation         x  
0 

Processing particles for run TS_6_6: 100%|██████████| 6/6 [00:00<00:00, 16.51it/s]

aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0   TS_6_6            NaN   
1        thyroglobulin  curation           0   TS_6_6            NaN   
2        thyroglobulin  curation           0   TS_6_6            NaN   
3        thyroglobulin  curation           0   TS_6_6            NaN   
4        thyroglobulin  curation           0   TS_6_6            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 4709.027, 'y': 5996.216, 'z...   
1  angstrom  {'location': {'x': 2822.785, 'y': 2974.861, 'z...   
2  angstrom  {'location': {'x': 2974.63, 'y': 5786.818, 'z'...   
3  angstrom  {'location': {'x': 4622.227, 'y': 4238.002, 'z...   
4  angstrom  {'location': {'x': 1743.59, 'y': 605.843, 'z':...   

   trust_orientation         x  
0               True  4709.027  
1               True  2822.785  
2               True  2974.630  
3               True  4622.227 




184


Processing particles for run TS_73_6:  17%|█▋        | 1/6 [00:00<00:00,  8.96it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_73_6/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0  TS_73_6            NaN   
1         apo-ferritin  curation           0  TS_73_6            NaN   
2         apo-ferritin  curation           0  TS_73_6            NaN   
3         apo-ferritin  curation           0  TS_73_6            NaN   
4         apo-ferritin  curation           0  TS_73_6            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 268.662, 'y': 4730.318, 'z'...   
1  angstrom  {'location': {'x': 238.946, 'y': 4853.061, 'z'...   
2  angstrom  {'location': {'x': 83.114, 'y': 5729.56, 'z': ...   
3  angstrom  {'location': {'x': 582.143, 'y': 2769.968, 'z'...   
4  angstrom  {'location': {'x': 510.389, 'y': 2157.244, 'z'...   

   trust_orientation        x  
0 

Processing particles for run TS_73_6: 100%|██████████| 6/6 [00:00<00:00, 14.47it/s]


Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_73_6/Picks/thyroglobulin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0  TS_73_6            NaN   
1        thyroglobulin  curation           0  TS_73_6            NaN   
2        thyroglobulin  curation           0  TS_73_6            NaN   
3        thyroglobulin  curation           0  TS_73_6            NaN   
4        thyroglobulin  curation           0  TS_73_6            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 1841.78, 'y': 1057.592, 'z'...   
1  angstrom  {'location': {'x': 2387.281, 'y': 2326.345, 'z...   
2  angstrom  {'location': {'x': 3242.6, 'y': 452.621, 'z': ...   
3  angstrom  {'location': {'x': 409.479, 'y': 1796.38, 'z':...   
4  angstrom  {'location': {'x': 4582.511, 'y': 1738.01, 'z'...   

   trust_orientation         x  


Processing particles for run TS_86_3:  50%|█████     | 3/6 [00:00<00:00, 19.94it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_86_3/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0  TS_86_3            NaN   
1         apo-ferritin  curation           0  TS_86_3            NaN   
2         apo-ferritin  curation           0  TS_86_3            NaN   
3         apo-ferritin  curation           0  TS_86_3            NaN   
4         apo-ferritin  curation           0  TS_86_3            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 3870.343, 'y': 4952.714, 'z...   
1  angstrom  {'location': {'x': 4130.897, 'y': 5422.292, 'z...   
2  angstrom  {'location': {'x': 2735.0, 'y': 4668.447, 'z':...   
3  angstrom  {'location': {'x': 2649.615, 'y': 4690.615, 'z...   
4  angstrom  {'location': {'x': 2665.353, 'y': 4810.641, 'z...   

   trust_orientation         x  
0

Processing particles for run TS_86_3: 100%|██████████| 6/6 [00:00<00:00, 12.87it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_86_3/Picks/thyroglobulin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0  TS_86_3            NaN   
1        thyroglobulin  curation           0  TS_86_3            NaN   
2        thyroglobulin  curation           0  TS_86_3            NaN   
3        thyroglobulin  curation           0  TS_86_3            NaN   
4        thyroglobulin  curation           0  TS_86_3            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 3491.458, 'y': 1964.375, 'z...   
1  angstrom  {'location': {'x': 1059.895, 'y': 338.01, 'z':...   
2  angstrom  {'location': {'x': 5583.535, 'y': 5055.373, 'z...   
3  angstrom  {'location': {'x': 3753.362, 'y': 3195.014, 'z...   
4  angstrom  {'location': {'x': 4845.469, 'y': 5984.182, 'z...   

   trust_orientation         x  





184


Processing particles for run TS_99_9:   0%|          | 0/6 [00:00<?, ?it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_99_9/Picks/apo-ferritin.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0         apo-ferritin  curation           0  TS_99_9            NaN   
1         apo-ferritin  curation           0  TS_99_9            NaN   
2         apo-ferritin  curation           0  TS_99_9            NaN   
3         apo-ferritin  curation           0  TS_99_9            NaN   
4         apo-ferritin  curation           0  TS_99_9            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 6072.464, 'y': 4038.0, 'z':...   
1  angstrom  {'location': {'x': 5967.452, 'y': 4228.213, 'z...   
2  angstrom  {'location': {'x': 5847.622, 'y': 5066.71, 'z'...   
3  angstrom  {'location': {'x': 472.853, 'y': 5632.618, 'z'...   
4  angstrom  {'location': {'x': 564.507, 'y': 5604.88, 'z':...   

   trust_orientation         x  
0

Processing particles for run TS_99_9:  50%|█████     | 3/6 [00:00<00:00, 12.87it/s]

Loading JSON file: /kaggle/input/czii-cryo-et-object-identification/train/overlay/ExperimentRuns/TS_99_9/Picks/ribosome.json
aquiii   pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0             ribosome  curation           0  TS_99_9            NaN   
1             ribosome  curation           0  TS_99_9            NaN   
2             ribosome  curation           0  TS_99_9            NaN   
3             ribosome  curation           0  TS_99_9            NaN   
4             ribosome  curation           0  TS_99_9            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 3319.52, 'y': 5476.3, 'z': ...   
1  angstrom  {'location': {'x': 3106.079, 'y': 5574.209, 'z...   
2  angstrom  {'location': {'x': 3676.997, 'y': 5271.281, 'z...   
3  angstrom  {'location': {'x': 3717.18, 'y': 5455.125, 'z'...   
4  angstrom  {'location': {'x': 3525.117, 'y': 5614.423, 'z...   

   trust_orientation         x  
0    

Processing particles for run TS_99_9: 100%|██████████| 6/6 [00:00<00:00, 10.44it/s]

  pickable_object_name   user_id  session_id run_name  voxel_spacing  \
0        thyroglobulin  curation           0  TS_99_9            NaN   
1        thyroglobulin  curation           0  TS_99_9            NaN   
2        thyroglobulin  curation           0  TS_99_9            NaN   
3        thyroglobulin  curation           0  TS_99_9            NaN   
4        thyroglobulin  curation           0  TS_99_9            NaN   

       unit                                             points  \
0  angstrom  {'location': {'x': 638.516, 'y': 1499.883, 'z'...   
1  angstrom  {'location': {'x': 2038.81, 'y': 214.041, 'z':...   
2  angstrom  {'location': {'x': 1504.54, 'y': 3662.02, 'z':...   
3  angstrom  {'location': {'x': 2166.658, 'y': 2000.169, 'z...   
4  angstrom  {'location': {'x': 656.526, 'y': 1988.951, 'z'...   

   trust_orientation         x         y        z  
0               True   638.516  1499.883  279.570  
1               True  2038.810   214.041  361.807  
2             




Volume path: /kaggle/input/czii10441/10441/TS_1/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_1.zarr
184


Processing particles for run TS_1: 100%|██████████| 6/6 [00:01<00:00,  3.30it/s]


Processing Run 9: TS_10, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_10/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_10.zarr
184


Processing particles for run TS_10: 100%|██████████| 6/6 [00:02<00:00,  3.00it/s]


Processing Run 10: TS_11, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_11/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_11.zarr
184


Processing particles for run TS_11: 100%|██████████| 6/6 [00:01<00:00,  3.29it/s]


Processing Run 11: TS_12, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_12/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_12.zarr
184


Processing particles for run TS_12: 100%|██████████| 6/6 [00:01<00:00,  3.14it/s]


Processing Run 12: TS_13, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_13/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_13.zarr
184


Processing particles for run TS_13: 100%|██████████| 6/6 [00:01<00:00,  3.40it/s]


Processing Run 13: TS_14, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_14/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_14.zarr
184


Processing particles for run TS_14: 100%|██████████| 6/6 [00:02<00:00,  2.98it/s]


Processing Run 14: TS_15, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_15/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_15.zarr
184


Processing particles for run TS_15: 100%|██████████| 6/6 [00:01<00:00,  3.14it/s]


Processing Run 15: TS_16, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_16/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_16.zarr
184


Processing particles for run TS_16: 100%|██████████| 6/6 [00:01<00:00,  3.06it/s]


Processing Run 16: TS_17, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_17/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_17.zarr
184


Processing particles for run TS_17: 100%|██████████| 6/6 [00:01<00:00,  3.28it/s]


Processing Run 17: TS_18, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_18/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_18.zarr
184


Processing particles for run TS_18: 100%|██████████| 6/6 [00:02<00:00,  2.72it/s]


Processing Run 18: TS_19, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_19/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_19.zarr
184


Processing particles for run TS_19: 100%|██████████| 6/6 [00:02<00:00,  2.95it/s]


Processing Run 19: TS_2, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_2/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_2.zarr
184


Processing particles for run TS_2: 100%|██████████| 6/6 [00:01<00:00,  3.13it/s]


Processing Run 20: TS_20, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_20/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_20.zarr
184


Processing particles for run TS_20: 100%|██████████| 6/6 [00:01<00:00,  3.18it/s]


Processing Run 21: TS_21, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_21/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_21.zarr
184


Processing particles for run TS_21: 100%|██████████| 6/6 [00:01<00:00,  3.25it/s]


Processing Run 22: TS_22, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_22/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_22.zarr
184


Processing particles for run TS_22: 100%|██████████| 6/6 [00:01<00:00,  3.54it/s]


Processing Run 23: TS_23, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_23/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_23.zarr
184


Processing particles for run TS_23: 100%|██████████| 6/6 [00:01<00:00,  3.21it/s]


Processing Run 24: TS_24, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_24/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_24.zarr
184


Processing particles for run TS_24: 100%|██████████| 6/6 [00:01<00:00,  3.44it/s]


Processing Run 25: TS_25, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_25/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_25.zarr
184


Processing particles for run TS_25: 100%|██████████| 6/6 [00:01<00:00,  3.12it/s]


Processing Run 26: TS_26, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_26/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_26.zarr
184


Processing particles for run TS_26: 100%|██████████| 6/6 [00:01<00:00,  3.37it/s]


Processing Run 27: TS_3, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_3/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_3.zarr
184


Processing particles for run TS_3: 100%|██████████| 6/6 [00:02<00:00,  2.58it/s]


Processing Run 28: TS_4, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_4/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_4.zarr
184


Processing particles for run TS_4: 100%|██████████| 6/6 [00:01<00:00,  3.31it/s]


Processing Run 29: TS_5, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_5/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_5.zarr
184


Processing particles for run TS_5: 100%|██████████| 6/6 [00:01<00:00,  3.06it/s]


Processing Run 30: TS_6, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_6/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_6.zarr
184


Processing particles for run TS_6: 100%|██████████| 6/6 [00:01<00:00,  3.38it/s]


Processing Run 31: TS_7, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_7/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_7.zarr
184


Processing particles for run TS_7: 100%|██████████| 6/6 [00:01<00:00,  3.23it/s]


Processing Run 32: TS_8, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_8/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_8.zarr
184


Processing particles for run TS_8: 100%|██████████| 6/6 [00:01<00:00,  3.42it/s]


Processing Run 33: TS_9, Is Train: True
Volume path: /kaggle/input/czii10441/10441/TS_9/Reconstructions/VoxelSpacing10.000/Tomograms/100/TS_9.zarr
184


Processing particles for run TS_9: 100%|██████████| 6/6 [00:02<00:00,  2.62it/s]


In [None]:
images_train_dir = "images/train"
labels_train_dir = "labels/train"

# Organize Dataset Folder Structure

In [None]:
os.makedirs('datasets/czii_det2d', exist_ok=True)
shutil.move('images/train', 'datasets/czii_det2d/images/train')
shutil.move('images/val', 'datasets/czii_det2d/images/val')
shutil.move('labels/train', 'datasets/czii_det2d/labels/train')
shutil.move('labels/val', 'datasets/czii_det2d/labels/val')

'datasets/czii_det2d/labels/val'

# Create Configuration File for YOLO

In [None]:
config_content = """
path: /kaggle/input/czii-yolo-datasets-kim/datasets/czii_det2d  # dataset root dir
train: images/train  # train images (relative to 'path')
val: images/val  # val images (relative to 'path')

# Classes
names:
  0: apo-ferritin
  1: beta-amylase
  2: beta-galactosidase
  3: ribosome
  4: thyroglobulin
  5: virus-like-particle
"""    #YOLO 모델이 데이터셋을 올바르게 사용할 수 있도록 도와주는 중간다리
with open("czii_conf.yaml", "w") as f:
    f.write(config_content.strip())

print("Configuration file 'czii_conf.yaml' created successfully.")

'''
그럼 3D 데이터는 랜덤으로 슬라이스를 하고, 좌표값으로부터 범위를 지정한다음
그 범위 안의 모든 슬라이스의 이미지를 하나의 레이블과 매칭시킨다는건가?

네, 정확합니다. 3D 데이터에서 입자의 좌표와 반지름 정보를 바탕으로, 입자가 포함되는 Z축 범위를 계산하고,
그 범위 내의 모든 슬라이스에 대해 하나의 레이블을 부여하는 방식입니다.

요약하자면:
3D 데이터에서 입자 위치 추출:

각 입자의 x, y, z 좌표가 주어집니다.
반지름(radius)은 입자의 크기를 나타내며, 이 값으로 해당 입자가 영향을 미치는 범위를 계산합니다.
Z축 범위 계산:

start_z는 입자의 z 좌표에서 반지름을 뺀 값입니다.
end_z는 입자의 z 좌표에서 반지름을 더한 값입니다.
이 범위는 해당 입자가 차지하는 Z축 영역을 정의합니다.
슬라이스 범위 추출:

start_z와 end_z를 이용해 해당 입자가 포함되는 슬라이스 범위를 구합니다.
범위 내에 해당하는 모든 슬라이스 인덱스(j)에 대해 입자에 대한 레이블을 기록합니다.
레이블 기록:

각 슬라이스(j)에서 입자의 x, y 좌표를 정규화하여 레이블을 기록합니다.
이때, 레이블 파일은 각 슬라이스에 대해 따로 생성되며, 그 슬라이스에 해당하는 입자의 정보를 하나의 레이블로 매칭시킵니다.
구체적으로:
예를 들어, z 좌표가 50이고, 반지름이 5인 입자가 있다고 가정해봅시다.
start_z = 50 - 5 = 45
end_z = 50 + 5 = 55
그러면, z 값이 45, 46, 47, ..., 55인 슬라이스들은 모두 이 입자에 대한 레이블을 공유하게 됩니다.
결론:
입자가 특정 범위의 슬라이스에 걸쳐 있으면, 그 범위 내의 모든 슬라이스에 대해 하나의 입자에 대한 레이블을 부여하고,
 각 슬라이스에서 해당 입자의 x, y 좌표와 크기를 정규화하여 레이블 파일에 기록합니다.
 이로써 3D 데이터에서 입자가 나타나는 모든 슬라이스에서 하나의 레이블이 다시 매칭되는 방식이 됩니다.'


Configuration file 'czii_conf.yaml' created successfully.


In [None]:
shutil.make_archive("/kaggle/working/datasets", 'gztar', "/kaggle/working/datasets")

'/kaggle/working/datasets.tar.gz'

# Continue to Training Baseline
Proceed to [Training Baseline...](https://www.kaggle.com/code/itsuki9180/czii-yolo11-training-baseline)