# Data processing

This notebook allows you to make transformations and augmentation of your dataset. Indeed, some models like YOLO take in YOLO formatted datasets, while the majority of detection models prefer COCO format annotated datasets.
<br></br>
Here is how this notebook works :
<div style="text-align: center;">
    <img src="images/Data_transform_graph.png" alt="Data transformation graph" title="Data transformation graph">
</div>

The different formats are as follows:
<table>
  <tr>
    <th>COCO</th>
    <th>YOLOv5</th>
    <th>YOLOv8</th>
  </tr>
  <tr>
  <td>
      <pre>
COCO/
├── data/
│   └── images/
└── └── annotations.json
  </pre>
</td>
    <td>
      <pre>
YOLOv5/
├── data/
│   ├── images/
│   │   ├── train/
│   │   └── val/
│   ├── labels/
│   │   ├── train/
│   │   └── val/
└── └── data.yaml
  </pre>
</td>
<td>
  <pre>
YOLOv8/
├── data/
│   ├── train/
│   │   ├── images/
│   │   └── labels/
│   ├── valid/
│   │   ├── images/
│   │   └── labels/
└── └── data.yaml
  </pre>
</td>
  </tr>
</table>

Note thate here you will also be able to merge multiple different COOC datasets into one.

## Library instantiation

In [1]:
from google.colab import drive
drive.mount("/content/drive")

# Direct to your assignment folder.
%cd /content/drive/MyDrive/Cartonomics/Cocass/notebooks

Mounted at /content/drive
/content/drive/MyDrive/Cartonomics/Cocass/notebooks


In [2]:
%load_ext autoreload
%autoreload 1

In [3]:
# Direct to your assignment folder.
%cd /content/drive/MyDrive/Cartonomics/Cocass

/content/drive/MyDrive/Cartonomics/Cocass


In [4]:
!pip install -r requirements.txt

Collecting absl-py==2.1.0 (from -r requirements.txt (line 1))
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting aiofiles==24.1.0 (from -r requirements.txt (line 2))
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting aiohappyeyeballs==2.4.0 (from -r requirements.txt (line 3))
  Downloading aiohappyeyeballs-2.4.0-py3-none-any.whl.metadata (5.9 kB)
Collecting aiohttp==3.10.5 (from -r requirements.txt (line 4))
  Downloading aiohttp-3.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting anyio==4.4.0 (from -r requirements.txt (line 6))
  Downloading anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting argcomplete==3.5.0 (from -r requirements.txt (line 7))
  Downloading argcomplete-3.5.0-py3-none-any.whl.metadata (16 kB)
Collecting asttokens==2.4.1 (from -r requirements.txt (line 8))
  Downloading asttokens-2.4.1-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting boto3==1.35.16 (from -r requirements.t

In [5]:
pip install sahi fiftyone ultralytics huggingface_hub funcy pylabel

Collecting sahi
  Downloading sahi-0.11.19-py3-none-any.whl.metadata (17 kB)
Collecting fiftyone
  Downloading fiftyone-1.1.0-py3-none-any.whl.metadata (12 kB)
Collecting ultralytics
  Downloading ultralytics-8.3.49-py3-none-any.whl.metadata (35 kB)
Collecting funcy
  Using cached funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting pylabel
  Downloading pylabel-0.1.55-py3-none-any.whl.metadata (3.8 kB)
Collecting opencv-python<=4.9.0.80 (from sahi)
  Using cached opencv_python-4.9.0.80-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting pybboxes==0.1.6 (from sahi)
  Using cached pybboxes-0.1.6-py3-none-any.whl.metadata (9.9 kB)
Collecting fire (from sahi)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting terminaltables (from sahi)
  Downloading terminaltables-3.1.10-py

In [6]:
# Direct to your assignment folder.
%cd /content/drive/MyDrive/Cartonomics/Cocass/notebooks

/content/drive/MyDrive/Cartonomics/Cocass/notebooks


In [7]:
from utils import init_notebook
%aimport datasets, datasets.cocodetr, datasets.data_transform

from pathlib import Path
import os

HOME = Path(os.getcwd()).parents[0]
HOME

PosixPath('/content/drive/MyDrive/Cartonomics/Cocass')

## Data transformation

### Coco to Coco splitted (train-val)

In [None]:
annotations_file = (HOME).as_posix()+"/data/coco_datasets/train_data/fraw_d.json"
images_folder =(HOME).as_posix()+"/data/coco_datasets/Cocass/images"

from datasets.cocodetr import create_coco_pth_datasets

# this function split a coco-like dataset into train and val datasets
# It extracts the annotations files but those files refers to the same image folder, just not the same images
create_coco_pth_datasets(annotations_file, images_folder,
                        split_only=True,
                        train_ann_name="fraw_detailed_train.json",
                        val_ann_name="fraw_detailed_val.json",
                        test_size=0.2)

### Coco to Yolo format

This function helps you extract from a COCO dataset (full) a YOLO formatted dataset. If your COCO dataset is splitted, call this function without split on each annotation file representing your split.

In [9]:
annotations_file = (HOME).as_posix()+"/data/coco_datasets/train_data/train_annotations.json"
images_folder =(HOME).as_posix()+"/data/coco_datasets/train_data/train_images"


output_dir = (HOME/"data/yolo_datasets/Classification").as_posix()
from datasets.data_transform import coco2yolo

coco2yolo(annotations_file,images_folder,
          output_dir= output_dir, # output folder
          copy_images=True, # copy images to output folder (if not only annotations are extracted)
          yolo_type="yolov8",    # or yolov5 (only yolov8 is supported as of now)
          split=True,   # split the dataset into train and val (if False you get YOLO Dataset (full))
          )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[list(set(schema) - set(df.columns))] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[list(set(schema) - set(df.columns))] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[list(set(schema) - set(df.columns))] = ""
A value is trying to be set on a copy of a slice from a DataFrame.
Try

In [13]:
#If split = False you may use
output_dir = (HOME/"data/yolo_datasets/Detection/test").as_posix()
from datasets.data_transform import merge_yaml
merge_yaml(output_dir,train_name="train",valid_name="val",augment=True)

Merged data.yaml files in test into one data.yaml file.


### YOLOv5 $\leftrightarrow$ YOLOv8
This part allows you to make the conversion between Yolov5 formatted datasets folders and Yolov8 formatted folders.

#### YOLOv5 $\rightarrow$ YOLOv8

In [None]:
yolov5_folder_path = "<your_yolov5_folder_path>"
yolov8_output_folder = "<your_yolov8_folder_path>"
from datasets.data_transform import convert_yolov5_to_yolov8

convert_yolov5_to_yolov8(yolov5_folder_path, yolov8_output_folder)

#### YOLOv5 $\leftarrow$ YOLOv8

In [None]:
yolov8_folder_path = (HOME/"data/yolo_datasets/Yolass_aug").as_posix() #"<your_yolov8_folder_path>"
yolov5_output_folder = (HOME/"data/yolo_datasets/Yolass_augv5").as_posix()#"<your_yolov5_folder_path>"
from datasets.data_transform import convert_yolov8_to_yolov5

convert_yolov8_to_yolov5(yolov8_folder_path, yolov5_output_folder)

## DATA augmentation
Here we only augment COCO Datasets


The augmentation requires a 'train_annotations.json' file as we need to only augment the train dataset. Thus, you need to split your COCO dataset beforehand.

In [None]:
from datasets.data_transform import albu_coco_augmentation
import os
# annotations_file = (HOME).as_posix()+"/data/coco_datasets/Cocass/ffull_detailed_train.json",
# images_folder = (HOME).as_posix()+"/data/coco_datasets/Cocass/images"
annotations_file = os.path.join(HOME,"data/coco_datasets/Cocass/ffull_detailed_train.json")#"/data/coco_datasets/Cocass/ffull_detailed_train.json",
images_folder = os.path.join(HOME,"data/coco_datasets/Cocass/images")

output_dir = (HOME).as_posix()+"/data/coco_datasets/Cocass_aug" #path to save the augmented dataset

albu_coco_augmentation(
    # By default the augmentation techniques are set to False
                    annotations_file,images_folder,
                    output_folder=output_dir,       #path to save the augmented dataset
                    annotations_name="ffull_detailed_train" , #name of the augmented annotations file
                    blur=True,                      #apply blur augmentation
                    #blur_limit = 15,               #How much to blur the image (limit for the random value)
                    grayscale = True,               #apply grayscale augmentation
                    equalize = True,                #apply equalize augmentation
                    dropout = True,                 #apply dropout augmentation (randomly remove pixels)
                    # dropout_percentage = 0.15,    #percentage of pixels to remove
                    hue_saturation = True,          #apply hue and saturation augmentation
                    # hue_shift_limit = 10,         #How much to shift the hue (limit for the random value)
                    # saturation_limit = 10,        #How much to change the saturation (limit for the random value)
                    brightness = True,              #apply brightness and contrast augmentation
                    # brightness_limit  = 0.2,      #How much to change the brightness (limit for the random value)
                    # contrast_limit = 0.2,         #How much to change the contrast (limit for the random value)
                    gamma = True,                   #apply gamma augmentation
                    gamma_range  = (10, 130),     #range to apply gamma
                    augmentation_ratio= 0.2,        #percentage of images to augment
                    verbose=True
                    )

In [None]:
## If you want to get back the original annotations with val and train merged
from datasets.data_transform import merge_coco_annotations
train_annotations_file = (HOME).as_posix()+"/data/coco_datasets/cocass_f52_synth_4000_6000_3000_1000_1280_nlabels_aug/train_detailed_nolabelsonly.json"
val_annotations_file = (HOME).as_posix()+"/data/coco_datasets/cocass_f52_synth_4000_6000_3000_1000_1280_nlabels_aug/val_detailed_nolabelsonly.json"
images_folder = (HOME).as_posix()+"/data/coco_datasets/cocass_f52_synth_4000_6000_3000_1000_1280_nlabels_aug/"

merge_coco_annotations([train_annotations_file, val_annotations_file], images_folder+"detailed_nolabelsonly.json")

In [None]:
from datasets.data_transform import merge_coco_json
merging= [(HOME).as_posix()+"/data/coco_datasets/Cocass_aug/fraw_detailed_val.json",
          (HOME).as_posix()+"/data/coco_datasets/Cocass_aug/ffull_detailed_train.json",

]
merge_coco_json(merging,
                       (HOME).as_posix()+"/data/coco_datasets/Cocass_aug/ffull_detailed.json")

In [None]:
from datasets.data_transform import merge_coco_json
merging= [(HOME).as_posix()+"/data/coco_datasets/Cocass/f006_detailed.json",
          (HOME).as_posix()+"/data/coco_datasets/Cocass/f008_detailed.json",
          (HOME).as_posix()+"/data/coco_datasets/Cocass/f052_detailed.json",
          (HOME).as_posix()+"/data/coco_datasets/Cocass/f165_detailed.json",
          #(HOME).as_posix()+"/data/coco_datasets/Cocass/fsynth_detailed.json",
]
merge_coco_json(merging,
                       (HOME).as_posix()+"/data/coco_datasets/Cocass/fraw_detailed.json")

In [None]:
from datasets.data_transform import make_ids_linear
make_ids_linear((HOME).as_posix()+"/data/coco_datasets/Cocass/ffull_detailed_train.json",(HOME).as_posix()+"/data/coco_datasets/Cocass/ffull_detailed.json")