# Data preprocessing

With this notebook we perform all the necessary steps to prepare the images and annotations so that they are available to be used by our models. Please note that running this notebook takes a long time because the steps performed below include file download, preprocessing, moving and image transformations. We therefore suggest not to run this notebook.

## Table of contents
1. [Downloading files](#download) <br>
2. [Reading json files](#read) <br>
3. [Removing unnecessary images](#remove) <br>
4. [Train, Val, Test split](#ttv) <br>
5. [Generating labels for YOLO](#generate) <br>
6. [Generating grayscale dataset](#grayscale) <br>
    6.1 [Three channels grayscale images](#three) <br>
    6.2 [One channel grayscale images](#one) <br>
7. [Colorize 1 channel grayscale image](#colorize) <br>
8. [Solving class imbalance](#imbalance) <br>
    8.1 [Undersampling](#under) <br>
    8.2 [Oversampling](#over) <br>


In [1]:
import os
import pandas as pd
import random
import numpy as np
import pickle
from zipfile import ZipFile
import tarfile
import glob
from tqdm.notebook import tqdm

## 1. Downloading files <a class="anchor" id="download"></a>

Downloading the ```Aves.tar.gz``` zip file from [OneDrive](https://bocconi-my.sharepoint.com/:u:/g/personal/debora_nozza_unibocconi_it/EWj145j9O41NjVADGAGDJxoBe8QQkbogIY0aTw45YLKBmg?e=WbgCSa)

In [1]:
input_url = 'https://bocconi-my.sharepoint.com/:u:/g/personal/debora_nozza_unibocconi_it/EWj145j9O41NjVADGAGDJxoBe8QQkbogIY0aTw45YLKBmg?e=WbgCSa'
output_dir = "Aves.tar.gz"

split_url = input_url.rfind('?')
converted_url = input_url[:split_url] + '?download=1'

!wget -O "$output_dir" "$converted_url"

--2022-11-12 08:40:55--  https://bocconi-my.sharepoint.com/:u:/g/personal/debora_nozza_unibocconi_it/EWj145j9O41NjVADGAGDJxoBe8QQkbogIY0aTw45YLKBmg?download=1
Resolving bocconi-my.sharepoint.com... 52.105.130.25
Connecting to bocconi-my.sharepoint.com|52.105.130.25|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /personal/debora_nozza_unibocconi_it/Documents/5.%20Corsi/Computer%20Vision%20-%202022-2023/DEEP%20LEARNING%20FOR%20COMPUTER%20VISION%20-%20PROJECT%20DATA/Aves.tar.gz?ga=1 [following]
--2022-11-12 08:40:55--  https://bocconi-my.sharepoint.com/personal/debora_nozza_unibocconi_it/Documents/5.%20Corsi/Computer%20Vision%20-%202022-2023/DEEP%20LEARNING%20FOR%20COMPUTER%20VISION%20-%20PROJECT%20DATA/Aves.tar.gz?ga=1
Reusing existing connection to bocconi-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 51068277745 (48G) [application/x-gzip]
Saving to: ‘Aves.tar.gz’


2022-11-12 08:54:48 (58.5 MB/s) - ‘Aves.tar.gz’ saved [510

The downloaded zip weighs 47.6 GB. Unzipping it completely would mean uploading another 47.6 GB to the directory. Since we don't need all the categories contained in this file, we used the following code to enter the zipped file and unzip only the 10 categories of interest.

In [11]:
aves = tarfile.open("Aves.tar.gz") # opening the zip

In [13]:
# the following are the most populated categories, thus the ones we want to unzip
new_cat = ['Ardea herodias',
 'Buteo jamaicensis',
 'Ardea alba',
 'Melospiza melodia',
 'Cardinalis cardinalis',
 'Zenaida macroura',
 'Agelaius phoeniceus',
 'Pandion haliaetus',
 'Junco hyemalis',
 'Picoides pubescens']

In [14]:
for c in tqdm(new_cat):
    lista = [l for l,i in enumerate(aves.getnames()) if i.startswith(f"Aves/{c}/")]
    #print(lista)
    for l in lista:
        aves.extract(aves.getmembers()[l])

  0%|          | 0/10 [00:00<?, ?it/s]

## 2. Reading json files <a class="anchor" id="read"></a>

Images labels and bounding boxes information are stored in _.json files_. Next cells of code are importing these files and storing relevant info into pandas dataframe. These _.json files_ were downloaded from the [iNaturalist](https://github.com/visipedia/inat_comp/tree/master/2017) github page

In [2]:
from SATM.preprocess import merge_aves_df

We have decided not to maintain the Train Val split provided by iNaturalist for three reasons:
1. on the iNaturalist page the subdivision into two sets (train and val) is incompatible with our requirements, because we need a subdivision into train dev and test.
2. the split in train dev of iNauralist is excessively unbalanced towards the train and there are very few images allocated in the val dataset to validate the performance
3. the third reason is that the val dataset proposed by iNaturalist is not exactly representative of the entire dataset, since only in the train there are images with more than one box, while in the val there are only images featuring one box.

In [3]:
df = merge_aves_df("train_2017_bboxes.json", "val_2017_bboxes.json")
df.head()

Unnamed: 0,license,file_name,rights_holder,height,width,id,area,iscrowd,image_id,bbox,category_id,id_y,identifier,category_name,super_category_name
139706,3,train_val_images/Aves/Bubulcus ibis/26a9157b48...,greglasley,545,800,213252,37238.0,0,213252,"[230, 199, 433, 172]",2912,153173,26a9157b48f66f71032f75ac70a11db7.jpg,Bubulcus ibis,Aves
214110,3,train_val_images/Insecta/Feltia herilis/8e7ecc...,leplady0209,600,800,318689,53133.5,0,318689,"[194, 138, 323, 329]",4169,235593,8e7ecc6f1bf06ad53acb8326b8696740.jpg,Feltia herilis,Insecta
429742,3,train_val_images/Aves/Tringa solitaria/a766afd...,J. Maughn,591,800,598727,6950.0,0,598727,"[113, 274, 139, 100]",3890,468382,a766afd4faa3b87cece8760ebaa6d9d1.jpg,Tringa solitaria,Aves
191185,3,train_val_images/Insecta/Manduca sexta/4f86fdb...,hobiecat,732,800,284974,194892.0,0,284974,"[146, 56, 654, 596]",3771,210301,4f86fdb62ca07b7257a1ab9a72816d98.jpg,Manduca sexta,Insecta
458018,3,train_val_images/Aves/Haemorhous mexicanus/e28...,cuskelly,600,800,631689,22000.0,0,631689,"[269, 129, 125, 352]",4506,498071,e28ff6dc89c379783c08bba85f66371c.jpg,Haemorhous mexicanus,Aves


## 3. Removing unncessary images <a class="anchor" id="remove"></a>

There are photos of birds in the dataset for which bounding box coordinates have not been provided. We don't need these images, because our goal is to do object detection. On the other side there are images in the df pandas dataframe, for which we don't have the corresponding _.jpg file_ (image) downloaded.

In [4]:
from SATM.preprocess import clean_aves
category_list = os.listdir("Aves") # retrieving the list of categories
clean_aves(cat_list = category_list, path = "Aves", element = ".ipynb_checkpoints") # removing dirty files

In [5]:
from SATM.preprocess import encode_df
img_list = [i for cat in category_list for i in os.listdir("Aves/"+cat)] # retrieving the list of images available
df = df[df.identifier.isin(img_list)]
df = encode_df(df)

In [6]:
# there are images we downloaded that do not feature a bounding box. We need to delete them
# to identify them, we look for pictures not appearing in the dataframe
img_list = [i for cat in category_list for i in os.listdir("Aves/"+cat)]
identifiers = list(df.identifier) # list of pictures not appearing in the dataframe
to_remove = [i for i in tqdm(img_list) if i not in identifiers] # retrieving the list of images to be deleted
print(f'There are {len(to_remove)} pictures to be removed')

  0%|          | 0/22257 [00:00<?, ?it/s]

There are 0 pictures to be removed


In [7]:
# deleting images
path = "Aves"
rem_count = 0
for i in category_list:
    temp_path = path+"/"+i
    list_dir = os.listdir(temp_path)
    for k in to_remove:
        if k in list_dir:
            os.remove(temp_path+"/"+k)
            rem_count += 1
print(f'{rem_count} pictures removed!')

0 pictures removed!


In [8]:
new_img_list = [i for cat in category_list for i in os.listdir(path+"/"+cat)]
print(f'From {len(img_list)} to {len(new_img_list)} images')
if len(new_img_list) == len(df.drop_duplicates("image_id")):
    print("The number of rows in the dataframe is the same as the number of available images")

From 22257 to 22257 images
The number of rows in the dataframe is the same as the number of available images


## 4. Train, Val, Test split <a class="anchor" id="tvt"></a>

As previously mentioned, we have decided to adopt a different split from the one proposed by iNaturalist. To make a random split, we shuffle the dataset containing the image information, and then we move the _.jpg files_ into the appropriate ```Train```, ```Val```, and ```Test``` folders. We decided to use 80%-10%-10% as the split percentage for the three sets.

In [9]:
seed = 810
random.seed(seed)
np.random.seed(seed)
images_df = df.drop_duplicates("image_id")
val_identifiers = []
test_identifiers = []
for cat in category_list:
    temp_list = list(images_df[images_df.category_name == cat].sample(frac = 0.2, random_state = 0).identifier)
    l = len(temp_list)//2
    val_identifiers += temp_list[:l]
    test_identifiers += temp_list[l:]
    #print(len(val_identifiers), len(test_identifiers))
random.shuffle(val_identifiers)
random.shuffle(test_identifiers)

In [10]:
val_test_identifiers = val_identifiers+test_identifiers
val_df = df[df.identifier.isin(val_identifiers)]
test_df = df[df.identifier.isin(test_identifiers)]
train_df = df[~df.identifier.isin(val_test_identifiers)]

In [11]:
train_identifiers = list(train_df.drop_duplicates("image_id").identifier)

To check that we have performed the split correctly, we check, for each category, that the number of images available has remained unchanged. We do this check for the images and for the boxes (note that some images have more than one box inside, so the number of boxes and images may not coincide).

In [12]:
if len(train_df) + len(val_df) + len(test_df) == len(df):
    print("Split done correctly!")

Split done correctly!


In [13]:
#print("original", "train","val","test", "sum")
recap_split_box = pd.DataFrame(columns = ["Original", "Train", "Val", "Test", "Sum"])
for cat in category_list:
    values = [len(df[df.category_name == cat]),
          len(train_df[train_df.category_name == cat]),
          len(val_df[val_df.category_name == cat]),
            len(test_df[test_df.category_name == cat]),
          sum([len(train_df[train_df.category_name == cat]),
          len(val_df[val_df.category_name == cat]),
          len(test_df[test_df.category_name == cat])])]
    
    recap_split_box.loc[cat,:] = values
    
recap_split_box

Unnamed: 0,Original,Train,Val,Test,Sum
Melospiza melodia,2098,1676,213,209,2098
Ardea alba,3640,2917,356,367,3640
Pandion haliaetus,1999,1588,201,210,1999
Cardinalis cardinalis,2207,1758,220,229,2207
Zenaida macroura,2502,1997,250,255,2502
Agelaius phoeniceus,2348,1856,237,255,2348
Junco hyemalis,1385,1113,134,138,1385
Ardea herodias,4299,3445,422,432,4299
Buteo jamaicensis,3612,2874,372,366,3612
Picoides pubescens,1546,1240,154,152,1546


In [14]:
#print("original", "train","val","test", "sum")
recap_split_image = pd.DataFrame(columns = ["Original", "Train", "Val", "Test", "Sum"])
for cat in category_list:
    values = [len(df.drop_duplicates("image_id")[df.drop_duplicates("image_id").category_name == cat]),
          len(train_df.drop_duplicates("image_id")[train_df.drop_duplicates("image_id").category_name == cat]),
          len(val_df.drop_duplicates("image_id")[val_df.drop_duplicates("image_id").category_name == cat]),
          len(test_df.drop_duplicates("image_id")[test_df.drop_duplicates("image_id").category_name == cat]),
         sum([len(train_df.drop_duplicates("image_id")[train_df.drop_duplicates("image_id").category_name == cat]),
          len(val_df.drop_duplicates("image_id")[val_df.drop_duplicates("image_id").category_name == cat]),
          len(test_df.drop_duplicates("image_id")[test_df.drop_duplicates("image_id").category_name == cat])])]
    recap_split_image.loc[cat,:] = values
    
recap_split_image 

Unnamed: 0,Original,Train,Val,Test,Sum
Melospiza melodia,2050,1640,205,205,2050
Ardea alba,2848,2278,285,285,2848
Pandion haliaetus,1794,1435,179,180,1794
Cardinalis cardinalis,2006,1605,200,201,2006
Zenaida macroura,1932,1546,193,193,1932
Agelaius phoeniceus,1888,1510,189,189,1888
Junco hyemalis,1281,1025,128,128,1281
Ardea herodias,3627,2902,362,363,3627
Buteo jamaicensis,3328,2662,333,333,3328
Picoides pubescens,1503,1202,150,151,1503


In [15]:
# saving dataframes
with open('pickles_df/train.pickle', 'wb') as handle:
    pickle.dump(train_df, handle)
with open('pickles_df/val.pickle', 'wb') as handle:
    pickle.dump(val_df, handle)
with open('pickles_df/test.pickle', 'wb') as handle:
    pickle.dump(test_df, handle)

Now that the split has been done at the dataframe level, let's perform the split in the files and directories.

In [None]:
# copying the images into correct Train, Val, Test
# accordingly with the train_df, val_df and test_df
count_train = 0
count_val = 0
count_test = 0

errors = [] # preparing a list to store any foto that is not correctly transferred


for cat in tqdm(category_list): # iterating through the categories
    temp_path = "Aves/"+cat
    #print(temp_path)
    temp_img_list = os.listdir(temp_path)
    for i in tqdm(temp_img_list):
        if i in train_identifiers:
            try:
                shutil.copyfile(temp_path+"/"+i, "data/images/Train/"+i)
                count_train += 1
            except:
                errors.append(i)
                print(id, "train error")

        elif i in val_identifiers:
            try:
                shutil.copyfile(temp_path+"/"+i, "data/images/Val/"+i)
                count_val += 1
            except:
                errors.append(i)
                print(id, "val error")
                
                
        elif i in test_identifiers:
            try:
                shutil.copyfile(temp_path+"/"+i, "data/images/Test/"+i)
                count_test += 1
            except:
                errors.append(i)
                print(id, "test error")
                
        else:
            errors.append(i)
            print(id, "the image is not contained in the dataframe")

## 5. Generating labels for YOLO <a class="anchor" id="generate"></a>

YOLO needs data formatted in a particular way and that differs from the way in which they are provided by iNaturalist. To solve this incompatibility, we use the ```convert_to_yolo``` function, which prepares the data so that it can also be processed by YOLO

In [16]:
from SATM.preprocess import convert_to_yolov5
convert_to_yolov5(which = "Train", df = train_df)
convert_to_yolov5(which = "Val", df = val_df)
convert_to_yolov5(which = "Test", df = test_df)

In [16]:
from SATM.preprocess import clean_data
clean_data() # clearing data from dirty files

## 6. Generating the grayscale dataset <a class="anchor" id="grayscale"></a>

Since we want to see the performance of the models on grayscale images (train on gray & test on gray, train on colored & test on gray, train on gray & test on colored), we need to convert the images to gray scale. To do this, it is sufficient to calculate the average across the three channels.

### 6.1 Three channels grayscale images <a class="anchor" id="three"></a>

With the next code cell we generate the grayscale images, keeping 3 channels, so that they have the same format as the color images.

In [95]:
from torchvision.utils import save_image
transform = transforms.Compose([transforms.ToTensor()])
for folder in tqdm(["Train", "Val", "Test"]):
    source_p = os.listdir("data/images/"+folder)
    for i in tqdm(source_p):
        image_file = Image.open(f"data/images/{folder}/{i}") # open colour image
        tensor = transform(image_file)
        avg_tensor = tensor.mean(axis = 0).numpy()
        new_image = torch.tensor(np.array([avg_tensor,avg_tensor,avg_tensor]))
        print(new_image.shape)
        #save_image(new_image, f'data_bw/images/{folder}/{i}')

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/17806 [00:00<?, ?it/s]

torch.Size([3, 532, 800])


  0%|          | 0/2224 [00:00<?, ?it/s]

torch.Size([3, 800, 800])


  0%|          | 0/2228 [00:00<?, ?it/s]

torch.Size([3, 800, 713])


### 6.2 One channel grayscale images <a class="anchor" id="one"></a>

With the next code cell, however, we generate the single-channel grayscale images. In this case, to become compatible with normal colored images, these monochannel grayscales must be recolored.

In [39]:
from torchvision.utils import save_image
transform = transforms.Compose([transforms.ToTensor()])
for folder in tqdm(["Train", "Val", "Test"]):
    source_p = os.listdir("data/images/"+folder)
    for i in tqdm(source_p):
        image_file = Image.open(f"data/images/{folder}/{i}") # open colour image
        tensor = transform(image_file)
        avg_tensor = tensor.mean(axis = 0)
        #new_image = torch.tensor(np.array([avg_tensor,avg_tensor,avg_tensor]))
        #print(new_image.shape)
        save_image(avg_tensor, f'data_1channel/images/{folder}/{i}')



  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/17805 [00:00<?, ?it/s]

  0%|          | 0/2224 [00:00<?, ?it/s]

  0%|          | 0/2228 [00:00<?, ?it/s]

## 7. Colorize 1 channel grayscale images <a class="anchor" id="colorize"></a>

We want to color grayscale single-channel images to be color and three-channel images to be processed by our models driven by color images. To do this we used the [DeOldify Image Colorization](https://deepai.org/machine-learning-model/colorizer) by DeepAI. It is a model based on GANs, capable of recoloring images and videos ([see GitHub here](https://github.com/jantic/DeOldify)).

In [9]:
#!git clone https://github.com/jantic/DeOldify.git DeOldify 

In [6]:
cd DeOldify # to run the colorizer we need to temporarily change the working directory

/home/labuser/Project/DeOldify


In [7]:
#NOTE:  This must be the first call in order to work properly!
from deoldify import device
from deoldify.device_id import DeviceId
#choices:  CPU, GPU0...GPU7
device.set(device=DeviceId.GPU0)

import torch

if not torch.cuda.is_available():
    print('GPU not available.')

In [13]:
# !pip install -r requirements-colab.txt

In [8]:
import fastai
from deoldify.visualize import *
import warnings
warnings.filterwarnings("ignore", category=UserWarning, message=".*?Your .*? set is empty.*?")

In [15]:
!mkdir 'models'
!wget https://data.deepai.org/deoldify/ColorizeArtistic_gen.pth -O ./models/ColorizeArtistic_gen.pth

--2022-12-02 18:19:11--  https://data.deepai.org/deoldify/ColorizeArtistic_gen.pth
Resolving data.deepai.org... 5.9.140.253
Connecting to data.deepai.org|5.9.140.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255144681 (243M) [application/octet-stream]
Saving to: ‘./models/ColorizeArtistic_gen.pth’


2022-12-02 18:19:13 (108 MB/s) - ‘./models/ColorizeArtistic_gen.pth’ saved [255144681/255144681]



In [9]:
colorizer = get_image_colorizer(artistic=True)

  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)


In [11]:
for img in tqdm(os.listdir("../data/images/Test/")):
    img_pil = colorizer.get_transformed_image("../data/images/Test/"+img,
                                render_factor = 35)
    img_pil.save("../colorized/images/Test/"+img)

  0%|          | 0/2228 [00:00<?, ?it/s]

In [12]:
cd .. # setting back the main directory

/home/labuser/Project


## 8. Solving class imbalance <a class="anchor" id="imbalance"></a>

As shown in the ```data_visualization.ipynb``` notebook, the classes are not well balanced with each other. To solve this potential problem, we can use two techniques, that can be performed on the training dataset: _undersampling_ and _oversampling_.

![under-over](https://pulplearning.altervista.org/wp-content/uploads/2020/08/undersampling-oversampling.png)

Within the project, we will verify the performance of Faster R-CNN and Yolo in under and oversampling conditions.
* For **Faster R-CNN** we used the ```SATMsampler``` class that we defined in the ```dataset.py``` file and then used in ```Faster_training.ipynb```
* For **YOLO** instead it was necessary to create special directories, which contain the files consistently with the setting of interest (under and oversampling). This processing is shown in the code cells below.

### 8.1 Undersampling <a class="anchor" id="under"></a>

Undersampling simply consists of identifying the least represented class and randomly removing images from all other classes so that their number is reduced to the level of the underpopulated class.

In [None]:
# computing the number of images in the underpopulated class
min_threshold = train_df.drop_duplicates("image_id").groupby("category_name").size().sort_values().head(1)[0]

In [78]:
np.random.seed(810) # setting a seed for reproducibility
for cat in tqdm(category_list): # iterating through categories
    temp_df = train_df.drop_duplicates("image_id") # images-only train dataset
    identifiers = temp_df[temp_df.category_name == cat].identifier.values # names of the images
    chosen = np.random.choice(identifiers, min_threshold) # choosing random images to be dropped
    for pic in tqdm(chosen): # removing images and txt files corresponding to the sampled images
        shutil.copyfile("data/images/Train/"+pic, "data_under/images/Train/"+pic)
        shutil.copyfile("data/labels/Train/"+pic[:-3]+"txt", "data_under/labels/Train/"+pic[:-3]+"txt")

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

### 8.2 Oversampling <a class="anchor" id="over"></a>

With oversampling instead we identify the most populated class and we bring to this level the number of images belonging to the other classes. This process is slightly more complicated, since to increase the underrepresented classes it is necessary to generate new images. To do this we use [Albumentation](https://github.com/albumentations-team/albumentations), an algorithm that allows us to transform images (in our case we apply rotations and flips), extending the transformation also to the bounding box

In [31]:
# identifying the number of images belonging to the most populated class
max_threshold = train_df.drop_duplicates("image_id").groupby("category_name").size().sort_values().tail(1)[0]

In [33]:
# how many images need to be added per each category?
to_add = max_threshold - train_df.drop_duplicates("image_id").groupby("category_name").size()

In [34]:
to_add

category_name
Agelaius phoeniceus      1392
Ardea alba                624
Ardea herodias              0
Buteo jamaicensis         240
Cardinalis cardinalis    1297
Junco hyemalis           1877
Melospiza melodia        1262
Pandion haliaetus        1467
Picoides pubescens       1700
Zenaida macroura         1356
dtype: int64

In [18]:
# importing custom functions that write txt files with the corret label(s) and bounding box(es)
from SATM.preprocess import generate_txt

In [153]:
np.random.seed(810) # setting a seed for reprudicibility

# defining the two possibile transformations
bbox_transform_flip = albumentations.Compose([albumentations.HorizontalFlip(p=1)],
                                                    bbox_params = albumentations.BboxParams(format='pascal_voc',
                                                                                            label_fields=['labels']))

bbox_transform_rotate = albumentations.Compose([albumentations.Rotate(p=1)],
                                                    bbox_params = albumentations.BboxParams(format='pascal_voc',
                                                                                           label_fields=['labels']))
wrong = [] # list for possible errors
trans_dic = {}

for cat in tqdm(category_list): # iterating across categories
    #print(to_add[cat])
    temp_df = train_df.drop_duplicates("image_id") # images-only dataframe
    identifiers = temp_df[temp_df.category_name == cat].identifier.values # list of images name
    pota  = to_add[cat] # number of images to be added in the specific class
    trans_dic[cat] = []
    #print(len(identifiers))
    
    # there are two possible cases:
        # 1. the current number of available images is enough to reach the upper threshold number
            # in this case one (or less) than trasformation per image is sufficient.
        # 2. instead, if the current number of images is not enough (that is, if transform once each
            # image, we are not able to reach the threshold), more than one trasformation is needed.
    
    
    # ------ CASE 1 -------
    if len(identifiers) > pota: 
        to_transform = np.random.choice(identifiers, pota)
        for img in tqdm(to_transform):
            image = cv2.imread("data/images/Train/"+img)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            targets = generate_target(train_df[train_df.identifier == img])
            try: # try to apply the transformation
                transformed = bbox_transform_flip(image = image,
                                             bboxes = targets['boxes'],
                                             labels = targets['labels'])

                trans_dic[cat].append(transformed)
                generate_txt(img, transformed, "_v1")
                save_image(torch.tensor(transformed["image"]).cpu().permute(2,0,1)/255, "data_over/images/Train/"+img[:-4]+"_v1"+".jpg")
            except:
                wrong.append(img)

            
    # ------ CASE 2 -------       
    else: 
        difference = pota - len(identifiers) # available images to be transformed
        for img in tqdm(identifiers):
            image = cv2.imread("data/images/Train/"+img)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            targets = generate_target(train_df[train_df.identifier == img])
            
            try: # try to apply the transormation
                transformed = bbox_transform_flip(image = image,
                                             bboxes = targets['boxes'],
                                             labels = targets['labels'])

                trans_dic[cat].append(transformed)
                generate_txt(img, transformed, "_v1")
                save_image(torch.tensor(transformed["image"]).cpu().permute(2,0,1)/255, "data_over/images/Train/"+img[:-4]+"_v1"+".jpg")
            except:
                
                wrong.append(img)
                
        to_transform = np.random.choice(identifiers, difference)
        # re sampling from the previosuly transformed images and applying a different transformation
        for img in tqdm(to_transform):
            image = cv2.imread("data/images/Train/"+img)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            targets = generate_target(train_df[train_df.identifier == img])
            try: # try to apply the transormation
                
                transformed = bbox_transform_rotate(image = image,
                                         bboxes = targets['boxes'],
                                         labels = targets['labels'])
            
            
                trans_dic[cat].append(transformed)
                generate_txt(img, transformed, "_v2")
                save_image(torch.tensor(transformed["image"]).cpu().permute(2,0,1)/255, "data_over/images/Train/"+img[:-4]+"_v2"+".jpg")
            except:
                wrong.append(img)                

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1262 [00:00<?, ?it/s]

  0%|          | 0/624 [00:00<?, ?it/s]

  0%|          | 0/1435 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

  0%|          | 0/1297 [00:00<?, ?it/s]

  0%|          | 0/1356 [00:00<?, ?it/s]

  0%|          | 0/1392 [00:00<?, ?it/s]

  0%|          | 0/1025 [00:00<?, ?it/s]

  0%|          | 0/852 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/240 [00:00<?, ?it/s]

  0%|          | 0/1202 [00:00<?, ?it/s]

  0%|          | 0/498 [00:00<?, ?it/s]