<a href="https://colab.research.google.com/github/JasonManesis/Wildfire-Smoke-Detection/blob/main/Scripts/Wildfires_Detectron2_Dataset_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# Import some common libraries.
import numpy as np
import pandas as pd
import random
import yaml 
import os, json, cv2, random, matplotlib
from google.colab.patches import cv2_imshow
import matplotlib.pyplot as plt
import matplotlib.font_manager
import datetime
import pytz
from termcolor import colored

In [2]:
# Mount google drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Dataset Preprocessing.

```
The folder structure of the original dataset looks like this:

<dataset_dir>/
    test/
        <filename0>.jpg
        <filename1>.jpg
        ...
        _annotations.coco.json
    train/
        <filename0>.jpg
        <filename1>.jpg
        ...
        _annotations.coco.json
    valid/
        <filename0>.jpg
        <filename1>.jpg
        ...
        _annotations.coco.json
```
```
The folder structure of a COCO dataset looks like this:

<dataset_dir>/
    data/
        <filename0>.<ext>
        <filename1>.<ext>
        ...
    labels.json -->
```
```
So we want to change the folder structure to:

<dataset_dir>/
    test/
        data/
            <filename0>.jpg
            <filename1>.jpg
            ...
        labels.json
    train/
        data/
            <filename0>.jpg
            <filename1>.jpg
            ...
        labels.json
    valid/
        data/
            <filename0>.jpg
            <filename1>.jpg
            ...
        labels.json
```




## Fix class id key in the original .json file and change the file structure of the dataset so it corresponds to the COCO dataset folder structure. 

In [8]:
def json_properties(json_name, annotations_directory):

    """Opens a .json file from a given directory and prints its properties.

    Args:
        json_name (str): Name of the .json file for reading.
        annotations_directory (str): Path to the .json file.

    """
    #Extract a list with all subfile names in the specific directory.
    _, _, SubFiles = next(os.walk(annotations_directory)) 

    if json_name in SubFiles:
        json2open = annotations_directory + '/' + json_name

        #Load .json data file
        dataset=None

        try:
            f = open(json2open)
            dataset = json.load(f)
            f.close()
        
            #Print dataset properties.
            if isinstance(dataset,dict):
                for i,key in enumerate(dataset.keys()): 
                    data_structure = type(dataset[key])
                    length = len(dataset[key])
                    print(i+1,"|",key,(15-len(key))*" ","|",data_structure,
                          "|",length,"element/s")

        except:
            print("Error with data loading")

    else:
        print("The requested .json file doesn't included in the given directory!")

In [9]:
def print_json(ttv,json_name,annotations_directory):

    """Auxillary function that works together with json_properties() for formatting the 
        printing output of the given .json file properties.

    Args:
        ttv (str): Takes one of the values "training", "testing" or "validation".
        json_name (str): Name of the .json file for reading.
        annotations_directory (str): Path to the .json file.

    """
    print("Properties of .json file with annotations for " + ttv + ":")
    print('--'*27)
    json_properties(json_name,annotations_directory)
    print('--'*27+'\n'*2)

In [10]:
#Print the properties of the .json files:
print_json('training','_annotations.coco.json',"/content/drive/MyDrive/Datasets/Wildfires_2/train")
print_json('testing','_annotations.coco.json',"/content/drive/MyDrive/Datasets/Wildfires_2/test")
print_json('validation','_annotations.coco.json',"/content/drive/MyDrive/Datasets/Wildfires_2/valid")

Properties of .json file with annotations for training:
------------------------------------------------------
1 | info             | <class 'dict'> | 6 element/s
2 | licenses         | <class 'list'> | 1 element/s
3 | categories       | <class 'list'> | 2 element/s
4 | images           | <class 'list'> | 516 element/s
5 | annotations      | <class 'list'> | 516 element/s
------------------------------------------------------


Properties of .json file with annotations for testing:
------------------------------------------------------
1 | info             | <class 'dict'> | 6 element/s
2 | licenses         | <class 'list'> | 1 element/s
3 | categories       | <class 'list'> | 2 element/s
4 | images           | <class 'list'> | 74 element/s
5 | annotations      | <class 'list'> | 74 element/s
------------------------------------------------------


Properties of .json file with annotations for validation:
------------------------------------------------------
1 | info             | <cl

In [11]:
#Inspection of the categories key in the original annotation file.
f = open('/content/drive/MyDrive/Datasets/Wildfires_2/train/_annotations.coco.json')
dataset = json.load(f)
f.close()
dataset['categories']

[{'id': 0, 'name': 'Smoke', 'supercategory': 'none'},
 {'id': 1, 'name': 'smoke', 'supercategory': 'Smoke'}]

We can see that the key 'categories' of the dataset contains two sub-dictionaries, every one of them has three keys: 'id', 'name' and 'supercategory'. The mistake in the specific annotations is that the 'name' key has the value of the positive class in this two dictionaries. That confuses the detectron2's dataset loading fuctions so it must be fixed. 

In [59]:
def fix_jsons_keys(dataset_dir):

    """Opens the original annotation file (.json) from a given directory,
    deletes the first sub-dictionary from  'categories' key and saves the new .json file ("labels.json").

    Args:
        dataset_dir (str): Path to the .json file.

    """
    _, SubFolders, _ = next(os.walk(dataset_dir))

    for folder_name in SubFolders:
        json_2_open = dataset_dir + "/" + folder_name + "/" + "_annotations.coco.json"

        #Open .json
        f = open(json_2_open)
        dataset = json.load(f)
        f.close()
        
        #Delete unwanted keys.
        if 'categories' in dataset: 
            del dataset['categories'][0]

        #Save the new corrected .json.
        jsonString = json.dumps(dataset)
        jsonFile = open(dataset_dir + "/" + folder_name + "/" + "labels.json", "w")
        jsonFile.write(jsonString)
        jsonFile.close()

    print(colored('Keys in .json files are now corrected !', 'green', attrs=['bold']))     

In [60]:
dataset_dir = "/content/drive/MyDrive/Datasets/Wildfires_2"
fix_jsons_keys(dataset_dir)

[1m[32mKeys in .json files are now corrected ![0m


In [14]:
#Print the properties of the .json files:
print_json('training','labels.json',"/content/drive/MyDrive/Datasets/Wildfires_2/train")
print_json('testing','labels.json',"/content/drive/MyDrive/Datasets/Wildfires_2/test")
print_json('validation','labels.json',"/content/drive/MyDrive/Datasets/Wildfires_2/valid")

Properties of .json file with annotations for training:
------------------------------------------------------
1 | info             | <class 'dict'> | 6 element/s
2 | licenses         | <class 'list'> | 1 element/s
3 | categories       | <class 'list'> | 1 element/s
4 | images           | <class 'list'> | 516 element/s
5 | annotations      | <class 'list'> | 516 element/s
------------------------------------------------------


Properties of .json file with annotations for testing:
------------------------------------------------------
1 | info             | <class 'dict'> | 6 element/s
2 | licenses         | <class 'list'> | 1 element/s
3 | categories       | <class 'list'> | 1 element/s
4 | images           | <class 'list'> | 74 element/s
5 | annotations      | <class 'list'> | 74 element/s
------------------------------------------------------


Properties of .json file with annotations for validation:
------------------------------------------------------
1 | info             | <cl

In [15]:
#Inspection of the categories key in the new annotation file.
f = open('/content/drive/MyDrive/Datasets/Wildfires_2/train/labels.json')
dataset = json.load(f)
f.close()
dataset['categories']

[{'id': 1, 'name': 'smoke', 'supercategory': 'Smoke'}]

**OK!**

## Move image and annotation data to new folders.

In [21]:
dataset_dir = '/content/drive/MyDrive/Datasets/Wildfires_2'
fix_jsons_keys(dataset_dir)

In [40]:
def copy_image_data(dataset_dir):

    """Copies all image files from all three directories (train,test,valid) 
    to three new ones with different folder structure:
    .../train -> .../train/data

    Args:
        dataset_dir (str): Path to the .json file.

    """
    _, SubFolders, _ = next(os.walk(dataset_dir))

    for subfoldername in SubFolders:
        #the dot at the end of the string is used because we want to copy only 
        # the files of the folder and not the folder itself.
        old_dir = dataset_dir + '/' + subfoldername + '/' + '.' 
        old_dir_color = colored(old_dir[:-1],'blue', attrs=['bold'])
        new_dir =  dataset_dir +'_COCO'+ '/' + subfoldername + '/data/'  
        new_dir_color = colored(new_dir,'green', attrs=['bold'])
        os.makedirs(new_dir, exist_ok=True) #create the new directory.
        !cp -r $old_dir $new_dir #copy all the images from old to new directory.
        print(f'{subfoldername.capitalize()} images successfully copied from {old_dir_color} to {new_dir_color}')

In [54]:
copy_image_data(dataset_dir)

Test images successfully copied from [1m[34m/content/drive/MyDrive/Datasets/Wildfires_2/test/[0m to [1m[32m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/test/data/[0m
Train images successfully copied from [1m[34m/content/drive/MyDrive/Datasets/Wildfires_2/train/[0m to [1m[32m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/train/data/[0m
Valid images successfully copied from [1m[34m/content/drive/MyDrive/Datasets/Wildfires_2/valid/[0m to [1m[32m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/valid/data/[0m


In [52]:
def copy_annotation_data(dataset_dir):

    """Copies the corrected annotation file in the new directories of the 
    dataset and deletes all the remaining annotation files.

    Args:
        dataset_dir (str): Path to the .json file.

    """
    _, SubFolders, _ = next(os.walk(dataset_dir))

    for subfoldername in SubFolders:
        
        #Copy corrected annotation file to the new directory.
        source = dataset_dir + '/' + subfoldername + '/' + 'labels.json' 
        destination =  (dataset_dir +'_COCO'+ '/' + subfoldername + '/'
            + 'labels.json')  
        source_color = colored(source,'blue', attrs=['bold'])
        destination_color = colored(destination,'green', attrs=['bold'])
        !cp  $source $destination

        #Delete the previous annotation files.
        file_4_del_path_1 = (dataset_dir +'_COCO'+ '/' + subfoldername + '/data/' 
            + '_annotations.coco.json')  
        file_4_del_path_1_c = colored(file_4_del_path_1,'red', attrs=['bold'])
        file_4_del_path_2 = (dataset_dir +'_COCO'+ '/' + subfoldername + '/data/' 
            + 'labels.json') 
        file_4_del_path_2_c = colored(file_4_del_path_2,'red', attrs=['bold'])

        !rm $file_4_del_path_1
        !rm $file_4_del_path_2

        print(f'The file: "labels.json" successfully copied from {source_color} to {destination_color}')
        print(f'The file: "_annotations.coco.json" successfully deleted from {file_4_del_path_1_c}')
        print(f'The file: "labels.json" successfully deleted from {file_4_del_path_2_c}\n')

In [55]:
 copy_annotation_data(dataset_dir)

The file: "labels.json" successfully copied from [1m[34m/content/drive/MyDrive/Datasets/Wildfires_2/test/labels.json[0m to [1m[32m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/test/labels.json[0m
The file: "_annotations.coco.json" successfully deleted from [1m[31m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/test/data/_annotations.coco.json[0m
The file: "labels.json" successfully deleted from [1m[31m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/test/data/labels.json[0m

The file: "labels.json" successfully copied from [1m[34m/content/drive/MyDrive/Datasets/Wildfires_2/train/labels.json[0m to [1m[32m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/train/labels.json[0m
The file: "_annotations.coco.json" successfully deleted from [1m[31m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/train/data/_annotations.coco.json[0m
The file: "labels.json" successfully deleted from [1m[31m/content/drive/MyDrive/Datasets/Wildfires_2_COCO/train/data/labels.json[0m

The 

## Fix image paths in .json files

In [57]:
def change_image_path(dataset_dir):
    
    """Opens a .json file from a given directory and changes the image paths.

    Args:
        dataset_dir (str): Path to the .json file.

    """
    _, SubFolders, _ = next(os.walk(dataset_dir))

    for folder_name in SubFolders:
        images_path = dataset_dir + "/" + folder_name + "/" + "data"
        _, _, image_names = next(os.walk(images_path))
        
        #Open .json
        json_2_open = dataset_dir + "/" + folder_name + "/" + "labels.json"
        f = open(json_2_open)
        dataset = json.load(f)
        f.close()

        for i, image_name in enumerate(image_names):
            # new_name = images_path + "/" + folder_name + "_" + str(i) + '.jpg'
            # os.rename(images_path + "/" + image_name, new_name)
            image_path = dataset['images'][i]['file_name']
            filename = image_path.split(sep='/')[-1]
            dataset['images'][i]['file_name'] = dataset_dir + "/" + folder_name + "/" + "data" + "/" + filename

        #Rewrite a new corrected .json to the old one.
        jsonString = json.dumps(dataset)
        jsonFile = open(dataset_dir + "/" + folder_name + "/" + "labels.json", "w")
        jsonFile.write(jsonString)
        jsonFile.close()  

    print(colored('Image paths in .json files changed successfully!', 'green', attrs=['bold']))    

In [58]:
dataset_dir = '/content/drive/MyDrive/Datasets/Wildfires_2_COCO'
change_image_path(dataset_dir)

[1m[32mImage paths in .json files changed successfully![0m
