# Dataset Exploration

This notebook will explore the content of a google-drive provided drone detection dataset. 

The focus will be on:
- determining the number of training examples
- determining a train / eval split strategy
- determining a stable indexing strategy (because the dataset is relatively small)
- getting image size and aspect ration information
- determining a training example to target label mapping strategy

### Imports

In [62]:
from pathlib import Path
from glob import glob
from itertools import chain
import json

### Constants

In [43]:
DATASET_PATH = Path("/mnt/d/drone_data/training-data")

### Top Level Structure

In [44]:
# the dataset is arranged into folders
DATASET_FOLDERS = list(DATASET_PATH.glob("*_1"))
print(f"The dataset is arranged into {len(DATASET_FOLDERS)} folders")

The dataset is arranged into 18 folders


### Folder Extension Analysis

In [59]:
# it looks like just .pngs and a .json for labels, confirm:

print(f"The file extensions in each folder are in the set: {set([x.suffixes[0] for x in (list(chain(*[list(x.glob('*')) for x in DATASET_FOLDERS])))])}")
print(f"There are {len(DATASET_FOLDERS)} dataset folders")
print(f"There are {len([x for x in (list(chain(*[list(x.glob('*')) for x in DATASET_FOLDERS]))) if x.name.endswith('.png')])} png files")
print(f"For each folder the count of .json files is in the set of {set(len(list(x.glob('*.json'))) for x in DATASET_FOLDERS)}")


The file extensions in each folder are in the set: {'.json', '.png'}
There are 18 dataset folders
There are 13502 png files
For each folder the count of .json files is in the set of {1}


### Folder image count distribution

In [98]:
for x in DATASET_FOLDERS:
    print(f"Folder {x.name} has {len(list(x.glob('*.png')))} images")

counts = [len(list(x.glob('*.png'))) for x in DATASET_FOLDERS]
print(f"The number of images in each folder varies in length from a minimum of {min(counts)} to a maximum of {max(counts)}")


Folder 11_1 has 487 images
Folder 12_1 has 714 images
Folder 15_1 has 619 images
Folder 25_1 has 818 images
Folder 27_1 has 646 images
Folder 28_1 has 820 images
Folder 33_1 has 790 images
Folder 34_1 has 537 images
Folder 35_1 has 666 images
Folder 36_1 has 728 images
Folder 38_1 has 730 images
Folder 42_1 has 744 images
Folder 43_1 has 679 images
Folder 47_1 has 997 images
Folder 53_1 has 604 images
Folder 5_1 has 1630 images
Folder 6_1 has 774 images
Folder 7_1 has 519 images
The number of images in each folder varies in length from a minimum of 487 to a maximum of 1630


### Label Structure

In [93]:
def analyze_label_folder(folder):
    print(f"the .json file corresponding to {folder}")
    labels = json.load(list(folder.glob("*.json"))[0].open())
    print(f"top level strucutre: {len(labels)}")
    print(f"types at the top level structure {[type(x) for x in labels]}")
    print(f"values at the top level structure {[x for x in labels]}")
    print(f"the count of the exist items: {len(labels["exist"])}")
    print(f"the count of the gt_rect items: {len(labels["gt_rect"])}")

    print(f"the value type of exist is: {type(labels["exist"][0])}")
    print(f"existance is in the set: {set([x for x in (labels["exist"])])}")

    print(f"the value type of gt_rect is: {type(labels["gt_rect"][0])}")
    print(f"the lengths of gt_rect lists are in the set: {set([len(x) for x in labels["gt_rect"]])}")

for folder in DATASET_FOLDERS:
    analyze_label_folder(folder)
    print("")

the .json file corresponding to /mnt/d/drone_data/training-data/11_1
top level strucutre: 2
types at the top level structure [<class 'str'>, <class 'str'>]
values at the top level structure ['exist', 'gt_rect']
the count of the exist items: 487
the count of the gt_rect items: 487
the value type of exist is: <class 'int'>
existance is in the set: {0, 1}
the value type of gt_rect is: <class 'list'>
the lengths of gt_rect lists are in the set: {4}

the .json file corresponding to /mnt/d/drone_data/training-data/12_1
top level strucutre: 2
types at the top level structure [<class 'str'>, <class 'str'>]
values at the top level structure ['exist', 'gt_rect']
the count of the exist items: 714
the count of the gt_rect items: 714
the value type of exist is: <class 'int'>
existance is in the set: {0, 1}
the value type of gt_rect is: <class 'list'>
the lengths of gt_rect lists are in the set: {4}

the .json file corresponding to /mnt/d/drone_data/training-data/15_1
top level strucutre: 2
types at