# Explorative Data Analysis

This notebook is the basis for the explorative data analysis for Task 1.1.

The task is as follows:

> Perform an exhaustive dataset analysis to explore the attributes of the TDT4265 dataset. The analysis should highlight commonalities, limitations and perhaps interesting samples in the dataset (both images and labels). You are free to choose how to present the analysis, and the analysis can be both statistical and qualitative. To get you started, you can try to analyze the size of the objects in the dataset.

For more information, these blog posts have been used for inspiration:
* [Popular Object Detection datasets - Analysis and Statistics](https://medium.com/@vijayshankerdubey550/popular-object-detection-datasets-analysis-and-statistics-66acdacc3aa9) (Medium article)
* [How to Do Data Exploration for Image Segmentation and Object Detection](https://neptune.ai/blog/data-exploration-for-image-segmentation-and-object-detection) (Neptune blog)
* [How to work with object detection datasets in COCO format](https://towardsdatascience.com/how-to-work-with-object-detection-datasets-in-coco-format-9bf4fb5848a4) (Medium article)
* [Visualizing Object Detections](https://medium.com/voxel51/visualizing-object-detections-9d0ed766297c) (Medium article)

This lead to the following points of interest:
* Intra-class varability
* Inter-class variability
* Image size
* Class imbalance
* Object size to image size
* Background clutter
* Occlusion

And these interesting statistics:
* Number of objects per category
* Number of objects contained in a single image
* Size of the images present in the dataset
* Ratio of areas of Object size and Image size, for each object present in the image
* Overlap between objects

In [50]:
# Settings

# Autoimport changes in code
%load_ext autoreload
%autoreload 2

import sys, os
sys.path.append(os.path.dirname(os.getcwd())) # Include ../SSD in path

# Third party libraries
import numpy as np
import torch
import matplotlib.pyplot as plt
from vizer.draw import draw_boxes
from tops.config import instantiate, LazyConfig
from ssd import utils

# Local libraries
from dataset_exploration.dataset_statistics import (
    statistics,
    analyze_distribution,
    analyze_bounding_boxes,
    get_config,
    get_dataloader,
)

# Set seed
np.random.seed(0)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Load Dataset

Only the training data will be used in the statistical analysis.

In [56]:
# Only necessary to run this on local machine
%cd SSD
# ../

/Users/mariu/dev/school/TDT4265/project/computer-vision-and-deep-learning/src/SSD


In [57]:
# Load data

# Load config
config_path = "/Users/mariu/dev/school/TDT4265/project/computer-vision-and-deep-learning/src/SSD/configs/tdt4265.py"  # This is to be used until Task 2.5
cfg = get_config(config_path)

# Get dataloader
dataset_to_explore = "train"
dataloader = get_dataloader(cfg, dataset_to_explore)

print('Dataloader attributes:')
print(['batch_sampler', 'batch_size', 'check_worker_number_rationality', 'collate_fn', 'dataset', 'drop_last', 'generator', 'multiprocessing_context', 'num_workers', 'persistent_workers', 'pin_memory', 'prefetch_factor', 'sampler', 'timeout', 'worker_init_fn'])

Saving SSD outputs to: outputs/
Dataloader attributes:
['batch_sampler', 'batch_size', 'check_worker_number_rationality', 'collate_fn', 'dataset', 'drop_last', 'generator', 'multiprocessing_context', 'num_workers', 'persistent_workers', 'pin_memory', 'prefetch_factor', 'sampler', 'timeout', 'worker_init_fn']


In [58]:
dataset = dataloader.dataset

print('Dataset attributes: ')
print(['annotate_file', 'class_names', 'data', 'get_annotations_as_coco', 'images', 'img_folder', 'img_keys', 'label_info', 'label_map', 'transform'])

Dataset attributes: 
['annotate_file', 'class_names', 'data', 'get_annotations_as_coco', 'images', 'img_folder', 'img_keys', 'label_info', 'label_map', 'transform']


In [63]:
cfg.data_train.dataset

{'img_folder': 'data/tdt4265_2022', 'transform': '${train_cpu_transform}', 'annotation_file': 'data/tdt4265_2022/train_annotations.json', '_target_': <class 'ssd.data.tdt4265.TDT4265Dataset'>}

### Statistics

Exploring statistics such as:
* Number of objects per category
* Number of objects contained in a single image
* Size of the images present in the dataset
* Ratio of areas of Object size and Image size, for each object present in the image
* Overlap between objects

In [None]:
# Get basic information
label_map = dataset.label_info


In [None]:
# Calculate number of objects per category


In [None]:
# Calculate average number of objects contained in each image


In [None]:
# Get image size


In [23]:
# Calculate overlap between objects


### Object size to image size

This section explores object sizes independently, and also in relation to the image size (1024, 128, 3).

In [None]:
# Calculate ratio of areas of object size and image size, for each object present in each image
