# Dataset Statistics for SynthDet
This example notebook shows how to use datasetinsights to load synthetic datasets generated from the [SynthDet](https://github.com/Unity-Technologies/synthdet) example project and visualize dataset statistics.

In addition to the object bounding boxes, SynthDet produces a mix of built-in and project-specific metrics. Statistics for the `RenderedObjectInfo` metrics built into the Perception package can be calculated using `datasetinsights.data.datasets.statistics.RenderedObjectInfo`. SynthDet-specific statistics are loaded via `datasetinsights.data.simulation.Metrics` and are calculated directly in this notebook.

## Setup dataset
If the dataset was generated locally, point `data_root` below to the path of the dataset. The `GUID` folder suffix should be changed accordingly.   

In [2]:
data_root = "/data/<GUID>"

In [7]:
import pandas as pd
import os
import random
from PIL import Image

import datasetinsights.datasets.unity_perception as sim
import datasetinsights.stats.statistics as stat
from datasetinsights.io import download_manifest, Downloader
from datasetinsights.stats import bar_plot, histogram_plot, rotation_plot
from datasetinsights.stats import plot_bboxes
from datasetinsights.datasets import read_bounding_box_2d

### Unity Simulation [Optional]
If the dataset was generated on Unity Simulation, the following cells can be used to download the metrics needed for dataset statistics.

Provide the `run-execution-id` which generated the dataset and a valid `auth_token` in the following cell. The `auth-token` can be generated using the Unity Simulation [CLI](https://github.com/Unity-Technologies/Unity-Simulation-Docs/blob/master/doc/cli.md#usim-inspect-auth).

In [7]:
# data_volume = "/data"   # directory where datasets should be downloaded to and loaded from
# run_execution_id = "ojEawoj"      # Unity Simulation run-execution-id
# auth_token = "xxxx"     # Unity Simulation auth token
# project_id = "xxxx"   # Unity Project ID

# data_root = os.path.join(data_volume, run_execution_id)

Before loading the dataset metadata for statistics we first download the relevant files from Unity Simulation. For downloading files, Unity Simulation provides a manifest file providing file paths and signed urls for each file. `download_manifest()` will download the manifest file to disk. `Download` can then be used to download the metrics and metric definitions.

In [None]:
# manifest_file = os.path.join(data_volume, f"{run_execution_id}.csv")
# download_manifest(run_execution_id, manifest_file, auth_token, project_id)

# dl = Downloader(manifest_file, data_root, use_cache=True)
# dl.download_references()
# dl.download_metrics()

## Load dataset metadata
Once the dataset metadata is downloaded, it can be loaded for statistics using `datasetinsights.data.simulation`. Annotation and metric definitions are loaded into pandas dataframes using `AnnotationDefinitions` and `MetricDefinitions` respectively.

In [None]:
ann_def = sim.AnnotationDefinitions(data_root)
ann_def.table

In [None]:
metric_def = sim.MetricDefinitions(data_root)
metric_def.table

## Built-in Statistics
The following tables and charts are supplied by `datasetinsights.data.datasets.statistics.RenderedObjectInfo` on datasets that include the "rendered object info" metric.

In [None]:
max_samples = 10000          # maximum number of samples points used in histogram plots

rendered_object_info_definition_id = "659c6e36-f9f8-4dd6-9651-4a80e51eabc4"
roinfo = stat.RenderedObjectInfo(data_root=data_root, def_id=rendered_object_info_definition_id)

### Descriptive Statistics

In [None]:
roinfo.num_captures()
roinfo.raw_table.head(3)

### Total Object Count

In [None]:
total_count = roinfo.total_counts()
total_count

In [None]:
bar_plot(
    total_count, 
    x="label_id", 
    y="count", 
    x_title="Label Name",
    y_title="Count",
    title="Total Object Count in Dataset",
    hover_name="label_name"
)

### Per Capture Object Count

In [None]:
per_capture_count = roinfo.per_capture_counts()
per_capture_count.head(10)

In [None]:
histogram_plot(
    per_capture_count, 
    x="count",  
    x_title="Object Counts Per Capture",
    y_title="Frequency",
    title="Distribution of Object Counts Per Capture",
    max_samples=max_samples
)

### Object Visible Pixels

In [None]:
histogram_plot(
    roinfo.raw_table, 
    x="visible_pixels",  
    x_title="Visible Pixels Per Object",
    y_title="Frequency",
    title="Distribution of Visible Pixels Per Object",
    max_samples=max_samples
)

## SynthDet Statistics
Metrics specific to the simulation can be loaded using `datasetinsights.data.simulation.Metrics`. 

In [None]:
metrics = sim.Metrics(data_root=data_root)

### Foreground placement info

In [None]:
foreground_placement_info_definition_id = "061e08cc-4428-4926-9933-a6732524b52b"
columns = ("x_rot", "y_rot", "z_rot")

def read_foreground_placement_info(metrics):
    filtered_metrics = metrics.filter_metrics(foreground_placement_info_definition_id)
    combined = pd.DataFrame(filtered_metrics["rotation"].to_list(), columns=columns)
    
    return combined

In [None]:
orientation = read_foreground_placement_info(metrics)
orientation.head(10)

In [None]:
rotation_plot(
    orientation,
    x="x_rot",
    y="y_rot",
    z="z_rot",
    title="Object orientations",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    orientation, 
    x="x_rot",  
    x_title="Object Rotation (Degree)",
    y_title="Frequency",
    title="Distribution of Object Rotations along X direction",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    orientation, 
    x="y_rot",  
    x_title="Object Rotation (Degree)",
    y_title="Frequency",
    title="Distribution of Object Rotations along Y direction",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    orientation, 
    x="z_rot",  
    x_title="Object Rotation (Degree)",
    y_title="Frequency",
    title="Distribution of Object Rotations along Z direction",
    max_samples=max_samples
)

### Lighting info 

In [None]:
lighting_info_definition_id = "939248ee-668a-4e98-8e79-e7909f034a47"
x_y_columns = ["x_rotation", "y_rotation"]
color_columns = ["color.r", "color.g", "color.b", "color.a"]

def read_lighting_info(metrics):
    filtered_metrics = metrics.filter_metrics(lighting_info_definition_id)
    colors = pd.json_normalize(filtered_metrics["color"])
    colors.columns = color_columns
    combined = pd.concat([filtered_metrics[x_y_columns], colors], axis=1, join='inner')

    return combined

In [None]:
lighting = read_lighting_info(metrics)
lighting.head(5)

In [None]:
rotation_plot(
    lighting,
    x="x_rotation",
    y="y_rotation",
    title="Light orientations",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    lighting, 
    x="x_rotation",  
    x_title="Lighting Rotation (Degree)",
    y_title="Frequency",
    title="Distribution of Lighting Rotations along X direction",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    lighting, 
    x="y_rotation",  
    x_title="Lighting Rotation (Degree)",
    y_title="Frequency",
    title="Distribution of Lighting Rotations along Y direction",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    lighting, 
    x="color.r",  
    x_title="Lighting Color",
    y_title="Frequency",
    title="Distribution of Lighting Color Redness",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    lighting, 
    x="color.g",  
    x_title="Lighting Color",
    y_title="Frequency",
    title="Distribution of Lighting Color Greeness",
    max_samples=max_samples
)

In [None]:
histogram_plot(
    lighting, 
    x="color.b",  
    x_title="Lighting Color",
    y_title="Frequency",
    title="Distribution of Lighting Color Blueness",
    max_samples=max_samples
)

## Images With 2D Bounding Boxes

In this section, we provide sample code to render 2d bounding boxes on top of the captured images.

### Unity Simulation [Optional]
If the dataset was generated on Unity Simulation, the following cells can be used to download the images, captures and annotations in the dataset. Make sure you have enough disk space to store all files. For example, a dataset with 100K captures requires roughly 300GiB storage.

In [None]:
# dl.download_captures()
# dl.download_binary_files()

### Captures

In [None]:
bounding_box_definition_id = "c31620e3-55ff-4af6-ae86-884aa0daa9b2"
cap = sim.Captures(data_root)
captures = cap.filter(def_id=bounding_box_definition_id)
captures.head(3)

### Visualize

In [None]:
line_width = 5
resize_scale = 2

def draw_bounding_boxes(filename, annotation):
    filepath = os.path.join(data_root, filename)
    image = Image.open(filepath)
    boxes = read_bounding_box_2d(annotation)
    img_with_boxes = plot_bboxes(image, boxes, box_line_width=line_width)
    print(f"Image: {filename}")
    new_size = (img_with_boxes.width // resize_scale, img_with_boxes.height // resize_scale)
    
    return img_with_boxes.resize(new_size)
    

In [None]:
# pick an index at random
index = random.randrange(captures.shape[0])

filename = captures.loc[index, "filename"]
annotation = captures.loc[index, "annotation.values"]
draw_bounding_boxes(filename, annotation)