# Explore the dataset


In this notebook, we will perform an EDA (Exploratory Data Analysis) on the processed Waymo dataset (data in the `processed` folder). In the first part, you will create a function to display 

In [None]:
from utils import get_dataset
import numpy as np
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from create_splits import split

In [None]:
dataset = get_dataset("/home/workspace/data/processed/*.tfrecord")

## Write a function to display an image and the bounding boxes

Implement the `display_instances` function below. This function takes a batch as an input and display an image with its corresponding bounding boxes. The only requirement is that the classes should be color coded (eg, vehicles in red, pedestrians in blue, cyclist in green).

In [None]:
def display_instances(batch):
    """
    This function takes a batch from the dataset and display the image with 
    the associated bounding boxes.
    """
    images = []
    bboxes = []
    classes = []
    
    count = 0 
    
        
    #iterate through the batch and get all the images, bounding boxes and classes
    for data in batch:
        images.append(np.array(data['image'], dtype="uint8"))
        bboxes.append(np.array(data['groundtruth_boxes'], dtype="float32"))
        classes.append(np.array(data['groundtruth_classes'], dtype="int64"))
        count += 1

    # color mapping of classes
    colormap = {1: [1, 0, 0], 2: [0, 1, 0], 4: [0, 0, 1]}

    #create figure and axes
    fig, ax = plt.subplots(5, 2, figsize =(50,50))

    for i in range(count):
        #get x and y for the axes
        x = i % 5
        y = i % 2
        
        #display the image
        ax[x,y].imshow(images[i])
        
        #list of bounding boxes for the current image
        b_box = bboxes[i]
        
        #list of classes for the current image
        b_class = classes[i]
        
        #use the built-in zip function to link the two lists as we iterate
        for cl, bb in zip(b_class, b_box):
            
            #get the image width and height
            w, h, _ = images[i].shape
            
            #get the coordinates of the bottom left and top right of each box
            y1, x1, y2, x2  = bb
            
            #resize the boxes to the current image size.
            x1, x2 = x1 * w, x2 * w
            y1, y2 = y1 * h, y2 * h
                
            #create a rectangle from those points with the edgecolor matching the color set for the class
            rec = patches.Rectangle((x1,y1), x2-x1, y2-y1, fc='none', ec=colormap[cl])
            
            #Add the patch to the current ax
            ax[x,y].add_patch(rec)
            
            #define the intesting area of the image
            ax[x,y].set_xlim([0, 640])
            ax[x,y].set_ylim([640, 0])
            
    plt.tight_layout()
    plt.show()


## Display 10 images 

Using the dataset created in the second cell and the function you just coded, display 10 random images with the associated bounding boxes. You can use the methods `take` and `shuffle` on the dataset.

In [None]:
qt = 10

batch = dataset.take(qt)
display_instances(batch)

## Additional EDA

In this last part, you are free to perform any additional analysis of the dataset. What else would like to know about the data?
For example, think about data distribution. So far, you have only looked at a single file...

There is a variety of images taken under different lighting conditions. Will need to compensate for lightness and brightness with augmentations during training.