# Data Preparation

Converson of the data into a format that can be used to train models.

## Converting the annotations into csv files

Getting the relative paths of the annotation files

In [1]:
import os
ANNOTATION_FILES_DIR = 'data/annotations'
annotation_files = os.listdir(ANNOTATION_FILES_DIR)

Defining a function that reads a given annotation file and returns a dictionary containg information about the overall image as well as a list paired with the key 'objects' containing information about the bounding boxes in the image. The function takes the path to the annotation file as input.

In [2]:
import pandas as pd
import xml.etree.ElementTree as ET

def read_img_info(annotation_file):
    img_info = {}
    tree = ET.parse(os.path.join(ANNOTATION_FILES_DIR, annotation_file))
    root = tree.getroot()
    img_info['filename'] = root.find('filename').text
    img_info['width'] = int(root.find('size').find('width').text)
    img_info['height'] = int(root.find('size').find('height').text)
    img_info['depth'] = int(root.find('size').find('depth').text)

    img_info['objects'] = []

    for obj in root.findall('object'):
        obj_info = {}
        obj_info['label'] = obj.find('name').text
        obj_info['pose'] = obj.find('pose').text
        obj_info['truncated'] = int(obj.find('truncated').text)
        obj_info['occluded'] = int(obj.find('occluded').text)
        obj_info['difficult'] = int(obj.find('difficult').text)
        obj_info['xmin'] = int(obj.find('bndbox').find('xmin').text)
        obj_info['ymin'] = int(obj.find('bndbox').find('ymin').text)
        obj_info['xmax'] = int(obj.find('bndbox').find('xmax').text)
        obj_info['ymax'] = int(obj.find('bndbox').find('ymax').text)
        img_info['objects'].append(obj_info)
    
    return img_info

Iterating over all the annotation files, calling the function defined above and storing the results into two dataframes. One dataframe contains information about the overall image and the other contains information about the bounding boxes in the image.

In [3]:
image_df = pd.DataFrame(columns=['filename', 'width', 'height', 'depth'])
object_df = pd.DataFrame(columns=['filename', 'label', 'pose', 'truncated', 'occluded', 'difficult', 'xmin', 'ymin', 'xmax', 'ymax'])
image_df.set_index('filename', inplace=True)
for annotation_file in annotation_files:
    img_info = read_img_info(annotation_file)  

    image_df.loc[img_info['filename']] = img_info

    for obj in img_info['objects']:
        obj['filename'] = img_info['filename']
        object_df.loc[len(object_df)] = obj

display(image_df.head())
display(object_df.head())

Unnamed: 0_level_0,width,height,depth
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BikesHelmets765.png,400,225,3
BikesHelmets759.png,400,267,3
BikesHelmets573.png,499,333,3
BikesHelmets215.png,400,280,3
BikesHelmets201.png,500,398,3


Unnamed: 0,filename,label,pose,truncated,occluded,difficult,xmin,ymin,xmax,ymax
0,BikesHelmets765.png,With Helmet,Unspecified,0,0,0,193,13,265,65
1,BikesHelmets759.png,With Helmet,Unspecified,0,0,0,148,99,172,123
2,BikesHelmets759.png,Without Helmet,Unspecified,0,0,0,230,103,247,124
3,BikesHelmets759.png,Without Helmet,Unspecified,0,0,0,64,102,88,125
4,BikesHelmets759.png,Without Helmet,Unspecified,0,0,0,287,97,304,116


In [4]:
print('Initial analysis of the image dataframe:')
display(image_df.describe())
image_df.info()

Initial analysis of the image dataframe:


Unnamed: 0,width,height,depth
count,764.0,764.0,764.0
mean,405.15445,298.59555,3.0
std,72.225562,75.225202,0.0
min,150.0,114.0,3.0
25%,400.0,250.0,3.0
50%,400.0,268.0,3.0
75%,400.0,341.0,3.0
max,600.0,600.0,3.0


<class 'pandas.core.frame.DataFrame'>
Index: 764 entries, BikesHelmets765.png to BikesHelmets2.png
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   width   764 non-null    int64
 1   height  764 non-null    int64
 2   depth   764 non-null    int64
dtypes: int64(3)
memory usage: 23.9+ KB


In [5]:
print('Initial analysis of the object dataframe:')
display(object_df.describe())
display(object_df.describe(include='object'))
object_df.info()

Initial analysis of the object dataframe:


Unnamed: 0,truncated,occluded,difficult,xmin,ymin,xmax,ymax
count,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0
mean,0.0,0.0,0.0,1929.013094,317.356995,2331.195038,655.063405
std,0.0,0.0,0.0,18584.887083,3878.562254,21732.409862,6142.225765
min,0.0,0.0,0.0,2.0,0.0,27.0,25.0
25%,0.0,0.0,0.0,123.0,17.5,165.0,68.0
50%,0.0,0.0,0.0,181.0,42.0,225.0,90.0
75%,0.0,0.0,0.0,247.0,72.0,288.0,115.0
max,0.0,0.0,0.0,334800.0,72900.0,355600.0,106800.0


Unnamed: 0,filename,label,pose
count,1451,1451,1451
unique,761,2,1
top,BikesHelmets297.png,With Helmet,Unspecified
freq,11,962,1451


<class 'pandas.core.frame.DataFrame'>
Index: 1451 entries, 0 to 1450
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   filename   1451 non-null   object
 1   label      1451 non-null   object
 2   pose       1451 non-null   object
 3   truncated  1451 non-null   int64 
 4   occluded   1451 non-null   int64 
 5   difficult  1451 non-null   int64 
 6   xmin       1451 non-null   int64 
 7   ymin       1451 non-null   int64 
 8   xmax       1451 non-null   int64 
 9   ymax       1451 non-null   int64 
dtypes: int64(7), object(3)
memory usage: 124.7+ KB


There are no null values in the dataframes. The columns 'truncated', 'occluded' and 'difficult' have a constant value of 0 in object_df, and 'pose' is always 'Unspecified'. Therefore, these columns are dropped from the dataframe.

In [6]:
object_df.drop(['pose', 'truncated', 'occluded', 'difficult'], axis=1, inplace=True, errors='ignore')
display(object_df.head())

Unnamed: 0,filename,label,xmin,ymin,xmax,ymax
0,BikesHelmets765.png,With Helmet,193,13,265,65
1,BikesHelmets759.png,With Helmet,148,99,172,123
2,BikesHelmets759.png,Without Helmet,230,103,247,124
3,BikesHelmets759.png,Without Helmet,64,102,88,125
4,BikesHelmets759.png,Without Helmet,287,97,304,116


Extracting the part of the images that are the area of interest for each of the bounding boxes from the corresponding image and saving them in the directory 'data/cropped_images'. The name of the file is simply the index of the row plus the extension '.jpg', but we will store the name of the file in the column 'cropped_image' in the dataframe to prevent any confusion. We will skip any rows where the bounding box being extracted results in an empty image, these rows were observed and only resulted in about 20 rows being dropped from over 1450.

In [7]:
import cv2

for row in object_df.itertuples():
    img = cv2.imread(os.path.join('data/images', row.filename))

    img = img[row.ymin:row.ymax, row.xmin:row.xmax]

    if img.shape[0] > 0 and img.shape[1] > 0:
        cv2.imwrite(os.path.join(f'data/cropped_images/{row.Index}.jpg'), img)
        object_df.loc[row.Index, 'CroppedImage'] = f'{row.Index}.jpg'    

object_df.dropna(inplace=True)

We have now obtained the cropped images and generated a dataframe that associates each cropped image with the corresponding bounding box and original image. We will now save the dataframes as csv files. While we may not need images_df while using the bounding boxes for training, it may be useful for experimentation without the bounding boxes or to create a training-validation split that makes visual inspection of the results easier since many images contain multiple bounding boxes. 

In [8]:
object_df.to_csv('data/objects.csv', index=False)
image_df.to_csv('data/images.csv')