# Notebook 01: Data Wrangling

Neptune.ai has a great blog post about exploratory data anlaysis for image segmentation datasets (https://neptune.ai/blog/data-exploration-for-image-segmentation-and-object-detection). They suggest:
1. Check general data quality
    1. Inspect it visually
    2. Make sure it's not corrupt and doesn't contain any obvious artifacts like black-only images
    3. Make sure that **all** files are readable
2. Get statistics on image sizes and aspect ratios
    1. Should resizing be done?
3. Are there class imbalances?
    1. Consider adding weights to the loss (i.e. making the contributions to some pixels higher)

## Imports

In [1]:
import os
import cv2
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from collections import Counter
from easyimages import EasyImageList

## Functions

In [2]:
# function to load image data into a list
def load_images_from_folder(folder_path):
    image_dataset = []
    image_size = []
    image_aspect_ratio = []
    for image_file in tqdm(os.listdir(folder_path)):
        image_data = cv2.imread(folder_path + image_file)
        image_size.append(image_data.shape)
        image_aspect_ratio.append(float(image_data.shape[1] / image_data.shape[0]))
        image_data = cv2.resize(image_data, (192, 128), interpolation=0)  # downsizing images saves memory and time, using nearest neighbor interpolation to prevent color interpolation of some pixels
        image_data = cv2.cvtColor(image_data, cv2.COLOR_BGR2RGB)  # color channels are GRB and mask_labels are RGB
        image_dataset.append(image_data)
    return image_dataset, image_size, image_aspect_ratio

## 1st Dataset: VDD: Varied Drone Dataset for Semantic Segmentation

### Download Data

In [3]:
# requires manual download from https://vddvdd.com/

### Visualize Images

In [4]:
# view thumbnails of the training images
Li = EasyImageList.from_folder(r"C:\Users\Kaitlin\Documents\GitHub\Semantic-Segmentation-of-Aerial-Cityscapes\data\VDD\train\src")
Li.symlink_images()
Li.html(size=44)

In [5]:
# view thumbnails of the training masks
Li = EasyImageList.from_folder(r"C:\Users\Kaitlin\Documents\GitHub\Semantic-Segmentation-of-Aerial-Cityscapes\data\VDD\train\gt")
Li.symlink_images()
Li.html(size=44)

The grount truth (mask) images for the VDD dataset load as black images. Perhaps the RGB values are between 0 and 1 rather than 0 and 255. Let's load the images to investigate.

In [6]:
# import training images
image_dataset, image_size, image_aspect_ratio = load_images_from_folder('../data/VDD/train/src/')

100%|████████████████████████████████████████████████████████████████████████████████| 280/280 [00:26<00:00, 10.58it/s]


In [7]:
# import training masks
mask_dataset, mask_size, mask_aspect_ratio = load_images_from_folder('../data/VDD/train/gt/')

100%|████████████████████████████████████████████████████████████████████████████████| 280/280 [00:13<00:00, 21.51it/s]


In [8]:
# convert image datasets to numpy arrays
image_dataset = np.array(image_dataset).astype('int')
mask_dataset = np.array(mask_dataset).astype('int')

In [9]:
# check dataset shape
print(image_dataset.shape)
print(mask_dataset.shape)

# view sample image and mask
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9,4.5))
axs[0].imshow(image_dataset[0])
axs[1].imshow(mask_dataset[0])
plt.show()

(280, 128, 192, 3)
(280, 128, 192, 3)


  plt.show()


The mask appears black. Let's investigate.

### Check Image Size and Aspect Ratio

In [10]:
# image size
print(np.unique(image_size))
print(np.unique(mask_size))

[   3 3000 4000]
[   3 3000 4000]


In [11]:
# image aspect ratio
print(np.unique(image_aspect_ratio))
print(np.unique(mask_aspect_ratio))

[1.33333333]
[1.33333333]


In [12]:
# check unique values to inspect "black" mask
np.unique(mask_dataset)

array([0, 1, 2, 3, 4, 5, 6])

There appears to be 7 unique values for the mask colors. To visualize these different values, let's change the scale from 0 to 6 to 0 to 255.

In [13]:
# scale mask rgb
mask_dataset_scaled = (mask_dataset * 255 / 6).astype('int')

In [14]:
# view sample image and mask
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9,4.5))
axs[0].imshow(image_dataset[0])
axs[1].imshow(mask_dataset_scaled[0])
plt.show()

  plt.show()


In [15]:
# view sample image and mask
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9,4.5))
axs[0].imshow(image_dataset[10])
axs[1].imshow(mask_dataset_scaled[10])
plt.show()

  plt.show()


In [16]:
# view sample image and mask
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9,4.5))
axs[0].imshow(image_dataset[190])
axs[1].imshow(mask_dataset_scaled[190])
plt.show()

  plt.show()


Ok great. After viewing a few image-mask comparisons, its clear there is a logical mask within the black images. Although they are masked in grayscale instead of RGB.

### Check Pixel Label Distribution

The website for VDD claims the images are labeled into 7 categories: Wall, Roof, Road, Water, Vehicle, Vegetation and Others. Unfortunately, I cannot find information on which color values denote which class, so I will define the class labels myself.

In [17]:
# label grayscale rgb values according to the 7 categories
# NOTE: these need to be updated
mask_labels_tuple = pd.DataFrame({'name':['Wall', 'Roof', 'Road', 'Water', 'Vehicle', 'Vegetation', 'Others'],
                            'rgb':[(0,0,0), (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5), (6,6,6)]})
mask_labels_tuple

Unnamed: 0,name,rgb
0,Wall,"(0, 0, 0)"
1,Roof,"(1, 1, 1)"
2,Road,"(2, 2, 2)"
3,Water,"(3, 3, 3)"
4,Vehicle,"(4, 4, 4)"
5,Vegetation,"(5, 5, 5)"
6,Others,"(6, 6, 6)"


In [18]:
# reshape the mask dataset matrix to a 1-dimensional vector
mask_vector = mask_dataset.reshape(-1, order='C')
mask_vector.shape

(20643840,)

In [19]:
# initialize an array with the length of the number of pixels in this dataset
pixel_classes = np.full(int((np.prod(mask_dataset.shape))/3), fill_value = -1, dtype = int)

# determine each pixel's mask class
unmasked_pixels_i = []
for i in tqdm(range(int(mask_vector.shape[0] / 3))):
    if np.any(mask_labels_tuple['rgb']==tuple(mask_vector[3*i:3*i+3])):
        pixel_classes[i] = mask_labels_tuple[mask_labels_tuple['rgb']==tuple(mask_vector[3*i:3*i+3])].index.values[0]
    else:
        unmasked_pixels_i.append(i)

100%|██████████████████████████████████████████████████████████████████████| 6881280/6881280 [14:11<00:00, 8077.11it/s]


In [20]:
# which mask classes are present in this dataset?
np.unique(pixel_classes)

array([0, 1, 2, 3, 4, 5, 6])

In [22]:
# prepare data for plotting
pixel_classes_value_counts = pd.Series(Counter(pixel_classes))
pixel_classes_value_counts.sort_values(ascending=False, inplace=True)
x = range(len(pixel_classes_value_counts.index))  # locations of tick labels
heights = pixel_classes_value_counts.values  # heights of bars
tick_labels = list(mask_labels_tuple.loc[pixel_classes_value_counts.index]['name'])  # tick labels

In [23]:
# do the plotting
fig, ax = plt.subplots()
ax.bar(x=x, height=heights)#, tick_label=tick_labels)
ax.set_title('Pixel Class Distribution')
ax.set_xticks(x)
ax.set_xticklabels(tick_labels, rotation='vertical')
ax.set_ylabel('# of occurances')
plt.show()

  plt.show()


There is a degree of class imbalance here, although given the nature of what is being masked, in the real world there isn't really a uniform distribution of these classes.

## 2nd Dataset: Semantic Drone Dataset

### Download Data

In [24]:
# requires manual download from https://www.tugraz.at/index.php?id=22387

### Visualize Images

In [25]:
Li = EasyImageList.from_folder(r"C:\Users\Kaitlin\Documents\GitHub\Semantic-Segmentation-of-Aerial-Cityscapes\data\semantic_drone_dataset\training_set\images")
Li.symlink_images()
Li.html(size=44)

In [26]:
Li = EasyImageList.from_folder(r"C:\Users\Kaitlin\Documents\GitHub\Semantic-Segmentation-of-Aerial-Cityscapes\data\semantic_drone_dataset\training_set\gt\semantic\label_images")
Li.symlink_images()
Li.html(size=44)

There does not appear to be anything wrong with the above 800 images. All are readable and contain an image with an appropriate mask.

### Load data

In [27]:
# import training images
image_dataset, image_size, image_aspect_ratio = load_images_from_folder('../data/semantic_drone_dataset/training_set/images/')

100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [01:27<00:00,  4.58it/s]


In [28]:
# import training masks
mask_dataset, mask_size, mask_aspect_ratio = load_images_from_folder('../data/semantic_drone_dataset/training_set/gt/semantic/label_images/')

100%|████████████████████████████████████████████████████████████████████████████████| 400/400 [00:59<00:00,  6.70it/s]


In [29]:
# convert image datasets to numpy arrays
image_dataset = np.array(image_dataset).astype('int')
mask_dataset = np.array(mask_dataset).astype('int')

### Sanity check

In [30]:
# check dataset shape
print(image_dataset.shape)
print(mask_dataset.shape)

# view sample image and mask
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(9,4.5))
axs[0].imshow(image_dataset[0])
axs[1].imshow(mask_dataset[0])
plt.show()

(400, 128, 192, 3)
(400, 128, 192, 3)


  plt.show()


1. image corresponds to mask
2. 400 images, 192 by 128 resolution, 3 channels

### Check Image Size and Aspect Ratio

In [31]:
# image size
print(np.unique(image_size))
print(np.unique(mask_size))

[   3 4000 6000]
[   3 4000 6000]


In [32]:
# image aspect ratio
print(np.unique(image_aspect_ratio))
print(np.unique(mask_aspect_ratio))

[1.5]
[1.5]


All 400 images are 3 channels (rgb), and have a resolution of 6k by 4k with an aspect ratio of 1.5.

### Check Pixel Label Distribution

In [33]:
# get mask labels
mask_labels = pd.read_csv('../data/semantic_drone_dataset/training_set/gt/semantic/class_dict.csv')
mask_labels_tuple = mask_labels.copy()
mask_labels_tuple['rgb'] = list(zip(mask_labels[' r'], mask_labels[' g'], mask_labels[' b']))
mask_labels_tuple.drop(columns=[' r', ' g', ' b'], inplace=True)
print(mask_labels)

           name    r    g    b
0     unlabeled    0    0    0
1    paved-area  128   64  128
2          dirt  130   76    0
3         grass    0  102    0
4        gravel  112  103   87
5         water   28   42  168
6         rocks   48   41   30
7          pool    0   50   89
8    vegetation  107  142   35
9          roof   70   70   70
10         wall  102  102  156
11       window  254  228   12
12         door  254  148   12
13        fence  190  153  153
14   fence-pole  153  153  153
15       person  255   22   96
16          dog  102   51    0
17          car    9  143  150
18      bicycle  119   11   32
19         tree   51   51    0
20    bald-tree  190  250  190
21    ar-marker  112  150  146
22     obstacle    2  135  115
23  conflicting  255    0    0


In [34]:
print(mask_labels_tuple)

           name              rgb
0     unlabeled        (0, 0, 0)
1    paved-area   (128, 64, 128)
2          dirt     (130, 76, 0)
3         grass      (0, 102, 0)
4        gravel   (112, 103, 87)
5         water    (28, 42, 168)
6         rocks     (48, 41, 30)
7          pool      (0, 50, 89)
8    vegetation   (107, 142, 35)
9          roof     (70, 70, 70)
10         wall  (102, 102, 156)
11       window   (254, 228, 12)
12         door   (254, 148, 12)
13        fence  (190, 153, 153)
14   fence-pole  (153, 153, 153)
15       person    (255, 22, 96)
16          dog     (102, 51, 0)
17          car    (9, 143, 150)
18      bicycle    (119, 11, 32)
19         tree      (51, 51, 0)
20    bald-tree  (190, 250, 190)
21    ar-marker  (112, 150, 146)
22     obstacle    (2, 135, 115)
23  conflicting      (255, 0, 0)


In [35]:
# reshape the mask dataset matrix to a 1-dimensional vector
mask_vector = mask_dataset.reshape(-1, order='C')
mask_vector.shape

(29491200,)

In [36]:
# initialize an array with the length of the number of pixels in this dataset
pixel_classes = np.full(int((np.prod(mask_dataset.shape))/3), fill_value = -1, dtype = int)

# determine each pixel's mask class
unmasked_pixels_i = []
for i in tqdm(range(int(mask_vector.shape[0] / 3))):
    if np.any(mask_labels_tuple['rgb']==tuple(mask_vector[3*i:3*i+3])):
        pixel_classes[i] = mask_labels_tuple[mask_labels_tuple['rgb']==tuple(mask_vector[3*i:3*i+3])].index.values[0]
    else:
        unmasked_pixels_i.append(i)

100%|██████████████████████████████████████████████████████████████████████| 9830400/9830400 [21:35<00:00, 7588.17it/s]


In [37]:
# which mask classes are present in this dataset?
np.unique(pixel_classes)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22])

There appears to be 1 unrepresented mask class not being found the dataset: 'conflicting'.

In [38]:
# prepare data for plotting
pixel_classes_value_counts = pd.Series(Counter(pixel_classes))
pixel_classes_value_counts.sort_values(ascending=False, inplace=True)
x = range(len(pixel_classes_value_counts.index))  # locations of tick labels
heights = pixel_classes_value_counts.values  # heights of bars
tick_labels = list(mask_labels.loc[pixel_classes_value_counts.index]['name'])  # tick labels

In [39]:
# do the plotting
fig, ax = plt.subplots()
ax.bar(x=x, height=heights)#, tick_label=tick_labels)
ax.set_title('Pixel Class Distribution')
ax.set_xticks(x)
ax.set_xticklabels(tick_labels, rotation='vertical')
ax.set_ylabel('# of occurances')
plt.show()

  plt.show()


It is very clear that there is a class imbalance in the mask categories of this dataset. When modeling I can provide a per-class accuracy as well as add weights to the losses to aid in balancing the predictions if it appears to be a problem.

### Semantic Labelling with Video Support

This dataset will not be used initially. Not until after image segmentation is successful, do I plan to introduce video.

In [40]:
# requires manual download from https://uavid.nl/

## Conclusion

The two datasets that are going to be used initially for semantically segmenting drone footage, are the two static image datasets titled:
1. Varied Drone Dataset for Semantic Segmentation
2. Semantic Drone Dataset

The aspect ratio for one dataset is 1.33 and the other is 1.5. The default resolution for one dataset is 6k by 4k and the other is 4k by 3k. Both are high resolution imagery and will need to be downsized to lower model training time. I plan to study the differences in model accuracy for fixed/varying aspect ratios, and image downsizing.

The imagery is masked into categories, also referred to as pixel classes. The classes are not uniformally distributed, and have a degree of class imbalance when comparing classes to one-another, however, the contents of the imagery may never contain balanced classes and therefore, this distribution of classes may be the best represented of classes that would be seen in new "real-world" imagery, following the construction of this model.

Another study will be to see how well a model can perform on independent datsets versus the combination of datasets.

This is the first draft of a data wrangling notebook for this project and will likely be updated during the course of the project.