## Find differences in class names of the two datasets

We do this by adding two lists, one for each dataset with the class labels given from their corresponding descriptions found here: 

- [CityScapes class description](https://www.cityscapes-dataset.com/dataset-overview/)
- [BDD100K Semantic Segmentation class description](https://doc.bdd100k.com/format.html#semantic-segmentation) 

This notebook is just for studying the dataset and does not do anything practical. The summary variation of the classes is done in GIT/utils/dataloaders



In [1]:
CITYSCAPES = [
    'unlabeled', 'ego vehicle' ,'rectification border', 'out of roi', 'static', 'dynamic', 'ground',  # Category "void": 0
    'road', 'sidewalk', 'parking', 'rail track',  # Category "Flat": 1 
    'building', 'wall', 'fence', 'guard rail', 'bridge', 'tunnel',  # Category "construction": 2
    'pole', 'polegroup', 'traffic light', 'traffic sign',  # Category "object": 3
    'vegetation', 'terrain',  # Category "nature": 4
    'sky',  # Category "sky": 5
    'person', 'rider',  # Category "human": 6
    'car', 'truck', 'bus', 'caravan', 'trailer', 'train', 'motorcycle', 'bicycle', 'license plate',  # Category "vehicle": 7
]

BDD100K = [
    'road', 'sidewalk', 'building', 'wall', 'fence',
    'pole', 'traffic light', 'traffic sign',
    'vegetation', 'terrain', 'sky', 
    'person', 'rider',  
    'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle',  
    'unknown',  # Not used for evaluation, but pixel values 255 (white)
]

Print out all classes found in both datasets. Also, which ones are in one or the other? 

In [2]:
print("Joint classes:")
joint_classes = [j for j in CITYSCAPES if j in BDD100K]
print(joint_classes)

print("\nOnly in BDD100K:")
only_in_BDD100K = [j for j in BDD100K if j not in CITYSCAPES]
print(only_in_BDD100K)

print("\nOnly in CityScapes:")
only_in_cityscapes = [j for j in CITYSCAPES if j not in BDD100K]
print(only_in_cityscapes)


Joint classes:
['road', 'sidewalk', 'building', 'wall', 'fence', 'pole', 'traffic light', 'traffic sign', 'vegetation', 'terrain', 'sky', 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle']

Only in BDD100K:
['unknown']

Only in CityScapes:
['unlabeled', 'ego vehicle', 'rectification border', 'out of roi', 'static', 'dynamic', 'ground', 'parking', 'rail track', 'guard rail', 'bridge', 'tunnel', 'polegroup', 'caravan', 'trailer', 'license plate']


## Summary: Class-labels 
#### Issues
- BDD100K has grouped buildings into a broader class, whereas CityScapes have detailed it into more specific classes e.g. guard rail, bridge, tunnel etc. 
- CityScapes have, in general, more descriptive classes. Some classes can be merged, e.g. "pole" with "pole group"

#### What do we do?
From Cityscape, convert the following:  
- Move guard rail to fence class. 
- Move bridge and tunnel to buildings class. 
- Move polegroup to pole class. 
- Move caravan, trailer and license plate to car. 

- Move parking to road, as we consider it an area where the vehicle may drive/park. 
- Move ground to terrain, as it is part of the area where a vehicle is not supposed to drive.  
- Move rail track to terrain, as it is part of the area where a vehicle is not supposed to drive. 

#### Ignoring classes out of scope/Unknown
Ignore the following CityScapes classes: 
- Unlabled
- Ego Vehicle
- Rectification border
- Out of ROI
- Static
- Dynamic

Ignore the following BDD100K classes:
- Unknown

## Find the pixel distribution of classes in the two datasets. 

In [3]:
import os 
import sys 
import numpy as np
from tqdm.notebook import tqdm

sys.path.append("..")
import utils.utils as CU
import utils.dataloaders as CD


### BDD100K 
Replace location of BDD100K images. Requires you to have converted the dataset to CityScapes format beforehand. 

In [4]:
DATA_DIR = '/mnt/ml-data-storage/jens/BDD100K/images'
LABEL_DIR = DATA_DIR.replace('images', 'labels')
print(LABEL_DIR)

/mnt/ml-data-storage/jens/BDD100K/labels


In [5]:
# load repo with data
training_images = sorted(CU.list_dir_recursive(os.path.join(DATA_DIR, 'train')))
training_labels = sorted(CU.list_dir_recursive(os.path.join(LABEL_DIR, 'train'), '-mask.png'))

validation_images = sorted(CU.list_dir_recursive(os.path.join(DATA_DIR, 'val')))
validation_labels = sorted(CU.list_dir_recursive(os.path.join(LABEL_DIR, 'val'), '-mask.png'))

test_images = sorted(CU.list_dir_recursive(os.path.join(DATA_DIR, 'test')))
test_labels = sorted(CU.list_dir_recursive(os.path.join(LABEL_DIR, 'test'), '-mask.png'))

In [8]:
from torch.utils.data import DataLoader
bdd100k = CD.BDD100K(training_images, training_labels, classes=CD.BDD100K.CLASSES)

In [9]:
# Dimension of the BDD100K images are 720x1280 (Height * Width)
DIM = 720*1280

bdd100k_classes = len(CD.BDD100K.CLASSES)
bdd100k_distribution = np.zeros((bdd100k_classes,1))

for _, lb in tqdm(bdd100k):
    for c in range(bdd100k_classes): 
        bdd100k_distribution[c] += np.sum(lb[:,:,c])/DIM

  0%|          | 0/2428 [00:00<?, ?it/s]

In [22]:
for name, part in zip(BDD100K, bdd100k_distribution): 
    val = np.round(part[0]/len(bdd100k)*100, 3)
    print('{}: {}%'.format(name, val)) 

val = round(100-(np.sum(bdd100k_distribution/len(bdd100k)*100)), 3) 
print('Unlabeled: {}%'.format(val))

road: 20.653%
sidewalk: 2.108%
building: 15.427%
wall: 0.512%
fence: 1.111%
pole: 1.004%
traffic light: 0.268%
traffic sign: 0.412%
vegetation: 11.717%
terrain: 0.784%
sky: 11.645%
person: 0.327%
rider: 0.024%
car: 8.712%
truck: 1.107%
bus: 0.674%
train: 0.013%
motorcycle: 0.018%
bicycle: 0.073%
unknown: 0.0%
Unlabeled: 23.413%



### CityScapes
Do the same thing for CityScapes.

Change the path to your DATA_DIR. Requires you to have re-arranged the dataset as described in Data-prep notebook. 

In [11]:
DATA_DIR = '/mnt/ml-data-storage/jens/CityScapes/images'
LABEL_DIR = DATA_DIR.replace('images', 'labels')

training_images = sorted(CU.list_dir_recursive(os.path.join(DATA_DIR, 'train')))
training_labels = sorted(CU.list_dir_recursive(os.path.join(LABEL_DIR, 'train'), 'labelIds.png'))

cityscapes = CD.CityScapes(training_images, training_labels, classes=CD.CityScapes.CLASSES)


In [12]:
#Image dimensions for CityScapes are 2048 x 1024 pixels (Width * Height)
DIM = 1024*2048

cityscapes_classes = len(CD.CityScapes.CLASSES)
cityscapes_distribution = np.zeros((cityscapes_classes,1))

for _, lb in tqdm(cityscapes):
    for c in range(cityscapes_classes): 
        cityscapes_distribution[c] += np.sum(lb[:,:,c])/DIM

  0%|          | 0/2215 [00:00<?, ?it/s]

In [23]:
for name, part in zip(CITYSCAPES, cityscapes_distribution): 
    val = np.round(part[0]/len(cityscapes)*100, 3)
    print('{}: {}%'.format(name, val)) 

val = round(100-(np.sum(cityscapes_distribution/len(cityscapes)*100)), 3) 
print('Unlabeled: {}%'.format(val))

unlabeled: 0.008%
ego vehicle: 4.445%
rectification border: 1.043%
out of roi: 1.508%
static: 1.323%
dynamic: 0.25%
ground: 1.299%
road: 32.538%
sidewalk: 5.453%
parking: 0.544%
rail track: 0.18%
building: 21.2%
wall: 0.674%
fence: 0.809%
guard rail: 0.01%
bridge: 0.119%
tunnel: 0.05%
pole: 1.047%
polegroup: 0.007%
traffic light: 0.174%
traffic sign: 0.472%
vegetation: 14.198%
terrain: 1.017%
sky: 3.382%
person: 1.007%
rider: 0.105%
car: 6.131%
truck: 0.201%
bus: 0.148%
caravan: 0.033%
trailer: 0.016%
train: 0.22%
motorcycle: 0.094%
bicycle: 0.294%
license plate: 0.0%
Unlabeled: 0.0%


In [4]:
DATA_DIR = '/mnt/ml-data-storage/jens/CityScapes/images'
LABEL_DIR = DATA_DIR.replace('images', 'labels')

training_images = sorted(CU.list_dir_recursive(os.path.join(DATA_DIR, 'train')))
training_labels = sorted(CU.list_dir_recursive(os.path.join(LABEL_DIR, 'train'), 'labelIds.png'))

cityscapes = CD.CityScapes_bdd100k_class_merge(training_images, training_labels, classes=CD.CityScapes_bdd100k_class_merge.CLASSES)