**Cloning Repo**

In [None]:
!rm -rf visiope_project > /dev/null
!git clone https://github.com/lavallone/VISIOPE_project.git visiope_project

# Data processing 🔨

<a href="https://imgur.com/2kk27xI"><img src="https://i.imgur.com/2kk27xI.png" title="source: imgur.com" width=150 height=130/></a> <a href="https://imgur.com/ZJTgfgb"><img src="https://i.imgur.com/ZJTgfgb.png" title="source: imgur.com" width=170 height=90/></a> <a href="https://imgur.com/fO4AaCZ"><img src="https://i.imgur.com/fO4AaCZ.png" title="source: imgur.com" width=150 height=130/></a>

> ***Why data processing?***

The goal of my project is to implement an object detection network that performs well both in term of accuracy and of inference time in a particular scenario: *the streets of Rome*. Due to the lack of annotated datasets of Italian cities for this task, the only solution was to integrate various datasets from the autonomous driving field in order to expose the model to as many situations as possible.

> ***Which dataset should we use?*** 

In recent years many datasets dealing with *autonomous driving* grew drastically. But since most of the images share the similar scene content (because they are from the same video clip), which makes the model easy to over-fitting, it's necessary, in order to build a well generalized model, to use multiple datasets. So to tackle this issue and improve the detector, I decided to use:
- **Waymo** open dataset: very huge and collected well. The company has been working on autonomous vehicle technology for over a decade and is considered a leader in the field. Since it operates in San Francisco the environments was all similar and not appropriate to generalize in other settings.
- **BDD100K** (Berkeley DeepDrive 100K) is a large-scale diverse dataset for self-driving cars and computer vision research. It contains over 100,000 high-resolution video clips, capturing various driving scenarios in different weather conditions, times of day, and lighting levels. The dataset includes annotations for various objects, such as vehicles and pedestrians. The quality of the videos is not really amazing.
- **Argoverse-HD** dataset was created by Argo AI, a self-driving vehicle company, and is designed to be used for training and evaluating autonomous vehicle algorithms. The sample images are fewer with respect to the other two datasets. <br>*(it's been "created" for the Streaming Perception Challenge 2021, in which the YOLOX architecture won the first prize)*


> ***Which classes?***

This is a very important question to answer. The three datasets have distinct annotated categories, so it's necessary to determine which ones will be the final ones for this project. I finally decided to have only three classes: **VEHICLE**, **PERSON** and **MOTORBIKE**. I didn't consider bicycles because in Rome they are very rare and it made more sense for the purpose of a possible *warning car system* to detect people that ride bicycles as PERSON. Always considering the "real" scenario in which I wat to test my model, I decided to consider the motorbike as a standalone vehicle, since the city is full of them and I'd like to detect them separately from cars!



> <a href="https://imgur.com/hs272pm"><img src="https://i.imgur.com/hs272pm.png" title="source: imgur.com" width=15 height=15/></a> The images and annotations of the dataset are stored in *Google Drive* because to heavy to be on my local machine. 

> 🤯 The overall process took me a very long time to be completed. This because of the slowness of Google Drive uploading/downloading speed and of the  huge amount of data I dealt with!

## **1** - Downloading raw data
`downloading phase`


### Waymo <a href="https://imgur.com/2kk27xI"><img src="https://i.imgur.com/2kk27xI.png" title="source: imgur.com" width=20 height=20/></a>

Instead of downloading the dataset on my local machine, then upload it to Google Drive and finally mount it to Colab and being able to access the data, I'll directly download it from Google Cloud buckets to my Colab machine. 

In order to do that, I first have to authenticate with GCP.

In [None]:
!gcloud auth login

This command will ask you to run another command on a machine where a web browser can be launched (our local one).

Do it and follow the instrunctions --> you can now run the 'copying command'.


In [None]:
# training files
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/training/training_0000.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/training/training_0001.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/training/training_0002.tar /content

In [None]:
# testing files
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/testing/testing_0000.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/testing/testing_0001.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/testing/testing_0002.tar /content

In [None]:
# validation files
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/validation/validation_0000.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/validation/validation_0001.tar /content
!gsutil cp gs://waymo_open_dataset_v_1_3_2/archived_files/validation/validation_0002.tar /content

In [None]:
# saving the files to Google Drive takes longer, but this way I can save them 'permanently'!
!tar -xvf "/content/training_000x.tar" -C "datasets/Waymo/images/tfrecord"
!rm "/content/training_000x.tar"
!tar -xvf "/content/testing_000x.tar" -C "dataset/Waymo/images/tfrecord"
!rm "/content/testing_000x.tar"
!tar -xvf "/content/validation_000x.tar" -C "dataset/Waymo/images/tfrecord"
!rm "/content/validation_000x.tar"

### BDD100K <a href="https://imgur.com/ZJTgfgb"><img src="https://i.imgur.com/ZJTgfgb.png" title="source: imgur.com" width=40 height=20/></a>

In [None]:
%cd datasets/BDD100K/

## TRAINING
!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-1.zip
!unzip images20-track-train-1.zip
!rm images20-track-train-1.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-2.zip
!unzip images20-track-train-2.zip
!rm images20-track-train-2.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-3.zip
!unzip images20-track-train-3.zip
!rm images20-track-train-3.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-4.zip
!unzip images20-track-train-4.zip
!rm images20-track-train-4.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-5.zip
!unzip images20-track-train-5.zip
!rm images20-track-train-5.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-train-6.zip
!unzip images20-track-train-6.zip
!rm images20-track-train-6.zip

# VALIDATION
!wget http://dl.yf.io/bdd100k/mot20/images20-track-val-1.zip
!unzip images20-track-val-1.zip
!rm images20-track-val-1.zip

## TESTING (we don't download these data because we don't have their annotations!)
!wget http://dl.yf.io/bdd100k/mot20/images20-track-test-1.zip
!unzip images20-track-test-1.zip
!rm images20-track-test-1.zip

!wget http://dl.yf.io/bdd100k/mot20/images20-track-test-2.zip
!unzip images20-track-test-2.zip
!rm images20-track-test-2.zip

🔄

In [None]:
# to synchronize Colab with Google Drive (since our drive is growing fastly).
# This has to be done when we're modifying our Drive and we want to be sure to have it updated.
from google.colab import drive
drive.flush_and_unmount()

### Argoverse-HD <a href="https://imgur.com/fO4AaCZ"><img src="https://i.imgur.com/fO4AaCZ.png" title="source: imgur.com" width=20 height=20/></a>

It uses exactly the **COCO annotations**! The ones to which I'll convert the annotations of the other two datasets. 

*(also in this dataset the annotations for the test set are not provided)*

In [None]:
# KAGGLE default stuffs to do in order to use the API
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
%cd datasets/Argoverse
!kaggle datasets download -d mtlics/argoversehd
!unzip argoversehd.zip
%cd /content

## **2** - Standardizing data 
`building phase`

Now, in order to be able to give as inputs all the data from these different datasets to the model, we need to *standardize* them (i.e. make their annotations to have the same format and their images to be of the same size). 
We'll process only the data with annotations of each dataset. Each dataset subfolder is organised in this way:
       

```
labels/
      / COCO
            / annotations.json # a unique json file for all the training images of the dataset
images/
      / videos
            /video_01
            /video_02
            ...
```

> All the images annotations will be converted in the ***COCO*** dataset format, since it's the most popular and gold standard dataset for detection tasks in computer vision. 

### Waymo <a href="https://imgur.com/2kk27xI"><img src="https://i.imgur.com/2kk27xI.png" title="source: imgur.com" width=20 height=20/></a>

Since the dataset is saved as a set of *.tar* archives which contain *tfRecord* files, we need to make a preprocessing phase in order to extrapolate the images and their corresponding labels.

*(we download the first 3 .tar files from the dataset)*

We're able to do that thanks to the modified version of the toolkit developed by Kushal B Kusram. <br>https://github.com/KushalBKusram/WaymoOpenDatasetToolKit

In [None]:
!pip3 install waymo-open-dataset-tf-2-1-0==1.2.0

In [None]:
# if it's needed to initialize the json label dictionary
import json
d = {"info" : {"num_videos": 116, "num_images": 68760}, "images" : [], "categories": [ {"name" : "vehicle", "id" : 0}, {"name" : "person", "id" : 1}, {"name" : "motorbike", "id" : 2}], "annotations": []}
f = open("/content/drive/MyDrive/VISIOPE/Project/datasets/Waymo/labels/COCO/annotations.json", "w")
json.dump(d, f)
f.close()

In [None]:
%cd /content/visiope_project
!git pull

!python data_toolkit/building/build.py "waymo"

### BDD100K <a href="https://imgur.com/ZJTgfgb"><img src="https://i.imgur.com/ZJTgfgb.png" title="source: imgur.com" width=40 height=20/></a>

In [None]:
## [IF we want to use the BDD100K Toolkit] ##
!rm -rf bdd100k_toolkit > /dev/null
!git clone https://github.com/bdd100k/bdd100k.git bdd100k_toolkit
%cd bdd100k_toolkit
!pip3 install -r requirements.txt

In [None]:
# If it's needed to initialize the json label dictionary
import json
d = {"info" : {"num_videos": 1400, "num_images": 277594}, "images" : [], "categories": [ {"name" : "vehicle", "id" : 0}, {"name" : "person", "id" : 1}, {"name" : "motorbike", "id" : 2}], "annotations": []}
f = open("/content/drive/MyDrive/VISIOPE/Project/datasets/BDD100K/labels/COCO/annotations.json", "w")
json.dump(d, f)
f.close()

In [None]:
%cd /content/visiope_project
!git pull

# now that we filter less annotations, the label json file became huge! --> we found a way to improve the efficency of the code! ;)
!python data_toolkit/building/build.py "bdd100k"

### Argoverse-HD <a href="https://imgur.com/fO4AaCZ"><img src="https://i.imgur.com/fO4AaCZ.png" title="source: imgur.com" width=20 height=20/></a>

In [None]:
!rm -rf cocoapi > /dev/null
!git clone https://github.com/cocodataset/cocoapi.git cocoapi

In [None]:
import json
# we need to add the 'info' key to the dictionary
ris_dict = { "info" : {"num_videos": 89, "num_images": 54446} , "categories" : None, "images": None, "annotations" : None, "sequences" : None, "seq_dirs" : None, "coco_subset" : None, "coco_mapping" : None, "n_tracks" : None}
d = json.load(open("/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse/labels/old_train.json"))
for k in d.keys():
  ris_dict[k] = d[k]
f = open("/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse/labels/old_train.json", "w")
json.dump(ris_dict, f)
f.close()
ris_dict = { "info" : {"num_videos": 89, "num_images": 39408} , "categories" : None, "images": None, "annotations" : None, "sequences" : None, "seq_dirs" : None, "coco_subset" : None, "coco_mapping" : None, "n_tracks" : None}
d = json.load(open("/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse/labels/old_val.json"))
for k in d.keys():
  ris_dict[k] = d[k]
f = open("/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse/labels/old_val.json", "w")
json.dump(ris_dict, f)
f.close()

In [None]:
%cd /content/visiope_project
!git pull

!python data_toolkit/building/build.py "argoverse"

## Data summary 📈

> After this process, we can visualize how many images and videos we have standardized from the three datasets. And if the annotations are "*COCO compatible*".

#### Number of videos and images

In [None]:
import os
waymo = "/content/drive/MyDrive/VISIOPE/Project/datasets/Waymo"
bdd100k = "/content/drive/MyDrive/VISIOPE/Project/datasets/BDD100K"
argoverse = "/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse"

l = [waymo, bdd100k, argoverse]
ris = { "Waymo" : [0,0], "BDD100K" : [0,0], "Argoverse" : [0,0] }

for dir in l:
  num_train_videos = 0
  num_train_images = 0
  for v in os.listdir(dir+"/images/videos"):
    num_train_videos = num_train_videos + 1
    num_train_images = num_train_images + len(os.listdir(dir+"/images/videos/"+v))
  ris[dir[48:]][0] = num_train_videos
  ris[dir[48:]][1] = num_train_images

print(ris) # --> {'Waymo': [116, 68760], 'BDD100K': [1400, 277594], 'Argoverse': [89, 54446]}

                                            Waymo          BDD100K           Argoverse
                    Number of videos        116            1400              89
                    Number of images        68760          277594            54446

#### COCO compatibility

In [None]:
# To check if the json files created are compatible with COCO format...

from pycocotools.coco import COCO
waymo_json =     "/content/drive/MyDrive/VISIOPE/Project/datasets/Waymo/labels/COCO/annotations.json"
bdd100k_json =   "/content/drive/MyDrive/VISIOPE/Project/datasets/BDD100K/labels/COCO/annotations.json"
argoverse_json = "/content/drive/MyDrive/VISIOPE/Project/datasets/Argoverse/labels/COCO/annotations.json"
coco_1 = COCO(waymo_json)
print("Waymo")
print(len(coco_1.dataset["images"]))
print(len(coco_1.getImgIds())) # it returns the number of unique image ids!
print("------------")
coco_2 = COCO(bdd100k_json)
print("BDD100K")
print(len(coco_2.dataset["images"]))
print(len(coco_2.getImgIds()))
print("------------")
coco_3 = COCO(argoverse_json)
print("Argoverse")
print(len(coco_3.dataset["images"]))
print(len(coco_3.getImgIds()))

loading annotations into memory...
Done (t=3.22s)
creating index...
index created!
Waymo
68760
68760
------------
loading annotations into memory...
Done (t=13.44s)
creating index...
index created!
BDD100K
277594
277594
------------
loading annotations into memory...
Done (t=3.91s)
creating index...
index created!
Argoverse
54446
54446


## **3** - Custom dataset creation
`extraction phase`

In [None]:
# initializing the processed_images_so_far.json file
import json
d = {"images_so_far" : [0]}
f = open("/content/drive/MyDrive/VISIOPE/Project/data/processed_images_so_far.json", "w")
json.dump(d, f)
f.close()

In [None]:
%cd /content/visiope_project
!git pull

!python data_toolkit/extraction/main.py

<table>
  <tr>
    <th><center> </center></th>
    <th><center>Before</center></th>
    <th><center>After</center></th>
    <td><center><i>(the extraction phase)</center></td>
  </tr>
  <tr>
    <td><center><i>Number of images</center></td>
    <td><center>400.800</center></td>
    <td><center>125.431</center></td>
  </tr>
  <tr>
    <td><center><i>Number of annotations</center></td>
    <td><center>3.795.457</center></td>
    <td><center>1.133.294</center></td>
  </tr>
</table>

> *Images reduction* of the ***68,71%*** *!* <br> *Annotations reduction* of the ***70.15%*** *!*

❌ *Not all the images have annotated bounding boxes! <br> For semplicity we deleted all the ones that have NO annotations.* <br> *(from **125431** to **114126** images)*

In [None]:
# code run to achieve it
import json
from tqdm import tqdm

images_list= (json.load(open("/content/drive/MyDrive/VISIOPE/Project/data/images_list.json")))["images_list"]
annotations = (json.load(open("/content/drive/MyDrive/VISIOPE/Project/data/labels/COCO/annotations.json")))

images_annotated = [str(int(ann["image_id"])) for ann in annotations["annotations"]] # 1133294 of annotations
images_annotated = list(set(images_annotated)) # 114126
print("Total number of images:")
print(len(images_list))
print("Total number of annotated images:")
print(len(images_annotated))

# modify annotations.json
annotations["info"]["num_images"] = len(images_annotated)
images = list(filter(lambda x: str(int(x["id"])) in images_annotated, tqdm(annotations["images"]) ))
annotations["images"] = images
f = open("/content/drive/MyDrive/VISIOPE/Project/data/labels/COCO/annotations.json", "w")
json.dump(annotations, f)
f.close()

# modify images_list.json
img2id = json.load(open("/content/drive/MyDrive/VISIOPE/Project/data/lookup_tables/img2id.json"))
filtered_images = list(filter(lambda x: str(int(img2id[x])) in images_annotate#d, tqdm(images_list)))
print("Number of annotated images after the filtering:")
print(len(filtered_images))

d = {"images_list" : filtered_images}
f = open("/content/drive/MyDrive/VISIOPE/Project/data/images_list.json", "w")
json.dump(d, f)
f.close()

> We just have to zip all images andannotation and produce a ***data.zip*** always available for faster downloads!

## 💡 *Observations during the process*

Since our final purpose is to test the detection system in the real world, in our case the streets of Rome, we need to think about which are the useful labels and which not. 
In general, three are the main aspects that have to be discussed to improve the quality of our data:

*   "Fine-tuning" the selection mechanism of annotations for each datasets. *Which annotations does worth it to stay?*
It is important because, for example, the first version of the BDD100K annotations was too strict. The filtering was too severe and most of the bounding boxes were discarded.
```
Infos about the quality of annotations:
Waymo:        difficulty level --> {LEVEL_1, LEVEL2}
BDD100K:      crowd, occluded, truncated
Argoverse:    is_crowd 
```


*   Merging the class of buses, trucks and cars into the class **VEHICLE**.
 In order to have a easier detection task and because the Waymo dataset doesn't make any difference between trucks, buses and cars, we simply merge these 3 classes into the vehicle one.
*   Dealing with people that rides a bicycle or a motorcycle. Some datasets consider the two-wheels vehicle and the rider as two different entities, others not. Do we merge bikes and motorbikes into a single class? 
This was the most intruing point. Since Rome is full of motorbikes and less full of bicycles, and because our model will be trained on data where motorbikes are rare, we suppose that our system won't work properly with "*two wheels vehicles*". With the goal of overcoming this issue we took this decision:
  *   All the detected bicycles (so their riders) will be considered as **PERSON**. The  Waymo "*TYPE_CYCLIST*" or the "*rider*" BDD100K class for example.
  *   Since Rome is full of them, motorbikes are considered as a single category (**MOTORBIKE**).

```
Different classes of each dataset:
Waymo:         TYPE_VEHICLE, TYPE_PEDESTRIAN, TYPE_CYCLIST
BDD100K:       pedestrian, rider, car, truck, bus, motorcycle, bicycle
Argoverse:     person, bicycle, car, motorcycle, bus, truck
```