# Dataset creation to train a YoloV8 model for tank detection

This notebook shows how to build a dataset of annotated images to train a computer vision model for object detection. We use available open source datasets to create a dataset of military vehicles and format it correctly for YoloV8 training.

We use [fiftyone](https://github.com/voxel51/fiftyone) to convert, merge, label and format the images prior to training with [Yolov8](https://github.com/ultralytics/ultralytics).

### Setup

We start by setting up some logging.

In [None]:
import logging

logging.basicConfig(level=logging.INFO)

### Download images from ImageNet

The first dataset we'll use is ImageNet21k. The ImageNet21k dataset is available at [https://image-net.org/download-images.php](https://image-net.org/download-images.php). You need to register and be granted access to download the images. We use the Winter 21 version since it gives the option of downloading the images for a single synset: https://image-net.org/data/winter21_whole/SYNSET_ID.tar, e.g., https://image-net.org/data/winter21_whole/n02352591.tar. The processed version of ImageNet21k is available here : https://github.com/Alibaba-MIIL/ImageNet21K. The class ids and names are available here https://github.com/google-research/big_transfer/issues/7#issuecomment-640048775.

We'll begin by downloading the class names that are in ImageNet21k and look for relevant classes that we can use.

In [None]:
from pathlib import Path

imagenet_dir = Path() / "imagenet"

In [None]:
from adomvi.datasets.imagenet import download_class_names, find_class_by_text

classes = download_class_names(imagenet_dir)
find_class_by_text(classes, "military")

We can now download images and annotations for the relevant classes. The `download_imagenet_detections` function will download the images and annotations for the given class ids **if the annotations exist** (not all classes have been annotated).

In [None]:
from adomvi.datasets.imagenet import download_imagenet_detections

class_ids = ["n02740300", "n04389033", "n02740533", "n04464852", "n03764276"]
download_imagenet_detections(class_ids, imagenet_dir)

The data we just downloaded into the `imagenet` directory is not all clean: there are annotations which have no corresponding image. We need to remove those labels, otherwise this causes errors when importing the data into fiftyone.

In [None]:
from adomvi.datasets.imagenet import cleanup_labels_without_images

cleanup_labels_without_images(imagenet_dir)

We can now create a new dataset with `fiftyone`. Fiftyone allows us to manage images annotated with bounding boxes and labels, to merge datasets from different sources, and to split the datasets and prepare them for processing.

In [None]:
from adomvi.utils import cleanup_existing_dataset

imagenet_name = "military-vehicles"
cleanup_existing_dataset(imagenet_name)

In [None]:
import fiftyone as fo

# Create the dataset
dataset = fo.Dataset.from_dir(
    dataset_dir=imagenet_dir,
    dataset_type=fo.types.VOCDetectionDataset,
    # dataset_name = imagenet_name,
)

dataset.name = imagenet_name

dataset.map_labels(
    "ground_truth",
    {"n04389033":"tank"}
).save()

Once our dataset is created, we can launch a session to display the dataset and view the annotated images

In [None]:
session = fo.launch_app(dataset, auto=False)

In [None]:
session.show()

### Add OpenImage samples

The ImageNet dataset only contained 378 annotated images of tanks, so we'll look into other available datasets to improve training of the model. We’ll load [Open Images](https://storage.googleapis.com/openimages/web/index.html) samples with `Tank` detection labels, passing in `only_matching=True` to only load the `Tank` labels. We then map these labels by changing `Tank` into `tank`.

In [None]:
import fiftyone.zoo as foz

oi_samples = foz.load_zoo_dataset(
    "open-images-v7",
    classes = ["Tank"],
    only_matching=True,
    label_types="detections"
).map_labels(
    "ground_truth",
    {"Tank":"tank"}
)

We can add these new samples into our training dataset with `merge_samples()`:

In [None]:
dataset.merge_samples(oi_samples)

In [None]:
session = fo.launch_app(dataset, auto=False)

In [None]:
session.show()

### Add Roboflow dataset

The current data contained 1624 annotated images of tank, so we'll look into other available datasets to improve training of the model. We’ll load [Roboflow Images](https://universe.roboflow.com/) with `Tank` detection labels.

In [None]:
from adomvi.datasets.roboflow import download_roboflow_dataset

roboflow_dir = Path() / "roboflow"
url="https://universe.roboflow.com/ds/P2jPq32qKU?key=E4MIo8mavP"
download_roboflow_dataset(url, roboflow_dir)

In [None]:
from adomvi.datasets.roboflow import restructure_dataset

restructure_dataset(roboflow_dir)

In [None]:
roboflow_name = "russian-military-annotated"
cleanup_existing_dataset(roboflow_name)

In [None]:
# Import the roboflow dataset
dataset_rf = fo.Dataset.from_dir(
    dataset_dir=roboflow_dir,
    dataset_type=fo.types.VOCDetectionDataset,
    name = roboflow_name,
)

existing_labels = ["bm-21", "t-80", "t-64", "btr-80", "mt-lb", "t-72", "bmp-1", "bmp-2", "bmd-2", "btr-70"]

label_mapping = {label: "tank" for label in existing_labels}

dataset_rf.map_labels(
    "ground_truth",
    label_mapping
).save()

In [None]:
# from adomvi.datasets.roboflow import delete_images_without_labels

# delete_images_without_labels(dataset_rf)

We can add these new samples into our training dataset

In [None]:
dataset.merge_samples(dataset_rf)

In [None]:
session = fo.launch_app(dataset, auto=False)

In [None]:
session.show()

### Add google dataset

We provide a sample annotated dataset with 4 classes (*AFV*, *APC*, *LAV* & *MEV*). You can download the dataset from [here](https://github.com/jonasrenault/adomvi/releases/download/v1.2.0/military-vehicles-dataset.tar.gz) and extract it.

In [None]:
from adomvi.datasets.google import download_google_dataset

google_dir = Path() / "google"
url = "https://github.com/jonasrenault/adomvi/releases/download/v1.2.0/military-vehicles-dataset.tar.gz"

download_google_dataset(url, google_dir)

We'll use fiftyone to load and preview the dataset.

In [None]:
google_name = "google-military-vehicles"
cleanup_existing_dataset(google_name)

In [None]:
dataset_google_dir = google_dir / "dataset"

# Create the dataset
dataset_google = fo.Dataset.from_dir(
    dataset_dir=dataset_google_dir,
    dataset_type=fo.types.YOLOv4Dataset,
    name=google_name,
)

We can add these new samples into our training dataset

In [None]:
dataset.merge_samples(dataset_google)


In [None]:
session = fo.launch_app(dataset, auto=False)

In [None]:
session.open_tab()

In [None]:
session.show()

### Export dataset to disk

Now that our dataset is created, we'll export it into a format supported by YOLOv8 to train our model.

We first remove tags from the dataset, and split it into a train, val and test sets.

In [None]:
import fiftyone.utils.random as four

## delete existing tags to start fresh
dataset.untag_samples(dataset.distinct("tags"))

## split into train, test and val
four.random_split(dataset, {"train": 0.8, "val": 0.1, "test": 0.1})

Once our dataset is split, we can export it to a specific directory.

In [None]:
from adomvi.yolo.utils import export_yolo_data

export_dir = Path() / "dataset"
export_yolo_data(dataset, export_dir, ["tank"], split = ["train", "val", "test"], overwrite=True)