Skip to content

JersonGB22/ObjectDetection-TensorFlow-PyTorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Object Detection

This repository showcases the implementation of Object Detection models, a fundamental task in the field of Computer Vision. Object detection involves identifying and locating instances of objects such as cars, people, buildings, animals, etc., in static images or sequences of images (videos). These models provide bounding box coordinates along with class labels and confidence scores for each detected object.

Object detection has diverse applications, including security and surveillance (detecting intruders or unusual behavior), industrial automation (object tracking and quality control), autonomous vehicles (identifying pedestrians, traffic signs, and obstacles), healthcare (analyzing medical images for abnormalities), precision agriculture (monitoring crops and detecting pests), and retail and commerce (customer flow analysis and inventory management).

Some of the models in this repository are built and trained from scratch using Convolutional Neural Networks (CNNs). In other cases, transfer learning and/or fine-tuning are applied, leveraging pre-trained models on large datasets using frameworks like TensorFlow, PyTorch, and Ultralytics. For basic models, tools like TensorFlow Hub or the timm library are utilized. For advanced, real-time applications, cutting-edge models such as YOLO11 from Ultralytics are employed. Additionally, model training and fine-tuning leverage hardware resources such as Google Colab TPUs or Google Colab GPUs, depending on project requirements.

Most of the notebooks in this repository include data augmentation techniques applied to the training dataset to improve the model’s generalization capabilities. These techniques are implemented manually, using tools like Albumentations, or automatically (e.g., with YOLO11). Strategies such as callbacks and Learning Rate Schedulers are also employed to prevent overfitting and achieve optimal performance.

If you're interested in exploring other Computer Vision tasks beyond Object Detection, you can visit my repository.

Implemented Models:

  • Object Detection with the Pascal VOC 2012 Dataset: The Pascal VOC 2012 dataset for object detection consists of 16,601 images across 20 object classes in realistic scenes, annotated with their respective bounding boxes. Each image may contain one or more instances of different classes. Of these, 13,299 images are used for training and 3,302 for validation. The YOLO11l model from Ultralytics was fine-tuned, achieving a mAP50 of 0.823 and a mAP50-95 of 0.681 on the validation dataset.

  • Object Detection with the VisDrone2019-DET Dataset: The VisDrone2019-DET dataset consists of images captured by various cameras mounted on drones across 14 different cities in China. It is divided into 6,471 images for training, 548 for validation, and 1,610 for the test-dev set, featuring 10 distinct classes. Each image may contain one or more instances of multiple classes. The YOLO11l model from Ultralytics was fine-tuned, achieving a mAP50 of 0.571 and a mAP50-95 of 0.367 on the validation dataset.

  • Fine-Grained Object Detection with the Stanford Dogs Dataset: The Stanford Dogs dataset consists of 20,580 images of 120 dog breeds from around the world, with 12,000 images for training and 8,500 for validation. Each image may contain one or more instances of a single class. The YOLO11l model from Ultralytics was fine-tuned for the fine-grained object detection task, achieving a mAP50 of 0.855 and a mAP50-95 of 0.785 on the validation dataset.

  • Oriented Object Detection with the DIOR-R Dataset: The DIOR-R dataset contains 23,463 optical remote sensing images annotated with oriented bounding boxes. Each image may include one or more instances of various classes. The dataset is divided into 5,862 images for training, 5,863 for validation, and 11,738 for testing. The YOLO11l model from Ultralytics was fine-tuned for the oriented bounding box object detection task, achieving a mAP50 of 0.84 and a mAP50-95 of 0.678 on the validation dataset.

  • Face Detection with the WIDER FACE Dataset: The WIDER FACE dataset consists of face images with various variations, divided into 12,880 training images and 3,226 validation images. Each image may contain one or multiple faces. The YOLO11m model from Ultralytics was fine-tuned, achieving a mAP50 of 0.802 and a mAP50-95 of 0.461 on the validation dataset.

  • Object Detection with the COCO Dataset: A subset of the COCO dataset was extracted, consisting of 10,135 images across 18 object classes, each containing a single bounding box per image. This subset was split into 80% for training and 20% for validation. The EfficientNetV2-S model was fine-tuned, achieving an accuracy of 97.29% and an average IoU of 0.856 on the validation dataset.

  • Object Detection with the Caltech-101 Dataset: The Caltech-101 dataset comprises 8,677 images categorized into 101 object classes, where each image contains only one bounding box. Since the dataset lacks a predefined split, it was divided into 80% for training and 20% for testing. The EfficientNetV2-S model was fine-tuned, achieving an accuracy of 98.21% and an average IoU of 0.879 on the test dataset.

  • Object Localization with the CUB-200-2011 Dataset: The CUB-200-2011 dataset comprises 11,788 images of 200 bird species, each with a single bounding box. It is split into 5,994 images for training and 5,794 for testing. The EfficientNetV2-S model was fine-tuned for the object localization task, focusing on predicting only the bounding box coordinates, achieving an average IoU of 0.92 on the test dataset.

  • Simple Object Detection with the MNIST Dataset: The MNIST dataset consists of grayscale images divided into 10 classes (digits 0-9), comprising 70,000 images: 60,000 for training and 10,000 for testing. Since the dataset does not include bounding boxes for the digits, an artificial variation was created. Each original 28x28 pixels image containing a digit was randomly placed on a black canvas of 112x112 pixels. A simple object detection model based on convolutional neural networks was then trained from scratch, achieving an accuracy of 99.3% and an average IoU of 0.948 on the test dataset.

Some Results

VisDrone2019-DET Dataset


Stanford Dogs Dataset


Pascal VOC 2012 Dataset


DIOR-R Dataset


WIDER FACE Dataset


COCO Dataset


Caltech-101 Dataset


CUB-200-2011 Dataset


MNIST Dataset

More results can be found in the respective notebooks.

Technological Stack

Python TensorFlow PyTorch Ultralytics

Scikit-learn OpenCV Pandas Plotly

Contact

Gmail LinkedIn GitHub