Skip to content

A system that detects scene text on traffic signs through images and videos

License

Notifications You must be signed in to change notification settings

nguyennpa412/scene-text-detection-for-driving-videos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Scene Text Detection for Driving Videos

Abstract
With the current trend of automation gradually dominating many aspects of human life, the demand for highly accurate and timely responsive automated systems has become essential. Specifically, in the context of transportation, self-driving vehicles, automated traffic monitoring and analysis systems require a capability to read and comprehend the traffic context at a given moment to make informed decisions. My research, "Scene Text Detection for Driving Videos", aims at supporting automated transportation systems in capturing textual information from traffic signs.

System Pipeline

Scene Text Detection for Driving Videos System pipeline

Module 1: Detect and classify traffic signs

ppyoloe_architecture
PP-YOLOE architecture 🡵

Module 2: Detect text box on traffic signs

ppocrv3_det_architecture
PP-OCRv3 (detection) architecture 🡵

Data

# Dataset Description Detail M1 Usage M2 Usage
#1 Vietnam Traffic Signs Dataset Open source recorded traffic videos around Ho Chi Minh City 40 videos (total length: 1h24m44s) Fine-tuning + Testing Fine-tuning + Testing
#2 VinText Largest Vietnamese Scene text dataset 2,000 labeled images, ~56,000 text objects (~10,500 unique objects) Testing Fine-tuning + Testing
#3 Zalo AI Challenge - Traffic Sign Detection Dataset Zalo AI Challenge dataset for “Traffic Signs Detection" contest in 2020 with image data collected from Google Map Street View ~8,000 traffic images with traffic sign labels Testing Testing
#4 Extra Self collected dataset around Ho Chi Minh City 198 images, 393 traffic sign objects Improved Fine-tuning + Testing Testing

Customized Vietnam Traffic Signs Dataset (Customized VTSD)

Since Dataset #1 was used in another project with different output, we need to re-process Dataset #1 to match with our project target

  • Splitting and filtering images from raw videos
  • Using CVAT to label traffic signs and text
  • Label statistics:
# of Images 296
Traffic sign objects 603
Traffic sign classes 12
Word objects 1,538 (274 unique words)
Textline objects 628

traffic_sign_classes
Traffic sign classes and data distribution

Fine-tune

Module Model Pre-trained dataset Fine-tuned dataset Performance FPS
#1 PP-YOLOE+ Objects365 Customized VTSD mAP: ~0.677 ~18.3
#2 PP-OCRv3 (detection) Baidu images + public datasets Customized VTSD + VinText H-mean: ~0.82 ~29.5
  • Improving M1 performance by combine Dataset #4 into Customized VTSD:
    • Total number of images and number of traffic sign objects increase by ~40%
    • After improved mAP: ~0.69

improved_sample
Improvement sample
(above and below images are before and after improvement, respectively)

Video output samples

#1 sample_1
#2 sample_2
#3 sample_3

Future works

  • Fine-tuning and combine Scene text recognition module into the system
  • Building an End-to-end model based on Transformer
  • Developing a web application for demonstration

About

A system that detects scene text on traffic signs through images and videos

Topics

Resources

License

Stars

Watchers

Forks