Scene Text Detection for Driving Videos

Abstract
With the current trend of automation gradually dominating many aspects of human life, the demand for highly accurate and timely responsive automated systems has become essential. Specifically, in the context of transportation, self-driving vehicles, automated traffic monitoring and analysis systems require a capability to read and comprehend the traffic context at a given moment to make informed decisions. My research, "Scene Text Detection for Driving Videos", aims at supporting automated transportation systems in capturing textual information from traffic signs.

System Pipeline

Module 1: Detect and classify traffic signs

Model: PP-YOLOE+ from Paddle Detection

PP-YOLOE architecture 🡵

Module 2: Detect text box on traffic signs

Model: PP-OCRv3 (detection) from Paddle OCR
- Using student model for lightweight inference

PP-OCRv3 (detection) architecture 🡵

Data

#	Dataset	Description	Detail	M1 Usage	M2 Usage
#1	Vietnam Traffic Signs Dataset	Open source recorded traffic videos around Ho Chi Minh City	40 videos (total length: 1h24m44s)	Fine-tuning + Testing	Fine-tuning + Testing
#2	VinText	Largest Vietnamese Scene text dataset	2,000 labeled images, ~56,000 text objects (~10,500 unique objects)	Testing	Fine-tuning + Testing
#3	Zalo AI Challenge - Traffic Sign Detection Dataset	Zalo AI Challenge dataset for “Traffic Signs Detection" contest in 2020 with image data collected from Google Map Street View	~8,000 traffic images with traffic sign labels	Testing	Testing
#4	Extra	Self collected dataset around Ho Chi Minh City	198 images, 393 traffic sign objects	Improved Fine-tuning + Testing	Testing

Customized Vietnam Traffic Signs Dataset (Customized VTSD)

Since Dataset #1 was used in another project with different output, we need to re-process Dataset #1 to match with our project target

Splitting and filtering images from raw videos
Using CVAT to label traffic signs and text
Label statistics:

# of	Images	296
	Traffic sign objects	603
	Traffic sign classes	12
	Word objects	1,538 (274 unique words)
	Textline objects	628

Traffic sign classes and data distribution

Fine-tune

Module	Model	Pre-trained dataset	Fine-tuned dataset	Performance	FPS
#1	PP-YOLOE+	Objects365	Customized VTSD	mAP: ~0.677	~18.3
#2	PP-OCRv3 (detection)	Baidu images + public datasets	Customized VTSD + VinText	H-mean: ~0.82	~29.5

Improving M1 performance by combine Dataset #4 into Customized VTSD:
- Total number of images and number of traffic sign objects increase by ~40%
- After improved mAP: ~0.69

Improvement sample
(above and below images are before and after improvement, respectively)

Video output samples

#1
#2
#3

Future works

Fine-tuning and combine Scene text recognition module into the system
Building an End-to-end model based on Transformer
Developing a web application for demonstration

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Scene Text Detection for Driving Videos

System Pipeline

Module 1: Detect and classify traffic signs

Module 2: Detect text box on traffic signs

Data

Customized Vietnam Traffic Signs Dataset (Customized VTSD)

Fine-tune

Video output samples

Future works

About

License

nguyennpa412/scene-text-detection-for-driving-videos

Folders and files

Latest commit

History

images

images

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Scene Text Detection for Driving Videos

System Pipeline

Module 1: Detect and classify traffic signs

Module 2: Detect text box on traffic signs

Data

Customized Vietnam Traffic Signs Dataset (Customized VTSD)

Fine-tune

Video output samples

Future works

About

Topics

Resources

License

Stars

Watchers

Forks