Video-based Distracted Driver Detection System

1 Problem Description

1.1 Classification of dangerous driving behavior

Distracted driving
Fatigue driving
Drunk driving
Aggression
...

1.2 Target Problems

A image-based detection scheme alone cannot accurately detect the leading actions of the driver's behavior, such as the driver reaching for the phone; and ignoring the whole action will lead to a decrease in recognition accuracy. Although current image-based detection schemes have a high accuracy rate, they rely heavily on key frames and ignore other revelent actions of the driver.
Detection speed is an important indicator. Low time delay is required to to achieve real-time monitoring as much as possible.

1.3 Objective

What we are going to do is distracted driver detection. On the embedded system side, we would use the camera to record the driver's behavior, and dump the video every 5s or 10s. Our current solution is to extract frames of this video under a proper time step in Raspberry Pi, and then send these packaged images to the server side as the raw input for the model prediction. The problem on this side is how to achieve a near real-time dectection. Because the Pi need to wait for the model's results and warn the driver through the buzzer.

2 Related Works

2.1 Related datasets

AUC: https://abouelnaga.io/projects/auc-distracted-driver-dataset/

The Distracted Driver’s dataset is collected using an ASUS ZenPhone (Model Z00UD) rear camera. The input was collected in a video format, and then, cut into individual images, 1080×19201080×1920 each. The phone was fixed using an arm strap to the car roof handle on top of the front passenger’s seat. In our use case, this setup proved to be very flexible as we needed to collect data in different vehicles. All of the distraction activities were performed without actual driving in a parking spot.

Only images, 10 classes, haven't reply yet.
SEU: https://www.kaggle.com/c/state-farm-distracted-driver-detection/data

Only images, 10 classes, 4GB
DMD: https://dmd.vicomtech.org

Videos, 14 classes , 7GB

For now, got the lite version of the database, which include 2446 70-frames clips.

2.2 Rreference Papers

transformer：

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020 oct)

ViViT: A Video Vision Transformer

https://www.jiqizhixin.com/articles/2020-10-05-4

https://www.jiqizhixin.com/articles/2020-05-28-9?fbclid=IwAR3ZMsqDOk5MDBUaGREMEiEMd05ucDViOwALWxGjBgBwdbMeGvlCfKDximg

2.3 Others

A repository contains different implementation of vit :

https://github.com/lucidrains/vit-pytorch

3 Method

3.1 Overall Ideas

https://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review#sec-2

A Guide to Action Recognition (2018) :

One network for capturing spatiotemporal information vs. two separate ones for each spatial and temporal
Fusing predictions across multiple clips
End-to-end training vs. feature extraction and classifying separately

3.2 Vision Transformer model

In the ViDDD model, our input video intercepts 70 consecutive frames of pictures, and each image enters the same ViT model in sequence. For the 70 predicted values, we carry out simple statistics. Calculate the number of predictions for each model category. The largest category was then selected as the predicted result of the video.

4 Communication Architecture

To bridge the deep learning side and embedded system side, we need to build a real-time communication system for video uploading and results feedback.

For real-time video uploading, on the client-side, camera capturers will continuously get frames from the camera embedded on the Pi. Frames senders are in charge of keeping sending frames to the server through the “video Steam sending” pipe. Once the frames reach the server-side, frames receivers will save them into a buffer. Every 70 frames in the buffer would be popped by clips dumper and write into a video clip. Model predictor would predict it and save the result to the Redis database.

At the same time, to obtain the results from the prediction model, result retrievers would query the server for the prediction results every 5s through the “Result Returning” pipe. Results replier on the server-side would look up in the database and return it.

5 Experiments

5.1 Dataset

Dataset Annotation

We are so thankful to the video annotation tools provided by DMD.

https://github.com/Vicomtech/DMD-Driver-Monitoring-Dataset/

对于视频的处理使用FFmpeg或者OpenCV库，FFmpeg下载安装时记得让git走全局代理。
Process Raw Dataset
- Cut the video into smaller clips containing 70 frames.
- Label the each clip with the annotation file.
- Restructure the folder
Data PreProcess

for each frame in the clips:
- Resize and Crop into 224*224 corresponding to the ViT expected input image size.
- RandomHorizontalFlip
- ToTensor
- Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])

5.2 Model Evaluation

5.3 Experiments

6 Code Structures

DDD
├── code
│   ├── __pycache__
│   │   └── mypath.cpython-38.pyc
│   ├── dataLoaders
│   │   ├── DMD-labels.json
│   │   ├── pic_dataset.py
│   │   └── video_dataset.py
│   ├── models
│   │   └── vit_base_patch16_224_in21k.pth
│   ├── mypath.py
│   ├── network
│   │   └── vit_model.py
│   ├── process_raw_dataset.py # processed raw dataset and save them into processed_dataset
│   ├── train.py
│   └── utils.py
|
|
├── data	# data ready for train
│   └── DMD-clips-70
│       ├── train
│       └── val
|
|
├── processed_dataset	# raw dataset after processed, but not ready for train
│   ├── DMD-clips-70
│   │   ├── change_gear
│   │   ├── drinking
│   │   ├── hair_and_makeup
│   │   ├── phonecall_left
│   │   ├── phonecall_right
│   │   ├── radio
│   │   ├── reach_backseat
│   │   ├── reach_side
│   │   ├── safe_drive
│   │   ├── standstill_or_waiting
│   │   ├── talking_to_passenger
│   │   ├── texting_left
│   │   ├── texting_right
│   │   └── unclassified
│   |
│   ├── driver_imgs_list.csv
│   └── imgs
│       └── train
...

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.vscode		.vscode
Readme.assets		Readme.assets
code		code
communication		communication
papers		papers
.gitignore		.gitignore
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-based Distracted Driver Detection System

1 Problem Description

1.1 Classification of dangerous driving behavior

1.2 Target Problems

1.3 Objective

2 Related Works

2.1 Related datasets

2.2 Rreference Papers

2.3 Others

3 Method

3.1 Overall Ideas

3.2 Vision Transformer model

4 Communication Architecture

5 Experiments

5.1 Dataset

5.2 Model Evaluation

5.3 Experiments

6 Code Structures

About

Releases

Packages

Contributors 2

Languages

jackfromeast/ViDDD-Video-based-Distracted-Driver-Detection

Folders and files

Latest commit

History

Repository files navigation

Video-based Distracted Driver Detection System

1 Problem Description

1.1 Classification of dangerous driving behavior

1.2 Target Problems

1.3 Objective

2 Related Works

2.1 Related datasets

2.2 Rreference Papers

2.3 Others

3 Method

3.1 Overall Ideas

3.2 Vision Transformer model

4 Communication Architecture

5 Experiments

5.1 Dataset

5.2 Model Evaluation

5.3 Experiments

6 Code Structures

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages