# Deep Learning Final Project: X-Ray Object Detection Literature Review

Soon-Hyuck Lee, Owen Randolph, Marcos Fernanadez, Pratham Dedhiya

##### Abstract

Airports face increasing pressure to improve the accuracy, speed, and consistency of security screening. Current baggage inspection systems rely heavily on human operators, who must analyze complex X-ray images under time constraints and high workload. This dependence introduces significant challenges, including operator fatigue, inconsistent performance, and a heightened risk of human error—particularly when identifying small, occluded, or visually ambiguous threat items.

To address these limitations, airports require a more advanced, scalable solution that can augment human screeners and enhance threat detection reliability. An AI-powered computer vision system based on deep learning provides a promising path forward. Such a system can automatically analyze X-ray images, highlight potential threats, reduce cognitive burden on security personnel, and serve as a foundation for next-generation screening platforms that increasingly automate the detection process.

##### Background

The use of deep learning in X-ray security screening has increased significantly in recent years, largely due to improvements in convolutional neural networks (CNNs). Traditional image-processing methods struggle with X-ray images because objects often overlap, vary in shape, and appear with low contrast. Deep neural networks, however, can automatically learn hierarchical visual patterns—such as edges, textures, materials, and object shapes—making them well suited for detecting prohibited items in complex baggage imagery.

CNNs have become the standard approach for image-based detection and classification because they learn features directly from raw pixels. Early layers capture simple patterns (edges, corners, textures), while deeper layers learn more abstract, high-level features (shapes, semantic structures, or full objects). This multi-level representation is essential in X-ray imaging, where important cues may be subtle or partially occluded.

While earlier CNNs like VGG networks achieved strong performance, researchers found that simply stacking more layers often made training worse, not better. Very deep models suffered from optimization problems such as vanishing gradients, where deeper layers fail to learn meaningful features. This limitation held back progress for years, especially in domains requiring high-level abstraction—such as detecting small or concealed objects in X-ray images.

##### ResNet-50

ResNet is a foundation for many modern computer vision models, including YOLO.  It works like this:  ResNet learns the residual of identity mapping in a naural network, which reduces the burden of each layer.  We have an identity function that is added to the residual function.  Back propagation can take care of the residual coding part. So it is easier to train deep learning neural networks.

ResNet-50 is a 50-layer deep convolutional neural network.  It's residual connections let he network learn differences as opposed to whole mappings.  It's a great choice for our Xray object detection project because we are seeking identification of deep hierarchical features.  We need to see edges and textures.
By cominbing


![image.png](attachment:fcf59ccc-8f66-4b61-9f03-e6481e44eb1b.png)

ResNet introduces residual blocks, avoiding gradient vanaishing and improves on small-image regonition, in this case for X-ray objectr detection in bags.

![image.png](attachment:00e2a1c0-d176-4701-8128-c70dd01de2d5.png)

This structure highlights how ResNet-50 builds its deep feature hierarchy through repeated bottleneck residual blocks, each combining a main convolutional path with a shortcut connection, enabling the network to learn rich, multi-scale representations that are both stable to train and highly effective for downstream tasks such as X-ray object recognition.

##### YOLO for Object Detection
One of the most significant advances in security imaging has been the shift from traditional two-stage detectors (e.g., Faster R-CNN) to faster single-stage architectures such as YOLO (You Only Look Once). Unlike two-stage models, which first generate region proposals and then classify them, YOLO performs detection in a single forward pass, enabling real-time inference without a substantial loss in accuracy.

YOLO-based detectors consistently outperform two-stage models in throughput and latency, making them especially suitable for airport screening and other high-volume inspection environments where rapid decision-making is critical. Their ability to handle small and irregularly shaped objects across multiple scales is also valuable in X-ray imagery, where prohibited items may appear partially occluded or vary widely in size and orientation.

##### Reference 

Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

He, Tong, "Bag of Tricks for image classification with convolutional neural networks"

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 779–788)