# A Comprehensive Computer Vision Project for Safety of Text Walkers


####Detection of people who are looking to their smartphone
##### Bournemouth University

# Report
##I- Introduction

The increase in daily usage of smartphones has importantly changed how humans interact with one another, conduct business, and access information, news and updates from their surroundings. We should admit that, the modern life has benefited highly from these very capable devices. But also they have brought new difficulties and potential dangers that we should consider. People's propensity to use their smartphones while walking is one of the most worrying behaviour that we face because of smartphone usage. This behaviour puts anyone nearby at danger in addition to the users themselves. Our project idea suggests a thorough computer vision based method for identifying persons who are using their smartphones while walking to solve this modern life problem. We want to increase safety, increase awareness, and finally lower the risk by implementing this system in public places, for example corner of the street near the traffic lights.
##II- Objectives and Roadmap


The main goal of our project is to develop nice working computer vision system that can correctly detect people using their smartphones. Our objective and motivation is to ensure that the system provides accurate predictions even in challenging scenarios such as varying lighting conditions, people who has coffee in their hand, crowded environments, varying style of walking, and varying smartphone holding styles. To achive the goals of this challenge, we plan to train and test advanced machine learning models and techniques such as Faster R-CNN, Faster R-CNN with SVM and HOG, YOLOv5 and YOLOv8.

We aim to work for creating a method for identifying smartphone use while walking. The steps in our project will be as follows:

1. First and foremost, our goal is to compile and analyse a sizable and varied dataset of images. This dataset will comprise people using their smartphones while walking and our dataset will also contain negative samples, for example people who are holding their smartphone but not looking to it, or people who are holding both coffee or water bottle and their smartphone.

2. Then, we will search for object detection algorithms, we will choose some of them to implement our project, also we need to learn how to implement these algorithms by checking their official websites or github accounts, and analyze corresponding documentations.

3. We will start training process, and we will try to improve parameters of our models, feeding more images, augmentations, trying to reduce memory usage, increasing model speed etc.

4. The performance of these machine learning models, including the Faster R-CNN, Faster R-CNN with SVM and HOG, YOLOv5, and YOLOv8, will then be tested. To achieve better results we will add more images to dataset, for example when we realize that model is predicting some of the images wrongly, we will feed similar positive and negative samples similar to that image to improve model accuracy.

5. Since as it is explained in project sub brief, we want our notebook to be reproducible from top to bottom, we will train our models to the fullest, with maximum number of epochs, such as the point where we see the loss of our model is getting flattened.

6. Dataset and full trained models should be uploaded to online sources and imported from third party websites like roboflow, dropbox, github. So that google colab link or jupyter notebook will be accessible and anyone can produce same results like us.

##III- Data Collection
###Sources of data
We are not using datasets from kaggle, or internet, we collected our data ourself from various sources like istock.com, pexels.com and from google images. We have collected more than a thousand images where we have people looking to their phone, or we have people who are not looking their smartphone in negative samples.

###Data Annotation with Roboflow

Roboflow is a easy to use platform which is designed to make annotation and data collection easier for machine learning projects. It provides simple interface for annotation and provides features for example; version control, data augmentation, and preprocessing, changing image sizes, applying various filters, removing duplicated images, marking not annotated images as negative samples etc. It is possible to load images and then share them across the teammates, so annotation can be done by team members. And also It is very easy to upload our dataset from roboflow to various platforms like google colab. We have loaded our images, annotated them and splitted our data set to three groups: Training, Testing, Validation. Their corresponding percentages are %70,%20,%10.

##IV- Literature Review
### Searching for similar projects
 
In order to understand our way to achieve good results in detecting people who are looking to their smartphones on the street, detailed search was conducted by our team to find relevant resources,research papers, relevant code snippets articles, and projects. 
The search included various sources like academic databases, Google Scholar, as well as popular web sources, such as towardsdatascience, Medium, and GitHub, Youtube. The search terms were usually containing combinations of the spesific keywords like following: "smartphone usage detection," "pedestrian detection," "object detection," "computer vision," "Faster R-CNN," "YOLOv5," "YOLOv8," "HOG," and "SVM."
After a couple of studies have been done on detecting pedestrians and/or smartphone users, text walkers in different contexts. The following is short brief of some of the most relevant things that we found: In conclusion, the literature review make us be aware of valuable information into the current state-of-the-art in object detection projects like in our study, pedestrian and smartphone usage detection. The things that we find from these searches informed us about deciding of algorithms and techniques that we will choose to use in our project.

### Documentations

After careful literature research we decided to implement faster cnn and yolo (you only look once) library as early we explained results were pretty well. At that point, analyzing and reading the official documentations were very helpful because there was a guideline to train our custom datasets and detailed explanation about other details. For example, trying our model on a video. At the end of the notebook we examine our yolov8 model on a video and we are truly satisfied with the results that we achived there.

##V- Object Detection Algorithms

### Faster R-CNN

Recently, the Faster R-CNN showed impressive results on a number of object detection benchmarks. On two frequently used face detection benchmarks, it is reported state-of-the-art results using a Faster R-CNN model trained on large scale datasets. In many computer vision tasks, deep convolutional neural networks (CNNs) have taken control. Region-based CNN detection approaches are currently the dominant paradigm in object detection. Three generations of region-based CNN detection models have been suggested in recent years, each with greater performance and quicker processing speed, because the field is expanding so quickly.(Jiang, 2017)

Faster R-CNN is one of the state-of-the-art object detection algorithm that contains Region Proposal Network (RPN) with a Fast R-CNN model. The RPN is generating example object bounding boxes, which are passed to Fast R-CNN model for classification and determining bounding boxes . This two stage approach is enabling Faster R-CNN to achieve good accuracy and also it is maintaining reasonable time and speed.
 
#### Integration to the project

Faster R-CNN was implemented as one of our primary algorithm for object detection, for detecting people using smartphones on the street while walking or standing. The model was trained on the annotated dataset that we have, focusing on identifying pedestrians and their smartphone usage. We have trained the model for 10 epochs. And It takes around two hours to complete it in our local computer, time is subject to change if one is using another local machine depending on the hardware available.


### Support Vector Machines (SVM) and Histogram of Oriented Gradients (HOG) on top of Faster R-CNN

SVM is supervised machine learning algorithm that we mostly use  for classification tasks in daily applications or in classes in university, we are familiar with that algorithm. It finds the optimal hyperplane which is separating the classes in space best. HOG is feature descriptor that also know from our computer vision lectures, we have used it in lecture and lab.It captures the distribution of gradients and It may capture edge directions in an image, and than make it suitable for object recognition tasks.
SVM and hog is used like a secondary filter on top of the fastercnn algorithm that we train, we aim to remove false positives that fastercnn predict by the help of svma and hog, by doing so we will be able to combine traditional computer vision approaches with the advanced models like neural networks. So this will become a hybrid solution to our problem.

#### Integration to the project

The SVM and HOG features were combined with Faster R-CNN to improve the detection of smartphone users. HOG features are extracted from the potential pedestrian or people with phone regions proposed by the RPN, and then SVM classifier is trained on that regions to identify people with smartphone on these features.

### YOLOv5
YOLO is a novel method of object detection.Classifiers from earlier work on object detection are repurposed for shape detection. Rather, it conceptualise object detection as a regression issue to bounding boxes with spatial separations and related class probabilities. Bounding boxes and class probabilities are directly predicted by a single neural network from entire images in a single assessment. Since the entire detection pipeline consists of a single network, detection performance can be optimised from beginning to end.(Redmon, 2016)

YOLOv5 (You Only Look Once) is free and easy to use efficient, single stage object detection algorithm. It divides input image into grid and then it is predicting bounding boxes and class probabilities. YOLOv5 is known for its performance, how easy of its implementation, and very good accuracy in various object detection tasks.

#### Integration to the project

YOLOv5 is implemented as alternative object detection algorithm for detecting people with smartphone on the street. Our model is trained on the annotated dataset that we upload from roboflow. And we are able to compare its performance with Faster R-CNN and the combined SVM-HOG approach.

### YOLOv8
In a wide range of applications, including those in the domains of autonomous cars, robotics, video surveillance, and augmented reality, real-time object identification has become an essential element. The YOLO (You Only Look Once) framework has distinguished itself among the many object detection algorithms for its exceptional balance of speed and precision, enabling the quick and accurate identification of objects in photos. The YOLO family has gone through several incarnations since its creation, each improving upon the ones before it to address flaws and improve performance.(Aboah, 2023)

YOLOv8 is more recent version of You only look once, YOLO family of object detection algorithms. It includes various improvements, for example it has new backbone architecture, new modified loss function, and improved data augmentation techniques. These improvements make it possible for YOLOv8 to achieve higher accuracy and faster inference speed compared to its predecessors.

#### Integration to the project

YOLOv8 was also implemented as an object detection algorithm for detecting people using smartphones on the street. The model was trained on the annotated dataset and its performance was compared with Faster R-CNN, the combined SVM-HOG approach, and YOLOv5 to evaluate the most effective method for the task.




##VI- Implementation and Evaluation
### Training the models

The one that we have selected as object detection algorithms, they are including Faster R-CNN, the hybrid SVM-HOG approach, YOLOv5, and YOLOv8, were trained on our annotated dataset over roboflow. The dataset was split into training, validation, and testing sets. Hyperparameters were tuned to optimize the models' performance, and we have trained fastercnn for 10 epochs around two hours, yolov5 for 150 epochs around 2.5 hours, yolov8 for 150 epochs around 2.3 hours.
 
### Comparing the results of different algorithms

The performance of different algorithms that we used and trained was compared based on specific evaluation metrics. The aim of comparison is to identify the most effective method for our project which is detecting people looking to their smartphone on the street. Results were analyzed to determine both strengths and weaknesses of each of the approaches, as well as to understand the reasons that contributed to their performance.

### Discussion of challenges and limitations

During implementation and evaluation, we have encountered with several challenges and limitations. These includes the varying quality of images that we feed as input to the algorithms, the diversity of people poses and diversity of smartphones, and patterns that people hold their phone or other things at hand like wallet, coffe, watch or laptop etc. And the computational resources required for training and inference, because It takes too much time for us to make our notebook work from bottom to top it takes around 9 hours, also we had difficulties about CUDA memorya and colab usage, since we are using free version in our local computers, training process stopped for several times because of that reasons.

##VII- Conclusion

### Summary of findings

Our project is aiming to detect people who are looking to their smartphones on the street using different object detection algorithms. The implemented models were Faster R-CNN, the combined SVM-HOG approach, YOLOv5, and YOLOv8. Based on the performance metrics, we have couple of very effective methods that we have trained. It offers insights into the factors that contributed to models success. Additionally, the challenges and limitations faced during the project were discussed.

### Implications and potential applications

In light of this, there may be applications.
There are numerous implications and potential applications for the study's findings. The developed system might be used for research on peoples behaviour, community safety precaustions, or urban planning. Additionally, it could be used to enhance people identification and boost the safety of autonomous vehicle systems' intelligent transportation systems.

### Future work and recommendations

The focus of future study may be on enhancing the models by including more data or looking into more sophisticated techniques and/or approaches. For example using attention mechanisms or adding temporal data from video sequences. Additional search might look into ethical implications because of people's faces and any biases in the detection system to make sure trained model's solution is fair, transparent, and with out privacy concerns.


## Important Notes About Notebook

To achieve reproducibility, and prevent CUDA memory crash while running the notebook, we will train all models for just one epoch. and also to reduce memory we added a 'number of images' variable for dataset loader function. Since our dataset contains more than a thousand images, we wanted to make number of images as input variable to limit usage of memory.

To prevent deterministic behavior error of pytorch we will edit cublas workspace config at the beginning.
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

In order to make the notebook partially producible as well for different machine learning models we will use 3 different dataset format, coco for fastercnn, and roboflow template for yolov5 and yolov8.

We will import and upload fully trained models from dropbox, since we use free version, links are valid for 1 month, they can be updated and their time can be extended if we buy subscription.

<br>
<br>
 

## References, Resource, and Links

Jiang, H. and Learned-Miller, E., 2017, May. Face detection with the faster R-CNN. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017) (pp. 650-657). IEEE.

Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).

Aboah, A., Wang, B., Bagci, U. and Adu-Gyamfi, Y., 2023. Real-time multi-class helmet violation detection using few-shot data sampling technique and yolov8. arXiv preprint arXiv:2304.08256.

### Code Cells
2.1 The Dataset

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset

2.3 Augmentations

https://albumentations.ai/docs/api_reference/pytorch/transforms/#albumentations.pytorch.transforms.ToTensorV2

https://github.com/albumentations-team/albumentations

2.4 DataLoaders

https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

2.5 Pretained model and 2.6 Training

https://pytorch.org/vision/stable/models.html#torchvision.models.detection.fasterrcnn_resnet50_fpn

https://pytorch.org/vision/stable/_modules/torchvision/models/detection/faster_rcnn.html#FastRCNNPredictor

2.7 Filtering the outputs

https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#object-detection-finishing-touches-nms

https://pyimagesearch.com/2014/11/17/non-maximum-suppression-object-detection-python/

4 Yolov5

https://docs.ultralytics.com/yolov5/

https://github.com/ultralytics/yolov5

5 Yolov8

https://docs.ultralytics.com/

https://github.com/ultralytics/ultralytics

### Example Projects:

https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/examples/tutorial.ipynb

https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb


###Helpful resources:

https://www.youtube.com/watch?v=GRtgLlwxpc4&t=536s&pp=ygUmY3VzdG9tIGRhdGFzZXQgb2JqZWN0IGRldGVjdGlvbiB5b2xvdjU%3D

https://www.youtube.com/watch?v=fu2tfOV9vbY

https://www.youtube.com/watch?v=fhzCwJkDONE&t=845s

https://www.youtube.com/watch?v=PPpKlPYL95c&t=868s

###Sources for Data Collection

https://www.pexels.com/search/people%20talking%20on%20the%20smartphone%20on%20foot/

https://www.istockphoto.com/search/2/image-film?phrase=people%20walking%20with%20smartphone

https://www.istockphoto.com/search/2/image-film?family=creative&phrase=people%20walking%20in%20street%20and%20smartphone

https://www.google.com/search?q=people+walking+while+looking+to+phone&rlz=1C1SQJL_enTR910TR910&sxsrf=APwXEdeOojQij3NQK136Ywc-yg_nv00eqQ:1683239646928&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjKk-ip3Nz-AhWGWcAKHUi4Bq8Q_AUoAXoECAEQAw&biw=1474&bih=762&dpr=1.25

<br>
<br>
 












 


#Implementation

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#9041A0;
           font-size:110%;
           font-family:Segoe UI;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
    Outline of Implementation:
    <li style="padding-left:1em">1.   Importing Libraries</li>
    <li style="padding-left:1em">2.   FasterCNN</li>
    <li style="padding-left:1em">2.1  The Dataset</li>
    <li style="padding-left:1em">2.2  Visualization</li>
    <li style="padding-left:1em">2.3  Augmentations</li>
    <li style="padding-left:1em">2.4  Data Loaders</li>
    <li style="padding-left:1em">2.5  Pre-trained Model</li>
    <li style="padding-left:1em">2.6 Training</li>
    <li style="padding-left:1em">2.7  Filtering The Outputs</li>
    <li style="padding-left:1em">2.8  Testing The Model</li>
    <li style="padding-left:1em">3    Hybrids of Traditional Approach and Neural Networks</li>
    <li style="padding-left:1em">3.1. SVM and HOG</li>
    <li style="padding-left:1em">4.   Yolov5</li>
    <li style="padding-left:1em">4.1  Training</li>
    <li style="padding-left:1em">4.2  Testing</li>
    <li style="padding-left:1em">4.3  Interactive Visualization</li>
    <li style="padding-left:1em">5.   Yolov8</li>
    <li style="padding-left:1em">5.1  Training</li>
    <li style="padding-left:1em">5.2  Testing</li>
    <li style="padding-left:1em">5.3  Testing Model On Video</li>
    <li style="padding-left:1em">6.   Importing Fully Trained Models</li>
    <li style="padding-left:1em">6.1  Loading FasterCNN</li>
    <li style="padding-left:1em">6.2  Defining Display Functions</li>
    <li style="padding-left:1em">6.3  Loading SVM</li>
    <li style="padding-left:1em">6.4  Comparing CNN and Hybrid Approach</li>
    <li style="padding-left:1em">6.5  Loading Yolov5</li>
    <li style="padding-left:1em">6.6  Loading Yolov8</li>
    <li style="padding-left:1em">6.7  Testing Yolov8 Model On Video</li>
             </p> </div>

In [None]:
# We are importing some necessary files that we upload to github, for example image for introduction and video to test model.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
!git clone https://github.com/musacim/ObjectDetection_files.git >/dev/null
!mv /content/ObjectDetection_files/* /content/

In [None]:
image1 = mpimg.imread('/content/crossing_road_smartphone.png')
image2 = mpimg.imread('/content/detection_system.png')

fig, axs = plt.subplots(1, 2, figsize=(14, 14))
axs[0].imshow(image2)
axs[0].set_title('With Detection System')
axs[0].axis('off')
axs[0].text(0.5, -0.1, '*Our fully trained YOLOv8 model is used for detection',
            size=8, ha="center", transform=axs[0].transAxes)
axs[1].imshow(image1)
axs[1].set_title('Without Detection System')
axs[1].axis('off')


plt.show()


##Positive and Negative Image Samples
### Dataset
We have 1010 images in our dataset, 70% of these images are positive samples and 30% is negative samples. All of the images are collected and annotated by project team. We use roboflow for;storing images,annotating, splitting images for test,train, and validation folders and also importing them into colab.

### Positive and Negative Samples

Our aim is to detect people who are looking their phone, they might walk or stand but for us to label them as positive and annotate the object, there should be a single person or multiple people who are looking to their phone, if they look somewhere else but holding their phone it is negative class, if they are looking to their book, computer or coffe bottle these are negative samples. Talking to phone or showing their phone to their friends are negative samples as well.

In [None]:
fig, axs = plt.subplots(2, 4, figsize=(10, 6))

axs[0, 0].imshow(Image.open("/content/ps0.jpg"))
axs[0, 1].imshow(Image.open("/content/ps1.jpg"))
axs[0, 2].imshow(Image.open("/content/ps2.jpg"))
axs[0, 3].imshow(Image.open("/content/ps3.jpg"))
axs[1, 0].imshow(Image.open("/content/ns0.jpg"))
axs[1, 1].imshow(Image.open("/content/ns1.jpg"))
axs[1, 2].imshow(Image.open("/content/ns2.jpg"))
axs[1, 3].imshow(Image.open("/content/ns3.jpg"))

axs[0, 0].set_title("Positive Samples")
axs[1, 0].set_title("Negative Samples")
for ax in axs.flat:
    ax.set_xticks([])
    ax.set_yticks([])

fig.subplots_adjust(hspace=0)
plt.show()


#1. Importing Libraries

In [None]:
!pip install torch --quiet
!pip install torchvision --quiet

# Install dependencies
!pip install albumentations
!pip install pycocotools --quiet
 

#Yolov5, its requirements and roboflow
!git clone https://github.com/ultralytics/yolov5 --quiet
!pip install -r requirements.txt --quiet   
!pip install roboflow --quiet

#Yolov8
!pip install ultralytics --quiet

# Clone TorchVision repo and copy helper files
!git clone https://github.com/pytorch/vision.git
%cd vision
!git checkout v0.3.0
%cd ..
!cp vision/references/detection/utils.py ./
!cp vision/references/detection/transforms.py ./
!cp vision/references/detection/coco_utils.py ./

In [None]:
# basic python and ML Libraries
import os
import random
import numpy as np
import pandas as pd

# for ignoring warnings
import warnings
warnings.filterwarnings('ignore')

# We will be reading images using OpenCV
import cv2

# matplotlib for visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.image as mpimg
from PIL import Image

# torch and torchvision libraries
import torch
import torchvision
from torchvision import transforms as torchtrans  
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import torch.utils.data as utils_data

# helper libraries
from custom_engine import train_one_epoch, evaluate
import utils
import transforms as T

# for image augmentations and tools
import albumentations as A
from albumentations.pytorch.transforms import ToTensorV2
import pycocotools

# sklearn functions and models
from joblib import load
from skimage.feature import hog
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
 
#Yolo and roboflow
from ultralytics import YOLO
from roboflow import Roboflow 

#some libraries for visualization for videos
from moviepy.editor import *
from IPython.display import HTML
from base64 import b64encode


In [None]:
#Pytorch gives error about deterministic behavior, providing cublas_workspace_config solves it.
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'  

#2. Faster CNN

##2.1 The Dataset 

In [None]:
# We are uploading the dataset here from roboflow to use it with fastercnn. 
!pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="7QnIUGHlluFrILkFBubf")
project = rf.workspace("adasd-ukmfp").project("people_with_smartphone")
dataset = project.version(4).download("coco")


In [None]:
#We are taking annotations into dataframe, dataframe consists of x value, y value, width, height, filename, and has_object value 
#which indicates that image has object in it.

import json
import pandas as pd
with open('/content/people_with_smartphone-4/train/_annotations.coco.json', 'r') as file:
    data = json.load(file)

df1=pd.DataFrame(data['images'])
df2=pd.DataFrame(data['annotations'])
df_annotations=df2[['id','image_id','bbox']]
df_images=df1[['id','file_name']]
merged_df = df_images.merge(df_annotations, left_on='id', right_on='image_id', how='left')
merged_df = merged_df.drop(columns=['image_id','id_y','id_x'])   
merged_df.rename(columns={'bbox': 'bbox_values'}, inplace=True)   
merged_df['bbox_values'].fillna(value=0, inplace=True)
merged_df['has_object'] = merged_df['bbox_values'].apply(lambda x: 0 if x == 0 or x == [0, 0, 0, 0] else 1)
def process_bbox_values(row):
    if row['has_object'] == 1:
        x, y, w, h = row['bbox_values']
        width = w
        height = h
    else:
        x, y, width, height = 0, 0, 0, 0

    return pd.Series([x, y, width, height])

merged_df[['x', 'y', 'width', 'height']] = merged_df.apply(process_bbox_values, axis=1)
merged_df.rename(columns={'file_name': 'filename'}, inplace=True)
merged_df=merged_df.drop(columns=['bbox_values'])
merged_df=merged_df[['x','y','width','height','filename','has_object']]
df=merged_df

In [None]:
df

In [None]:
# We are defining training directory and testing directory.
train_images_dir = '/content/people_with_smartphone-4/train'
test_images_dir = '/content/people_with_smartphone-4/test/'

#We are defining custom dataset for object detection
#We are loading the images from a directory and their corresponding annotations
#There are also transform and display function

class PersonDataset(torch.utils.data.Dataset):

    def __init__(self, files_dir, df,number_of_images_to_load, transforms=None):
        self.transforms = transforms
        self.files_dir = files_dir
        self.df = df
        self.number_of_images_to_load=number_of_images_to_load
         
        self.imgs = sorted([image for image in os.listdir(files_dir) if not image.endswith('.json')])[:number_of_images_to_load]
        
        self.classes = [_, 'person_lookingto_phone']
         
    def __getitem__(self, idx):
      img_name = self.imgs[idx]
      image_path = os.path.join(self.files_dir, img_name)

      img = cv2.imread(image_path)
      img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32)
      img_res = img_rgb / 255.0

      
      row = self.df[self.df['filename'] == img_name]

      if row.empty:
          has_object = 0
      else:
          has_object = row['has_object'].values[0]

      if has_object:
          xmin = row['x'].values[0]
          xmax = xmin + row['width'].values[0]
          ymin = row['y'].values[0]
          ymax = ymin + row['height'].values[0]

          boxes = [[xmin, ymin, xmax, ymax]]
          boxes = torch.as_tensor(boxes, dtype=torch.float32)
          labels = [1]
          area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
          iscrowd = torch.zeros((boxes.shape[0],), dtype=torch.int64)
      else:
          boxes = torch.zeros((0, 4), dtype=torch.float32)
          labels = []
          area = torch.zeros((0,), dtype=torch.float32)
          iscrowd = torch.zeros((0,), dtype=torch.int64)

      labels = torch.as_tensor(labels, dtype=torch.int64)

      target = {
          "boxes": boxes,
          "labels": labels,
          "area": area,
          "iscrowd": iscrowd,
          "image_id": torch.tensor([idx]),
      }

      if self.transforms:
          sample = self.transforms(image=img_res, bboxes=target['boxes'], labels=labels)
          img_res = sample['image']
          target['boxes'] = torch.Tensor(sample['bboxes']) if len(sample['bboxes']) > 0 else torch.zeros((0, 4), dtype=torch.float32)

      return img_res, target

    def __len__(self):
        return len(self.imgs)

    def show_image_with_bbox(self, idx):
        img, target = self.__getitem__(idx)
        img = img * 255.0  # Undo normalization
        fig, ax = plt.subplots(1)
        ax.imshow(img.astype(np.uint8))

        for box in target['boxes']:
            xmin, ymin, xmax, ymax = box
            width = xmax - xmin
            height = ymax - ymin
            rect = patches.Rectangle((xmin, ymin), width, height, linewidth=1, edgecolor='r', facecolor='none')
            ax.add_patch(rect)

        plt.show()




##2.2 Visualization



In [None]:
#To be sure about boxes around our object, we display an example.
dataset = PersonDataset(train_images_dir, df,50)
dataset.show_image_with_bbox(11) 


##2.3 Augmentations

In [None]:
#We are applying augmentations to images by ensuring that the boxes that we draw are still aligned with the object.
def get_transform(train):
  if train:
    return A.Compose(
      [
        A.HorizontalFlip(0.5),
         
        ToTensorV2(p=1.0) 
      ],
      bbox_params={'format': 'pascal_voc', 'label_fields': ['labels']}
    )
  else:
    return A.Compose(
      [ToTensorV2(p=1.0)],
      bbox_params={'format': 'pascal_voc', 'label_fields': ['labels']}
    )

##2.4 Dataloaders



In [None]:
#We are creating dataloader objects for training and testing, in order to ensure that CUDA memory is not full,
#for training the model we are loading just 50 images, at the end of the notebook. We load the full trained models where we use 
#all the images which are more than a thousand and we train models with 150 epochs.

number_of_images_to_load_train=50
number_of_images_to_load_test=250
 
dataset = PersonDataset(train_images_dir,df,number_of_images_to_load_train, transforms=get_transform(train=True))
dataset_test = PersonDataset(test_images_dir,df,number_of_images_to_load_test, transforms=get_transform(train=False))

 
def collate_fn(batch):
    images, targets = zip(*batch)
    images = list(images)
    targets = list(targets)

    return images, targets

 
data_loader = torch.utils.data.DataLoader(
  dataset,
  batch_size=2 ,
  shuffle=True,
  num_workers=2,
  collate_fn=collate_fn,

)

data_loader_test = torch.utils.data.DataLoader(
  dataset_test,
  batch_size=2,
  shuffle=False,
  num_workers=2,
  collate_fn=collate_fn,

)

##2.5 Pre-trained Model

In [None]:
#We are loading pretrained model for fastercnn so that we won't start from scratch

def get_object_detection_model(num_classes):
   
  model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
  in_features = model.roi_heads.box_predictor.cls_score.in_features
  model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) 

  return model

##2.6 Training

In [None]:
#We are preparing for training, defining device as gpu if available,
#we have 2 number of classes one for object itself and one more background.

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
num_classes = 2 
model = get_object_detection_model(num_classes)
model.to(device)

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

lr_scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer,
  step_size=3,
  gamma=0.1
)

In [None]:
#We are training the model here, for full training that we use at the end of notebook we trained
#the model for 10 epochs, in order to ensure that CUDA memory is not full we train it just for one epoch here,
#because notebook should be working from top to bottom without intervention

num_epochs=1

for epoch in range(num_epochs):

    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    lr_scheduler.step()
    evaluate(model, data_loader_test, device=device)


##2.7 Filtering the outputs

In [None]:
# We are filtering outputs with non maximum suppression technique, this technique is used to remove
# unnecessary redundant bounding boxes, because model might predict several bounding boxes at the same location,
# we want to filter them by nms and we use certain threshold

def apply_nms(orig_prediction, iou_thresh=0.3, score_thresh=0.7):
    
    mask = orig_prediction['scores'].cpu() > score_thresh
    filtered_boxes = orig_prediction['boxes'].cpu()[mask]
    filtered_scores = orig_prediction['scores'].cpu()[mask]
    filtered_labels = orig_prediction['labels'].cpu()[mask]

    keep = torchvision.ops.nms(filtered_boxes, filtered_scores, iou_thresh)

    final_prediction = orig_prediction
    final_prediction['boxes'] = filtered_boxes[keep]
    final_prediction['scores'] = filtered_scores[keep]
    final_prediction['labels'] = filtered_labels[keep]

    return final_prediction

def torch_to_pil(img):
  return torchtrans.ToPILImage()(img).convert('RGB')


##2.8 Testing The Model

In [None]:
 #We are defining two functions here, one for drawing bounding boxes into image
 #and another function is used to display the image with that bounding box.

def plot_img_bbox(ax, img, target):
    ax.imshow(img)

    for box, score in zip(target['boxes'], target['scores']):
        x, y, width, height = box[0], box[1], box[2] - box[0], box[3] - box[1]

        rect = patches.Rectangle(
            (x, y),
            width, height,
            linewidth=2,
            edgecolor='r',
            facecolor='none'
        )
        ax.add_patch(rect)
        ax.text(x, y, f"{score:.2f}", fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

 
def display_images(test_dataset, model, num_images, columns=2):
    rows = int(np.ceil(num_images / columns))
    fig, axes = plt.subplots(rows, columns, figsize=(3 * columns, 3 * rows))

     
    axes = axes.ravel()
    for ax in axes[num_images:]:
        fig.delaxes(ax)

    model.eval()

    sequential_indices = list(range(num_images))

    for i, sequential_index in enumerate(sequential_indices):
        img, _ = test_dataset[sequential_index]

        with torch.no_grad():
            prediction = model([img.to(device)])[0]

        nms_prediction = apply_nms(prediction, iou_thresh=0.01, score_thresh=0.01)
        plot_img_bbox(axes[i], torch_to_pil(img), nms_prediction)

    plt.tight_layout()
    plt.show()

In [None]:
display_images(dataset_test, model, num_images=4, columns=4)

#3. Hybrids of Traditional Approach and Neural Networks

In [None]:
#We have trained our fastercnn model but sometimes, false positives can sometimes occur.
#We will train svm by using HOG as a feature extraction method and 
#SVM and hog will be our secondary filter to remove false positives.
#The idea is that HOG features will be extracted from locations from the image that FasterCNN model determine
#then we will feed these features to svm classifier to distinguish between true and false positives.

##3.1 SVM and HOG

In [None]:
#We are defining our hog feature funtions
def extract_hog_features(img, size=(64, 128)):
    img_res = cv2.resize(img, size, cv2.INTER_AREA)
    features, _ = hog(img_res, orientations=9, pixels_per_cell=(8, 8),
                      cells_per_block=(2, 2), visualize=True, multichannel=True)
    return np.array(features)


In [None]:
def extract_hog_features_from_bbox(img, bbox):
    x1, y1, x2, y2 = map(int, bbox)
    cropped_img = img[y1:y2, x1:x2]
    img_res = cv2.resize(cropped_img, (64, 128), cv2.INTER_AREA)
    img_res /= 255.0
    features = extract_hog_features(img_res)
    return features


In [None]:
# Preparing the data to train svm classifier
X = []
y = []

for _, row in df.iterrows():
    img_name = row['filename']
    image_path = os.path.join(train_images_dir, img_name)
    img = cv2.imread(image_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32)

    has_object = row["has_object"]
    if has_object:
        bbox = [row['x'], row['y'], row['x'] + row['width'], row['y'] + row['height']]
        features = extract_hog_features_from_bbox(img_rgb, bbox)
        X.append(features)
        y.append(1)
    else:
        X.append(extract_hog_features(img_rgb))
        y.append(0)




X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = np.array(X_train)
X_test = np.array(X_test)


clf = SVC(kernel='linear', C=1, probability=True)
clf.fit(X_train, y_train)



In [None]:
#Defining the function that we will use svm on images if it has a false positive
def apply_svm_on_proposals(model, img, threshold=0.5):
    
    img_tensor = torch.tensor(img / 255, dtype=torch.float).permute(2, 0, 1).to(device)
    img_tensor = img_tensor.unsqueeze(0)

     
    with torch.no_grad():
        outputs = model(img_tensor)

     
    scores = outputs[0]['scores']
    boxes = outputs[0]['boxes']
    labels = outputs[0]['labels']

    filtered_indices = [idx for idx, score in enumerate(scores) if score > threshold]
    filtered_boxes = boxes[filtered_indices]
    filtered_labels = labels[filtered_indices]

     
    final_boxes = []
    final_labels = []
    final_scores = []

    for i, bbox in enumerate(filtered_boxes):
        hog_features = extract_hog_features_from_bbox(img, bbox)
        svm_pred = clf.predict([hog_features])
        svm_prob = clf.predict_proba([hog_features])

        if svm_pred[0] == 1:
            final_boxes.append(bbox.tolist())
            final_labels.append(filtered_labels[i].tolist())
            final_scores.append(svm_prob[0][1])

    return final_boxes, final_labels, final_scores


In [None]:
#We will show the images after svm is applied.
def show_images_SVM_HOG(test_dataset, model, num_images, columns ):
    rows = int(np.ceil(num_images / columns))
    fig, axes = plt.subplots(rows, columns, figsize=(3 * columns, 3 * rows))
   
    axes = axes.ravel()
    for ax in axes[num_images:]:
        fig.delaxes(ax)

    model.eval()

    for i in range(num_images):
        img, _ = test_dataset[i]

        with torch.no_grad():
            prediction = model([img.to(device)])[0]

        nms_prediction = apply_nms(prediction, iou_thresh=0.01, score_thresh=0.5)

        
        img_np = img.cpu().numpy().transpose((1, 2, 0)) * 255
        boxes, labels, scores = apply_svm_on_proposals(model, img_np, threshold=0.5)

        svm_filtered_prediction = {
            'boxes': torch.tensor(boxes),
            'labels': torch.tensor(labels, dtype=torch.int64),
            'scores': torch.tensor(scores)
        }

        plot_img_bbox(axes[i], torch_to_pil(img), svm_filtered_prediction)
 
    plt.tight_layout()

In [None]:
show_images_SVM_HOG(dataset_test, model, num_images=4, columns=4)

#4. Yolov5

In [None]:
%cd /content/yolov5 

In [None]:
#We are importing the data from roboflow to implementation for yolov5.

rf = Roboflow(api_key="7QnIUGHlluFrILkFBubf")
project = rf.workspace("adasd-ukmfp").project("people_with_smartphone")
dataset = project.version(3).download("yolov5")

##4.1 Training

In [None]:
# We are training yolov5 model here, we have trained the model with 150 epochs and used this fully trained model
# at the end of the notebook.

!python train.py --img 320 --epochs 1 --data people_with_smartphone-3/data.yaml --weights yolov5x.pt

##4.2 Testing

In [None]:
#We are testing our model on the test dataset
!python detect.py --weights runs/train/exp/weights/best.pt --img 640 --conf 0.25 --data people_with_smartphone-3/data.yaml --source people_with_smartphone-3/test/images

In [None]:
#We are displaying the test images that the model predict

image_directory = '/content/yolov5/runs/detect/exp'
all_files = os.listdir(image_directory)

image_filenames = [file for file in all_files if file.endswith(('.jpg', '.png'))]

num_images = 4
for i in range(num_images):
    image_filename = image_filenames[i]
    img = mpimg.imread(f"{image_directory}/{image_filename}")
    plt.imshow(img)
    plt.show()


#5. Yolov8

In [None]:
%cd /content

In [None]:
#We are uploading the dataset in yolov8 format.

#rf = Roboflow(api_key="7QnIUGHlluFrILkFBubf")
#project = rf.workspace("adasd-ukmfp").project("people_with_smartphone")
dataset = project.version(3).download("yolov8")

##5.1 Training

In [None]:
# We are load a pretrained YOLO model which is recommended for training in the official documentation.
model_yolov8 = YOLO('yolov8x.pt')

#We are training the model for one epoch, as we mentioned earlier we have trained this model for 150 epochs at the end of
#the notebook we use it.
results = model_yolov8.train(data='/content/people_with_smartphone-3/data.yaml', epochs=1)

##5.2 Testing

In [None]:
#We are testing our trained model on test dataset
predictions=model_yolov8.predict('/content/people_with_smartphone-3/test/images', save=True, imgsz=640, conf=0.8)

In [None]:
#Displaying the images that trained model predict

image_directory = 'runs/detect/predict'
all_files = os.listdir(image_directory)

image_filenames = [file for file in all_files if file.endswith(('.jpg', '.png'))]

num_images = 10
for i in range(num_images):
    image_filename = image_filenames[i]
    img = mpimg.imread(f"{image_directory}/{image_filename}")
    plt.imshow(img)
    plt.show()


#6. Importing Fully Trained Models

We have trained Faster CNN for 10 epochs, yolov5 and yolov8 for 150 epochs. Since these models are large, we upload them to dropbox and we will import them because training these models takes huge amount of time, faster cnn and svm takes about more than two hours for 10 epochs, yolov5 takes 2.3 hours for 150 epochs and yolov8 takes about 2.5 hours for 150 epochs.

In [None]:
#We are uploading trained models from dropbox and unzipping these files
!wget -O fully_trained_models.zip https://www.dropbox.com/s/f8uc0p3j1yqrznp/full_training.zip?dl=0
!unzip fully_trained_models.zip

##6.1 Loading FasterCNN

In [None]:
#We are loading fully trained model of fastercnn

full_model_fasterCNN = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=2, pretrained_backbone=True) 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 
model_load_path = "/content/faster_cnn_10_epochs.pth"    
full_model_fasterCNN.load_state_dict(torch.load(model_load_path))
full_model_fasterCNN = full_model_fasterCNN.to(device)


##6.2 Defining Display Functions


In [None]:
#We are defining the same functions that we use before but we are defining them to be producing random
#images and also functions will be used upon on whole test dataset. So we need to define them again.
#for avoiding repeating codes we might use functions defined earlier but to avoid CUDA Memory error,
#we had to define them again

def apply_svm_on_proposals(model, clf, img, threshold=0.5):
 
    img_tensor = torch.tensor(img / 255, dtype=torch.float).permute(2, 0, 1).to(device)
    img_tensor = img_tensor.unsqueeze(0)
    with torch.no_grad():
        outputs = model(img_tensor)

    scores = outputs[0]['scores']
    boxes = outputs[0]['boxes']
    labels = outputs[0]['labels']

    filtered_indices = [idx for idx, score in enumerate(scores) if score > threshold]
    filtered_boxes = boxes[filtered_indices]
    filtered_labels = labels[filtered_indices]

    final_boxes = []
    final_labels = []
    final_scores = []

    for i, bbox in enumerate(filtered_boxes):
        hog_features = extract_hog_features_from_bbox(img, bbox)
        svm_pred = clf.predict([hog_features])
        svm_prob = clf.predict_proba([hog_features])

        if svm_pred[0] == 1:
            final_boxes.append(bbox.tolist())
            final_labels.append(filtered_labels[i].tolist())
            final_scores.append(svm_prob[0][1])

    return final_boxes, final_labels, final_scores
def show_random_images_SVM_HOG(test_dataset, model, svm_model,num_images, columns, random_seed=None):
    rows = int(np.ceil(num_images / columns))
    fig, axes = plt.subplots(rows, columns, figsize=(3 * columns, 3 * rows))

    axes = axes.ravel()
    for ax in axes[num_images:]:
        fig.delaxes(ax)

    model.eval()

    if random_seed is not None:
        random.seed(random_seed)

    selected_indices = random.sample(range(len(test_dataset)), num_images)

    for i, idx in enumerate(selected_indices):
        img, _ = test_dataset[idx]

        with torch.no_grad():
            prediction = model([img.to(device)])[0]

        nms_prediction = apply_nms(prediction, iou_thresh=0.01, score_thresh=0.9)

        
        img_np = img.cpu().numpy().transpose((1, 2, 0)) * 255
        boxes, labels, scores = apply_svm_on_proposals(model,svm_model, img_np, threshold=0.9)

        svm_filtered_prediction = {
            'boxes': torch.tensor(boxes),
            'labels': torch.tensor(labels, dtype=torch.int64),
            'scores': torch.tensor(scores)
        }

        plot_img_bbox(axes[i], torch_to_pil(img), svm_filtered_prediction)

    plt.tight_layout()


def show_random_images_faster_cnn(test_dataset, model, num_images, columns=2, random_seed=None):
    rows = int(np.ceil(num_images / columns))
    fig, axes = plt.subplots(rows, columns, figsize=(3 * columns, 3 * rows))

    
    axes = axes.ravel()
    for ax in axes[num_images:]:
        fig.delaxes(ax)

    model.eval()

    if random_seed is not None:
        random.seed(random_seed)

    selected_indices = random.sample(range(len(test_dataset)), num_images)

    for i, idx in enumerate(selected_indices):
        img, _ = test_dataset[idx]

        with torch.no_grad():
            prediction = model([img.to(device)])[0]

        nms_prediction = apply_nms(prediction, iou_thresh=0.01, score_thresh=0.01)
        plot_img_bbox(axes[i], torch_to_pil(img), nms_prediction)

    plt.tight_layout()
    plt.show()

##6.3 Loading SVM

In [None]:
#We are loading svm classifier that we trained fully.
svm_model = load('/content/svm_classifier.joblib')

##6.4 Comparing CNN and Hybrid Approach

In [None]:
#Results before SVM and HOG
random_seed=42
num_images=10
columns=5
show_random_images_faster_cnn(dataset_test, full_model_fasterCNN, num_images=num_images, columns=columns,random_seed=random_seed)

In [None]:
#Results after SVM and HOG
show_random_images_SVM_HOG(dataset_test, full_model_fasterCNN,svm_model, num_images=num_images, columns=columns,random_seed=random_seed)

##6.5 Loading Yolov5

In [None]:
#We are loading yolov5 model that we trained fully.

In [None]:
%cd yolov5

In [None]:
#Displaying the changes of metrics during training.
image = Image.open('/content/yolov5_150_results.png')
fig = plt.figure(figsize=(12, 6))

plt.imshow(image)
plt.show()

In [None]:
#We are making predictions on test dataset by using our fully trained yolov5 model.
!python detect.py --weights /content/yolov5_150_weights.pt --img 640 --conf 0.25 --data /content/people_with_smartphone-3/data.yaml --source /content/people_with_smartphone-3/test/images

In [None]:
#Displaying images that model predicted.

image_directory = 'runs/detect/exp2'
all_files = os.listdir(image_directory)

image_filenames = [file for file in all_files if file.endswith(('.jpg', '.png'))]

num_images = 5
for i in range(num_images):
    image_filename = image_filenames[i]
    img = mpimg.imread(f"{image_directory}/{image_filename}")
    plt.imshow(img)
    plt.show()


##6.7 Loading Yolov8

In [None]:
#We are loading yolov8 model that we trained fully.
model_full_yolov8 = YOLO('/content/yolov8_150_weights.pt')

In [None]:
#Displaying the changes of metrics during training.
image = Image.open('/content/yolov8_150_results.png')
fig = plt.figure(figsize=(12, 6))

plt.imshow(image)
plt.show()

##6.7 Testing Yolov8 Model On Video

In [None]:
#We are testing the fully trained yolov8 model on the video. 

video_file = '/content/text_walkers.mp4'
predictions=model_full_yolov8.predict(video_file, save=True, imgsz=640, conf=0.8)

In [None]:
video_file_predicted='/content/yolov5/runs/detect/predict/text_walkers.mp4'
clip = VideoFileClip(video_file_predicted)
clip_resized = clip.resize(height=360)  
temp_video_file = "/content/temp.mp4"
clip_resized.write_videofile(temp_video_file, codec="libx264")

with open(temp_video_file, "rb") as f:
    video_data = b64encode(f.read()).decode()

HTML(f"""
<video width="80%" controls>
  <source src="data:video/mp4;base64,{video_data}" type="video/mp4">
</video>
""")
