# 💻 You Only Look Once (YOLO) Implementation 
### Group: 4NN
#### Dharani Palanisamy (z5260276)
#### Faiyam Islam (z5258151) 
#### Pooja Saianand (z5312416)
#### Priya Nandyal (z5312288) 

This notebook implements EDA on the x-ray images that has not been covered in the EDA notebook then we use the YOLO algorithm to the predicted and actual distribution of each abnormality. We will also analyse the evaluation metrics such as mAP and IoU and instigate a conclusion on the accuracy of this object detection method. 

We've used the following sources which assisted our approach for YOLO: 

https://www.kaggle.com/code/kimse0ha/vinbigdata-eda-infer-analysis-with-yolov5

https://www.kaggle.com/code/awsaf49/vinbigdata-cxr-ad-yolov5-14-class-infer/notebook

https://www.kaggle.com/code/mrutyunjaybiswal/vbd-chest-x-ray-abnormalities-detection-eda/notebook

# Importing Packages

In [1]:
import numpy as np # working with arrays in our images
import pandas as pd # machine learning tasks 
from sklearn.model_selection import GroupKFold # classification 
from tqdm.notebook import tqdm # creating Progress Metres and Progress Bars
from glob import glob # used to return all file paths that match a specific pattern 
import shutil # used for file collection, including copying
import os # creating and removing directories and folders
import random # generate random numbers
import cv2 # imports name for opencv in Python
import matplotlib # create 2D graphs and plotting
import matplotlib.pyplot as plt # for plotting
from mpl_toolkits.axes_grid1 import ImageGrid # For creating heatmaps
import plotly.express as px # for aesthetics of figures and plots
import seaborn as sns # uses matplotlib to plot graphs
import torch # deep learning
from IPython.display import Image, clear_output # Makes plots interactive

# Importing Datasets and assigning directories

In [2]:
train_dir = f'/kaggle/input/vinbigdata-512-image-dataset/vinbigdata/train'
weights_dir = '/kaggle/input/vinbigdata-cxr-ad-yolov5-14-class-train/yolov5/runs/train/exp/weights/best.pt'
train_df = pd.read_csv('../input/vinbigdata-512-image-dataset/vinbigdata/train.csv')
train_df.head()

# Background information of the train dataset

In [3]:
len(train_df) # Total length of train dataset

In [4]:
len(train_df.image_id.unique()) # unique values in train dataset

In [5]:
len(train_df) / len(train_df.image_id.unique()) # Average number of chest disease per person

In [6]:
len(train_df[train_df.class_id != 14]) / len(train_df[train_df.class_id != 14].image_id.unique())

# Distribution of each class chest abnormality

Let's re-explore the plot label distribution of each disease from the patiences with and without the class name 'No finding'.

In [7]:
# Before plotting our distributions, let's set a colour palette
color_palette = [px.colors.label_rgb(px.colors.convert_to_RGB_255(x)) 
                for x in sns.color_palette("plasma", 15)]

### Distribution of chest abnormalities

In [8]:
fig = px.bar(train_df.class_name.value_counts().sort_index(), 
             color = train_df.class_name.value_counts().sort_index().index,
             color_discrete_sequence = color_palette,
             title = "Distribution of chest abnormalities")
fig.update_layout(legend_title = "Disease names",
                  xaxis_title ="Diseases",
                  yaxis_title = "Count")
fig.show()

### Distribution of chest abnormalities (without 'No finding') 

In [9]:
fig = px.bar(train_df[train_df.class_id != 14].class_name.value_counts().sort_index(), 
             color = train_df[train_df.class_id!=14].class_name.value_counts().sort_index().index,
             color_discrete_sequence = color_palette,
             title = "Distribution of chest abnormalities (without 'No finding')")
fig.update_layout(legend_title = "Disease names",
                  xaxis_title = "Diseases",
                  yaxis_title = "Count")
fig.show()

# Bounding box placement in Heatmap 

In our EDA, we constructed various distribution of area of the bounding boxes for each category, however the issue was we were unable to visualise the relative sizes of each abnormality, we were simply observing the coordinates of the bounding boxes. Here, we have implemented code that will generate Heatmaps to properly visualise on the actual x-ray images. This gives us a good initial idea of bounding box sizes. 

In [10]:
boundingbox_df = train_df[train_df.class_id != 14].reset_index(drop = True) 

boundingbox_df['frac_x_min'] = boundingbox_df.apply(lambda x: (x.x_min) / x.width, axis = 1)
boundingbox_df['frac_y_min'] = boundingbox_df.apply(lambda x: (x.y_min) / x.height, axis = 1) 
boundingbox_df['frac_x_max'] = boundingbox_df.apply(lambda x: (x.x_max) / x.width, axis = 1) 
boundingbox_df['frac_y_max'] = boundingbox_df.apply(lambda x: (x.y_max) / x.height, axis = 1) 
boundingbox_df.head()

In [11]:
avg_width  = int(np.mean(boundingbox_df.width))
avg_height = int(np.mean(boundingbox_df.height))

heatmap_size = (avg_width, avg_height, 14)
heatmap = np.zeros((heatmap_size), dtype=np.int16)

bbox_np = boundingbox_df[["class_id", "frac_x_min", "frac_x_max", "frac_y_min", "frac_y_max"]].to_numpy()
bbox_np[:, 1:3] *= avg_width; bbox_np[:, 3:5] *= avg_height
bbox_np = np.floor(bbox_np).astype(np.int16)

label_dic = {i:train_df[train_df["class_id"] == i].iloc[0]["class_name"] for i in range(15)}

custom_cmaps = [matplotlib.colors.LinearSegmentedColormap.from_list(colors = [(0.,0.,0.), c, (0.95,0.95,0.95)], 
        name = f"custom_{i}") for i,c in enumerate(sns.color_palette("Spectral", 15))]
custom_cmaps.pop(8) # This removes the class 'No finding'

for row in tqdm(bbox_np, total=bbox_np.shape[0]):
    heatmap[row[3]:row[4] + 1, row[1]:row[2] + 1, row[0]] += 1
    
fig = plt.figure(figsize = (20,25))
plt.suptitle("Heatmaps of Bounding Box Placement ", fontsize = 20)
for i in range(15):
    plt.subplot(4, 4, i + 1)
    if i == 0:
        plt.imshow(heatmap.mean(axis =- 1), cmap = "bone")
        plt.title(f"Average of All Classes", fontweight = "bold")
    else:
        plt.imshow(heatmap[:, :, i-1], cmap=custom_cmaps[i - 1])
        plt.title(f"{label_dic[i - 1]} – id : {i}", fontweight = "bold")
        
    plt.axis(False)
fig.tight_layout(rect = [0, 0.03, 1, 0.97])
plt.show()

# Percentage of area of Bounding boxes in each image

We are able to quantify each class disease now in a range, this is more accurate and effective than estimating by scrutinising the dicom images. 

In [12]:
boundingbox_df["frac_bbox_area"] = (boundingbox_df["frac_x_max"] - boundingbox_df["frac_x_min"]) * (boundingbox_df["frac_y_max"] - boundingbox_df["frac_y_min"])
fig = px.box(boundingbox_df.sort_values(by = "class_name"), x = "class_name", y = "frac_bbox_area", color = "class_name", notched = True,
             color_discrete_sequence = color_palette, 
             labels = {"class_id_as_str": "Class Name", "frac_bbox_area" : "BBox Area (%)"},
             title = "Percentage of bounding box on each image")

fig.update_layout(showlegend = True,
                  yaxis_range = [-0.025,0.40],
                  legend_title_text = None,
                  xaxis_title = "",
                  yaxis_title = "")
fig.show()

# Validation Set with K = 4 folds 
Creating a validation set will assist in selecting and tuning the YOLO model. Using GroupKFold() command ensures that the same group is not represented in both testing/validation and training sets. In this case for all the features we will have k = 4 folds. 
Source: https://www.kaggle.com/code/reighns/groupkfold-and-stratified-groupkfold-efficientnet/notebook

In [13]:
dimension = 512
# Justification of using 512 
# Essentially this is the size of the images. We believed this was the best one to use because there is a good balance between 
# the quality of the images and the time it takes to import the datasets.
# Here is a useful source we looked into deciding which one was more effective for YOLO: 
# https://www.kaggle.com/code/seokhyunseo/256-vs-512-vs-1024-which-dataset-is-useful 
train_df['image_path'] = f'/kaggle/input/vinbigdata-{dimension}-image-dataset/vinbigdata/train/' + train_df.image_id + ('.png' if dimension != 'original' else '.jpg')
train_df.head()

Essentially, adding an image path helps alleviate the confusion between the features of the chest abnormalities, such as x_min, y_min, x_max and y_max as well as the class_name of each image. 

In [14]:
# Drop all the x-rays that do not contain any abnormality.
train_df = train_df[train_df.class_id != 14].reset_index(drop = True)

fold = 4
gkf  = GroupKFold(n_splits = 5)
train_df['fold'] = -1 # adds folds to the end of the dataframe

for fold, (train_idx, val_idx) in enumerate(gkf.split(train_df, groups = train_df.image_id.tolist())):
    train_df.loc[val_idx, 'fold'] = fold
    
val_df = train_df[train_df['fold'] == 4]
val_df.head()

In [15]:
train_files = []
val_files   = [] 
val_files += list(train_df[train_df.fold == fold].image_path.unique())
train_files += list(train_df[train_df.fold != fold].image_path.unique())
print(len(train_files)) # size of train dataset
print(len(val_files)) # size of validation dataset (which includes the folds) 

In our training dataset we do not include the folds, but we do for the validation files, hence we get a smaller number of files for the validation set at 879. 

In [16]:
os.makedirs('/kaggle/working/vinbigdata/labels/train', exist_ok = True)
os.makedirs('/kaggle/working/vinbigdata/labels/val', exist_ok = True)
os.makedirs('/kaggle/working/vinbigdata/images/train', exist_ok = True)
os.makedirs('/kaggle/working/vinbigdata/images/val', exist_ok = True)
label_dir = '/kaggle/input/vinbigdata-yolo-labels-dataset/labels'

for file in train_files:
    shutil.copy(file, '/kaggle/working/vinbigdata/images/train')
    filename = file.split('/')[-1].split('.')[0]
    shutil.copy(os.path.join(label_dir, filename+'.txt'), '/kaggle/working/vinbigdata/labels/train')
    
for file in val_files:
    shutil.copy(file, '/kaggle/working/vinbigdata/images/val')
    filename = file.split('/')[-1].split('.')[0]
    shutil.copy(os.path.join(label_dir, filename+'.txt'), '/kaggle/working/vinbigdata/labels/val')
    
val_dir = f'/kaggle/working/vinbigdata/images/val'

# Setting up YOLOv5

In [19]:
shutil.copytree('/kaggle/input/yolov5-official-v31-dataset/yolov5', '/kaggle/working/yolov5')
os.chdir('/kaggle/working/yolov5') 
clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

Below we import the validation set containing 879 images, the image resolution is 640x640, the confidence threshold being 0.15, the IoU or 
Intersection Over Union is set to 0.4 and the source is from the validation directoy. The justification of setting IoU to 0.4 instead of 0.5
or over is because we are trying to retrieve all the images that are the classified correctly and then using the algorithm we can not only
classify the chest abnormalities correctly but also showcase them through data visualisation in charts. 

It is also important to notice that unlike the code executed in the EDA section which converts the DICOM x-ray files to np array, the validation set of the 512 image dataset from VingBigData automatically converted the images to png. This is confirmed from the output we get below with 879 images all in png showcasing the resolutions and the different class names of diseases. 

In [23]:
!python detect.py --weights $weights_dir\
--img 640\ # image resolution: 640 x 640
--conf 0.15\ # confidence threshold 
--iou 0.4\ # Intersection Over Union, more information on this can be found in the methdology of report.pdf
--source $val_dir\ # set source to the validation directory 
--save-txt --save-conf --exist-ok

# Inference plots of objection detection using YOLO

It is important to note that the output is randomised, meaning if we were to run the below again we will observe different outputs. The output does not guarantee a correct prediction, because there are instances where we produce an IoU value of less than 0.5, further inference
and explanation of the output we achieved is explained in the results section of report.pdf


In [24]:
files = glob('runs/detect/exp/*png')
for _ in range(1):
    row = 3; col = 5 # we get 15 images here
    grid_files = random.sample(files, row*col)
    images     = []
    for image_path in tqdm(grid_files):
        img          = cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB)
        images.append(img)

    fig = plt.figure(figsize=(col*5, row*5))
    grid = ImageGrid(fig, 111,  # similar to subplot(111)
                     nrows_ncols=(row, col),  # creates 2x2 grid of axes
                     axes_pad=0.10,  # pad between axes in inch.
                     )

    for ax, im in zip(grid, images):
        # Iterating over the grid returns the Axes.
        ax.imshow(im)
        ax.set_xticks([])
        ax.set_yticks([])
    plt.show()

In [25]:
def yolo2voc(image_height, image_width, bboxes):
    bboxes = bboxes.copy().astype(float)
    bboxes[..., [0, 2]] = bboxes[..., [0, 2]] * image_width
    bboxes[..., [1, 3]] = bboxes[..., [1, 3]] * image_height
    bboxes[..., [0, 1]] = bboxes[..., [0, 1]] - bboxes[..., [2, 3]] / 2
    bboxes[..., [2, 3]] = bboxes[..., [0, 1]] + bboxes[..., [2, 3]]
    return bboxes

In [26]:
image_ids = []; PredictionStrings = []; classes = []; scores = []
x_min = []; y_min = []; x_max = []; y_max = []

for file_path in glob('runs/detect/exp/labels/*txt'):
    image_id = file_path.split('/')[-1].split('.')[0]
    w, h = val_df[val_df.image_id == image_id][['width', 'height']].values[0]
    f = open(file_path, 'r')
    data = np.array(f.read().replace('\n', ' ').strip().split(' ')).astype(np.float32).reshape(-1, 6)
    data = data[:, [0, 5, 1, 2, 3, 4]]
    bboxes = list(np.concatenate((data[:, :2], np.round(yolo2voc(h, w, data[:, 2:]))), axis = 1).reshape(-1))#.astype(str))
    for i in range(len(bboxes) // 6):
        image_ids.append(image_id)
        classes.append(int(bboxes[i * 6]))
        scores.append(int(bboxes[i * 6 + 1]))
        x_min.append(int(bboxes[i * 6 + 2]))
        y_min.append(int(bboxes[i * 6 + 3]))
        x_max.append(int(bboxes[i * 6 + 4]))
        y_max.append(int(bboxes[i * 6 + 5]))
        
pred_df = pd.DataFrame({'image_id' : image_ids,'classes' : classes,'scores' : scores,'x_min' : x_min,'y_min' : y_min,'x_max' : x_max,'y_max' : y_max})
pred_df['class_name'] = pred_df.classes.apply(lambda x : label_dic[x])

## Distribution of actual vs predicted

In [27]:
fig = px.bar(val_df.class_name.value_counts().sort_index(), 
             color = val_df.class_name.value_counts().sort_index().index,
             color_discrete_sequence = color_palette,
             title = "Actual Distribution of Validation Dataset")
fig.update_layout(legend_title = "Disease names",
                  xaxis_title = "Diseases",
                  yaxis_title = "Count")
fig.show()

## Distribution of predicted validation dataset

In [28]:
fig = px.bar(pred_df.class_name.value_counts().sort_index(), 
             color = pred_df.class_name.value_counts().sort_index().index,
             color_discrete_sequence = color_palette,
             title = "Predicted Distribution of Validation Dataset")
fig.update_layout(legend_title = "Disease names",
                  xaxis_title = "Diseases",
                  yaxis_title = "Count")
fig.show()

## Distribution of difference in the actual and predicted dataset

In [29]:
difference = val_df.class_name.value_counts().sort_index() - pred_df.class_name.value_counts().sort_index()
fig = px.bar(difference, 
             color = difference.index,
             color_discrete_sequence = color_palette,
             title = "The Difference between Actual Distribution and Predicted Distribution")
fig.update_layout(legend_title = "Disease names",
                  xaxis_title = "Diseases",
                  yaxis_title = "Count")
fig.show()

This is the end of the notebook, the next process is to calculate mAP (mean average precision) on the dataset, then analyse the PR curves that will involve understanding the difference between precision and recall values for different thresholds. 