# Universidade Federal do Rio Grande do Norte


## Programa de Pós-Graduação em Engenharia Elétrica e de Computação
## EEC1509 - Aprendizagem de Máquina


# Group

## João Lucas Correia Barbosa de Farias

## Júlio Freire Peixoto Gomes


# Project 2 - Traffic Sign Recognition


## About the Project
This project is divided in 6 files including this one, where each one represents one step in the process of deploying a machine learning algorithm. In this case, we chose a Neural Network algorithm as Classifier. The goal is to explore learning, generalization and batch-normalization techniques and compare results.

The dataset has over 50k images of traffic signs. Our goal is to predict which sign a specific image refers to.


### The details about the dataset are shown below.

The German Traffic Sign Benchmark is a multi-class, single-image classification challenge held at the International Joint Conference on Neural Networks (IJCNN) 2011.

*   Single-image, multi-class classification problem
*   More than 40 classes
*   More than 50,000 images in total
*   Large, lifelike database

For more information, visit:

https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign

Also, for each class, that is a respective shape, color and sign id's. They are describred as follows:



1.   Shape ID
  *   0: red
  *   1: blue
  *   2: yellow
  *   3: white
2.   Color ID
  *   0: triangle
  *   1: circle
  *   2: diamond
  *   3: hexagon
  *   4: inverse-triangle
3.   Sign ID
  *   float: value according to Ukranian Traffic Rule

## The dataset was taken from Kaggle:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009

# 1.0 Install and Load Libraries


In [None]:
%%capture
!pip install wandb

In [None]:
import wandb

# 2.0 Fetch Data

In this first step, the raw data from the dataset is uploaded to wandb. This way, in the following steps, we are able to communicate with wandb and retrieve the dataset.

First, we import 'numpy' for array operations, 'os' for path-like operations and 'cv2' for dealing with images.

In [None]:
import h5py
import numpy as np
import os
from PIL import Image
import pandas as pd

Since our data is stored in Google Drive, we mounted the drive into Colab and used this to gain access to the dataset.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


After mounting the drive, we found out the path to the dataset. For this, use the Files on the tool bar to the left of the screen. Then, find the folder you want the path to and click the three dots (...) to the right of the name of the folder. Finally, click 'copy path'. We used this path to set the following variables.

In [None]:
# After uploading the data to your Drive and mounting it to Colab, use the path
# to the folder with the data to create the following variables.

path_to_data = 'path_to_data'
path_to_train = os.path.join(path_to_data, 'Train')
path_to_test = os.path.join(path_to_data, 'Test')

The ratio defined here was used to decrease the number of images we use in our classification problem. This is done because there are over 50k images and it becomes hard to train models and tune hyperparameters using the free version Google Colab. Besides, using 15k images gives good results.

In [None]:
# The ratio was calculated with the goal of keeping about 15k of the total 50k 
# images, this way the dataset was reduced to 30% of its original size

ratio = 15/50

We decided to use HDF5 encoding to gather of images. This is done because using HDF5 makes it easier to upload to W&B. Also, it makes more sense when designing the pipeline, as an HDF5 file can be used as the input.

First, we created an HDF5 file for the train set. Here 'ratio' was used to select about only 30% of the images to create the HDF5 file. The labels for each of the selected files were copied to an array.

In [None]:
image_labels = []

NUM_LABELS = len(os.listdir(path_to_train)) 

path_to_train_hdf5 = 'raw_data_train.h5'

with h5py.File(path_to_train_hdf5, 'a') as hf:
  for i in range(NUM_LABELS):
    label = i
    folder = os.path.join(path_to_train, str(label))
    images = os.listdir(folder)
    for img in images:
      if np.random.rand() < ratio:
        img_name = os.path.join(folder, img)
        img_array = np.array(Image.open(img_name))
        dset = hf.create_dataset(img, data=img_array)
        image_labels.append(label)

image_labels = np.array(image_labels)

print(f"Size of HDF5 file: {os.path.getsize(path_to_train_hdf5)}")
print(f"image_labels.shape: {image_labels.shape}")

Next, we created the HDF5 file for the test set. Again, 'ratio' was used to select only a portion of the total number of images. We have to make sure to export an array with the corresponding label for each image in the test set. This way, we will be able to evaluate our model later. For this, we use the 'Test.csv' that comes with the dataset and look for the 'ClassId' column.

In [None]:
path_to_test_labels = os.path.join(path_to_data, 'Test.csv')
df_test_labels = pd.read_csv(path_to_test_labels)

In [None]:
path_to_test_hdf5 = 'raw_data_test.h5'

test_labels_1 = []
test_labels_2 = []

with h5py.File(path_to_test_hdf5, 'a') as hf:
  folder = path_to_test
  images = os.listdir(folder)
  for img in images:
    if np.random.rand() < ratio:
      Class_ID_name = os.path.join('Test', img)
      df_img = df_test_labels[df_test_labels['Path'] == Class_ID_name]
      Class_ID = int(df_img['ClassId'])
      test_labels_1.append(Class_ID)
      test_labels_2.append(img)
      img_name = os.path.join(folder, img)
      img_array = np.array(Image.open(img_name))
      dset = hf.create_dataset(img, data=img_array)

test_labels_1 = pd.DataFrame(test_labels_1, columns=['label'])
test_labels_2 = pd.DataFrame(test_labels_2, columns=['path'])
test_labels = pd.concat(objs=[test_labels_1, test_labels_2], axis=1)

print(f"Size of HDF5 file: {os.path.getsize(path_to_test_hdf5)}")
print(f"test_labels.shape: {test_labels.shape}")

Size of HDF5 file: 36544546
test_labels.shape: (3722, 2)


In [None]:
test_labels.head()

Unnamed: 0,label,path
0,3,11779.png
1,39,11272.png
2,23,10811.png
3,12,10649.png
4,4,10880.png


In [None]:
path_to_test_hdf5 = 'raw_data_test.h5'
test_labels = []

with h5py.File(path_to_test_hdf5, 'a') as hf:
  folder = path_to_test
  images = os.listdir(folder)
  for img in images:
    if np.random.rand() < ratio:
      Class_ID_name = os.path.join('Test', img)
      df_img = df_test_labels[df_test_labels['Path'] == Class_ID_name]
      Class_ID = int(df_img['ClassId'])
      test_labels.append(Class_ID)
      img_name = os.path.join(folder, img)
      img_array = np.array(Image.open(img_name))
      dset = hf.create_dataset(img, data=img_array)

test_labels = np.array(test_labels)

print(f"Size of HDF5 file: {os.path.getsize(path_to_test_hdf5)}")
print(f"test_labels.shape: {test_labels.shape}")

Now, we send the HDF5 files to W&B as artifacts. Also, the labels are exported to a csv file and uploaded to W&B as well.

In [None]:
image_labels.tofile('raw_data_train_labels.csv', sep=',')
test_labels.to_csv('raw_data_test_labels.csv', index=False)

In [None]:
# login to wandb account
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# upload file raw_data_train.h5 (dataset) to wandb under the
# project called traffic_sign_recognition

!wandb artifact put \
      --name ppgeec-ml-jj/traffic_sign_recognition/raw_data_train.h5 \
      --type raw_data \
      --description "Raw data train (HDF5 file) from Traffic Sign Recognition Dataset (without labels)" raw_data_train.h5

In [None]:
# upload file raw_data_train_labels.csv (dataset) to wandb under the
# project called traffic_sign_recognition

!wandb artifact put \
      --name ppgeec-ml-jj/traffic_sign_recognition/raw_data_train_labels.csv \
      --type raw_data \
      --description "Raw data train from Traffic Sign Recognition Dataset (only labels)" raw_data_train_labels.csv

In [None]:
# upload file raw_data_test.h5 (dataset) to wandb under the
# project called traffic_sign_recognition

!wandb artifact put \
      --name ppgeec-ml-jj/traffic_sign_recognition/raw_data_test.h5 \
      --type raw_data \
      --description "Raw data test (HDF5 file) from Traffic Sign Recognition Dataset" raw_data_test.h5

[34m[1mwandb[0m: Uploading file raw_data_test.h5 to: "ppgeec-ml-jj/traffic_sign_recognition/raw_data_test.h5:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mjotafarias[0m ([33mppgeec-ml-jj[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220726_030111-jx6wl53k[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mstellar-vortex-302[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/ppgeec-ml-jj/traffic_sign_recognition[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/ppgeec-ml-jj/traffic_sign_recognition/runs/jx6wl53k[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("ppgeec-ml-jj/traffic_sign_recognition/raw_data_test.h5:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish

In [None]:
# upload file raw_data_test_labels.csv (dataset) to wandb under the
# project called traffic_sign_recognition

!wandb artifact put \
      --name ppgeec-ml-jj/traffic_sign_recognition/raw_data_test_labels.csv \
      --type raw_data \
      --description "Raw data test from Traffic Sign Recognition Dataset (only labels)" raw_data_test_labels.csv

[34m[1mwandb[0m: Uploading file raw_data_test_labels.csv to: "ppgeec-ml-jj/traffic_sign_recognition/raw_data_test_labels.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mjotafarias[0m ([33mppgeec-ml-jj[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220726_030140-3kvgwryh[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mbumbling-tree-303[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/ppgeec-ml-jj/traffic_sign_recognition[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/ppgeec-ml-jj/traffic_sign_recognition/runs/3kvgwryh[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("ppgeec-ml-jj/traffic_sign_recognition/raw_data_test_labels.csv:latest")

[34m[1mwandb[0m: Waiting fo