# Task 2: Coral Classification using Support Vector Machine
This notebook trains a support vector machine to classify whether an image contains corals or not. The SVM is trained on HOG features (Histogram of Oriented Gradients) with the help of the scikit-learn library.

### Download Data
The data is available for download through a public link. After downloading, unzip the folder to get access to the data.

In [None]:
import gdown

# Download training and validation set
url = 'https://drive.google.com/uc?id=1Gdxb0R8ohGqI4yB4KufWYESl0wIc8r8o'
output = 'Data_2022_assignment_COMP3007.zip'
gdown.download(url, output)

In [None]:
!unzip {output} >/dev/null

In [None]:
# Download testing set
url = 'https://drive.google.com/uc?id=1vc5avjn2lRfnIDC2i7XOq22R70m6UTrH'
output = 'Testing_Data_2022.zip'
gdown.download(url, output)

In [None]:
!unzip {output} >/dev/null

### Define Directories
To access the data, various directories need to be defined. The data directory contains two subdirectories that correspond to the training and validation set. The test data directory contains the testing set.

In [None]:
import os

# Train and valid directories
DATA_PATH = os.path.join('Data', 'coral image classification')
TRAIN_PATH = os.path.join(DATA_PATH, 'train')
VALID_PATH = os.path.join(DATA_PATH, 'val')

# Test directory
TEST_PATH = os.path.join('TestData', 'CoralImageClassification')

### HOG Features
The scikit-learn library provides an implemention of the HOG feature descriptor. HOG features describe the structure of an image based on its gradients. The image is first divided into small cells, and for each cell, the gradient magnitude and angle of all pixels is collected into a histogram. Once all the histogram has been computed, the feature vector is created as the concatenation of normalized histograms.

https://scikit-image.org/docs/stable/auto_examples/features_detection/plot_hog.html

In [None]:
import cv2
import numpy as np
from skimage.feature import hog

# Get HOG features from a grayscaled image
def get_hog_features(path, image_size=64):
  img = cv2.imread(path)
  gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
  gray = cv2.resize(gray, (image_size, image_size))
  return hog(gray, orientations=9, pixels_per_cell=(8, 8), cells_per_block=(4, 4), visualize=True)

### Dataset
This cell returns a dataset containing the HOG features of images and their labels. Two parameters can be defined: the image size and a shuffle modifier.

In [None]:
import numpy as np
from tqdm.notebook import tqdm

# Get the HOG features and labels of the dataset
def get_dataset(dataset_path, image_size=64, shuffle=False):
  # Define empty lists to store the HOG features and labels
  x, y = [], []

  # Loop through each label
  labels = sorted(os.listdir(dataset_path))
  for i, label in enumerate(labels):
    # Define path of the image
    label_path = os.path.join(dataset_path, label)
    image_paths = sorted(os.path.join(label_path, path) for path in os.listdir(label_path))

    # Get the HOG features and label of the image
    for path in tqdm(image_paths, desc='{} ({})'.format(label, dataset_path.split('/')[-1])):
      hog = get_hog_features(path, image_size=image_size)[0]
      x.append(hog)
      y.append(i)
  
  # Convert the lists to numpy arrays
  x, y = np.array(x), np.array(y)
  p = np.random.permutation(len(x))

  return (x[p], y[p]) if shuffle else (x, y)

### Image Size Hyperparameter
While the SVM has its own set of hyperparameters, it has been found that the image size has the greatest influence on the accuracy of the SVM. A larger image size will return more descriptive HOG features. Although this leads to an increase in accuracy, a large image size will take more time and memory to compute.

For this experiment, 4 image size has been chosen. A list to store the results of the accuracy is also defined.

In [None]:
image_sizes = [32, 64, 128, 256]
accuracies = []

In [None]:
x_train, y_train = get_dataset(TRAIN_PATH, image_size=image_sizes[3], shuffle=True)
x_valid, y_valid = get_dataset(VALID_PATH, image_size=image_sizes[3], shuffle=False)

### Train the SVM
This cell trains the SVM with the training set. After training, the SVM is saved to the local directory.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
from sklearn import svm
from joblib import dump

clf = svm.SVC(probability=True)
clf.fit(x_train, y_train)

dump(clf, 'svm_{}.joblib'.format(image_sizes[3]))

### Evaluate the SVM on Valid Set
This cell evaluates the SVM with the validation set. The accuracy score as well as classification report are shown.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = clf.predict(x_valid)
accuracy = accuracy_score(y_valid, y_pred)
accuracies.append(accuracy)

print('Accuracy: {}\n'.format(accuracy))
print(classification_report(y_valid, y_pred, target_names=['No Coral', 'Coral']))

### Plot Accuracies
This cell plots the accuracy of the SVM with different image sizes. As shown here, a larger image size leads to a more accurate classifier.

In [None]:
accuracies

In [None]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Bar(x=[str(x) for x in image_sizes], y=accuracies))
fig.update_layout(
    title='Accuracy of SVM with Different Image Sizes',
    title_x=0.5,
    xaxis_title='Image Size',
    yaxis_title='Accuracy'
)
fig.show()

### Evaluate a Trained SVM on Test Set
The following cells download an already trained SVM and evaluate it on the test set.

In [None]:
import gdown

url = 'https://drive.google.com/uc?id=1fY23h4Tbu7tGyivxt-DMbRt3H6Z45kXg'
output = 'svm_256.joblib'
gdown.download(url, output)

In [None]:
from joblib import load
from sklearn.metrics import accuracy_score, classification_report

x_test, y_test = get_dataset(TEST_PATH, image_size=256, shuffle=False)

clf = load('svm_256.joblib')
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy: {}\n'.format(accuracy))
print(classification_report(y_test, y_pred, target_names=['No Coral', 'Coral']))

### Classify Single Image
This cell classifies a single image. The true label and the prediction are both shown.

In [None]:
from google.colab.patches import cv2_imshow

labels = sorted(os.listdir(TEST_PATH))
modified_labels = ['No Coral', 'Coral']
true_label = modified_labels[0]

# Get the image
img_path = os.path.join(TEST_PATH, labels[0], '13-11-41-27_1.1421167342.57-top_right.png')
hog = get_hog_features(img_path, image_size=256)[0]

# Get the prediction
img_preds = clf.predict_proba([hog])
img_pred = modified_labels[np.argmax(img_preds)]
img_pred_score = np.max(img_preds) * 100

print('True: {}'.format(true_label))
print('Pred: {} ({:.2f}%)'.format(img_pred, img_pred_score))

img = cv2.imread(img_path)
cv2_imshow(cv2.resize(img, (256, 256)))