### Computer Vision (S1-24_AIMLCZG525) - Assignment 1
#### CV Group 19 
    SAYANTA CHATTERJEE - 2023aa05173

    TAIBA REHMAN - 2023aa05466

    KASHYAP RAJPUROHIT - 2023ab05027

    C RAMAKRISHNA - 2023ab05177

### Problem Statement 2: "Edge-based Image Retrieval"

#### Objective: Implement an image retrieval system that uses edge-based features to find similar images in a dataset. Evaluate the retrieval performance using relevant metrics.


### Task 1.Import the required libraries

Import the necessary libraries for image processing (e.g., OpenCV, scikit-image) and machine learning (e.g., scikit-learn, pandas).

In [6]:
import cv2  
import numpy as np  
import pandas as pd 
import random
import os
import shutil
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import  accuracy_score, precision_score, recall_score, f1_score, average_precision_score, pairwise_distances
from sklearn.preprocessing import label_binarize, StandardScaler
from sklearn.decomposition import PCA

### Task 2. Data Acquisition

•	Download the Caltech 101 dataset from the provided link: Caltech 101 DatasetLinks to an external site.

•	Select any 5 categories from the dataset. Each category should have at least 50 images.

•	Organize the selected categories into separate folders, with each folder containing the corresponding images.

•	Analyze and plot the distribution of images per category using a bar graph or pie chart.

Url for the dataset: 
https://data.caltech.edu/records/mzrjq-6wc02/files/caltech-101.zip?download=1


In [7]:

folder_path = r"C:/Users/asus/OneDrive/Desktop/CVassimgement/caltech-101/"
folder_contents = os.listdir(folder_path)
folder_names = [name for name in folder_contents if os.path.isdir(os.path.join(folder_path, name))]

# Select categories with at least 50 images
randomsa = []
for folder_name in folder_names:
    folder_contents = os.listdir(os.path.join(folder_path, folder_name))
    if len(folder_contents) >= 50:
        randomsa.append(folder_name)

# Select 5 random categories
selected_folders = random.sample(randomsa, 5)

# Create a new directory to organize the selected categories
organized_folder_path = r"C:/Users/asus/OneDrive/Desktop/CVassimgement/organized_caltech101/"
os.makedirs(organized_folder_path, exist_ok=True)

# Organize the selected categories into separate folders
for folder_name in selected_folders:
    src_folder = os.path.join(folder_path, folder_name)
    dest_folder = os.path.join(organized_folder_path, folder_name)
    os.makedirs(dest_folder, exist_ok=True)
    for file_name in os.listdir(src_folder):
        src_file = os.path.join(src_folder, file_name)
        dest_file = os.path.join(dest_folder, file_name)
        shutil.copy(src_file, dest_file)

# Analyze and plot the distribution of images per category
category_counts = {folder_name: len(os.listdir(os.path.join(organized_folder_path, folder_name))) for folder_name in selected_folders}
print(category_counts)
# Plotting the distribution using a bar graph
plt.figure(figsize=(10, 6))
plt.bar(category_counts.keys(), category_counts.values(), color='skyblue')
plt.xlabel('Category')
plt.ylabel('Number of Images')
plt.title('Distribution of Images per Category')
plt.xticks(rotation=45)
plt.show()

# Plotting the distribution using a pie chart
plt.figure(figsize=(8, 8))
plt.pie(category_counts.values(), labels=category_counts.keys(), autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Images per Category')
plt.show()

NameError: name 'os' is not defined

### Task 3.  Data Preparation

Randomly split the dataset into training and testing sets using an 80% training and 20% testing ratio. Ensure that the split is stratified, maintaining the same proportion of images per category in both sets.

**Note:** This step is integrated in the **Task 5** in the code below.


### Task 4: **Preprocess** the images as follows:

•	Convert the images to grayscale.

•	Apply the Canny edge detection algorithm to extract edges from the images.

•	Extract edge-based descriptors such as edge histograms or gradient histograms from the preprocessed images.

•	Create multiple feature sets by varying the parameters of the edge detection and descriptor extraction methods.

•	Store the extracted features in separate dataframes, along with the corresponding category labels.

•	Normalize the feature values using techniques like min-max scaling or z-score normalization.

•	If the feature dimensionality is too high, apply dimensionality reduction techniques such as PCA or resize the images to a smaller fixed size.


In [6]:
def plotImage(image):
    # Display the edge-detected image
    plt.imshow(image, cmap='gray')
    plt.title("Edge Detection")
    plt.axis('off')
    plt.show()

In [7]:
def loadPreProcess(image_path, canny_low, canny_high):
    if(os.path.exists(image_path)):
        # Load the image as gray scale
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE )
        # resize the image for uniform feature length
        resized = cv2.resize(image, (256,128), interpolation=cv2.INTER_AREA)

        # Apply Gaussian Blur
        blurred = cv2.GaussianBlur(resized, (5, 5), 0)
        # Apply Canny for edge detection
        return cv2.Canny(blurred, canny_low, canny_high)
    else:
        return []

In [21]:
def getHOGFeatures(image, cellSize, blockSize, blockStride):
    winSize = (256,128) # scaled image size
    nbins = 9

    # HOG from preprocessed image
    hog = cv2.HOGDescriptor(_winSize = winSize, _cellSize = cellSize, _blockSize = blockSize, _nbins= nbins, _blockStride = blockStride)
    return hog.compute(image)

In [9]:
def loadData(base_path, folder_names, params):
    features = list()
    category = list()
    image_names = list()
    for folder_name in folder_names:
        folder_path = os.path.join(base_path, folder_name)
        # get and extract feature from all files in current folder
        for file_name in os.listdir(folder_path):
            image = loadPreProcess(os.path.join(folder_path, file_name), params['canny_low'], params['canny_high'])
            features.append(getHOGFeatures(image, params['cell_size'], params['block_size'], params['block_stride']))
            category.append(folder_name)
            image_names.append(file_name)

    # z-score normalization feature set
    scaler = StandardScaler()
    norm_data = scaler.fit_transform(features)

    # PCA to reduce dimentionality
    pca = PCA(n_components=0.98)
    pca_data = pca.fit_transform(norm_data)

    data = pd.DataFrame(pca_data)
    data['category'] = category
    data['image_name'] = image_names

    return data

### Task 5. Model Building

* Select a classical machine learning algorithm such as Support Vector Machines (SVM), Random Forest, or XGBoost for training the image retrieval model.

* Train the chosen model on different feature combinations created in the preprocessing step.

* Use appropriate hyperparameter tuning techniques (e.g., grid search, random search) and cross-validation (e.g., k-fold) to optimize the model's performance.


In [10]:
def train(train_x, train_y):
    # define model
    model = SVC()

    # Define the parameter grid
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': [1, 0.1, 0.01, 0.001],
        'kernel': ['rbf', 'linear', 'sigmoid']
    }

    # k-fold with grid search cross validation
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kfold, scoring='accuracy')
    grid_search.fit(train_x, train_y)
    
    print("Best Cross-Validation Accuracy:", grid_search.best_score_)
    model.fit(train_x, train_y)
    return model

### Task 6. Validation Metrics

Evaluate the trained models using the following metrics:

    o	Accuracy
    o	Precision
    o	Recall
    o	F1 score
    o	Mean Average Precision (mAP)

Calculate the mAP by ranking the retrieved images based on their similarity scores and computing the average precision at different recall levels.
Aim for a minimum mAP of 0.7 to consider the retrieval system as acceptable.

Note: The accuracy is siplay in the output of the above cell


In [14]:
def calcultate_map(test_y,pred_y):
    binarized_test_y = label_binarize(test_y, classes=np.unique(test_y))
    binarized_pred_y = label_binarize(pred_y, classes=np.unique(test_y))
    ap_scores = []
    for i in range(binarized_test_y.shape[1]):
        ap = average_precision_score(binarized_test_y[:, i], binarized_pred_y[:, i])
        ap_scores.append(ap)
    
    map_score = np.mean(ap_scores)
    return map_score

              precision    recall  f1-score   support

    car_side       1.00      0.96      0.98        25
 cougar_face       0.75      0.64      0.69        14
   crocodile       0.54      0.70      0.61        10
         cup       0.82      0.82      0.82        11
   dragonfly       0.86      0.86      0.86        14

    accuracy                           0.82        74
   macro avg       0.79      0.80      0.79        74
weighted avg       0.84      0.82      0.83        74



In [None]:
def test(model, test_x, test_y):
    pred_y = model.predict(test_x)
    # Calculate metrics
    accuracy = accuracy_score(test_y, pred_y)
    precision = precision_score(test_y, pred_y, average='weighted')
    recall = recall_score(test_y, pred_y, average='weighted')
    f1 = f1_score(test_y, pred_y, average='weighted')
    map_score = calcultate_map(test_y, pred_y)

    # 8. Analysis and Discussion
    print(f"Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print(f"F1-score: {f1}")
    print(f"MAP: {map_score}")

**Observation:** The mean average precision (mAP) is calculated above with a value as **0.68**

### Task 7. Model Inference 

•	Randomly select 5 test images from each category and use the best-performing model to retrieve the top-5 most similar images for each query image.

•	Display the query image along with its top-5 retrieved images, their predicted labels, and the actual labels.

•	Justify your choice of features based on the retrieval performance and visual analysis of the results.


In [22]:
def get_top_k_similar_images(model, source, image_data, k=5):
    pred_cat = model.predict(image_data)
    distances = pairwise_distances(source, image_data, metric='cosine').ravel()
    top_k_indices = np.argsort(distances)[:-k]
    return pred_cat, top_k_indices

<class 'cv2.HOGDescriptor'>


TypeError: _BasePCA.transform() missing 1 required positional argument: 'X'

In [None]:
feature_params_1  ={
    'canny_low': 50,
    'canny_high': 150,
    'cell_size' : (16,16), # in pixel
    'block_size' : (32,32), # multiple of cell size
    'block_stride':(16,16) # multiple of cell size
}

data_1 = loadData(organized_folder_path, category_counts.keys(), feature_params_1)
vector_1 =data_1.iloc[:, :-2]
category = data_1.iloc[:, -2]

In [None]:
train_x, test_x, train_y, test_y = train_test_split(vector_1, category, train_size=0.8, stratify=category )
model = train(train_x, train_y)
test(model, test_x, test_y)

In [None]:
# Randomly select 5 test images from each selected folder
for folder in selected_folders:
    folder_dir = os.path.join(organized_folder_path, folder)
    file_name = random.sample(os.listdir(folder_dir), 1)[0]
    image_data = data_1.query(f"category ==  '{folder}' & image_name == '{file_name}'").iloc[:,:-2]
    pred_label, top_k_match = get_top_k_similar_images(model, vector_1, image_data)

    # Display query image and top-5 retrieved images
    plt.figure(figsize=(15, 5))
    plt.subplot(1, 6, 1)
    plt.imshow(cv2.imread(os.path.join(folder_dir, file_name)), cmap='gray')
    plt.title(pred_label)
    plt.axis('off')
    
      
    for i in range(5):
        category = data_1.iloc[top_k_match[i]]['category']
        file_name = data_1.iloc[top_k_match[i]]['image_name']
        img = cv2.imread(os.path.join(organized_folder_path, category, file_name))
        plt.subplot(1, 6, i + 2)
        plt.imshow(img, cmap='gray')
        plt.title(f"Top-{i+1}: {category}")
        plt.axis('off')
    
    plt.show()

In [21]:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate metrics
accuracy = accuracy_score(test_y, pred_y)
precision = precision_score(test_y, pred_y, average='weighted')
recall = recall_score(test_y, pred_y, average='weighted')
f1 = f1_score(test_y, pred_y, average='weighted')

# 8. Analysis and Discussion
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")
print(f"mAP: {map_score}")


Accuracy: 0.8356164383561644
Precision: 0.8608393128941075
Recall: 0.8356164383561644
F1-score: 0.8349502723569266
mAP: 0.74053909352056


### Task 8.  Analysis and Discussion

•	Analyze the retrieval results and discuss the effectiveness of the edge-based features for image retrieval.

•	Compare the performance of different feature combinations and highlight the ones that yield the best results.

•	Identify any limitations or challenges encountered during the implementation and suggest potential improvements or alternative approaches.
        1. Resizing of the images due to different size 

        2. CNN for better accuracy
    
        3. Canny edge detection limitation
    
        4. HOG limitation

