https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
https://kushalvyas.github.io/BOV.html
http://127.0.0.1:5500/image-feature-extraction.html
https://www.geeksforgeeks.org/feature-extraction-in-image-processing-techniques-and-applications/

### The main goal of this exercise is to get a feeling and understanding on the importance of representation and extraction of information from complex media content, in this case images. 

1. Start Simple with Colour Histograms
2. Explore Key Point-Based Feature Extraction
    - Explore Key Point-Based Feature Extraction: Once you’re comfortable with simple features, try using SIFT for key point detection and descriptor extraction. SIFT is robust to scale and rotation, which makes it ideal for finding distinctive features in images.
        - Visual Bag of Words: After extracting SIFT features, apply a bag-of-words approach to cluster these descriptors into visual “words.” This converts a variable number of key points into a fixed-length feature vector suitable for classifiers.


3. Then, try to use deep learning approaches, such as convolutional neural networks (for images) or recurrent neural networks (for text), or other approaches (but likely not just simple MLPs), and see
how your performance differs. Try at least two different architectures, they can be (reusing or be based on )existing, well-known ones.

Compare not just the overall measures, but perform a detailed comparison and analysis per class (confusion matrix), to identify if the two approaches lead to different types of errors in the different classes, and also try to identify other patterns.

Also perform a detailed comparison of runtime, considering both time for training and testing, including also the feature extraction components.
For the datasets you shall work with, pick two text/image datasets, from the list of suggestions below.

For images, you can base your DL implementation on the tutorial provided by colleagues at the institute, available at https://github.com/tuwien-musicir/DL_Tutorial/blob/master/Car_recognition.ipynb (you can also check the rest of the repository for interesting code; credit to Thomas Lidy (http://www.ifs.tuwien.ac.at/~lidy/)). Mind also that you will find plenty of examples on how to create and train CNNs / RNNs in various frameworks - tensorflow, keras, pytorch, ....

In [8]:
import pandas as pd
import numpy as np
import time
import cv2
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt

import io
from PIL import Image

# Scikit-learn imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split
import seaborn as sns

# Torch and torchvision 
import torch
from torchvision import datasets, transforms

In [None]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001-1359597a978bc4fa.parquet', 'valid': 'data/valid-00000-of-00001-70d52db3c749a935.parquet'}
df = pd.read_parquet("hf://datasets/zh-plus/tiny-imagenet/" + splits["train"])

# Pipeline

### Helper Functions

In [None]:

def extract_color_histogram(image, bins=8):
    """
    Extracts a normalized color histogram from an image.
    
    Args:
        image (numpy.ndarray): Image array in RGB format.
        bins (int): Number of bins per channel.
        
    Returns:
        np.ndarray: Flattened concatenated normalized histogram for each channel.
    """
    # Ensure image is in uint8 format
    if image.dtype != np.uint8:
        image = (255 * image).astype("uint8")
    
    # If image is not in 3 channels, convert (this might happen with grayscale images)
    if len(image.shape) != 3 or image.shape[2] != 3:
        image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
    
    chans = cv2.split(image)
    features = []
    
    for chan in chans:
        # Compute histogram for the channel
        hist = cv2.calcHist([chan], [0], None, [bins], [0, 256])
        # Normalize histogram
        hist = cv2.normalize(hist, hist).flatten()
        features.extend(hist)
        
    return np.array(features)



def run_experiment(X_images, y_labels, dataset_name, bins_values=[8, 16]):
    """
    Run experiments on a given dataset using two histogram configurations.
    
    Args:
        X_images (list): List of images as numpy arrays in RGB format.
        y_labels (list or array): Corresponding labels.
        dataset_name (str): Name of the dataset (for printing/plotting).
        bins_values (list): List of bin counts to try (e.g. [8, 16]).
    """
    results = {}
    
    for bins in bins_values:
        print(f"\n==== Running experiment on {dataset_name} with {bins} bins per channel ====")
        t0 = time.time()
        # Extract features for all images; use tqdm for progress
        features = []
        for img in tqdm(X_images, desc="Extracting features"):
            feat = extract_color_histogram(img, bins=bins)
            features.append(feat)
        features = np.array(features)
        extraction_time = time.time() - t0
        print(f"Feature extraction time: {extraction_time:.2f} sec")
        
        # Split into train and test sets
        X_train, X_test, y_train, y_test = train_test_split(
            features, y_labels, test_size=0.3, random_state=42, stratify=y_labels
        )
        
        # Train classifier and measure training time
        clf = LogisticRegression(max_iter=500, solver="lbfgs", multi_class="multinomial")
        t_train = time.time()
        clf.fit(X_train, y_train)
        train_time = time.time() - t_train
        print(f"Training time: {train_time:.2f} sec")
        
        # Test classifier and measure prediction time
        t_test = time.time()
        y_pred = clf.predict(X_test)
        test_time = time.time() - t_test
        print(f"Testing time: {test_time:.2f} sec")
        
        # Compute overall accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f"Overall Accuracy: {acc:.4f}")
        
        # Compute confusion matrix and classification report
        cm = confusion_matrix(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        results[bins] = {
            "extraction_time": extraction_time,
            "train_time": train_time,
            "test_time": test_time,
            "accuracy": acc,
            "confusion_matrix": cm,
            "classification_report": report,
            "model": clf,
        }
        
        # Plot the confusion matrix for visual inspection
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
        plt.title(f"{dataset_name} Confusion Matrix (bins={bins})")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        plt.show()
        
        # Also display a text report
        print(f"Classification Report (bins={bins}):")
        print(classification_report(y_test, y_pred))
    
    return results




### Experiment Runner Function


In [10]:

def run_experiment(X_images, y_labels, dataset_name, bins_values=[8, 16]):
    """
    Run experiments on a given dataset using two histogram configurations.
    
    Args:
        X_images (list): List of images as numpy arrays in RGB format.
        y_labels (list or array): Corresponding labels.
        dataset_name (str): Name of the dataset (for printing/plotting).
        bins_values (list): List of bin counts to try (e.g. [8, 16]).
    """
    results = {}
    
    for bins in bins_values:
        print(f"\n==== Running experiment on {dataset_name} with {bins} bins per channel ====")
        t0 = time.time()
        # Extract features for all images; use tqdm for progress
        features = []
        for img in tqdm(X_images, desc="Extracting features"):
            feat = extract_color_histogram(img, bins=bins)
            features.append(feat)
        features = np.array(features)
        extraction_time = time.time() - t0
        print(f"Feature extraction time: {extraction_time:.2f} sec")
        
        # Split into train and test sets
        X_train, X_test, y_train, y_test = train_test_split(
            features, y_labels, test_size=0.3, random_state=42, stratify=y_labels
        )
        
        # Train classifier and measure training time
        clf = LogisticRegression(max_iter=500, solver="lbfgs", multi_class="multinomial")
        t_train = time.time()
        clf.fit(X_train, y_train)
        train_time = time.time() - t_train
        print(f"Training time: {train_time:.2f} sec")
        
        # Test classifier and measure prediction time
        t_test = time.time()
        y_pred = clf.predict(X_test)
        test_time = time.time() - t_test
        print(f"Testing time: {test_time:.2f} sec")
        
        # Compute overall accuracy
        acc = accuracy_score(y_test, y_pred)
        print(f"Overall Accuracy: {acc:.4f}")
        
        # Compute confusion matrix and classification report
        cm = confusion_matrix(y_test, y_pred)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        results[bins] = {
            "extraction_time": extraction_time,
            "train_time": train_time,
            "test_time": test_time,
            "accuracy": acc,
            "confusion_matrix": cm,
            "classification_report": report,
            "model": clf,
        }
        
        # Plot the confusion matrix for visual inspection
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
        plt.title(f"{dataset_name} Confusion Matrix (bins={bins})")
        plt.xlabel("Predicted")
        plt.ylabel("True")
        plt.show()
        
        # Also display a text report
        print(f"Classification Report (bins={bins}):")
        print(classification_report(y_test, y_pred))
    
    return results


def run_tiny_imagenet_experiment():
    # Define the parquet splits (update paths if needed)
    splits = {
        'train': 'data/train-00000-of-00001-1359597a978bc4fa.parquet',
        'valid': 'data/valid-00000-of-00001-70d52db3c749a935.parquet'
    }
    # Load training data from Tiny ImageNet (here we use the train split)
    # Note: The dataset from Hugging Face hub is accessed via "hf://datasets/..."
    try:
        df = pd.read_parquet("hf://datasets/zh-plus/tiny-imagenet/" + splits["train"])
    except Exception as e:
        print("Error loading Tiny ImageNet parquet file. Please ensure the path is correct.")
        raise e
    
    # Assume the dataframe has at least two columns: 'image' and 'label'
    # Here we assume that the 'image' column contains raw image data that can be converted to a numpy array.
    # (You might need to adjust this code based on the actual data format.)
# Loading Tiny ImageNet images with proper decoding for dict format
    X_images = []
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Loading Tiny ImageNet images"):
        img = row['image']
        # Check if the image is stored as a dict with keys 'bytes' and 'path'
        if isinstance(img, dict):
            if set(img.keys()) == {"bytes", "path"}:
                # Decode the image bytes using PIL
                try:
                    pil_img = Image.open(io.BytesIO(img["bytes"]))
                    pil_img = pil_img.convert("RGB")  # Ensure RGB format
                    img = np.array(pil_img)
                except Exception as e:
                    raise ValueError(f"Error decoding image bytes: {e}")
            else:
                raise ValueError("Unknown image dict format: " + str(img.keys()))
        elif not isinstance(img, np.ndarray):
            img = np.array(img)
        X_images.append(img)
    
    y_labels = df['label'].tolist()
    
    print(f"Tiny ImageNet: Loaded {len(X_images)} images.")
    
    # Run the experiment using two histogram configurations: 8 and 16 bins per channel.
    tiny_results = run_experiment(X_images, y_labels, dataset_name="Tiny ImageNet", bins_values=[8, 16])
    return tiny_results


if __name__ == "__main__":
    print("Starting experiments...\n")
    
    # Run Tiny ImageNet experiment (if the parquet file is available)
    try:
        tiny_results = run_tiny_imagenet_experiment()
    except Exception as e:
        print("Tiny ImageNet experiment could not be run:", e)
        tiny_results = None
    
    # Run CIFAR-10 experiment
    cifar_results = run_cifar10_experiment()
    
    # (Optional) Further analysis: compare per-class errors across approaches, print runtime comparisons, etc.
    # For example, you could compare tiny_results[8]["accuracy"] vs. tiny_results[16]["accuracy"]
    print("\nExperiments completed.")




Starting experiments...



Loading Tiny ImageNet images:   9%|▉         | 8770/100000 [00:00<00:07, 12521.78it/s]


KeyboardInterrupt: 

# 1. Tiny ImageNet Experiment


In [4]:

def run_tiny_imagenet_experiment():
    # Define the parquet splits (update paths if needed)
    splits = {
        'train': 'data/train-00000-of-00001-1359597a978bc4fa.parquet',
        'valid': 'data/valid-00000-of-00001-70d52db3c749a935.parquet'
    }
    # Load training data from Tiny ImageNet (here we use the train split)
    # Note: The dataset from Hugging Face hub is accessed via "hf://datasets/..."
    try:
        df = pd.read_parquet("hf://datasets/zh-plus/tiny-imagenet/" + splits["train"])
    except Exception as e:
        print("Error loading Tiny ImageNet parquet file. Please ensure the path is correct.")
        raise e
    
    # Assume the dataframe has at least two columns: 'image' and 'label'
    # Here we assume that the 'image' column contains raw image data that can be converted to a numpy array.
    # (You might need to adjust this code based on the actual data format.)
# Loading Tiny ImageNet images with proper decoding for dict format
    X_images = []
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Loading Tiny ImageNet images"):
        img = row['image']
        # Check if the image is stored as a dict with keys 'bytes' and 'path'
        if isinstance(img, dict):
            if set(img.keys()) == {"bytes", "path"}:
                # Decode the image bytes using PIL
                try:
                    pil_img = Image.open(io.BytesIO(img["bytes"]))
                    pil_img = pil_img.convert("RGB")  # Ensure RGB format
                    img = np.array(pil_img)
                except Exception as e:
                    raise ValueError(f"Error decoding image bytes: {e}")
            else:
                raise ValueError("Unknown image dict format: " + str(img.keys()))
        elif not isinstance(img, np.ndarray):
            img = np.array(img)
        X_images.append(img)
    
    y_labels = df['label'].tolist()
    
    print(f"Tiny ImageNet: Loaded {len(X_images)} images.")
    
    # Run the experiment using two histogram configurations: 8 and 16 bins per channel.
    tiny_results = run_experiment(X_images, y_labels, dataset_name="Tiny ImageNet", bins_values=[8, 16])
    return tiny_results



# 2. CIFAR-10 Experiment


In [5]:

def run_cifar10_experiment():
    # Define a transform to convert PIL images to numpy arrays
    transform = transforms.Compose([
        transforms.ToTensor(),  # Converts to [0,1] tensor in shape (C, H, W)
        # We'll transpose later to get HxWxC and convert to numpy
    ])
    
    # Download and load CIFAR-10 training set (we will use the train split for simplicity)
    cifar10_train = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    
    X_images = []
    y_labels = []
    for img, label in tqdm(cifar10_train, desc="Loading CIFAR-10 images"):
        # img is a tensor of shape (C, H, W); convert to numpy and then transpose to (H, W, C)
        img_np = img.permute(1, 2, 0).numpy()
        # Convert to uint8 format (0-255)
        img_np = (img_np * 255).astype(np.uint8)
        X_images.append(img_np)
        y_labels.append(label)
    
    print(f"CIFAR-10: Loaded {len(X_images)} images.")
    
    # Run the experiment using two histogram configurations: 8 and 16 bins per channel.
    cifar_results = run_experiment(X_images, y_labels, dataset_name="CIFAR-10", bins_values=[8, 16])
    return cifar_results



# Main: Run Both Experiments


In [11]:

if __name__ == "__main__":
    print("Starting experiments...\n")
    
    # Run Tiny ImageNet experiment (if the parquet file is available)
    try:
        tiny_results = run_tiny_imagenet_experiment()
    except Exception as e:
        print("Tiny ImageNet experiment could not be run:", e)
        tiny_results = None
    
    # Run CIFAR-10 experiment
    cifar_results = run_cifar10_experiment()
    
    # (Optional) Further analysis: compare per-class errors across approaches, print runtime comparisons, etc.
    # For example, you could compare tiny_results[8]["accuracy"] vs. tiny_results[16]["accuracy"]
    print("\nExperiments completed.")


Starting experiments...



Loading Tiny ImageNet images:  26%|██▌       | 25641/100000 [00:02<00:05, 12670.06it/s]


KeyboardInterrupt: 

In [13]:
import time
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import torch
import torchvision
import torchvision.transforms as transforms
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# --- Feature Extraction Function ---
def extract_grayscale_histogram(img, bins=16):
    """
    Given a 2D image (grayscale values in [0,1]), compute a normalized histogram.
    """
    # Convert to uint8 (0-255) then compute histogram
    img_8u = (img * 255).astype(np.uint8)
    hist, _ = np.histogram(img_8u, bins=bins, range=(0, 256))
    hist = hist.astype(np.float32)
    hist /= (hist.sum() + 1e-7)  # Normalize to sum to 1
    return hist

# --- Experiment Function ---
def run_experiment(images, labels, dataset_name, bins_values=[8, 16]):
    """
    Run experiments on a dataset with different histogram bin settings.
    
    Args:
        images (list): List of 2D numpy arrays (grayscale images).
        labels (list): Corresponding image labels.
        dataset_name (str): Name of the dataset (for printing).
        bins_values (list): List of bin counts to try.
    """
    results = {}
    for bins in bins_values:
        print(f"\n=== {dataset_name}: Experiment with {bins} bins per image ===")
        
        # Feature extraction
        t0 = time.time()
        features = [extract_grayscale_histogram(img, bins=bins) for img in tqdm(images, desc="Extracting features")]
        features = np.array(features)
        extraction_time = time.time() - t0
        print(f"Feature extraction time: {extraction_time:.2f} sec")
        
        # Split into train/test sets
        X_train, X_test, y_train, y_test = train_test_split(
            features, labels, test_size=0.3, random_state=42, stratify=labels
        )
        
        # Train classifier
        clf = LogisticRegression(max_iter=500, solver="lbfgs", multi_class="multinomial")
        t0 = time.time()
        clf.fit(X_train, y_train)
        train_time = time.time() - t0
        print(f"Training time: {train_time:.2f} sec")
        
        # Evaluate classifier
        t0 = time.time()
        y_pred = clf.predict(X_test)
        test_time = time.time() - t0
        print(f"Testing time: {test_time:.2f} sec")
        
        acc = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {acc:.4f}")
        cm = confusion_matrix(y_test, y_pred)
        print("Confusion Matrix:")
        print(cm)
        print("Classification Report:")
        print(classification_report(y_test, y_pred))
        
        results[bins] = {
            "extraction_time": extraction_time,
            "train_time": train_time,
            "test_time": test_time,
            "accuracy": acc,
            "confusion_matrix": cm,
            "model": clf,
        }
    return results

# --- Main Script ---
if __name__ == "__main__":
    # Load FashionMNIST
    transform = transforms.ToTensor()
    dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
    print(f"Loaded {len(dataset)} FashionMNIST images.")

    # Convert images (tensor shape: [1, 28, 28]) to 2D numpy arrays
    images, labels = [], []
    for img_tensor, label in tqdm(dataset, desc="Preparing images"):
        images.append(img_tensor.numpy().squeeze())  # Shape: (28,28)
        labels.append(label)
    
    # Run experiment with two bin configurations
    results = run_experiment(images, labels, dataset_name="FashionMNIST", bins_values=[8, 16])
    
    # (Optional) You can add further analysis on the `results` dictionary.


Loaded 60000 FashionMNIST images.


Preparing images: 100%|██████████| 60000/60000 [00:01<00:00, 40923.00it/s]



=== FashionMNIST: Experiment with 8 bins per image ===


Extracting features: 100%|██████████| 60000/60000 [00:01<00:00, 30371.58it/s]


Feature extraction time: 1.99 sec
Training time: 0.45 sec
Testing time: 0.00 sec
Accuracy: 0.3538
Confusion Matrix:
[[ 174   18  205  116  205   32  436   29  438  147]
 [  13  980    8   90   10  293   13  263   26  104]
 [  74    2  687   19  368   19  420   23  158   30]
 [ 103  377   17  399   36  203   43  140   73  409]
 [  75   13  282   72  615    5  238   12  313  175]
 [  25  141    2  160    3 1205    8  215    6   35]
 [ 142    3  446   58  289   21  547   41  161   92]
 [   4  270    0   89    0  685    1  688    2   61]
 [ 157   61  302  127  287   31  166   46  313  310]
 [  61  226   27  291   94   20   18   60  242  761]]
Classification Report:
              precision    recall  f1-score   support

           0       0.21      0.10      0.13      1800
           1       0.47      0.54      0.50      1800
           2       0.35      0.38      0.36      1800
           3       0.28      0.22      0.25      1800
           4       0.32      0.34      0.33      1800
     

Extracting features: 100%|██████████| 60000/60000 [00:01<00:00, 30422.71it/s]


Feature extraction time: 1.98 sec
Training time: 0.51 sec
Testing time: 0.00 sec
Accuracy: 0.3683
Confusion Matrix:
[[ 286   16  168  124  182    7  400   33  443  141]
 [  14 1016   11   98    7  296   13  225   23   97]
 [  81    4  707   12  362    1  438   14  152   29]
 [  96  333   17  489   30  199   51  131   76  378]
 [ 105    9  272   72  590    0  233    8  336  175]
 [  19  158    2  149    3 1207    8  207    6   41]
 [ 181    3  423   64  272    4  569   26  163   95]
 [   6  267    0   90    0  647    0  725    2   63]
 [ 198   52  295  143  267   31  167   41  301  305]
 [  77  231   23  310   82   16   15   58  248  740]]
Classification Report:
              precision    recall  f1-score   support

           0       0.27      0.16      0.20      1800
           1       0.49      0.56      0.52      1800
           2       0.37      0.39      0.38      1800
           3       0.32      0.27      0.29      1800
           4       0.33      0.33      0.33      1800
     