## Perceptual Hashing

### What is Perceptual Hashing?

Perceptual Hashing is a technique used to generate a hash value that represents the perceptual content of an image. These hashes are designed to be robust against minor modifications and variations in images, making it a useful technique for comparing images and finding similar ones.

### How Does Perceptual Hashing Work?

1. **Image Preprocessing**: The image is converted to grayscale and resized to a smaller, fixed size to normalize and simplify the content.
2. **Feature Extraction**: Various features are extracted from the image.
3. **Hash Calculation**: These features are then used to calculate a hash value. The hashing algorithm is designed in such a way that similar images produce similar hash values. There are different hash algorithms; we are using the perceptual hashing algorithm.
    - **3.1** The image is resized and applied to a discrete cosine transform (DCT).
    - **3.2** The top-left 8x8 block of the DCT coefficients (excluding the DC coefficient) is selected.
    - **3.3** The median value of these coefficients is calculated.
    - **3.4** The hash is generated based on whether each coefficient is above or below the median.


In [2]:
import os
import numpy as np
import pickle
from tqdm import tqdm
from PIL import Image
import imagehash
from dask.diagnostics import ProgressBar

In [3]:
# Function to get all image paths in a folder
def get_image_paths(main_folder):
    image_paths = []
    for root, _, files in os.walk(main_folder):
        for file in files:
            if file.lower().endswith(('.jpg', '.jpeg', '.png')):
                full_path = os.path.join(root, file)
                image_paths.append(full_path)
    return image_paths

# Compute hash for an image
def compute_image_hash(image_path, hash_func=imagehash.phash): # using perceputal hashing
    try:
        img = Image.open(image_path)
        img_hash = hash_func(img)
        return img_hash
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None

# Precompute hashes for all images in the directory
def precompute_hashes(main_folder, hashes_path, hash_func=imagehash.phash):
    image_paths = get_image_paths(main_folder)
    hashes = []
    valid_image_paths = []

    for img_path in tqdm(image_paths, desc="Computing image hashes"):
        img_hash = compute_image_hash(img_path, hash_func)
        if img_hash is not None:
            hashes.append(img_hash)
            valid_image_paths.append(img_path)

    with open(hashes_path, 'wb') as f:
        pickle.dump((valid_image_paths, hashes), f)

# Find similar images using hash comparison
def find_similar_images_hash(input_img_path, hashes_path, top_n=5, hash_func=imagehash.phash):
    # Load precomputed hashes
    with open(hashes_path, 'rb') as f:
        image_paths, hashes = pickle.load(f)

    # Compute hash for the input image
    input_hash = compute_image_hash(input_img_path, hash_func)
    if input_hash is None:
        print(f"Failed to compute hash for input image: {input_img_path}")
        return []

    # Compute distances between input hash and precomputed hashes
    distances = [input_hash - h for h in hashes]
    indices = np.argsort(distances)[:top_n]
    similar_images = [(image_paths[i], distances[i]) for i in indices]

    return similar_images

# Load embeddings and image paths
def load_embeddings_and_paths(hashes_path, image_paths_path):
    with open(hashes_path, 'rb') as f:
        hashes = pickle.load(f)
    with open(image_paths_path, 'rb') as f:
        image_paths = pickle.load(f)
    return hashes, image_paths


In [4]:
# Example usage
main_folder = "C:\\Users\\lucas\\OneDrive - Hochschule Düsseldorf\\Uni_Drive\\DIV2k"
hashes_path = 'image_hashes.pkl'
image_paths_path = 'image_paths.pkl'


if not os.path.exists(hashes_path) or not os.path.exists(image_paths_path):
    image_paths = get_image_paths(main_folder)
    with ProgressBar():
        precompute_hashes(main_folder, hashes_path)
     
else:
    embeddings, valid_image_paths = load_embeddings_and_paths(hashes_path, image_paths_path)

# time to execute this cell: 100 sec



Computing image hashes:   0%|          | 1/902 [00:00<04:23,  3.42it/s]

Error processing C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\._0008.png: cannot identify image file 'C:\\Users\\lucas\\OneDrive - Hochschule Düsseldorf\\Uni_Drive\\DIV2k\\DIV2K_train_HR\\DIV2K_train_HR\\._0008.png'


Computing image hashes: 100%|██████████| 902/902 [01:26<00:00, 10.46it/s]


In [5]:
# This cell only gives back the paths and does not open the images
input_img_path = "C:\\Users\\lucas\\Downloads\\Gewitter.jpg" # change for new input image
input_image = Image.open(input_img_path) # Display the input image
input_image.show()

results = {}

similar_images = find_similar_images_hash(input_img_path, hashes_path)
for img_path, distance in similar_images:
    print(f"Image: {img_path}, Distance: {distance}") # gives back paths to top n images

Image: C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\0212.png, Distance: 18
Image: C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\0130.png, Distance: 20
Image: C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\0266.png, Distance: 20
Image: C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\0144.png, Distance: 20
Image: C:\Users\lucas\OneDrive - Hochschule Düsseldorf\Uni_Drive\DIV2k\DIV2K_train_HR\DIV2K_train_HR\0300.png, Distance: 20


In [8]:
# This code works, but gives me back whatever as images, I am going to try and improve this. 
similar_images

('C:\\Users\\lucas\\OneDrive - Hochschule Düsseldorf\\Uni_Drive\\DIV2k\\DIV2K_train_HR\\DIV2K_train_HR\\0300.png',
 20)