<a href="https://colab.research.google.com/github/martasannzz/hola_mundo/blob/main/class/NLP/Image_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/Image_search.ipynb">
        <img src="https://colab.research.google.com/img/colab_favicon_256px.png"  width="50" height="50" style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/Image_search.ipynb">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png"  width="50" height="50" style="padding-bottom:5px;" />View Source on GitHub</a></td>
</table>

# Introduction to Image Similarity

In this notebook, we'll introduce image search using [`sentence-transformers`](https://www.sbert.net/), a Python library for state-of-the-art sentence, text and image embeddings.

## What is Image Similarity?

Image similarity refers to the process of finding images that are visually alike. This can range from finding near-identical duplicates to grouping images based on thematic resemblance. The implications of this technology are profound, as it underpins systems in:

- **Visual Search:** Retailers and online marketplaces use image similarity to provide product recommendations based on user-uploaded photos. This technology enhances the shopping experience by allowing users to search for products using images instead of words.

- **Content Discovery:** Social media platforms and content management systems rely on image similarity to categorize and recommend content, helping users discover new posts related to what they already like.

- **Digital Archiving:** In libraries and archives, image similarity helps in organizing, indexing, and retrieving visual content from vast databases, making it easier to find historical documents and artworks.

- **Security and Surveillance:** Image similarity algorithms can identify objects or persons of interest across different video frames or locations, contributing to safety and law enforcement efforts.

- **Healthcare:** In medical imaging, similarity measures can help in identifying similar case histories, understanding disease progression, and even assisting in diagnosis by comparing patient scans with a database of known conditions.

## Why is it Challenging?

Images can vary in size, angle, lighting, and even be partially obscured. Traditional methods that rely on exact matches fall short in providing relevant results under these conditions. Hence, modern image similarity techniques employ deep learning, particularly leveraging models like CLIP (Contrastive Language-Image Pretraining) developed by OpenAI, to understand and quantify the likeness in a way that mimics human perception.

## Objectives of this Tutorial

In this tutorial, we will delve into how deep learning, especially through the use of Sentence Transformers and the CLIP model, enables us to map images and texts into a shared vector space. This mapping allows us to perform nuanced search and retrieval tasks, offering a bridge between textual descriptions and visual content.

By the end of this session, you'll understand how to:
- Utilize the CLIP model to create embeddings for images and text.
- Implement an image search system that can find similar images based on text queries.
- Explore practical applications and considerations in deploying image similarity models.

PARA ESTA PRÁCTICA VAMOS A ISAR DE ESTA LIBRERÍA LA PARTE DE MODELOS PRE ENTRENADOS Y MAS CONCREAMNTE LOS MODELOS CLIP
TAL Y COMO SE VE EN LA WEB EXISTEN MODELOS DE DIFERENTES TAMAÑOS Y PRECISIONES. TAMBIEN HAY UN MODELO MULTILINGÜE. SIMPLEMETE HABRA QUE COGER EL NOMBRE DEL MODELO QUE QUERAMOS VER VER COMO USARLO

# Image search

In this notebook, we'll introduce image search using Sentence Transformers, by mapping images and texts into the same vector space. This enables us to perform search and retrieval tasks for images based on textual descriptions.

VAMOS A USAR MODELOS CLIP. SON MODELOS DESARROLLADOS POR OPEN IA ¡. SOM MODELOS MULTIMODALES (PERMITEN PROCESAR DIFERENTES TIPOS DE DATOS) EN ESTE CASO, IMAGENES Y TEXTOS A LA VEZ.

To achieve this, we'll utilize the [CLIP (Contrastive Language-Image Pretraining)](https://openai.com/research/clip) model, which is designed to learn a joint embedding space for both images and texts.

Contrastive Language-Image Pretraining (CLIP) is an AI model developed by OpenAI. It is designed to learn from a wide range of tasks by leveraging the connection between natural language and images.

1. Multimodal Learning: CLIP is a multimodal model that can understand both images and text. It is pretrained on a large dataset containing pairs of images and their associated text captions, learning to associate visual concepts with natural language.

2. Contrastive Learning: CLIP learns by optimizing a contrastive objective. It is trained to recognize which image-caption pairs are correct among a set of negative examples. By learning to score the correct image-text pairs higher than incorrect ones, the model learns a useful representation for both modalities.

3. Architecture: CLIP uses a Transformer-based architecture for processing text and a Vision Transformer or ResNet architecture for processing images. The image and text encoders are jointly trained, allowing the model to align both modalities in a shared embedding space.


In [None]:
# Install the sentence-transformers library
!pip install -U sentence-transformers

In [None]:
import sentence_transformers
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from PIL import Image
import glob
import pickle
import zipfile
import copy
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm

In [None]:
# First, we load the respective CLIP model
model_name = 'clip-ViT-B-32'
model = SentenceTransformer(model_name) # CREAMOS EL MODELO SentenceTransformer CON EL NOMBRE DEL MODELO
# ESTO SIMPLEMENTE DECARGA EL MODELO Y CUANTO MÁS GRANDE MAS TARDARÁ Y MÁS SI TRABAJAMOS EN LOCAL

In [None]:
import requests
from io import StringIO, BytesIO

# VAMOS A HACER UNA FUNCION PARA QUE DADA UNA URL, DESCARGAR LA IMAGEN CON EL OBEJTO IMAGEN
def get_image_from_url(url):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

For searching images, we need an image set

In [None]:
img_url_path = 'https://github.com/ezponda/intro_deep_learning/raw/main/images/'
img_urls = [
    f'{img_url_path}eiffel_tower.jpeg',
    f'{img_url_path}taj_mahal.jpeg',
    f'{img_url_path}colosseum.jpeg',
    f'{img_url_path}great_wall_of_china.jpeg',
    f'{img_url_path}statue_of_liberty.jpeg',
]
images = [get_image_from_url(url) for url in img_urls]
# EN EL OBJETO IMAGE TENEMOS LAS IMAGENES OBTENIDAS QUE NOS HEMOS DESCARGADO

print('Sample images: ')
for url, image in zip(img_urls, images):
    print('_'*50)
    print(f'url: {url}')
    display(image)

In [None]:
# EL MODELO CLIP PERMITE PROCESAR IMAGENES Y TEXTOS EN EL MISMO ESPACIO
# VAMOS A HACER UNA REPRESENTACIÓN VECTORIAL/EMBEDDING DE CUALQUIER OBJETO QUE USEMOS, YA SEA IMAGEN O TEXTO
img_embeddings = model.encode(images, # SE PUEDE HACER TANTO LA LISTA DE IMAGENES COMO LA LISTA DE TEXTOS
                       batch_size=128,
                       convert_to_tensor=True,
                       show_progress_bar=True)
img_embeddings = img_embeddings.cpu()
print(img_embeddings.shape)
# ESTO LO QUE HACE, YA SEA UNA IMAGEN O UN TEXTO, LO PASA A UN VECTOR DE 512 DIMENSIONES.
#ENTONCES TODO OBEJTO QUE LE METAMOS, SEA DE LA DIEMSNION QUE SEA, LO VA A PASAR A UN VECTOR DE 512 DIMENSIONES
# ES POR ELLO QUE PODEMOS REPRESENTAR CONJUNTAMENTE IMAGENE SY TEXTOS PORQUE VAN A TENER LA MISMA REPRESENTACIÓN VECTORIAL EN EL MISMO ESPACIO

Now, let's define a function to perform image search, given a query and a list of image embeddings.

UNA DE LAS PRIMERAS APLICACIONES ES SI NOSOTROS QUEREMOS BUSCAR CON UNA QUERY A UN TEXTO EN UN DATASET DE IMAGENES.
LO QUE VAMOS A HACER ES CODIFICAR LAS 5 IMAGENES Y DE ESA FORMA TENDREMOS 5 VECTORES DE 512 DIMENSIONES.
ENTONCES YO CUANDO QUIERA HACER UNA QUERY COMO "DAME UN EDIFICIO DE CHINA" LO QUE SE HACE ES COGER LA QUERY QUE ES UN STRING Y LA VOY A CODIFICAR PASÁNDOLA AL MISMO VECTOR DE 512 DIMENSIONES. DE ESTA FORMA VAMOS A TENER EL VECTOR DE MI QUERY QUE VA A TENER 512 DIMENSIONES Y VAMOS A TENER 5 IMAGENES DONDE CADA UNA VA A TENER 512 DIMENSIONES.
PARA RECOGER LAS IMAGENES MÁS SIMIALRES EN ESTE NUEVO ESPACIO, LO QUE SE VA A HACER ES COGER EL VECTOR QUE SEA MÁS PARECIDO AL VECTOR DE MI QUERY. SE PUEDE HACER CON LA DISTANCIA EUCLIDEA, CON LA COSINE SIMILARITY. AQUI USAREMOS COSINE SIMILARITY

In [None]:
from typing import List, Union

def image_search(query: str, model: SentenceTransformer, img_embeddings: np.ndarray,
                 images: List[Image.Image], top_k: int = 2) -> None:
    """Perform an image search given a text query.

    This function computes the cosine similarity between the query embedding and the image embeddings,
    retrieves the top_k images with the highest similarity, and displays them.

    Args:
        query (str): The search query as a text string.
        model (SentenceTransformer): The SentenceTransformer model used to encode the text and images.
        img_embeddings (np.ndarray): Precomputed embeddings for the images.
        images (List[Image.Image]): A list of PIL Image objects.
        top_k (int): The number of top results to display.

    """
    # Encode the query to get the embeddings
    query_embedding = model.encode([query])[0] # CODIFICAMOS LA QUERY

    # Compute the cosine similarities between the query embedding and the image embeddings
    similarities = cosine_similarity([query_embedding], img_embeddings)[0] # VAMOS A LOS 5 EMBEDDINGS DE IMAGENES QUE TENEMOS

    # Get the indices of the top_k most similar images
    top_k_indices = np.argsort(-similarities)[:top_k] # LE PEDIMOS QUE NS DEVUELVA LAS MÁS SIMILARES/CERCANAS USANDO COSING SIMILARITY O LA DISTANCIA EUCLIDEA...

    # Print the input query for reference
    print(f"Input query: {query}\n")

    # Display the top_k similar images along with their similarity scores
    for index in top_k_indices:
        print('_' * 50)
        print(f"Similarity Score: {similarities[index]:.4f}")  # Improved readability with formatting
        display(images[index])

In [None]:
image_search('A building in Paris', model, img_embeddings, images, top_k=2)
# SI HACEMOS IMAGE SEARCH Y PONEMOS UN EDIFICIO EN PARIS NOS DEVUELVE LA IMAGEN QUE APARECE MÁS SIMIALR JUNTO CON EL SCORE DE SIMILARIDAD

In [None]:
image_search('Find me an image of a famous monument in India', model, img_embeddings, images, top_k=2)

In [None]:
image_search('A building in China', model, img_embeddings, images, top_k=2)

## Unsplash subset dataset

LO ANTERIOR SE PUEDE HACER CON OBJETOS MÁS GRANDES. POR EJEMPLO AQUI SE TIENE EL DAATASET Unsplash QUE ES UN DATASET PUBLICO QUE TIENE 25000 IMAGENES.

[Unsplash](https://unsplash.com/data) is a collaborative image dataset openly shared.

In [None]:
# PAARA HACER BUSQUEDAS O BUSQUEDAS DE SIMILARIDAD EN ESTAS IMAGENES LO QUE HAY QUE HACER

# Next, we get about 25k images from Unsplash
img_folder = './photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)

    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename) # DE LA PROPIA PAGINA DE SENTTRANSFORMERS YA TENEMOS LOS EMBEDDINGS

    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

In [None]:
# Now, we need to compute the embeddings
# To speed things up, we destribute pre-computed embeddings
# Otherwise you can also encode the images yourself.
# To encode an image, you can use the following code:
# from PIL import Image
# img_emb = model.encode(Image.open(filepath))
def read_image_from_path(file_path):
    img = Image.open(file_path)
    return img

use_precomputed_embeddings = True # YA TENEMOS CALCUALDOS LOS EMBEDDINGS ASI QUE NOS LOS DESCARGAMOS

if use_precomputed_embeddings:
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+emb_filename, emb_filename)
        # DE ESTA FORMA TENEMOS LOS 25000 EMB DE NUETSRAS IMAGENES

    with open(emb_filename, 'rb') as fIn:
        img_names, img_embeddings = pickle.load(fIn)


    print("Images:", len(img_names))
else:
    img_names = list(glob.glob('photos/*.jpg'))[:5_000] # AQUI HEMOS REDUCIDO EL NUMERO DE IMAGENES SI NO TUVIERAMOS LOS ENB YA CALCULADOS
    print("Images:", len(img_names))
    images = [read_image_from_path(img_name) for img_name in  img_names]
    # LO UNICO QUE HABRIA QUE HACER ES COGER LAS 5000 IMAGENES Y MODIFICARLAS
    img_embeddings = model.encode(images, batch_size=128, convert_to_tensor=True, show_progress_bar=True)
    img_embeddings = img_embeddings.cpu()
# DE ESTA FORMA YA TENEMOS LOS 25000 VECTORES DE 512 DIMENSIONES
# AQUI LO QUE HACEMOS ES YA TENERLOS PRE COMPUTADOS

In [None]:
from typing import List, Union
import os
from PIL import Image
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import copy

# VAMOS A HACER EXACTAMENTE LO MISMO QUE ANTES. VAMOS A TENER UNA QUERY, LOS 25000 EMBEDDINGS Y LUEGO EL DIRECTORIO DE IMAGENES GUARDADAS PARA NO TENER TODAS CARGADAS EN MEMORIA.
# TENEMOS EL PATH QUE NOS HEMOS DESCARGADO CON TODOS LOS NOOMBRES DE IMAGENES


def image_search_from_path(query, model: SentenceTransformer, img_embeddings: np.ndarray,
                           img_folder: str, img_names: List[str], top_k: int = 2) -> None:
    """Perform an image search for a given textual query within a set of images located in a specified folder.

    Args:
        query (str or Image.Image): The textual query for searching similar images.
        model (SentenceTransformer): The SentenceTransformer model used for encoding.
        img_embeddings (np.ndarray): The precomputed embeddings of the images.
        img_folder (str): The folder where images are stored.
        img_names (List[str]): The filenames of the images.
        top_k (int): The number of top results to return.

    """
    try:
        # Generate the embedding for the query
        query_embedding = model.encode([query])[0]
        # Calculate the similarities between the query embedding and image embeddings
        similarities = cosine_similarity([query_embedding], img_embeddings)[0]
        # Identify the indexes of the top_k most similar images
        indexes = np.argpartition(similarities, -top_k)[-top_k:]
        indexes = indexes[np.argsort(-similarities[indexes])]

        print(f"Input query: {query}\n")
        for index in indexes:
            similarity_score = similarities[index]
            image_name = img_names[index]
            image_path = os.path.join(img_folder, image_name) # CREAMOS EL PATH Y NOS DESCARGAMOS LA IMAGEN PARA PODER HACER UN PLOT
            try:
                with Image.open(image_path) as img:
                    print('_' * 50)
                    print(f"Similarity: {similarity_score:.4f}")
                    display(copy.deepcopy(img))
            except Exception as e:
                print(f"Error displaying image {image_name}: {e}")
    except Exception as e:
        print(f"Error in image search: {e}")

# ESTO NOS VA DANDO PARA CADA QUERY LAS IMAGENES MÁS SIMMIALRES PARA LO QUE PEDIMOS DE DENTRO DE LAS 25000 IMAGENES


In [None]:
image_search_from_path('A building in Paris', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('A building in China', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('A building in China', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('Two dogs playing in the snow', model, img_embeddings, img_folder, img_names, top_k=2)

## Image-to-Image Search
You can use the method also for image-to-image search.

COMO ESTAMOS EN UN ESPACIO VECTORIAL, DIRECTAMENTE PODRÍAMOS DECIR QUE NOS DE LA IMAGEN MÁS CERCANA A X IMAGEN

To achieve this, you pass `get_image_from_url(url)` to the search method.

It will then return similar images

POR EJEMPLO COGEMOS UNA IMAGEN DE LA TORRE EIFEL, SE LA PASAMOS Y QUE NOS DE LAS IMAGENES MÁS SIMILARES A ESA

In [None]:
img = get_image_from_url(img_urls[0])
img

In [None]:
image_search_from_path(img, model, img_embeddings, img_folder, img_names, top_k=5)