## Creating Text Embeddings to identify related slides
This notebook is used to identify similar or related groups of slides within different pdf files. For this, we use a dictionary with Text-Slide pairs (which was created in the [first Notebook](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/Test_Models.ipynb)). 

To do so, we first load the dictionary from the .json file:

In [1]:
import yaml

# Load the YAML file containing the image paths and corresponding text
with open("dict_slides_text.yml", "r") as yaml_file:
    slide_dict = yaml.safe_load(yaml_file)

### Pip install the model
Here, the [mxbai-embed-large model](https://ollama.com/library/mxbai-embed-large) is used to create the embedding.

`!pip install -U mixedbread-ai sentence-transformers`

In [2]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Load the model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

A pandas DataFrame is created to save all important properties of each slide:
- .pdf filename
- Slide number
- Text
- .png filename
- Embedding vector

In [3]:
page_data = []

# Iterate over the dictionary
for png_filename, text in slide_dict.items():
    # Get the PDF name and slide number from the filename
    pdf_name = png_filename.split('_slide')[0]
    
    # Process only entries from the "WhatIsOMERO" PDF
    if pdf_name == "WhatIsOMERO":
        slide_number = int(png_filename.split('_slide')[-1].split('.png')[0])
        
        # Get embedding
        embedding = model.encode(text)

        page_data.append({
            'pdf_filename': f'{pdf_name}.pdf',
            'page_index': slide_number,
            'text': text,
            'png_filename': png_filename,
            'embedding': embedding
        })


In [4]:
import pandas as pd

df = pd.DataFrame(page_data)

In [5]:
df

Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding
0,WhatIsOMERO.pdf,1,I3D:bio OMERO user training slides\nHOW TO USE...,WhatIsOMERO_slide1.png,"[0.4003333, -0.33649126, 0.39981106, -0.473099..."
1,WhatIsOMERO.pdf,2,Disclaimer\n• The following slides are intende...,WhatIsOMERO_slide2.png,"[0.39082658, -0.28587455, 0.38830236, -0.37186..."
2,WhatIsOMERO.pdf,3,Research Data Management for Bioimage Data\nat...,WhatIsOMERO_slide3.png,"[0.18631458, -0.3715705, -0.016562128, -0.6950..."
3,WhatIsOMERO.pdf,4,OMERO: An open-source software for image data ...,WhatIsOMERO_slide4.png,"[0.1806397, -0.6081787, -0.6387917, -0.4824682..."
4,WhatIsOMERO.pdf,5,From isolated data silos…\np 5 ADD LOGO SMALL,WhatIsOMERO_slide5.png,"[-0.44303596, -0.50006086, 0.52318454, -0.3337..."
5,WhatIsOMERO.pdf,6,"… to centralized, structured data management\n...",WhatIsOMERO_slide6.png,"[-0.16422987, -0.5776923, 0.6634136, -0.554668..."
6,WhatIsOMERO.pdf,7,OMERO at the ADD INSTITUTE HERE\nService provi...,WhatIsOMERO_slide7.png,"[-0.62620574, 0.008420461, -0.89739054, -0.598..."
7,WhatIsOMERO.pdf,8,Advantages of using OMERO\n• Organize your ori...,WhatIsOMERO_slide8.png,"[0.030079262, 0.27903175, -0.33694014, -0.8887..."
8,WhatIsOMERO.pdf,9,Contact\nPlease review the additional informat...,WhatIsOMERO_slide9.png,"[-0.23266867, -0.5224059, 0.050569788, -0.4461..."


### Pip install UMAP
Now, we perform a dimensionality reduction using the UMAP, to enable a simple 2D plotting of our datapoints (slides).

`!pip install -U umap-learn`

In [6]:
import numpy as np
import umap.umap_ as umap

# Convert embedding vectors to numpy array for UMAP
embeddings = np.array(df['embedding'].tolist())

# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

df['UMAP0'] = umap_embeddings[:, 0]
df['UMAP1'] = umap_embeddings[:, 1]

df

  warn(
  warn(


Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding,UMAP0,UMAP1
0,WhatIsOMERO.pdf,1,I3D:bio OMERO user training slides\nHOW TO USE...,WhatIsOMERO_slide1.png,"[0.4003333, -0.33649126, 0.39981106, -0.473099...",22.988615,6.322064
1,WhatIsOMERO.pdf,2,Disclaimer\n• The following slides are intende...,WhatIsOMERO_slide2.png,"[0.39082658, -0.28587455, 0.38830236, -0.37186...",23.344933,5.576495
2,WhatIsOMERO.pdf,3,Research Data Management for Bioimage Data\nat...,WhatIsOMERO_slide3.png,"[0.18631458, -0.3715705, -0.016562128, -0.6950...",23.924927,6.251481
3,WhatIsOMERO.pdf,4,OMERO: An open-source software for image data ...,WhatIsOMERO_slide4.png,"[0.1806397, -0.6081787, -0.6387917, -0.4824682...",23.767023,6.962986
4,WhatIsOMERO.pdf,5,From isolated data silos…\np 5 ADD LOGO SMALL,WhatIsOMERO_slide5.png,"[-0.44303596, -0.50006086, 0.52318454, -0.3337...",22.896355,7.774837
5,WhatIsOMERO.pdf,6,"… to centralized, structured data management\n...",WhatIsOMERO_slide6.png,"[-0.16422987, -0.5776923, 0.6634136, -0.554668...",22.52528,7.269181
6,WhatIsOMERO.pdf,7,OMERO at the ADD INSTITUTE HERE\nService provi...,WhatIsOMERO_slide7.png,"[-0.62620574, 0.008420461, -0.89739054, -0.598...",24.787344,6.237605
7,WhatIsOMERO.pdf,8,Advantages of using OMERO\n• Organize your ori...,WhatIsOMERO_slide8.png,"[0.030079262, 0.27903175, -0.33694014, -0.8887...",24.230747,7.502076
8,WhatIsOMERO.pdf,9,Contact\nPlease review the additional informat...,WhatIsOMERO_slide9.png,"[-0.23266867, -0.5224059, 0.050569788, -0.4461...",24.280399,5.510377


### Interactively Plotting Slides and Embedding
In the final step, we can compare different groups of slides and their content.

- The plot on the right shows the 2D representation of the Embedding. With drawing a circle around datapoints you can have a look at their content at the left.
- Slides with similar content, regarding their text, should have similar vector representations and should appear close to each other in the plot.

In [7]:
def get_images(df):
    images = []
    for _, row in df.iterrows():
        img_path = row['png_filename']  # Access the correct row value
        img = imread(img_path)  # Read the image
        images.append(img)
    return np.asarray(images)

In [8]:
import stackview
from skimage.io import imread

stackview.sliceplot(df, get_images(df), column_x="UMAP0", column_y="UMAP1", zoom_factor=1.5, zoom_spline_order=2)

HBox(children=(HBox(children=(VBox(children=(VBox(children=(HBox(children=(VBox(children=(ImageWidget(height=3…

### Testing the plotting with a larger Slide Deck
To have a better idea, how this plotting is working, we now use a larger sample of PDFs from the [training material collection](https://zenodo.org/records/14030307) about Bio-Image Analysis from Robert Haase (licensed under CC-BY 4.0). We also use his implementation to [download the PDFs from Zenodo](https://github.com/haesleinhuepf/stackview/blob/main/docs/sliceplot_datagen.ipynb).

In [9]:
import requests
import os

def download_pdfs_from_zenodo(record_id):
    """Download PDFs from Zenodo record."""
    base_url = f"https://zenodo.org/api/records/{record_id}"
    response = requests.get(base_url)
    data = response.json()
    
    if not os.path.exists('downloads'):
        os.makedirs('downloads')
    
    files_info = []
    for file in data['files']:
        if file['key'].endswith('.pdf'):
            download_url = file['links']['self']
            filename = record_id + "_" + file['key']
            filepath = os.path.join('downloads', filename)

            if not os.path.exists(filepath):
                # Download file
                response = requests.get(download_url)
                with open(filepath, 'wb') as f:
                    f.write(response.content)
            
            files_info.append({'filename': filename, 'url': download_url})
    
    return files_info


# Download PDFs
files_info = download_pdfs_from_zenodo('12623730')
files_info

[{'filename': '12623730_14_Summary.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/14_Summary.pdf/content'},
 {'filename': '12623730_10_function_calling.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/10_function_calling.pdf/content'},
 {'filename': '12623730_11_prompteng_rag_finetuning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/11_prompteng_rag_finetuning.pdf/content'},
 {'filename': '12623730_12_Vision_models.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/12_Vision_models.pdf/content'},
 {'filename': '12623730_09_Deep_Learning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/09_Deep_Learning.pdf/content'},
 {'filename': '12623730_08_Sup_Unsup_Machine_Learning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/08_Sup_Unsup_Machine_Learning.pdf/content'},
 {'filename': '12623730_03_RSM_Image_Processing.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/03_RSM_Image_Processing.pdf/content'},