## Creating Text Embeddings to identify related slides
This notebook is used to identify similar or related groups of slides within different pdf files. For this, we use a dictionary with Text-Slide pairs (which was created in the [first Notebook](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/Test_Models.ipynb)). 

To do so, we first load the dictionary from the .json file:

In [1]:
import yaml

# Load the YAML file containing the image paths and corresponding text
with open("dict_slides_text.yml", "r") as yaml_file:
    slide_dict = yaml.safe_load(yaml_file)

### Pip install the model
Here, the [mxbai-embed-large model](https://ollama.com/library/mxbai-embed-large) is used to create the embedding.

`!pip install -U mixedbread-ai sentence-transformers`

In [2]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Load the model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

A pandas DataFrame is created to save all important properties of each slide:
- .pdf filename
- Slide number
- Text
- .png filename
- Embedding vector

In [3]:
page_data = []

# Iterate over the dictionary
for png_filename, text in slide_dict.items():
    # Get the PDF name and slide number from the filename
    pdf_name = png_filename.split('_slide')[0]
    
    # Process only entries from the "WhatIsOMERO" PDF
    if pdf_name == "WhatIsOMERO":
        slide_number = int(png_filename.split('_slide')[-1].split('.png')[0])
        
        # Get embedding
        embedding = model.encode(text)

        page_data.append({
            'pdf_filename': f'{pdf_name}.pdf',
            'page_index': slide_number,
            'text': text,
            'png_filename': png_filename,
            'embedding': embedding
        })


In [4]:
import pandas as pd

df = pd.DataFrame(page_data)

In [5]:
df

Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding
0,WhatIsOMERO.pdf,1,I3D:bio OMERO user training slides\nHOW TO USE...,WhatIsOMERO_slide1.png,"[0.4003333, -0.33649126, 0.39981106, -0.473099..."
1,WhatIsOMERO.pdf,2,Disclaimer\n• The following slides are intende...,WhatIsOMERO_slide2.png,"[0.39082658, -0.28587455, 0.38830236, -0.37186..."
2,WhatIsOMERO.pdf,3,Research Data Management for Bioimage Data\nat...,WhatIsOMERO_slide3.png,"[0.18631458, -0.3715705, -0.016562128, -0.6950..."
3,WhatIsOMERO.pdf,4,OMERO: An open-source software for image data ...,WhatIsOMERO_slide4.png,"[0.1806397, -0.6081787, -0.6387917, -0.4824682..."
4,WhatIsOMERO.pdf,5,From isolated data silos…\np 5 ADD LOGO SMALL,WhatIsOMERO_slide5.png,"[-0.44303596, -0.50006086, 0.52318454, -0.3337..."
5,WhatIsOMERO.pdf,6,"… to centralized, structured data management\n...",WhatIsOMERO_slide6.png,"[-0.16422987, -0.5776923, 0.6634136, -0.554668..."
6,WhatIsOMERO.pdf,7,OMERO at the ADD INSTITUTE HERE\nService provi...,WhatIsOMERO_slide7.png,"[-0.62620574, 0.008420461, -0.89739054, -0.598..."
7,WhatIsOMERO.pdf,8,Advantages of using OMERO\n• Organize your ori...,WhatIsOMERO_slide8.png,"[0.030079262, 0.27903175, -0.33694014, -0.8887..."
8,WhatIsOMERO.pdf,9,Contact\nPlease review the additional informat...,WhatIsOMERO_slide9.png,"[-0.23266867, -0.5224059, 0.050569788, -0.4461..."


### Pip install UMAP
Now, we perform a dimensionality reduction using the UMAP, to enable a simple 2D plotting of our datapoints (slides).

`!pip install -U umap-learn`

In [6]:
import numpy as np
import umap.umap_ as umap

# Convert embedding vectors to numpy array for UMAP
embeddings = np.array(df['embedding'].tolist())

# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

df['UMAP0'] = umap_embeddings[:, 0]
df['UMAP1'] = umap_embeddings[:, 1]

df

  warn(
  warn(


Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding,UMAP0,UMAP1
0,WhatIsOMERO.pdf,1,I3D:bio OMERO user training slides\nHOW TO USE...,WhatIsOMERO_slide1.png,"[0.4003333, -0.33649126, 0.39981106, -0.473099...",22.988615,6.322064
1,WhatIsOMERO.pdf,2,Disclaimer\n• The following slides are intende...,WhatIsOMERO_slide2.png,"[0.39082658, -0.28587455, 0.38830236, -0.37186...",23.344933,5.576495
2,WhatIsOMERO.pdf,3,Research Data Management for Bioimage Data\nat...,WhatIsOMERO_slide3.png,"[0.18631458, -0.3715705, -0.016562128, -0.6950...",23.924927,6.251481
3,WhatIsOMERO.pdf,4,OMERO: An open-source software for image data ...,WhatIsOMERO_slide4.png,"[0.1806397, -0.6081787, -0.6387917, -0.4824682...",23.767023,6.962986
4,WhatIsOMERO.pdf,5,From isolated data silos…\np 5 ADD LOGO SMALL,WhatIsOMERO_slide5.png,"[-0.44303596, -0.50006086, 0.52318454, -0.3337...",22.896355,7.774837
5,WhatIsOMERO.pdf,6,"… to centralized, structured data management\n...",WhatIsOMERO_slide6.png,"[-0.16422987, -0.5776923, 0.6634136, -0.554668...",22.52528,7.269181
6,WhatIsOMERO.pdf,7,OMERO at the ADD INSTITUTE HERE\nService provi...,WhatIsOMERO_slide7.png,"[-0.62620574, 0.008420461, -0.89739054, -0.598...",24.787344,6.237605
7,WhatIsOMERO.pdf,8,Advantages of using OMERO\n• Organize your ori...,WhatIsOMERO_slide8.png,"[0.030079262, 0.27903175, -0.33694014, -0.8887...",24.230747,7.502076
8,WhatIsOMERO.pdf,9,Contact\nPlease review the additional informat...,WhatIsOMERO_slide9.png,"[-0.23266867, -0.5224059, 0.050569788, -0.4461...",24.280399,5.510377


### Interactively Plotting Slides and Embedding
In the final step, we can compare different groups of slides and their content.

- The plot on the right shows the 2D representation of the Embedding. With drawing a circle around datapoints you can have a look at their content at the left.
- Slides with similar content, regarding their text, should have similar vector representations and should appear close to each other in the plot.

In [7]:
def get_images(df):
    images = []
    for _, row in df.iterrows():
        img_path = row['png_filename']  # Access the correct row value
        img = imread(img_path)  # Read the image
        images.append(img)
    return np.asarray(images)

In [8]:
import stackview
from skimage.io import imread

stackview.sliceplot(df, get_images(df), column_x="UMAP0", column_y="UMAP1", zoom_factor=1.5, zoom_spline_order=2)

HBox(children=(HBox(children=(VBox(children=(VBox(children=(HBox(children=(VBox(children=(ImageWidget(height=3…

# Testing the plotting with a larger Slide Deck

### OPTION 1: Downloading pdfs by hand and extract their Slides as pictures step by step
To have a better idea, how this plotting is working, we now use a larger sample of PDFs from the [training material](https://zenodo.org/records/14030307) collection about Bio-Image Analysis from Robert Haase (licensed under CC-BY 4.0). We also use his implementation to [download the PDFs from Zenodo](https://github.com/haesleinhuepf/stackview/blob/main/docs/sliceplot_datagen.ipynb).


In [1]:
import requests
import os

def download_pdfs_from_zenodo(record_id):
    """Download PDFs from Zenodo record."""
    base_url = f"https://zenodo.org/api/records/{record_id}"
    response = requests.get(base_url)
    data = response.json()
    
    if not os.path.exists('downloads'):
        os.makedirs('downloads')
    
    files_info = []
    for file in data['files']:
        if file['key'].endswith('.pdf'):
            download_url = file['links']['self']
            filename = record_id + "_" + file['key']
            filepath = os.path.join('downloads', filename)

            if not os.path.exists(filepath):
                # Download file
                response = requests.get(download_url)
                with open(filepath, 'wb') as f:
                    f.write(response.content)
            
            files_info.append({'filename': filename, 'url': download_url})
    
    return files_info


# Download PDFs
files_info = download_pdfs_from_zenodo('12623730')
files_info


[{'filename': '12623730_14_Summary.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/14_Summary.pdf/content'},
 {'filename': '12623730_10_function_calling.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/10_function_calling.pdf/content'},
 {'filename': '12623730_11_prompteng_rag_finetuning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/11_prompteng_rag_finetuning.pdf/content'},
 {'filename': '12623730_12_Vision_models.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/12_Vision_models.pdf/content'},
 {'filename': '12623730_09_Deep_Learning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/09_Deep_Learning.pdf/content'},
 {'filename': '12623730_08_Sup_Unsup_Machine_Learning.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/08_Sup_Unsup_Machine_Learning.pdf/content'},
 {'filename': '12623730_03_RSM_Image_Processing.pdf',
  'url': 'https://zenodo.org/api/records/12623730/files/03_RSM_Image_Processing.pdf/content'},

## Saving all Slides from the pdfs to .png Images

In [2]:
from pdf2image import convert_from_path
from IPython.display import display
from PIL import Image
from pdf_utilities import load_pdf, save_images, text_extraction, text_extract_from_pdfs

In [3]:
downloads_folder = "downloads"
images_folder = os.path.join("downloads", "images")

# Ensure the "images" folder exists
os.makedirs(images_folder, exist_ok=True)

# Iterate over all files in the downloads folder
for file_name in os.listdir(downloads_folder):
    # Check if the file is a PDF
    if file_name.lower().endswith('.pdf'):
        pdf_path = os.path.join(downloads_folder, file_name)
        print(f"Processing PDF: {pdf_path}")
        
        try:
            # Use the save_images function to save images in the "images" folder
            save_images(filepath=images_folder, pdf=pdf_path, new_width=700)
            print(f"Images for {file_name} saved successfully in {images_folder}.")
        except Exception as e:
            print(f"Error processing {file_name}: {e}")

Processing PDF: downloads/12623730_10_function_calling.pdf
Images for 12623730_10_function_calling.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_09_Deep_Learning.pdf
Images for 12623730_09_Deep_Learning.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_06_Chatbots.pdf
Images for 12623730_06_Chatbots.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_05_Surface_Recon_QA.pdf
Images for 12623730_05_Surface_Recon_QA.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_08_Sup_Unsup_Machine_Learning.pdf
Images for 12623730_08_Sup_Unsup_Machine_Learning.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_04_Image_segmentation.pdf
Images for 12623730_04_Image_segmentation.pdf saved successfully in downloads/images.
Processing PDF: downloads/12623730_02_Introduction_RDM_2024.pdf
Images for 12623730_02_Introduction_RDM_2024.pdf saved successfully in downl

## Extracting the text from each slide and save it to the dict_slides_text.yml

In [4]:
text_extract_from_pdfs(downloads_folder="downloads", yaml_file_path="dict_slides_text.yml")

Processing slides for 12623730_01_Introduction_BIDS_2024...
Processing slides for 12623730_02_Introduction_RDM_2024...
Processing slides for 12623730_03_RSM_Image_Processing...
Processing slides for 12623730_04_Image_segmentation...
Processing slides for 12623730_05_Surface_Recon_QA...
Processing slides for 12623730_06_Chatbots...
Processing slides for 12623730_07_distributed_gpu_computing...
Processing slides for 12623730_08_Sup_Unsup_Machine_Learning...
Processing slides for 12623730_09_Deep_Learning...
Processing slides for 12623730_10_function_calling...
Processing slides for 12623730_11_prompteng_rag_finetuning...
Processing slides for 12623730_12_Vision_models...
Processing slides for 12623730_13_quality_assurance...
Processing slides for 12623730_14_Summary...


## Creating the pandas DataFrame

In [6]:
import yaml
# Load the YAML file containing the image paths and corresponding text
with open("dict_slides_text.yml", "r") as yaml_file:
    slide_dict = yaml.safe_load(yaml_file)

In [7]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Load the model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

In [8]:
import re

page_data = []
last_pdf_name = None # to keep track of processing

# Iterate over the dictionary
for png_filename, text in slide_dict.items():
    # Get the PDF name and slide number from the filename
    pdf_path = png_filename.split('_slide')[0]
    pdf_name = pdf_path.replace('downloads/', '')

    if pdf_name != "WhatIsOMERO":
        slide_number = int(png_filename.split('_slide')[-1].split('.png')[0])
        cleaned_png_name = re.sub(r"^downloads/", "", png_filename)    
        # Get embedding
        embedding = model.encode(text)
    
        page_data.append({
                'pdf_filename': f'{pdf_name}.pdf',
                'page_index': slide_number,
                'text': text,
                'png_filename': cleaned_png_name,
                'embedding': embedding
        })
        
    # Print message only if the PDF name has changed to keep track of the already processed pdfs
        if pdf_name != last_pdf_name:
            print(f"Finished processing slides for {pdf_name}")
            last_pdf_name = pdf_name


Finished processing slides for 12623730_01_Introduction_BIDS_2024
Finished processing slides for 12623730_02_Introduction_RDM_2024
Finished processing slides for 12623730_03_RSM_Image_Processing
Finished processing slides for 12623730_04_Image_segmentation
Finished processing slides for 12623730_05_Surface_Recon_QA
Finished processing slides for 12623730_06_Chatbots
Finished processing slides for 12623730_07_distributed_gpu_computing
Finished processing slides for 12623730_08_Sup_Unsup_Machine_Learning
Finished processing slides for 12623730_09_Deep_Learning
Finished processing slides for 12623730_10_function_calling
Finished processing slides for 12623730_11_prompteng_rag_finetuning
Finished processing slides for 12623730_12_Vision_models
Finished processing slides for 12623730_13_quality_assurance
Finished processing slides for 12623730_14_Summary


In [9]:
import pandas as pd

df_all = pd.DataFrame(page_data)
df_all

Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding
0,12623730_01_Introduction_BIDS_2024.pdf,1,CENTER FOR SCALABLE DATA ANALYTICS\nAND ARTIFI...,12623730_01_Introduction_BIDS_2024_slide1.png,"[-0.05327429, -0.044598766, 0.13387947, 0.0016..."
1,12623730_01_Introduction_BIDS_2024.pdf,2,Hello my name is…\n• Robert Haase\n• Applied i...,12623730_01_Introduction_BIDS_2024_slide2.png,"[0.31092378, -0.544821, 0.118443735, -0.434637..."
2,12623730_01_Introduction_BIDS_2024.pdf,3,Survey\nThink about the FAIR principles for\nd...,12623730_01_Introduction_BIDS_2024_slide3.png,"[0.023842672, 0.07436548, 0.23269953, 0.095499..."
3,12623730_01_Introduction_BIDS_2024.pdf,4,Survey\nWhich open-source license might be\nth...,12623730_01_Introduction_BIDS_2024_slide4.png,"[0.87824875, -0.29804212, -0.18778965, -0.3386..."
4,12623730_01_Introduction_BIDS_2024.pdf,5,Survey\nWhich topic is typically not covered i...,12623730_01_Introduction_BIDS_2024_slide5.png,"[0.29033828, -0.38901746, 0.018388608, -0.0018..."
...,...,...,...,...,...
858,12623730_14_Summary.pdf,65,Benchmarking vision models\n• Prompt: „Analyse...,12623730_14_Summary_slide65.png,"[0.9727107, -0.58993036, 0.11184755, -0.058897..."
859,12623730_14_Summary.pdf,66,CLIP scores\n• Example: Prompt optimization\nA...,12623730_14_Summary_slide66.png,"[0.59838593, 0.15697823, 0.71220344, 0.2943070..."
860,12623730_14_Summary.pdf,67,Testing functional correctness: HumanEval\nPub...,12623730_14_Summary_slide67.png,"[0.32575104, 0.009973767, 0.26995668, 0.425572..."
861,12623730_14_Summary.pdf,68,Modified from\nDS-1000\nstackoverflow\n„functi...,12623730_14_Summary_slide68.png,"[-0.055250976, 0.20740864, 0.6866639, -0.17039..."


## Perform Dimensionality Reduction by using UMAP

In [10]:
import numpy as np
import umap.umap_ as umap

# Convert embedding vectors to numpy array for UMAP
embeddings = np.array(df_all['embedding'].tolist())

# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
umap_embeddings = reducer.fit_transform(embeddings)

df_all['UMAP0'] = umap_embeddings[:, 0]
df_all['UMAP1'] = umap_embeddings[:, 1]

df_all

  warn(


Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding,UMAP0,UMAP1
0,12623730_01_Introduction_BIDS_2024.pdf,1,CENTER FOR SCALABLE DATA ANALYTICS\nAND ARTIFI...,12623730_01_Introduction_BIDS_2024_slide1.png,"[-0.05327429, -0.044598766, 0.13387947, 0.0016...",6.225243,10.467344
1,12623730_01_Introduction_BIDS_2024.pdf,2,Hello my name is…\n• Robert Haase\n• Applied i...,12623730_01_Introduction_BIDS_2024_slide2.png,"[0.31092378, -0.544821, 0.118443735, -0.434637...",7.837998,8.350363
2,12623730_01_Introduction_BIDS_2024.pdf,3,Survey\nThink about the FAIR principles for\nd...,12623730_01_Introduction_BIDS_2024_slide3.png,"[0.023842672, 0.07436548, 0.23269953, 0.095499...",4.069005,11.226105
3,12623730_01_Introduction_BIDS_2024.pdf,4,Survey\nWhich open-source license might be\nth...,12623730_01_Introduction_BIDS_2024_slide4.png,"[0.87824875, -0.29804212, -0.18778965, -0.3386...",2.105043,11.333110
4,12623730_01_Introduction_BIDS_2024.pdf,5,Survey\nWhich topic is typically not covered i...,12623730_01_Introduction_BIDS_2024_slide5.png,"[0.29033828, -0.38901746, 0.018388608, -0.0018...",4.592541,11.582233
...,...,...,...,...,...,...,...
858,12623730_14_Summary.pdf,65,Benchmarking vision models\n• Prompt: „Analyse...,12623730_14_Summary_slide65.png,"[0.9727107, -0.58993036, 0.11184755, -0.058897...",7.367217,9.412332
859,12623730_14_Summary.pdf,66,CLIP scores\n• Example: Prompt optimization\nA...,12623730_14_Summary_slide66.png,"[0.59838593, 0.15697823, 0.71220344, 0.2943070...",5.801350,7.999154
860,12623730_14_Summary.pdf,67,Testing functional correctness: HumanEval\nPub...,12623730_14_Summary_slide67.png,"[0.32575104, 0.009973767, 0.26995668, 0.425572...",5.278358,8.707326
861,12623730_14_Summary.pdf,68,Modified from\nDS-1000\nstackoverflow\n„functi...,12623730_14_Summary_slide68.png,"[-0.055250976, 0.20740864, 0.6866639, -0.17039...",8.575429,6.964963


#### Upload Data
As this file can get quite big, depending on the number of pdfs we feed in, it might be helpful to store it online rather than on disc. A good option would be for example to store it with Huggingface.
To do so you first need to install this option with:

`pip install datasets`

You also have to create a [Huggingface Token](https://huggingface.co/docs/hub/security-tokens) and set this as a environment variable. To get more information on how to do that, that check the [ReadMe](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/README.md).

In [11]:
from huggingface_hub import login
import os

#Authenticate your current session
login(token=os.getenv("HUGGINGFACE_TOKEN"))

To save the dictonary, first create a HF Dataset.

In [12]:
from datasets import Dataset

# Create a Hugging Face dataset from the DataFrame
dataset = Dataset.from_pandas(df_all)
dataset

Dataset({
    features: ['pdf_filename', 'page_index', 'text', 'png_filename', 'embedding', 'UMAP0', 'UMAP1'],
    num_rows: 863
})

In [13]:
# Upload the dataset to huggingface
dataset.push_to_hub("lea-33/SlightInsight_Data")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/lea-33/SlightInsight_Data/commit/ddee771036f208cb2551eaf580bb78de161c9285', commit_message='Upload dataset', commit_description='', oid='ddee771036f208cb2551eaf580bb78de161c9285', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/lea-33/SlightInsight_Data', endpoint='https://huggingface.co', repo_type='dataset', repo_id='lea-33/SlightInsight_Data'), pr_revision=None, pr_num=None)

#### Upload Images
To save the corresponding Images, another Dataset is created and pushed to HF like this:

In [14]:
from datasets import Dataset, Features, Image
import os
from natsort import natsorted

image_folder = "downloads/images"

# List and filter only valid image files
valid_extensions = {".png", ".jpg", ".jpeg", ".bmp", ".gif", ".webp"}  # Add more extensions if needed
image_paths = natsorted(
    [
        os.path.join(image_folder, fname)
        for fname in os.listdir(image_folder)
        if os.path.isfile(os.path.join(image_folder, fname)) and os.path.splitext(fname)[1].lower() in valid_extensions
    ]
)

# Create a dataset
data = [{"image": path} for path in image_paths]

# This specifies the column contains images
features = Features({
    "image": Image(),
})

dataset = Dataset.from_list(data, features=features)

# Preview the dataset
print(dataset)

Dataset({
    features: ['image'],
    num_rows: 863
})


In [15]:
dataset.push_to_hub("lea-33/SlideInsight_Images")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/863 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/9 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/lea-33/SlideInsight_Images/commit/c0a1d3e674e1b5f26383d3d88ab89038b8e29ae7', commit_message='Upload dataset', commit_description='', oid='c0a1d3e674e1b5f26383d3d88ab89038b8e29ae7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/lea-33/SlideInsight_Images', endpoint='https://huggingface.co', repo_type='dataset', repo_id='lea-33/SlideInsight_Images'), pr_revision=None, pr_num=None)

### OPTION 2: Loading the dict and corresponding Images from Huggingface
After the dictionary was successfully stored on Huggingface, we can skip the whole downloading and embedding calculation part and directly work with the dictionary/images by loading them from Huggingface.

In [16]:
import pandas as pd

df_loaded = pd.read_parquet("hf://datasets/lea-33/SlightInsight_Data/data/train-00000-of-00001.parquet")

In [17]:
df_loaded

Unnamed: 0,pdf_filename,page_index,text,png_filename,embedding,UMAP0,UMAP1
0,12623730_01_Introduction_BIDS_2024.pdf,1,CENTER FOR SCALABLE DATA ANALYTICS\nAND ARTIFI...,12623730_01_Introduction_BIDS_2024_slide1.png,"[-0.05327429, -0.044598766, 0.13387947, 0.0016...",6.225243,10.467344
1,12623730_01_Introduction_BIDS_2024.pdf,2,Hello my name is…\n• Robert Haase\n• Applied i...,12623730_01_Introduction_BIDS_2024_slide2.png,"[0.31092378, -0.544821, 0.118443735, -0.434637...",7.837998,8.350363
2,12623730_01_Introduction_BIDS_2024.pdf,3,Survey\nThink about the FAIR principles for\nd...,12623730_01_Introduction_BIDS_2024_slide3.png,"[0.023842672, 0.07436548, 0.23269953, 0.095499...",4.069005,11.226105
3,12623730_01_Introduction_BIDS_2024.pdf,4,Survey\nWhich open-source license might be\nth...,12623730_01_Introduction_BIDS_2024_slide4.png,"[0.87824875, -0.29804212, -0.18778965, -0.3386...",2.105043,11.333110
4,12623730_01_Introduction_BIDS_2024.pdf,5,Survey\nWhich topic is typically not covered i...,12623730_01_Introduction_BIDS_2024_slide5.png,"[0.29033828, -0.38901746, 0.018388608, -0.0018...",4.592541,11.582233
...,...,...,...,...,...,...,...
858,12623730_14_Summary.pdf,65,Benchmarking vision models\n• Prompt: „Analyse...,12623730_14_Summary_slide65.png,"[0.9727107, -0.58993036, 0.11184755, -0.058897...",7.367217,9.412332
859,12623730_14_Summary.pdf,66,CLIP scores\n• Example: Prompt optimization\nA...,12623730_14_Summary_slide66.png,"[0.59838593, 0.15697823, 0.71220344, 0.2943070...",5.801350,7.999154
860,12623730_14_Summary.pdf,67,Testing functional correctness: HumanEval\nPub...,12623730_14_Summary_slide67.png,"[0.32575104, 0.009973767, 0.26995668, 0.425572...",5.278358,8.707326
861,12623730_14_Summary.pdf,68,Modified from\nDS-1000\nstackoverflow\n„functi...,12623730_14_Summary_slide68.png,"[-0.055250976, 0.20740864, 0.6866639, -0.17039...",8.575429,6.964963


In [18]:
from datasets import load_dataset
import numpy as np

def get_all_images(dataset_name, split="train"):
    # Load the dataset
    dataset = load_dataset(dataset_name, split=split, streaming=True)

    # Extract images
    images = []
    for sample in dataset:
        img = sample["image"]  # Access the image
        images.append(np.array(img))  # Convert the PIL image to a NumPy array
    
    return np.asarray(images)

In [19]:
dataset_name = "lea-33/SlideInsight_Images"
images = get_all_images(dataset_name, split="train")

README.md:   0%|          | 0.00/320 [00:00<?, ?B/s]

In [20]:
import stackview
from skimage.io import imread

stackview.sliceplot(df_loaded, images, column_x="UMAP0", column_y="UMAP1", zoom_factor=1, zoom_spline_order=2)

HBox(children=(HBox(children=(VBox(children=(VBox(children=(HBox(children=(VBox(children=(ImageWidget(height=3…

### Optionally: Deleting the pdfs and images again from your local disc (by deleting the whole downloads folder that was just created)

In [27]:
import os
import shutil

In [28]:
# Specify the folder path
folder_path = "downloads"

try:
    # Delete the entire folder and its contents
    shutil.rmtree(folder_path)
    print(f"Deleted the folder: {folder_path}")
except Exception as e:
    print(f"Error deleting folder {folder_path}: {e}")

Deleted the folder: downloads
