### Caching the Embeddings for each Presentation Slide in our Training Material 
For each Zenodo Record in our Resources "nfdi4bioimage.yml" File, we can generate different embeddings (visual, text and mixed). 
Instead of Calculating it over and over again for different tasks, we can calculate the Embeddings once and store them somewhere (e.g. via Huggingface) to load them again at any time.
For now, the Embeddings are stored as a Dataset on Huggingface.

In [1]:
from huggingface_hub import login
import os

#Authenticate your current session
login(token=os.getenv("HUGGINGFACE_TOKEN"))

1. Load all Zenodo Record IDs from our Training Material

In [8]:
from caching import get_zenodo_ids_from_yaml
import requests

file_url = "https://raw.githubusercontent.com/NFDI4BIOIMAGE/training/main/resources/nfdi4bioimage.yml"
yaml_file = "nfdi4bioimage.yml" 
response = requests.get(file_url)

# Download the current Training Material yaml file from the Git Repository
with open(yaml_file, "wb") as file:
    file.write(response.content)
print(f"File downloaded successfully as {yaml_file}")

# Extract the Zenodo Record IDs
zenodo_ids = get_zenodo_ids_from_yaml(yaml_file)
print(f"Found {len(zenodo_ids)} Zenodo records: {zenodo_ids}")

File downloaded successfully as nfdi4bioimage.yml
Found 38 Zenodo records: ['10008464', '10008465', '10083555', '10654775', '10679054', '10687658', '10793699', '10815329', '10816895', '10886749', '10939519', '10942559', '10970869', '10972692', '10990107', '11031746', '11066250', '11107798', '11265038', '11396199', '11472148', '11474407', '11548617', '12623730', '3778431', '4317149', '4328911', '4330625', '4334697', '4461261', '4630788', '4748510', '4748534', '4778265', '8323588', '8329305', '8329306', '8414318']


2. Calculate and save all embeddings in a Huggingface Dataset

In [3]:
from caching import ensure_repo_exists, load_cache_dataset, append_rows_to_dataset, embed_text, embed_visual, embed_mixed, cache_hf
from datasets import concatenate_datasets, load_dataset

repo_name = "lea-33/SlightInsight_Cache"
for record_id in zenodo_ids:
    cache_hf(record_id, repo_name=repo_name)

Repository 'lea-33/SlightInsight_Cache_Zenodo' already exists.


README.md:   0%|          | 0.00/492 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.78k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/179 [00:00<?, ? examples/s]

Skipping Zenodo Record 10687658: Already in dataset.
Repository 'lea-33/SlightInsight_Cache_Zenodo' already exists.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Finished processing Zenodo Record 10793699.
Repository 'lea-33/SlightInsight_Cache_Zenodo' already exists.
Skipping Zenodo Record 10815329: Already in dataset.


3. Example on how to load the data again for a specific slide from a specific Record/Presentation

In [8]:
from caching import load_hf_cache
import pandas as pd

cached_data = load_hf_cache("10008464", 3, repo_name=repo_name)
df = pd.DataFrame.from_dict([cached_data])

df.head()

Unnamed: 0,key,text,visual,mixed,zenodo_record_id,zenodo_filename,page_number
0,record10008464_pdf1_slide3,2,2,2,10008464,2023-Moore-N4BI-AHM-Welcome.pdf,3
