## Implement Caching of Text and Visual Embeddings

In this notebook, we establish a method to cache embeddings. By implementing a persistent cache, we don't need to perform costly calculations over and over again for the same pdfs and slides. We can save a lot of time by storing them, once they were calculated and just fetch the desired outcome if we need it again.

- caching_local: Calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a local file using python's shelve module.
- caching_hf: Also calculates embeddings (text and visual) if they are not calculated yet. Results are then stored in a Caching file on Huggingface.



In [1]:
from caching import caching_hf, caching_local

### 1. Caching the results on the local disc

In [2]:
import time
pdf_path = "WhatIsOMERO.pdf"  # Path to your PDF

start_time = time.time()

caching_local(pdf_path)

end_time= time.time()
duration= end_time - start_time
print(f'It took {duration:.2f} seconds to calculate the embeddings.')

Caching slide 1
Caching slide 2
Caching slide 3
Caching slide 4
Caching slide 5
Caching slide 6
Caching slide 7
Caching slide 8
Caching slide 9
It took 3.67 seconds to calculate the embeddings.


When performing the same task again, the Embeddings are already stored in the Cache and the calculation should be much faster:

In [3]:
start_time = time.time()

caching_local(pdf_path)

end_time= time.time()
duration= end_time - start_time
print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')

Fetching from cache: Slide 1
Fetching from cache: Slide 2
Fetching from cache: Slide 3
Fetching from cache: Slide 4
Fetching from cache: Slide 5
Fetching from cache: Slide 6
Fetching from cache: Slide 7
Fetching from cache: Slide 8
Fetching from cache: Slide 9
It took 0.01 seconds to fetch the embeddings from the cache.


### 2. Caching the results online via Huggingface

You need to install the Huggingface Hub Package first and create an Account on [Huggingface](https://huggingface.co/). You also have to create a [Huggingface Token](https://huggingface.co/docs/hub/security-tokens) and set this as a environment variable. To get more information on how to do that, check out the [ReadMe](https://github.com/NFDI4BIOIMAGE/SlideInsight/blob/main/README.md).
In this example the Data is stored in my Repository on Huggingface.

In [4]:
repo_name = "lea-33/SlightInsight_Cache2"  # Change this to your Hugging Face repository name

start_time = time.time()

caching_hf(pdf_path, repo_name)

end_time= time.time()
duration= end_time - start_time
print(f'It took {duration:.2f} seconds to calculate the embeddings.')

Repository 'lea-33/SlightInsight_Cache2' created.
Caching Slide 1
Caching Slide 2
Caching Slide 3
Caching Slide 4
Caching Slide 5
Caching Slide 6
Caching Slide 7
Caching Slide 8
Caching Slide 9


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Finished caching WhatIsOMERO.pdf
It took 7.91 seconds to calculate the embeddings.


Again, re-calculating the Embeddings should be faster, because they can now be fetched directly from the storage on Huggingface.

In [5]:
start_time = time.time()

caching_hf(pdf_path, repo_name)

end_time= time.time()
duration= end_time - start_time
print(f'It took {duration:.2f} seconds to fetch the embeddings from the cache.')

Repository 'lea-33/SlightInsight_Cache2' already exists.


Generating train split:   0%|          | 0/9 [00:00<?, ? examples/s]

Fetching from cache: Slide 1
Fetching from cache: Slide 2
Fetching from cache: Slide 3
Fetching from cache: Slide 4
Fetching from cache: Slide 5
Fetching from cache: Slide 6
Fetching from cache: Slide 7
Fetching from cache: Slide 8
Fetching from cache: Slide 9


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Finished caching WhatIsOMERO.pdf
It took 3.77 seconds to fetch the embeddings from the cache.


### 3. Load the Dataset from Cache and convert it to a pandas DataFrame for easy processing

In [6]:
from datasets import load_dataset
import pandas as pd

# Load the dataset
def load_and_display_cache(repo_name):
    # Load the dataset from Hugging Face
    cache_dataset = load_dataset(repo_name, split="train")
    
    # Convert to pandas DataFrame for better visualization
    df = pd.DataFrame(cache_dataset)

    # Display a preview of the dataset
    print("Dataset Preview:")
    print(df.head())
    
    # Example: Display restored image from the first record
    first_record = cache_dataset[0]
    
    print("\nFirst Text Embedding:")
    print(first_record["value"]["text_embedding"][:10], "...") 
    
    print("\nFirst Vision Embedding:")
    print(first_record["value"]["vision_embedding"][:10], "...")


load_and_display_cache("lea-33/SlightInsight_Cache2")


Dataset Preview:
                      key                                              value
0  WhatIsOMERO.pdf_slide1  {'text_embedding': [0.4003332853317261, -0.336...
1  WhatIsOMERO.pdf_slide2  {'text_embedding': [0.39082658290863037, -0.28...
2  WhatIsOMERO.pdf_slide3  {'text_embedding': [0.18631458282470703, -0.37...
3  WhatIsOMERO.pdf_slide4  {'text_embedding': [0.18063969910144806, -0.60...
4  WhatIsOMERO.pdf_slide5  {'text_embedding': [-0.44303596019744873, -0.5...

First Text Embedding:
[0.4003332853317261, -0.33649125695228577, 0.3998110592365265, -0.4730990529060364, -0.5025672316551208, 0.12307340651750565, -0.24336643517017365, -0.3277848958969116, 0.29507237672805786, 0.5909251570701599] ...

First Vision Embedding:
[-0.037381067872047424, 0.4586034417152405, 0.020449191331863403, 0.13002845644950867, 0.3475934863090515, -0.14490166306495667, -0.16358992457389832, 0.13041885197162628, -0.04649023711681366, 0.08413688838481903] ...
