<a href="https://colab.research.google.com/github/mhmd2015/AI/blob/main/RAG_Test1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

[INFO] Running in Google Colab, installing requirements.
Collecting PyMuPDF
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.14
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting flash-attn
  Downloading flash_attn-2.7.0.post2.tar.gz (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Download PDF file
import os
import requests

# Get PDF document
pdf_path = "2205.07690v1.pdf"

# Download PDF if it doesn't already exist
if not os.path.exists(pdf_path):
  print("File doesn't exist, downloading...")

  # The URL of the PDF you want to download
  url = "https://arxiv.org/pdf/2205.07690v1"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request was successful
  if response.status_code == 200:
      # Open a file in binary write mode and save the content to it
      with open(filename, "wb") as file:
          file.write(response.content)
      print(f"The file has been downloaded and saved as {filename}")
  else:
      print(f"Failed to download the file. Status code: {response.status_code}")
else:
  print(f"File {pdf_path} exists.")

File 2205.07690v1.pdf exists.


In [3]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 2365,
  'page_word_count': 330,
  'page_sentence_count_raw': 7,
  'page_token_count': 591.25,
  'text': 'REAL-TIME SEMANTIC SEGMENTATION ON FPGAS FOR AUTONOMOUS VEHICLES WITH HLS4ML Nicolò Ghielmetti∗, Vladimir Loncar†, Maurizio Pierini, Marcel Roed‡, Sioni Summers European Organization for Nuclear Research (CERN) CH-1211 Geneva 23, Switzerland Thea Aarrestad Institute for Particle Physics and Astrophysics, ETH Zürich 8093 Zürich, Switzerland Christoffer Petersson§ Zenseact Gothenburg, 41756, Sweden Hampus Linander University of Gothenburg Gothenburg, 40530, Sweden Jennifer Ngadiuba Fermi National Accelerator Laboratory Batavia, IL 60510, USA Kelvin Lin¶ University of Washington Seattle, WA 98195, USA Philip Harris Massachusetts Institute of Technology Cambridge, MA 02139, USA May 17, 2022 ABSTRACT In this paper, we investigate how ﬁeld programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks releva

In [4]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': -35,
  'page_char_count': 3572,
  'page_word_count': 621,
  'page_sentence_count_raw': 28,
  'page_token_count': 893.0,
  'text': 'Input image b a c a a b c Shift registers Sliding input window c b Figure 5: Schematic representation of the new hls4ml implementation of Convolutional layers, as described in the text. The hyperparameter scan is done sequentially over the blocks, i.e. the Bayesian search over quantization and ﬁlter count of the initial layer is performed ﬁrst and is then frozen for the hyperparameter scan of the ﬁrst bottleneck and so on. The rest of the model is kept in ﬂoating point until everything in the end is quantized. Figure 4 shows the outcome of the heterogeneous QAT, in terms of validation accuracy and total number of bits for the six blocks in the network. The optimal conﬁguration search is performed taking as a baseline the Enet4 model, scanning the kernel bits in {4, 8} and ﬁxing the number of kernels to four times a by-layer multiplicative c

In [5]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,2365,330,7,591.25,REAL-TIME SEMANTIC SEGMENTATION ON FPGAS FOR A...
1,-40,5351,813,36,1337.75,"challenges faced by, for example, the automoti..."
2,-39,3091,508,24,772.75,Figure 1: An downsampled image from the Citysc...
3,-38,2574,437,19,643.5,Maxpool(2) Pad(2) Concat Skip connection Co...
4,-37,2274,364,13,568.5,"Skip branch Main branch Maxpool(2,2) Pad(1) Co..."


In [6]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,11.0,11.0,11.0,11.0,11.0
mean,-36.0,3227.0,515.82,30.55,806.75
std,3.32,1376.29,219.43,25.75,344.07
min,-41.0,359.0,59.0,7.0,89.75
25%,-38.5,2469.5,400.5,16.0,617.38
50%,-36.0,3091.0,514.0,24.0,772.75
75%,-33.5,3967.5,636.5,34.5,991.88
max,-31.0,5351.0,814.0,101.0,1337.75


In [7]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/11 [00:00<?, ?it/s]

In [9]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': -40,
  'page_char_count': 5351,
  'page_word_count': 813,
  'page_sentence_count_raw': 36,
  'page_token_count': 1337.75,
  'text': 'challenges faced by, for example, the automotive industry, will require the capability of processing large amounts of data in real-time, often through edge computing devices with strict latency and power-consumption constraints. This requirement has generated interest in the development of energy-effective neural networks, resulting in efforts like tinyML [3], which aims to reduce power consumption as much as possible without negatively affecting the model accuracy. Advances in Deep Learning for computer vision have had a crucial impact on the development of autonomous vehicles, enabling the vehicles to perceive their environment at ever-increasing levels of accuracy and detail. Deep Neural Networks are used for ﬁnding patterns and extracting relevant information from camera images, such as the precise location of the surrounding vehicles

In [10]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,11.0,11.0,11.0,11.0,11.0,11.0
mean,-36.0,3227.0,515.82,30.55,806.75,27.18
std,3.32,1376.29,219.43,25.75,344.07,19.65
min,-41.0,359.0,59.0,7.0,89.75,5.0
25%,-38.5,2469.5,400.5,16.0,617.38,15.0
50%,-36.0,3091.0,514.0,24.0,772.75,24.0
75%,-33.5,3967.5,636.5,34.5,991.88,34.5
max,-31.0,5351.0,814.0,101.0,1337.75,75.0


In [11]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/11 [00:00<?, ?it/s]

In [12]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': -39,
  'page_char_count': 3091,
  'page_word_count': 508,
  'page_sentence_count_raw': 24,
  'page_token_count': 772.75,
  'text': 'Figure 1: An downsampled image from the Cityscapes dataset (left) and the corresponding semantic segmentation target (right), in which the pixels belong to one of the classes {background (blue), road (teal), car (yellow), person (red)}. 3 Baseline model The architecture we use is inspired by a fully convolutional residual network called Efﬁcient Neural Network (ENet) [16]. This network was designed for low latency and minimal resource usage. It is designed as a sequence of blocks, summarized in Table 1. The initial block, shown in the left ﬁgure in Fig. 2, encodes the input into a 32×120×76 tensor, which is then processed by a set of sequential blocks of bottlenecks. The ﬁrst three blocks constitute the downsampling encoder, where each block consists of a series of layers as summarized in the left diagram in Fig. 3. The ﬁnal two blocks pro

In [13]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,-36.0,3227.0,515.82,30.55,806.75,27.18,3.18
std,3.32,1376.29,219.43,25.75,344.07,19.65,2.04
min,-41.0,359.0,59.0,7.0,89.75,5.0,1.0
25%,-38.5,2469.5,400.5,16.0,617.38,15.0,2.0
50%,-36.0,3091.0,514.0,24.0,772.75,24.0,3.0
75%,-33.5,3967.5,636.5,34.5,991.88,34.5,4.0
max,-31.0,5351.0,814.0,101.0,1337.75,75.0,8.0


In [14]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/11 [00:00<?, ?it/s]

35

In [15]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': -31,
  'sentence_chunk': '8, 675–686. [23] W. Jia, J. Cui, X. Zheng, and Q. Wu, “Design and implementation of real-time semantic segmentation network based on fpga”, in 2021 7th International Conference on Computing and Artiﬁcial Intelligence, ICCAI 2021, p. 321–325. Association for Computing Machinery, New York, NY, USA, 2021.doi:10.1145/3467707.3467756.11',
  'chunk_char_count': 333,
  'chunk_word_count': 46,
  'chunk_token_count': 83.25}]

In [16]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,35.0,35.0,35.0,35.0
mean,-35.03,1012.26,160.86,253.06
std,3.31,590.47,99.06,147.62
min,-41.0,66.0,10.0,16.5
25%,-38.0,491.5,66.0,122.88
50%,-34.0,992.0,164.0,248.0
75%,-32.5,1432.0,231.5,358.0
max,-31.0,2364.0,425.0,591.0


In [20]:
# Show random chunks with under 30 tokens in length
min_token_length = 10
for row in df[df["chunk_token_count"] <= min_token_length].sample(3).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

ValueError: a must be greater than 0 unless no samples are taken

In [21]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -41,
  'sentence_chunk': 'REAL-TIME SEMANTIC SEGMENTATION ON FPGAS FOR AUTONOMOUS VEHICLES WITH HLS4ML Nicolò Ghielmetti∗, Vladimir Loncar†, Maurizio Pierini, Marcel Roed‡, Sioni Summers European Organization for Nuclear Research (CERN) CH-1211 Geneva 23, Switzerland Thea Aarrestad Institute for Particle Physics and Astrophysics, ETH Zürich 8093 Zürich, Switzerland Christoffer Petersson§ Zenseact Gothenburg, 41756, Sweden Hampus Linander University of Gothenburg Gothenburg, 40530, Sweden Jennifer Ngadiuba Fermi National Accelerator Laboratory Batavia, IL 60510, USA Kelvin Lin¶ University of Washington Seattle, WA 98195, USA Philip Harris Massachusetts Institute of Technology Cambridge, MA 02139, USA May 17, 2022 ABSTRACT In this paper, we investigate how ﬁeld programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network archi

In [22]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07982697e-02  3.03164832e-02 -2.01217849e-02  6.86484650e-02
 -2.55256221e-02 -8.47686455e-03 -2.07225574e-04 -6.32377118e-02
  2.81606894e-02 -3.33353989e-02  3.02633960e-02  5.30721806e-02
 -5.03527038e-02  2.62288321e-02  3.33313718e-02 -4.51577231e-02
  3.63044813e-02 -1.37122418e-03 -1.20171458e-02  1.14947259e-02
  5.04510924e-02  4.70856987e-02  2.11913940e-02  5.14606535e-02
 -2.03746483e-02 -3.58889401e-02 -6.67763175e-04 -2.94393823e-02
  4.95859198e-02 -1.05639677e-02 -1.52014112e-02 -1.31758570e-03
  4.48197424e-02  1.56023465e-02  8.60379430e-07 -1.21392624e-03
 -2.37978697e-02 -9.09368275e-04  7.34484056e-03 -2.53933994e-03
  5.23370504e-02 -4.68043424e-02  1.66214760e-02  4.71579395e-02
 -4.15599644e-02  9.01976076e-04  3.60277519e-02  3.42214219e-02
  9.68227163e-02  5.94829023e-02 -1.64984372e-02 -3.51249315e-02
  5.92516130e-03 -7.07903586e-04 -2.4103

In [23]:
single_sentence = "Yo! How cool are embeddings?"
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")

Sentence: Yo! How cool are embeddings?
Embedding:
[-1.97448116e-02 -4.51077055e-03 -4.98486962e-03  6.55444860e-02
 -9.87674389e-03  2.72836108e-02  3.66426110e-02 -3.30219767e-03
  8.50078650e-03  8.24952498e-03 -2.28497703e-02  4.02430147e-02
 -5.75200692e-02  6.33691847e-02  4.43207137e-02 -4.49506715e-02
  1.25284614e-02 -2.52011847e-02 -3.55293006e-02  1.29559003e-02
  8.67021922e-03 -1.92917790e-02  3.55635840e-03  1.89505480e-02
 -1.47128161e-02 -9.39848833e-03  7.64175924e-03  9.62184742e-03
 -5.98920882e-03 -3.90168726e-02 -5.47824651e-02 -5.67456335e-03
  1.11644426e-02  4.08067517e-02  1.76319088e-06  9.15305596e-03
 -8.77257995e-03  2.39382870e-02 -2.32784245e-02  8.04999843e-02
  3.19176875e-02  5.12598455e-03 -1.47708450e-02 -1.62525177e-02
 -6.03213124e-02 -4.35689688e-02  4.51211594e-02 -1.79053694e-02
  2.63366792e-02 -3.47866528e-02 -8.89172778e-03 -5.47675341e-02
 -1.24372439e-02 -2.38606706e-02  8.33496898e-02  5.71241677e-02
  1.13328267e-02 -1.49595067e-02  9.2037

In [24]:
%%time

# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/35 [00:00<?, ?it/s]

CPU times: user 3.1 s, sys: 338 ms, total: 3.44 s
Wall time: 2.49 s


In [25]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

In [26]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: user 940 ms, sys: 4.46 ms, total: 944 ms
Wall time: 818 ms


tensor([[-0.0199,  0.0228,  0.0031,  ..., -0.0688,  0.0063,  0.0173],
        [-0.0090,  0.0336, -0.0204,  ..., -0.0468, -0.0234, -0.0376],
        [-0.0414,  0.0438, -0.0120,  ..., -0.0430, -0.0261, -0.0288],
        ...,
        [-0.0407, -0.0263, -0.0056,  ..., -0.0326, -0.0325, -0.0200],
        [-0.0252,  0.0127, -0.0063,  ..., -0.0500, -0.0211, -0.0151],
        [ 0.0117, -0.0427,  0.0029,  ..., -0.0719,  0.0091,  0.0033]],
       device='cuda:0')

In [27]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [28]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-41,REAL-TIME SEMANTIC SEGMENTATION ON FPGAS FOR A...,2364,329,591.0,[-1.99189372e-02 2.27944255e-02 3.12705827e-...
1,-40,"challenges faced by, for example, the automoti...",1816,256,454.0,[-9.01952107e-03 3.35726589e-02 -2.03959588e-...
2,-40,By applying aggressive ﬁlter-reduction and qua...,1633,234,408.25,[-4.13998924e-02 4.38373275e-02 -1.20260157e-...
3,-40,"With these steps, we obtain a good balance bet...",1188,190,297.0,[-9.95968096e-03 4.90480065e-02 3.64915580e-...
4,-40,In this way all inputs are smaller than one an...,707,129,176.75,[-6.82296976e-02 -3.08206026e-02 1.60923079e-...


In [29]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([35, 768])

In [30]:
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-41,REAL-TIME SEMANTIC SEGMENTATION ON FPGAS FOR A...,2364,329,591.0,"[-0.0199189372, 0.0227944255, 0.00312705827, 0..."
1,-40,"challenges faced by, for example, the automoti...",1816,256,454.0,"[-0.00901952107, 0.0335726589, -0.0203959588, ..."
2,-40,By applying aggressive ﬁlter-reduction and qua...,1633,234,408.25,"[-0.0413998924, 0.0438373275, -0.0120260157, 0..."
3,-40,"With these steps, we obtain a good balance bet...",1188,190,297.0,"[-0.00995968096, 0.0490480065, 0.036491558, 0...."
4,-40,In this way all inputs are smaller than one an...,707,129,176.75,"[-0.0682296976, -0.0308206026, 0.0160923079, 0..."


In [31]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device) # choose the device to load the model to

# Searching

In [32]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = "fully-on-chip deployment with a latency"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: fully-on-chip deployment with a latency
Time take to get scores on 35 embeddings: 0.00296 seconds.


torch.return_types.topk(
values=tensor([0.5494, 0.4831, 0.4765, 0.3932, 0.3429], device='cuda:0'),
indices=tensor([23, 19, 18, 21, 20], device='cuda:0'))

In [33]:
larger_embeddings = torch.randn(100*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

# Perform dot product across 168,000 embeddings
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=larger_embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(larger_embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

Embeddings shape: torch.Size([3500, 768])
Time take to get scores on 3500 embeddings: 0.02678 seconds.


In [34]:
# Define helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [35]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'fully-on-chip deployment with a latency'

Results:
Score: 0.5494
Text:
In order to achieve the lowest possible latency, we implement a fully on-chip
design with high layer parallelism. We optimize for latency, rather than frame
rate, such that in a real-life application the vehicle response time could be
minimized. Keeping up with the camera frame rate is a minimal requirement, but a
latency lower than the frame interval can be utilized. In our approach, each
layer is implemented as a separate module and data is streamed through the
architecture layer by layer. Dedicated per-layer buffers ensure that just enough
data is buffered in order to feed the next layer. This is highly efﬁcient, but
limits the number of layers that can be implemented on the FPGA. Consequently,
in order to ﬁt onto the FPGA in question, our model is smaller and achieves a
lower mIoU. Ref. [23] does not quote a latency, but a frame rate. A best-case
latency is then computed as the inverse of this frame rate