In [4]:
%pip install pymongo

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting pymongo
  Downloading pymongo-4.6.2-cp39-cp39-macosx_10_9_universal2.whl.metadata (22 kB)
Downloading pymongo-4.6.2-cp39-cp39-macosx_10_9_universal2.whl (534 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m534.6/534.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: pymongo
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated an

# Embeddings in Data Engineering: A Comprehensive Guide

In this section, we will explore various data storage solutions that are apt for handling the rich data generated from embeddings. As embeddings are high-dimensional vectors, choosing the right storage solution is vital to ensure efficient retrieval and analysis. Let's delve into the following aspects:

## Embedding Storage: Relational Databases vs. NoSQL Databases

When dealing with embeddings, the choice of data storage solution is critical. Here, we compare how relational databases and NoSQL databases fare in terms of storing and managing embeddings:

| Aspect                       | Relational Databases (SQL)                                                                 | NoSQL Databases                                               |
|------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| **Schema Design**            | Embeddings, being high-dimensional vectors, would require a schema with a column for each dimension, making the schema quite extensive. | Can store unstructured data such as embeddings as arrays or lists, allowing for a more compact and natural storage structure. |
| **Data Retrieval**           | Can retrieve embeddings using SQL queries, but complex queries for finding similar embeddings might be less efficient. | Can be optimized to retrieve similar embeddings quickly and efficiently, which is a common operation when working with embeddings. |
| **Storage Efficiency**       | Storing high-dimensional vectors in a structured format may not be space-efficient, potentially increasing storage costs. | Can handle a large volume of data (like embeddings) efficiently, offering a scalable solution for large datasets. |
| **Suitability for Embeddings** | May face challenges in efficiently storing and retrieving high-dimensional embeddings. | More suited for storing embeddings due to flexible data models and scalability. |

In the next sections, we will delve deeper into each of these aspects with Python examples demonstrating how to store and retrieve embeddings in both SQL and NoSQL databases, highlighting the performance and efficiency considerations for each.





### Storing and Retrieving Embeddings in SQL Databases

In [1]:
import sqlite3
import numpy as np

# Create a new SQLite database (or connect to an existing database)
conn = sqlite3.connect('embeddings.db')
cursor = conn.cursor()

# Create a new table to store embeddings (assuming 3-dimensional embeddings for simplicity)
cursor.execute('''
CREATE TABLE embeddings (
    id INTEGER PRIMARY KEY,
    dimension1 REAL,
    dimension2 REAL,
    dimension3 REAL
)
''')

# Insert an embedding vector into the database
embedding_vector = np.array([0.1, 0.2, 0.3])
cursor.execute('''
INSERT INTO embeddings (dimension1, dimension2, dimension3)
VALUES (?, ?, ?)
''', tuple(embedding_vector))

# Commit the transaction
conn.commit()

# Retrieve the embedding vector from the database
cursor.execute('SELECT * FROM embeddings WHERE id = 1')
retrieved_vector = cursor.fetchone()[1:]

# Close the connection
conn.close()

# Print the retrieved vector
print(retrieved_vector)
# Output: (0.1, 0.2, 0.3)

(0.1, 0.2, 0.3)


In this script, we are utilizing the SQLite database, a type of relational database, to store and retrieve embedding vectors. Here is a step-by-step explanation of the script and what the output indicates:

1. **Importing Necessary Libraries**:
   ```python
   import sqlite3
   import numpy as np
   ```
   We start by importing the `sqlite3` library, which allows us to work with SQLite databases in Python, and the `numpy` library to handle arrays efficiently.

2. **Creating a New SQLite Database and Initializing a Cursor**:
   ```python
   # Create a new SQLite database (or connect to an existing database)
   conn = sqlite3.connect('embeddings.db')
   cursor = conn.cursor()
   ```
   We create a new SQLite database named 'embeddings.db' (or connect to it if it already exists) and initialize a cursor object to interact with the database.

3. **Creating a New Table to Store Embeddings**:
   ```python
   # Create a new table to store embeddings (assuming 3-dimensional embeddings for simplicity)
   cursor.execute('''
   CREATE TABLE embeddings (
       id INTEGER PRIMARY KEY,
       dimension1 REAL,
       dimension2 REAL,
       dimension3 REAL
   )
   ''')
   ```
   We create a new table named 'embeddings' with columns to store each dimension of the embedding vectors. In this example, we assume 3-dimensional embeddings for simplicity.

4. **Inserting an Embedding Vector into the Database**:
   ```python
   # Insert an embedding vector into the database
   embedding_vector = np.array([0.1, 0.2, 0.3])
   cursor.execute('''
   INSERT INTO embeddings (dimension1, dimension2, dimension3)
   VALUES (?, ?, ?)
   ''', tuple(embedding_vector))
   ```
   We create a numpy array to represent a 3-dimensional embedding vector and insert it into the 'embeddings' table in the database.

5. **Committing the Transaction and Retrieving the Embedding Vector**:
   ```python
   # Commit the transaction
   conn.commit()

   # Retrieve the embedding vector from the database
   cursor.execute('SELECT * FROM embeddings WHERE id = 1')
   retrieved_vector = cursor.fetchone()[1:]
   ```
   We commit the transaction to save the changes to the database, and then retrieve the embedding vector we inserted using a SQL SELECT query.

6. **Closing the Connection and Printing the Retrieved Vector**:
   ```python
   # Close the connection
   conn.close()

   # Print the retrieved vector
   print(retrieved_vector)
   # Output: (0.1, 0.2, 0.3)
   ```
   Finally, we close the connection to the database and print the retrieved vector to verify that it matches the vector we inserted.

**Understanding the Output**:
- The output, `(0.1, 0.2, 0.3)`, confirms that we successfully stored and retrieved the embedding vector from the SQLite database. It indicates that the database operations of inserting and retrieving data were performed correctly, showcasing how a relational database can be used to handle embedding data, albeit with the limitation of handling only low-dimensional embeddings efficiently.

### Storing and Retrieving Embeddings in NoSQL Databases

In [8]:
from pymongo import MongoClient
import numpy as np

# Connect to the MongoDB server
client = MongoClient('mongodb://localhost:27017/')

# Create a new database and a new collection
db = client['embedding_database']
collection = db['embedding_collection']

# Insert an embedding vector into the database (assuming 3-dimensional embeddings for simplicity)
embedding_vector = np.array([0.1, 0.2, 0.3]).tolist()
collection.insert_one({'embedding_vector': embedding_vector})

# Retrieve the embedding vector from the database
retrieved_document = collection.find_one()
retrieved_vector = retrieved_document['embedding_vector']

# Close the connection
client.close()

# Print the retrieved vector
print(retrieved_vector)
# Output: [0.1, 0.2, 0.3]

[0.1, 0.2, 0.3]


In this script, we are utilizing MongoDB, a popular NoSQL database, to store and retrieve embedding vectors. Here is a step-by-step explanation of the script and what the output indicates:

1. **Importing Necessary Libraries**:
   ```python
   from pymongo import MongoClient
   import numpy as np
   ```
   We start by importing the necessary libraries: `pymongo` to interact with MongoDB and `numpy` to handle arrays efficiently.

2. **Connecting to MongoDB and Creating a Database and Collection**:
   ```python
   # Connect to the MongoDB server
   client = MongoClient('mongodb://localhost:27017/')
   
   # Create a new database and a new collection
   db = client['embedding_database']
   collection = db['embedding_collection']
   ```
   We establish a connection to the MongoDB server running locally and create a new database named 'embedding_database'. Inside this database, we create a new collection called 'embedding_collection' where our embeddings will be stored.

3. **Inserting an Embedding Vector into the Database**:
   ```python
   # Insert an embedding vector into the database (assuming 3-dimensional embeddings for simplicity)
   embedding_vector = np.array([0.1, 0.2, 0.3]).tolist()
   collection.insert_one({'embedding_vector': embedding_vector})
   ```
   We create a numpy array representing a 3-dimensional embedding vector and convert it to a Python list, which is then inserted into the MongoDB collection as a document. The document contains a field 'embedding_vector' holding the list representation of the embedding vector.

4. **Retrieving the Embedding Vector from the Database**:
   ```python
   # Retrieve the embedding vector from the database
   retrieved_document = collection.find_one()
   retrieved_vector = retrieved_document['embedding_vector']
   ```
   We use the `find_one` method to retrieve the first document from the collection, which contains our stored embedding vector. We then extract the 'embedding_vector' field from the document to get the stored embedding vector.

5. **Closing the Connection and Printing the Retrieved Vector**:
   ```python
   # Close the connection
   client.close()
   
   # Print the retrieved vector
   print(retrieved_vector)
   # Output: [0.1, 0.2, 0.3]
   ```
   After retrieving the vector, we close the connection to the MongoDB server and print the retrieved vector to verify that it matches the vector we inserted.

**Understanding the Output**:
- The output, `[0.1, 0.2, 0.3]`, confirms that we successfully stored and retrieved the embedding vector from the MongoDB database. It indicates that the database operations of inserting and retrieving data were executed correctly, showcasing how a NoSQL database can be utilized to store and handle embedding data efficiently, especially given the flexible data models that NoSQL databases offer.

This script demonstrates a method to store and retrieve embedding vectors in a NoSQL database using Python and MongoDB, highlighting a potential approach to managing embeddings in data engineering tasks with the flexibility and scalability that NoSQL databases provide.

---

### Open-Source Distributed File Systems for Embedding Storage

In this section, we delve into the functionalities of Hadoop HDFS, a scalable open-source distributed file system, and explore its potential as a solution for storing and managing large volumes of embedding data efficiently.

#### Hadoop HDFS

##### Introduction to HDFS

Hadoop Distributed File System (HDFS) is an integral component of the Apache Hadoop project, designed to store and manage very large files across a distributed set of machines. The system is setup in a way that data is distributed across several servers, and parallel operations can be performed on the data, which proves to be highly efficient for storing and analyzing large volumes of data, including high-dimensional embeddings. Due to its distributed nature, it is fault-tolerant, offering reliability and scalability, which are vital attributes when handling big data scenarios like storing embeddings.

##### Storing Embeddings in HDFS

To store embeddings effectively in HDFS, it's essential to consider the following steps:

1. **File Format Selection**: Depending on the nature of the embeddings, selecting an appropriate file format is crucial. Common choices could be plain text formats (like CSV) for ease of use or binary formats (like Parquet) that are more space-efficient and support complex data structures.
   
2. **Data Serialization**: Before storing the embeddings, they need to be serialized. Serialization is the process of converting the in-memory representation of the embeddings (like arrays in Python) to a format that can be stored persistently. In Python, libraries such as `pickle` can be used for serialization.

3. **Writing Data to HDFS**: This involves using HDFS APIs or Python libraries like `pydoop` that allow interaction with HDFS to write the serialized data to the file system. The data is generally written in blocks, distributed across different nodes in the HDFS cluster.

##### Retrieving Embeddings from HDFS

When it comes to retrieving embeddings stored in HDFS for further analysis or processing, the following steps are generally involved:

1. **Reading Data from HDFS**: Utilize HDFS APIs or Python libraries that facilitate reading data from HDFS. The embeddings data stored in files in HDFS can be read into the local computational environment for further operations.

2. **Data Deserialization**: After reading the data, it needs to be deserialized to convert it back to its original format, facilitating further analysis or processing. This involves converting the stored format back into in-memory data structures like arrays or lists.

In the following section, we will provide Python examples demonstrating the conceptual steps involved in storing and retrieving embeddings from HDFS, considering the practical complexities of setting up and interacting with an HDFS cluster.


### Using MinIO for Storing Embeddings

In the context of data engineering, especially when working with embeddings, storing and managing large volumes of data efficiently is a common requirement. Distributed file systems offer a solution to this by providing scalable and reliable data storage solutions. MinIO is one such open-source object storage server that is compatible with Amazon S3 APIs. It provides an excellent platform to handle large data volumes, such as embeddings, due to its high performance and scalability.

#### Why Use MinIO?

1. **Ease of Use**: MinIO is easy to set up and use, making it a convenient choice for developers.
2. **High Performance**: It is designed for high-performance object storage, making it suitable for storing large volumes of data, including embeddings.
3. **S3 Compatibility**: MinIO is compatible with Amazon S3 APIs, which allows for easy integration with S3 compatible tools and SDKs.
4. **Open-Source**: Being open-source, it provides the flexibility of customization and integration as per the project requirements.

#### Setting Up MinIO Locally

To set up MinIO locally, follow these steps:

1. **Install Docker**: Ensure Docker is installed on your system. If not, download it from the [Docker official website](https://www.docker.com/get-started).
   
2. **Pull MinIO Docker Image**: Open a terminal and run the following command to pull the MinIO Docker image:
`docker pull minio/minio`

3. **Run MinIO Server**: Start the MinIO server using the following command:
`docker run -p 9000:9000 minio/minio server /data`

This command starts the MinIO server and maps port 9000 on your local machine to port 9000 on the Docker container, allowing you to access the MinIO server at `http://127.0.0.1:9000`.

4. **Access MinIO Console**: Once the server is running, open a web browser and go to `http://127.0.0.1:9000` to access the MinIO Console. The default access and secret keys are both `minioadmin`.

With MinIO set up, we can proceed to create Python scripts to store and retrieve embeddings in MinIO, demonstrating a practical approach to managing embeddings in data engineering tasks.

In [9]:
!pip install minio
!pip install numpy

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting minio
  Downloading minio-7.2.5-py3-none-any.whl.metadata (6.4 kB)
Collecting pycryptodome (from minio)
  Downloading pycryptodome-3.20.0-cp35-abi3-macosx_10_9_universal2.whl.metadata (3.4 kB)
Downloading minio-7.2.5-py3-none-any.whl (93 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.2/93.2 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycryptodome-3.20.0-cp35-abi3-macosx_10_9_universal2.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pycryptodome, minio
[33m  DEPRECATION: Configuring installation scheme with distutils config f

---

<div class="alert alert-block alert-info">To illustrate the effective storage and retrieval of a large set of embeddings using a distributed file system. The primary objective is to create a substantial dataset of embeddings, derived from a rich source of information - PDF files, particularly books. Here, we elucidate the rationale behind our approach and the choice of the BERT model for generating embeddings.</div>

#### Creating a Large Set of Embeddings from PDFs

1. **Objective of Creating a Large Dataset**:
   The foremost goal is to create a large dataset of embeddings that can effectively demonstrate the capabilities of distributed file systems in handling substantial volumes of complex data, thereby showcasing efficient storage and retrieval mechanisms.

2. **Utilizing PDFs as a Data Source**:
   PDF files, specifically books, encapsulate a vast repository of knowledge and information on varied subjects. By creating embeddings from the text data in these files, we can construct a meaningful and extensive dataset that serves as a significant asset for various data science applications.

3. **Significance of Embeddings**:
   Embeddings play a pivotal role in converting raw text data into a format conducive to analysis and processing. These high-dimensional vectors encapsulate semantic relationships between words and phrases, serving as a fundamental building block in data science and machine learning applications.

4. **Applications in Search and Recommendation Systems**:
   Embeddings also find substantial applications in building intelligent search and recommendation systems. These systems can process natural language queries, provide relevant search results, and offer personalized recommendations, thereby enhancing the user experience.

#### Choice of BERT Model Over spaCy

1. **Contextual Embeddings**: BERT (Bidirectional Encoder Representations from Transformers) is known for creating contextual embeddings. It considers the context of words during embedding generation, capturing more nuanced meanings and semantic relationships compared to traditional embedding methods.

2. **Handling Broader Context**: Given that we are dealing with books, which contain long and detailed narratives or discussions, BERT is capable of handling broader contexts more effectively compared to spaCy, which might lose some contextual information when the text is chunked into smaller pieces.

3. **Segment-Level Embeddings**: BERT allows for the creation of segment-level embeddings (like paragraphs or sections), preserving more of the document structure and contextual information compared to sentence-level embeddings that might be created using spaCy.

4. **State-of-the-Art Performance**: BERT has delivered state-of-the-art performance on a wide range of NLP tasks, making it a reliable choice for creating high-quality embeddings that can capture complex relationships in the text data extracted from PDFs.

By choosing BERT for this task, we aim to create embeddings that can effectively capture the rich and complex information contained in books, paving the way for more advanced analysis and applications in data science and machine learning.


In the ensuing sections, we will detail the process of using the BERT model to create embeddings from the text data extracted from PDFs. Subsequently, we will illustrate how to efficiently store and manage these embeddings using MinIO, a high-performance distributed file system.


In [26]:
import os
import fitz  # PyMuPDF
from transformers import BertModel, BertTokenizer
import torch
from minio import Minio
import pickle
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity

# Step 1: Extract text from PDF files
directory = 'gitignore-files'
pdf_text_data = {}

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        file_path = os.path.join(directory, filename)
        
        # Open the PDF file
        doc = fitz.open(file_path)
        text = ""
        
        # Combine text from all pages into a single string
        for page in doc:
            text += page.get_text()
        pdf_text_data[filename] = text

# Step 2: Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Step 3: Create segment-level embeddings for the text from each PDF file
embeddings_data = {}
for filename, text in pdf_text_data.items():
    
    # Split the text into paragraphs
    paragraphs = text.split('\n\n')
    embeddings_data[filename] = []
    
    for paragraph in paragraphs:
        if not paragraph.strip():
            continue
        
        # Tokenize the paragraph and create an embedding
        inputs = tokenizer(paragraph, return_tensors='pt', max_length=512, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings_data[filename].append(outputs.last_hidden_state.mean(dim=1).numpy())

# Step 4: Setup MinIO client and create a bucket
client = Minio(
    "127.0.0.1:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
    secure=False
)
bucket_name = "embeddings"
client.make_bucket(bucket_name)

# Step 5: Serialize and store embeddings in MinIO
for filename, embedding_list in tqdm(embeddings_data.items(), desc="Uploading embeddings", unit="file"):
    for i, embedding in enumerate(embedding_list):
        
        # Create a unique object name based on the file name and segment index
        object_name = f"{filename.replace('.pdf', '')}_segment_{i}_embedding.pkl"
        
        # Serialize the embedding data
        serialized_embedding = pickle.dumps(embedding)
        
        # Upload the serialized data to MinIO
        client.put_object(bucket_name, object_name, io.BytesIO(serialized_embedding), len(serialized_embedding))

# Step 6: Embed the user query and search for similar segments in MinIO
user_question = input("Please enter your question: ")

# Create an embedding for the query
query_inputs = tokenizer(user_question, return_tensors='pt')
with torch.no_grad():
    query_outputs = model(**query_inputs)
query_embedding = query_outputs.last_hidden_state.mean(dim=1).numpy()

# Retrieve embeddings from MinIO and find the most similar segment to the query
segment_similarities = []
for obj in tqdm(client.list_objects(bucket_name), desc="Retrieving and comparing embeddings", unit="file"):
    
    # Get the object data from MinIO and deserialize to get the original embedding
    data = client.get_object(bucket_name, obj.object_name)
    retrieved_embedding = pickle.loads(data.read())
    
    # Calculate the cosine similarity between the query embedding and the retrieved embedding
    similarity = cosine_similarity(query_embedding.reshape(1, -1), retrieved_embedding.reshape(1, -1))
    
    # Store the similarity score along with file and segment information
    segment_similarities.append((obj.object_name, similarity[0][0]))

# Find and display the most similar segment to the query
most_similar_segment = max(segment_similarities, key=lambda x: x[1])
print(f"The most similar segment to the query '{user_question}' is in file: {most_similar_segment[0]}, with a similarity score of {most_similar_segment[1]}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Uploading embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Please enter your question: who is ahab from moby dick?


Retrieving and comparing embeddings: 3file [00:00, 135.92file/s]

The most similar segment to the query 'who is ahab from moby dick?' is in file: Tom-Sawyer_segment_0_embedding.pkl, with a similarity score of 0.5759598016738892





In [30]:
import os
import fitz  # PyMuPDF
from transformers import BertModel, BertTokenizer
import torch
from minio import Minio
import pickle
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

# Step 1: Extract text from PDF files and tokenize into sentences
directory = 'gitignore-files'
pdf_text_data = {}

for filename in tqdm(os.listdir(directory), desc="Reading PDFs", unit="file"):
    if filename.endswith(".pdf"):
        file_path = os.path.join(directory, filename)
        
        # Open the PDF file
        doc = fitz.open(file_path)
        text = ""
        
        # Combine text from all pages into a single string
        for page in doc:
            text += page.get_text()
        pdf_text_data[filename] = sent_tokenize(text)  # Tokenize text into sentences
        break;

# Step 2: Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Step 3: Create sentence-level embeddings for the text from each PDF file
embeddings_data = {}
for filename, sentences in tqdm(pdf_text_data.items(), desc="Creating Embeddings - Files", unit="file"):
    embeddings_data[filename] = []
    for sentence in tqdm(sentences, desc="Creating Embeddings - Sentences", unit="sentence"):
        # Skip empty sentences
        if not sentence.strip():
            continue
        
        # Tokenize the sentence and create an embedding
        inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings_data[filename].append(outputs.last_hidden_state.mean(dim=1).numpy())

# Step 4: Setup MinIO client and create a bucket
client = Minio(
    "127.0.0.1:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
    secure=False
)
bucket_name = "embeddings-new"
client.make_bucket(bucket_name)

# Step 5: Serialize and store embeddings in MinIO
for filename, embedding_list in tqdm(embeddings_data.items(), desc="Uploading embeddings", unit="file"):
    for i, embedding in enumerate(embedding_list):
        # Create a unique object name based on the file name and sentence index
        object_name = f"{filename.replace('.pdf', '')}_sentence_{i}_embedding.pkl"
        
        # Serialize the embedding data
        serialized_embedding = pickle.dumps(embedding)
        
        # Upload the serialized data to MinIO
        client.put_object(bucket_name, object_name, io.BytesIO(serialized_embedding), len(serialized_embedding))

# Step 6: Query the embeddings
user_question = input("Please enter your question: ")

# Create an embedding for the query
query_inputs = tokenizer(user_question, return_tensors='pt')
with torch.no_grad():
    query_outputs = model(**query_inputs)
query_embedding = query_outputs.last_hidden_state.mean(dim=1).numpy()

# Retrieve embeddings from MinIO, find the most similar sentences, and display them
similar_sentences = []

for obj in tqdm(client.list_objects(bucket_name), desc="Retrieving and comparing embeddings", unit="file"):
    
    # Get the object data from MinIO and deserialize to get the original embedding
    data = client.get_object(bucket_name, obj.object_name)
    retrieved_embedding = pickle.loads(data.read())
    
    # Calculate the cosine similarity between the query embedding and the retrieved embedding
    similarity = cosine_similarity(query_embedding.reshape(1, -1), retrieved_embedding.reshape(1, -1))
    
    # Get the file name and sentence index from the object name
    file_name, sentence_index = obj.object_name.replace('_embedding.pkl', '').rsplit('_sentence_', 1)
    sentence_index = int(sentence_index)
    
    # Get the text of the sentence
    sentence_text = pdf_text_data[file_name + '.pdf'][sentence_index]
    
    # Store the sentence text and similarity score
    similar_sentences.append((sentence_text, similarity[0][0]))

# Find the top 2 most similar sentences and display them
most_similar_sentences = sorted(similar_sentences, key=lambda x: x[1], reverse=True)[:2]
for i, (sentence, score) in enumerate(most_similar_sentences):
    print(f"Most similar sentence {i+1}: {sentence} (Similarity Score: {score})")

Reading PDFs:   0%|                                                                                                                                                                                                           | 0/2 [00:00<?, ?file/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (init

Creating Embeddings - Sentences:   2%|██▊                                                                                                                                                                     | 81/4886 [00:03<04:00, 20.00sentence/s][A
Creating Embeddings - Sentences:   2%|██▉                                                                                                                                                                     | 84/4886 [00:03<03:52, 20.64sentence/s][A
Creating Embeddings - Sentences:   2%|██▉                                                                                                                                                                     | 87/4886 [00:03<03:47, 21.13sentence/s][A
Creating Embeddings - Sentences:   2%|███                                                                                                                                                                     | 90/4886 [00:04<03:43, 21.47sentence/s][A


Creating Embeddings - Sentences:   4%|██████                                                                                                                                                                 | 177/4886 [00:07<03:24, 23.00sentence/s][A
Creating Embeddings - Sentences:   4%|██████▏                                                                                                                                                                | 180/4886 [00:08<03:23, 23.15sentence/s][A
Creating Embeddings - Sentences:   4%|██████▎                                                                                                                                                                | 183/4886 [00:08<03:16, 23.89sentence/s][A
Creating Embeddings - Sentences:   4%|██████▎                                                                                                                                                                | 186/4886 [00:08<03:20, 23.41sentence/s][A


Creating Embeddings - Sentences:   6%|█████████▎                                                                                                                                                             | 273/4886 [00:11<03:03, 25.12sentence/s][A
Creating Embeddings - Sentences:   6%|█████████▍                                                                                                                                                             | 276/4886 [00:11<03:03, 25.16sentence/s][A
Creating Embeddings - Sentences:   6%|█████████▌                                                                                                                                                             | 279/4886 [00:12<03:04, 25.03sentence/s][A
Creating Embeddings - Sentences:   6%|█████████▋                                                                                                                                                             | 282/4886 [00:12<03:01, 25.35sentence/s][A


Creating Embeddings - Sentences:   8%|████████████▌                                                                                                                                                          | 369/4886 [00:15<03:00, 25.08sentence/s][A
Creating Embeddings - Sentences:   8%|████████████▋                                                                                                                                                          | 372/4886 [00:15<02:58, 25.34sentence/s][A
Creating Embeddings - Sentences:   8%|████████████▊                                                                                                                                                          | 375/4886 [00:16<02:53, 25.94sentence/s][A
Creating Embeddings - Sentences:   8%|████████████▉                                                                                                                                                          | 378/4886 [00:16<02:53, 25.92sentence/s][A


Creating Embeddings - Sentences:  10%|███████████████▉                                                                                                                                                       | 465/4886 [00:19<03:13, 22.85sentence/s][A
Creating Embeddings - Sentences:  10%|███████████████▉                                                                                                                                                       | 468/4886 [00:20<03:11, 23.12sentence/s][A
Creating Embeddings - Sentences:  10%|████████████████                                                                                                                                                       | 471/4886 [00:20<03:07, 23.55sentence/s][A
Creating Embeddings - Sentences:  10%|████████████████▏                                                                                                                                                      | 474/4886 [00:20<03:04, 23.92sentence/s][A


Creating Embeddings - Sentences:  11%|███████████████████▏                                                                                                                                                   | 561/4886 [00:24<03:11, 22.56sentence/s][A
Creating Embeddings - Sentences:  12%|███████████████████▎                                                                                                                                                   | 564/4886 [00:24<03:10, 22.73sentence/s][A
Creating Embeddings - Sentences:  12%|███████████████████▍                                                                                                                                                   | 567/4886 [00:24<03:03, 23.50sentence/s][A
Creating Embeddings - Sentences:  12%|███████████████████▍                                                                                                                                                   | 570/4886 [00:24<03:08, 22.87sentence/s][A


Creating Embeddings - Sentences:  13%|██████████████████████▍                                                                                                                                                | 657/4886 [00:28<02:55, 24.05sentence/s][A
Creating Embeddings - Sentences:  14%|██████████████████████▌                                                                                                                                                | 660/4886 [00:28<03:01, 23.28sentence/s][A
Creating Embeddings - Sentences:  14%|██████████████████████▋                                                                                                                                                | 663/4886 [00:28<02:58, 23.66sentence/s][A
Creating Embeddings - Sentences:  14%|██████████████████████▊                                                                                                                                                | 666/4886 [00:28<02:56, 23.89sentence/s][A


Creating Embeddings - Sentences:  15%|█████████████████████████▋                                                                                                                                             | 753/4886 [00:32<02:43, 25.33sentence/s][A
Creating Embeddings - Sentences:  15%|█████████████████████████▊                                                                                                                                             | 756/4886 [00:32<02:43, 25.27sentence/s][A
Creating Embeddings - Sentences:  16%|█████████████████████████▉                                                                                                                                             | 759/4886 [00:32<02:44, 25.08sentence/s][A
Creating Embeddings - Sentences:  16%|██████████████████████████                                                                                                                                             | 762/4886 [00:32<02:41, 25.58sentence/s][A


Creating Embeddings - Sentences:  17%|█████████████████████████████                                                                                                                                          | 849/4886 [00:36<02:53, 23.23sentence/s][A
Creating Embeddings - Sentences:  17%|█████████████████████████████                                                                                                                                          | 852/4886 [00:36<02:51, 23.55sentence/s][A
Creating Embeddings - Sentences:  17%|█████████████████████████████▏                                                                                                                                         | 855/4886 [00:36<02:46, 24.18sentence/s][A
Creating Embeddings - Sentences:  18%|█████████████████████████████▎                                                                                                                                         | 858/4886 [00:36<02:43, 24.68sentence/s][A


Creating Embeddings - Sentences:  19%|████████████████████████████████▎                                                                                                                                      | 945/4886 [00:39<02:36, 25.16sentence/s][A
Creating Embeddings - Sentences:  19%|████████████████████████████████▍                                                                                                                                      | 948/4886 [00:40<02:37, 25.06sentence/s][A
Creating Embeddings - Sentences:  19%|████████████████████████████████▌                                                                                                                                      | 951/4886 [00:40<02:36, 25.16sentence/s][A
Creating Embeddings - Sentences:  20%|████████████████████████████████▌                                                                                                                                      | 954/4886 [00:40<02:34, 25.49sentence/s][A


Creating Embeddings - Sentences:  21%|███████████████████████████████████▎                                                                                                                                  | 1041/4886 [00:43<02:32, 25.29sentence/s][A
Creating Embeddings - Sentences:  21%|███████████████████████████████████▍                                                                                                                                  | 1044/4886 [00:43<02:43, 23.55sentence/s][A
Creating Embeddings - Sentences:  21%|███████████████████████████████████▌                                                                                                                                  | 1047/4886 [00:43<02:38, 24.28sentence/s][A
Creating Embeddings - Sentences:  21%|███████████████████████████████████▋                                                                                                                                  | 1050/4886 [00:44<02:36, 24.46sentence/s][A


Creating Embeddings - Sentences:  23%|██████████████████████████████████████▋                                                                                                                               | 1137/4886 [00:47<02:31, 24.70sentence/s][A
Creating Embeddings - Sentences:  23%|██████████████████████████████████████▋                                                                                                                               | 1140/4886 [00:47<02:32, 24.50sentence/s][A
Creating Embeddings - Sentences:  23%|██████████████████████████████████████▊                                                                                                                               | 1143/4886 [00:47<02:28, 25.14sentence/s][A
Creating Embeddings - Sentences:  23%|██████████████████████████████████████▉                                                                                                                               | 1146/4886 [00:47<02:31, 24.63sentence/s][A


Creating Embeddings - Sentences:  25%|█████████████████████████████████████████▉                                                                                                                            | 1233/4886 [00:51<02:23, 25.38sentence/s][A
Creating Embeddings - Sentences:  25%|█████████████████████████████████████████▉                                                                                                                            | 1236/4886 [00:51<02:24, 25.25sentence/s][A
Creating Embeddings - Sentences:  25%|██████████████████████████████████████████                                                                                                                            | 1239/4886 [00:51<02:27, 24.72sentence/s][A
Creating Embeddings - Sentences:  25%|██████████████████████████████████████████▏                                                                                                                           | 1242/4886 [00:51<02:26, 24.92sentence/s][A


Creating Embeddings - Sentences:  27%|█████████████████████████████████████████████▏                                                                                                                        | 1329/4886 [00:55<02:18, 25.77sentence/s][A
Creating Embeddings - Sentences:  27%|█████████████████████████████████████████████▎                                                                                                                        | 1332/4886 [00:55<02:19, 25.48sentence/s][A
Creating Embeddings - Sentences:  27%|█████████████████████████████████████████████▎                                                                                                                        | 1335/4886 [00:55<02:22, 24.98sentence/s][A
Creating Embeddings - Sentences:  27%|█████████████████████████████████████████████▍                                                                                                                        | 1338/4886 [00:55<02:22, 24.88sentence/s][A


Creating Embeddings - Sentences:  29%|████████████████████████████████████████████████▍                                                                                                                     | 1425/4886 [00:58<02:21, 24.42sentence/s][A
Creating Embeddings - Sentences:  29%|████████████████████████████████████████████████▌                                                                                                                     | 1428/4886 [00:59<02:19, 24.86sentence/s][A
Creating Embeddings - Sentences:  29%|████████████████████████████████████████████████▌                                                                                                                     | 1431/4886 [00:59<02:17, 25.14sentence/s][A
Creating Embeddings - Sentences:  29%|████████████████████████████████████████████████▋                                                                                                                     | 1434/4886 [00:59<02:17, 25.11sentence/s][A


Creating Embeddings - Sentences:  31%|███████████████████████████████████████████████████▋                                                                                                                  | 1521/4886 [01:02<02:12, 25.47sentence/s][A
Creating Embeddings - Sentences:  31%|███████████████████████████████████████████████████▊                                                                                                                  | 1524/4886 [01:02<02:17, 24.52sentence/s][A
Creating Embeddings - Sentences:  31%|███████████████████████████████████████████████████▉                                                                                                                  | 1527/4886 [01:03<02:14, 24.89sentence/s][A
Creating Embeddings - Sentences:  31%|███████████████████████████████████████████████████▉                                                                                                                  | 1530/4886 [01:03<02:12, 25.32sentence/s][A


Creating Embeddings - Sentences:  33%|██████████████████████████████████████████████████████▉                                                                                                               | 1617/4886 [01:06<02:12, 24.59sentence/s][A
Creating Embeddings - Sentences:  33%|███████████████████████████████████████████████████████                                                                                                               | 1620/4886 [01:06<02:11, 24.87sentence/s][A
Creating Embeddings - Sentences:  33%|███████████████████████████████████████████████████████▏                                                                                                              | 1623/4886 [01:06<02:13, 24.44sentence/s][A
Creating Embeddings - Sentences:  33%|███████████████████████████████████████████████████████▏                                                                                                              | 1626/4886 [01:07<02:08, 25.40sentence/s][A


Creating Embeddings - Sentences:  35%|██████████████████████████████████████████████████████████▏                                                                                                           | 1713/4886 [01:10<02:07, 24.94sentence/s][A
Creating Embeddings - Sentences:  35%|██████████████████████████████████████████████████████████▎                                                                                                           | 1716/4886 [01:10<02:08, 24.68sentence/s][A
Creating Embeddings - Sentences:  35%|██████████████████████████████████████████████████████████▍                                                                                                           | 1719/4886 [01:10<02:07, 24.82sentence/s][A
Creating Embeddings - Sentences:  35%|██████████████████████████████████████████████████████████▌                                                                                                           | 1722/4886 [01:10<02:09, 24.49sentence/s][A


Creating Embeddings - Sentences:  37%|█████████████████████████████████████████████████████████████▍                                                                                                        | 1809/4886 [01:14<02:02, 25.10sentence/s][A
Creating Embeddings - Sentences:  37%|█████████████████████████████████████████████████████████████▌                                                                                                        | 1812/4886 [01:14<02:00, 25.41sentence/s][A
Creating Embeddings - Sentences:  37%|█████████████████████████████████████████████████████████████▋                                                                                                        | 1815/4886 [01:14<02:00, 25.42sentence/s][A
Creating Embeddings - Sentences:  37%|█████████████████████████████████████████████████████████████▊                                                                                                        | 1818/4886 [01:14<02:03, 24.92sentence/s][A


Creating Embeddings - Sentences:  39%|████████████████████████████████████████████████████████████████▋                                                                                                     | 1905/4886 [01:18<02:01, 24.45sentence/s][A
Creating Embeddings - Sentences:  39%|████████████████████████████████████████████████████████████████▊                                                                                                     | 1908/4886 [01:18<02:03, 24.14sentence/s][A
Creating Embeddings - Sentences:  39%|████████████████████████████████████████████████████████████████▉                                                                                                     | 1911/4886 [01:18<02:01, 24.56sentence/s][A
Creating Embeddings - Sentences:  39%|█████████████████████████████████████████████████████████████████                                                                                                     | 1914/4886 [01:18<02:01, 24.41sentence/s][A


Creating Embeddings - Sentences:  41%|███████████████████████████████████████████████████████████████████▉                                                                                                  | 2001/4886 [01:22<01:54, 25.22sentence/s][A
Creating Embeddings - Sentences:  41%|████████████████████████████████████████████████████████████████████                                                                                                  | 2004/4886 [01:22<01:56, 24.83sentence/s][A
Creating Embeddings - Sentences:  41%|████████████████████████████████████████████████████████████████████▏                                                                                                 | 2007/4886 [01:22<01:54, 25.16sentence/s][A
Creating Embeddings - Sentences:  41%|████████████████████████████████████████████████████████████████████▎                                                                                                 | 2010/4886 [01:22<02:03, 23.33sentence/s][A


Creating Embeddings - Sentences:  43%|███████████████████████████████████████████████████████████████████████▏                                                                                              | 2097/4886 [01:26<01:51, 25.08sentence/s][A
Creating Embeddings - Sentences:  43%|███████████████████████████████████████████████████████████████████████▎                                                                                              | 2100/4886 [01:26<01:51, 25.09sentence/s][A
Creating Embeddings - Sentences:  43%|███████████████████████████████████████████████████████████████████████▍                                                                                              | 2103/4886 [01:26<01:51, 24.98sentence/s][A
Creating Embeddings - Sentences:  43%|███████████████████████████████████████████████████████████████████████▌                                                                                              | 2106/4886 [01:26<01:49, 25.49sentence/s][A


Creating Embeddings - Sentences:  45%|██████████████████████████████████████████████████████████████████████████▌                                                                                           | 2193/4886 [01:30<01:44, 25.75sentence/s][A
Creating Embeddings - Sentences:  45%|██████████████████████████████████████████████████████████████████████████▌                                                                                           | 2196/4886 [01:30<01:43, 26.08sentence/s][A
Creating Embeddings - Sentences:  45%|██████████████████████████████████████████████████████████████████████████▋                                                                                           | 2199/4886 [01:30<01:43, 26.04sentence/s][A
Creating Embeddings - Sentences:  45%|██████████████████████████████████████████████████████████████████████████▊                                                                                           | 2202/4886 [01:30<01:43, 25.84sentence/s][A


Creating Embeddings - Sentences:  47%|█████████████████████████████████████████████████████████████████████████████▊                                                                                        | 2289/4886 [01:33<01:47, 24.23sentence/s][A
Creating Embeddings - Sentences:  47%|█████████████████████████████████████████████████████████████████████████████▊                                                                                        | 2292/4886 [01:34<01:43, 24.99sentence/s][A
Creating Embeddings - Sentences:  47%|█████████████████████████████████████████████████████████████████████████████▉                                                                                        | 2295/4886 [01:34<01:44, 24.83sentence/s][A
Creating Embeddings - Sentences:  47%|██████████████████████████████████████████████████████████████████████████████                                                                                        | 2298/4886 [01:34<01:47, 24.04sentence/s][A


Creating Embeddings - Sentences:  49%|█████████████████████████████████████████████████████████████████████████████████                                                                                     | 2385/4886 [01:37<01:35, 26.12sentence/s][A
Creating Embeddings - Sentences:  49%|█████████████████████████████████████████████████████████████████████████████████▏                                                                                    | 2388/4886 [01:37<01:34, 26.36sentence/s][A
Creating Embeddings - Sentences:  49%|█████████████████████████████████████████████████████████████████████████████████▏                                                                                    | 2391/4886 [01:37<01:37, 25.63sentence/s][A
Creating Embeddings - Sentences:  49%|█████████████████████████████████████████████████████████████████████████████████▎                                                                                    | 2394/4886 [01:38<01:41, 24.63sentence/s][A


Creating Embeddings - Sentences:  51%|████████████████████████████████████████████████████████████████████████████████████▎                                                                                 | 2481/4886 [01:41<01:36, 24.92sentence/s][A
Creating Embeddings - Sentences:  51%|████████████████████████████████████████████████████████████████████████████████████▍                                                                                 | 2484/4886 [01:41<01:38, 24.32sentence/s][A
Creating Embeddings - Sentences:  51%|████████████████████████████████████████████████████████████████████████████████████▍                                                                                 | 2487/4886 [01:41<01:37, 24.60sentence/s][A
Creating Embeddings - Sentences:  51%|████████████████████████████████████████████████████████████████████████████████████▌                                                                                 | 2490/4886 [01:41<01:36, 24.71sentence/s][A


Creating Embeddings - Sentences:  53%|███████████████████████████████████████████████████████████████████████████████████████▌                                                                              | 2576/4886 [01:45<01:36, 23.99sentence/s][A
Creating Embeddings - Sentences:  53%|███████████████████████████████████████████████████████████████████████████████████████▌                                                                              | 2579/4886 [01:45<01:32, 24.93sentence/s][A
Creating Embeddings - Sentences:  53%|███████████████████████████████████████████████████████████████████████████████████████▋                                                                              | 2582/4886 [01:45<01:36, 23.98sentence/s][A
Creating Embeddings - Sentences:  53%|███████████████████████████████████████████████████████████████████████████████████████▊                                                                              | 2585/4886 [01:45<01:31, 25.21sentence/s][A


Creating Embeddings - Sentences:  55%|██████████████████████████████████████████████████████████████████████████████████████████▊                                                                           | 2672/4886 [01:49<01:26, 25.70sentence/s][A
Creating Embeddings - Sentences:  55%|██████████████████████████████████████████████████████████████████████████████████████████▉                                                                           | 2675/4886 [01:49<01:26, 25.53sentence/s][A
Creating Embeddings - Sentences:  55%|██████████████████████████████████████████████████████████████████████████████████████████▉                                                                           | 2678/4886 [01:49<01:25, 25.71sentence/s][A
Creating Embeddings - Sentences:  55%|███████████████████████████████████████████████████████████████████████████████████████████                                                                           | 2681/4886 [01:50<01:24, 26.01sentence/s][A


Creating Embeddings - Sentences:  57%|██████████████████████████████████████████████████████████████████████████████████████████████                                                                        | 2768/4886 [01:53<01:21, 26.03sentence/s][A
Creating Embeddings - Sentences:  57%|██████████████████████████████████████████████████████████████████████████████████████████████▏                                                                       | 2771/4886 [01:53<01:21, 26.08sentence/s][A
Creating Embeddings - Sentences:  57%|██████████████████████████████████████████████████████████████████████████████████████████████▏                                                                       | 2774/4886 [01:53<01:21, 25.91sentence/s][A
Creating Embeddings - Sentences:  57%|██████████████████████████████████████████████████████████████████████████████████████████████▎                                                                       | 2777/4886 [01:53<01:19, 26.49sentence/s][A


Creating Embeddings - Sentences:  59%|█████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                    | 2864/4886 [01:57<01:21, 24.67sentence/s][A
Creating Embeddings - Sentences:  59%|█████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                    | 2867/4886 [01:57<01:19, 25.28sentence/s][A
Creating Embeddings - Sentences:  59%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                    | 2870/4886 [01:57<01:20, 25.17sentence/s][A
Creating Embeddings - Sentences:  59%|█████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                    | 2873/4886 [01:57<01:26, 23.22sentence/s][A


Creating Embeddings - Sentences:  61%|████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                 | 2960/4886 [02:01<01:20, 24.01sentence/s][A
Creating Embeddings - Sentences:  61%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                 | 2963/4886 [02:01<01:20, 23.87sentence/s][A
Creating Embeddings - Sentences:  61%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                 | 2966/4886 [02:01<01:20, 23.91sentence/s][A
Creating Embeddings - Sentences:  61%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                 | 2969/4886 [02:01<01:19, 23.97sentence/s][A


Creating Embeddings - Sentences:  63%|███████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                              | 3056/4886 [02:05<01:12, 25.17sentence/s][A
Creating Embeddings - Sentences:  63%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                              | 3059/4886 [02:05<01:11, 25.58sentence/s][A
Creating Embeddings - Sentences:  63%|████████████████████████████████████████████████████████████████████████████████████████████████████████                                                              | 3062/4886 [02:05<01:11, 25.44sentence/s][A
Creating Embeddings - Sentences:  63%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                             | 3065/4886 [02:05<01:11, 25.46sentence/s][A


Creating Embeddings - Sentences:  65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████                                                           | 3152/4886 [02:08<01:07, 25.62sentence/s][A
Creating Embeddings - Sentences:  65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                          | 3155/4886 [02:09<01:07, 25.50sentence/s][A
Creating Embeddings - Sentences:  65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                          | 3158/4886 [02:09<01:07, 25.66sentence/s][A
Creating Embeddings - Sentences:  65%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                          | 3161/4886 [02:09<01:08, 25.27sentence/s][A


Creating Embeddings - Sentences:  66%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                       | 3248/4886 [02:12<01:05, 24.95sentence/s][A
Creating Embeddings - Sentences:  67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                       | 3251/4886 [02:12<01:04, 25.33sentence/s][A
Creating Embeddings - Sentences:  67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                       | 3254/4886 [02:12<01:05, 24.93sentence/s][A
Creating Embeddings - Sentences:  67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                       | 3257/4886 [02:13<01:05, 24.87sentence/s][A


Creating Embeddings - Sentences:  68%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                    | 3344/4886 [02:16<01:01, 25.23sentence/s][A
Creating Embeddings - Sentences:  69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                    | 3347/4886 [02:16<01:02, 24.73sentence/s][A
Creating Embeddings - Sentences:  69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                    | 3350/4886 [02:16<01:01, 24.97sentence/s][A
Creating Embeddings - Sentences:  69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                    | 3353/4886 [02:16<01:01, 25.10sentence/s][A


Creating Embeddings - Sentences:  70%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                 | 3440/4886 [02:20<00:55, 26.03sentence/s][A
Creating Embeddings - Sentences:  70%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                 | 3443/4886 [02:20<00:55, 25.88sentence/s][A
Creating Embeddings - Sentences:  71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                 | 3446/4886 [02:20<00:55, 25.98sentence/s][A
Creating Embeddings - Sentences:  71%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                                | 3449/4886 [02:20<00:55, 26.01sentence/s][A


Creating Embeddings - Sentences:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                             | 3536/4886 [02:24<00:54, 24.99sentence/s][A
Creating Embeddings - Sentences:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                             | 3539/4886 [02:24<00:54, 24.72sentence/s][A
Creating Embeddings - Sentences:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                             | 3542/4886 [02:24<00:54, 24.83sentence/s][A
Creating Embeddings - Sentences:  73%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                             | 3545/4886 [02:24<00:52, 25.40sentence/s][A


Creating Embeddings - Sentences:  74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                          | 3632/4886 [02:27<00:48, 26.11sentence/s][A
Creating Embeddings - Sentences:  74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                          | 3635/4886 [02:28<00:49, 25.14sentence/s][A
Creating Embeddings - Sentences:  74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                          | 3638/4886 [02:28<00:49, 25.18sentence/s][A
Creating Embeddings - Sentences:  75%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                          | 3641/4886 [02:28<00:49, 25.27sentence/s][A


Creating Embeddings - Sentences:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                       | 3728/4886 [02:31<00:46, 24.99sentence/s][A
Creating Embeddings - Sentences:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                       | 3731/4886 [02:31<00:47, 24.08sentence/s][A
Creating Embeddings - Sentences:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                       | 3734/4886 [02:32<00:47, 24.17sentence/s][A
Creating Embeddings - Sentences:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                       | 3737/4886 [02:32<00:47, 24.24sentence/s][A


Creating Embeddings - Sentences:  78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                    | 3824/4886 [02:35<00:41, 25.52sentence/s][A
Creating Embeddings - Sentences:  78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                    | 3827/4886 [02:35<00:40, 26.09sentence/s][A
Creating Embeddings - Sentences:  78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                    | 3830/4886 [02:35<00:40, 25.96sentence/s][A
Creating Embeddings - Sentences:  78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 3833/4886 [02:35<00:40, 25.75sentence/s][A


Creating Embeddings - Sentences:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                | 3920/4886 [02:39<00:37, 25.71sentence/s][A
Creating Embeddings - Sentences:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                | 3923/4886 [02:39<00:37, 25.61sentence/s][A
Creating Embeddings - Sentences:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                | 3926/4886 [02:39<00:37, 25.81sentence/s][A
Creating Embeddings - Sentences:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                | 3929/4886 [02:39<00:36, 25.93sentence/s][A


Creating Embeddings - Sentences:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                             | 4016/4886 [02:43<00:33, 25.72sentence/s][A
Creating Embeddings - Sentences:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                             | 4019/4886 [02:43<00:33, 25.90sentence/s][A
Creating Embeddings - Sentences:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                             | 4022/4886 [02:43<00:33, 25.69sentence/s][A
Creating Embeddings - Sentences:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                             | 4025/4886 [02:43<00:34, 25.10sentence/s][A


Creating Embeddings - Sentences:  84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                          | 4112/4886 [02:47<00:34, 22.72sentence/s][A
Creating Embeddings - Sentences:  84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                          | 4115/4886 [02:47<00:33, 22.99sentence/s][A
Creating Embeddings - Sentences:  84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                          | 4118/4886 [02:47<00:31, 24.02sentence/s][A
Creating Embeddings - Sentences:  84%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                          | 4121/4886 [02:47<00:31, 24.41sentence/s][A


Creating Embeddings - Sentences:  86%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                       | 4208/4886 [02:50<00:27, 24.47sentence/s][A
Creating Embeddings - Sentences:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                       | 4211/4886 [02:51<00:30, 22.43sentence/s][A
Creating Embeddings - Sentences:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                      | 4214/4886 [02:51<00:28, 23.23sentence/s][A
Creating Embeddings - Sentences:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                      | 4217/4886 [02:51<00:27, 24.58sentence/s][A


Creating Embeddings - Sentences:  88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                   | 4304/4886 [02:54<00:22, 25.31sentence/s][A
Creating Embeddings - Sentences:  88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                   | 4307/4886 [02:54<00:22, 25.28sentence/s][A
Creating Embeddings - Sentences:  88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                   | 4310/4886 [02:54<00:22, 25.11sentence/s][A
Creating Embeddings - Sentences:  88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                   | 4313/4886 [02:55<00:22, 25.36sentence/s][A


Creating Embeddings - Sentences:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                | 4400/4886 [02:58<00:18, 26.01sentence/s][A
Creating Embeddings - Sentences:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 4403/4886 [02:58<00:19, 24.88sentence/s][A
Creating Embeddings - Sentences:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                | 4406/4886 [02:58<00:18, 25.41sentence/s][A
Creating Embeddings - Sentences:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                | 4409/4886 [02:58<00:19, 24.99sentence/s][A


Creating Embeddings - Sentences:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 4496/4886 [03:02<00:16, 23.71sentence/s][A
Creating Embeddings - Sentences:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊             | 4499/4886 [03:02<00:15, 25.10sentence/s][A
Creating Embeddings - Sentences:  92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉             | 4502/4886 [03:02<00:14, 25.76sentence/s][A
Creating Embeddings - Sentences:  92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████             | 4505/4886 [03:02<00:14, 25.55sentence/s][A


Creating Embeddings - Sentences:  94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 4592/4886 [03:06<00:11, 25.66sentence/s][A
Creating Embeddings - Sentences:  94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 4595/4886 [03:06<00:11, 25.90sentence/s][A
Creating Embeddings - Sentences:  94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏         | 4598/4886 [03:06<00:11, 25.92sentence/s][A
Creating Embeddings - Sentences:  94%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎         | 4601/4886 [03:06<00:10, 26.37sentence/s][A


Creating Embeddings - Sentences:  96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎      | 4688/4886 [03:10<00:08, 24.54sentence/s][A
Creating Embeddings - Sentences:  96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎      | 4691/4886 [03:10<00:07, 24.78sentence/s][A
Creating Embeddings - Sentences:  96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍      | 4694/4886 [03:10<00:07, 25.03sentence/s][A
Creating Embeddings - Sentences:  96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌      | 4697/4886 [03:10<00:07, 25.24sentence/s][A


Creating Embeddings - Sentences:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌   | 4784/4886 [03:14<00:04, 24.05sentence/s][A
Creating Embeddings - Sentences:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋   | 4787/4886 [03:14<00:04, 24.67sentence/s][A
Creating Embeddings - Sentences:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋   | 4790/4886 [03:14<00:03, 24.68sentence/s][A
Creating Embeddings - Sentences:  98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   | 4793/4886 [03:14<00:03, 23.74sentence/s][A


Creating Embeddings - Sentences: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 4880/4886 [03:17<00:00, 23.93sentence/s][A
Creating Embeddings - Sentences: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4883/4886 [03:18<00:00, 21.80sentence/s][A
Creating Embeddings - Sentences: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4886/4886 [03:18<00:00, 24.64sentence/s][A
Creating Embeddings - Files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:18<00:00, 198.27s/file]
Upl

Please enter your question: name of the book?


Retrieving and comparing embeddings: 4886file [00:15, 311.97file/s]

Most similar sentence 1: Was the sacred presence there? (Similarity Score: 0.7459015846252441)
Most similar sentence 2: What's his other name?" (Similarity Score: 0.7040351033210754)





In [32]:
# Step 6: Query the embeddings
user_question = input("Please enter your question: ")

# Create an embedding for the query
query_inputs = tokenizer(user_question, return_tensors='pt')
with torch.no_grad():
    query_outputs = model(**query_inputs)
query_embedding = query_outputs.last_hidden_state.mean(dim=1).numpy()

# Retrieve embeddings from MinIO, find the most similar sentences, and display them
similar_sentences = []

for obj in tqdm(client.list_objects(bucket_name), desc="Retrieving and comparing embeddings", unit="file"):
    
    # Get the object data from MinIO and deserialize to get the original embedding
    data = client.get_object(bucket_name, obj.object_name)
    retrieved_embedding = pickle.loads(data.read())
    
    # Calculate the cosine similarity between the query embedding and the retrieved embedding
    similarity = cosine_similarity(query_embedding.reshape(1, -1), retrieved_embedding.reshape(1, -1))
    
    # Get the file name and sentence index from the object name
    file_name, sentence_index = obj.object_name.replace('_embedding.pkl', '').rsplit('_sentence_', 1)
    sentence_index = int(sentence_index)
    
    # Get the text of the sentence
    sentence_text = pdf_text_data[file_name + '.pdf'][sentence_index]
    
    # Store the sentence text and similarity score
    similar_sentences.append((sentence_text, similarity[0][0]))

# Find the top 2 most similar sentences and display them
most_similar_sentences = sorted(similar_sentences, key=lambda x: x[1], reverse=True)[:2]
for i, (sentence, score) in enumerate(most_similar_sentences):
    print(f"Most similar sentence {i+1}: {sentence} (Similarity Score: {score})")

Please enter your question: tom sawyer lied


Retrieving and comparing embeddings: 4886file [00:16, 298.64file/s]

Most similar sentence 1: Tom Sawyer's Gang! (Similarity Score: 0.7714118957519531)
Most similar sentence 2: Tom Sawyer find it! (Similarity Score: 0.7477138042449951)



