In [6]:
#!pip install transformers -U > /dev/null
!pip install elasticsearch > /dev/null

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebre

### Elasticsearch - Enhancing the Search Interface

**Elasticsearch** is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. Elasticsearch is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Here are the ways integrating Elasticsearch can significantly enhance the search interface:

1. **Full-Text Search**: Elasticsearch is designed to help you find the most relevant information quickly by performing advanced full-text search operations on large datasets. It can analyze the text contents and find the best matches based on various criteria such as term frequency, proximity, and so forth.

2. **Scalability and Speed**: Elasticsearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. This distribution of data facilitates search operations to be scaled horizontally, improving the speed and efficiency of searches, especially when dealing with large volumes of data.

3. **Complex Query Language**: Elasticsearch supports a rich and flexible query language (Query DSL) that allows for the formulation of complex queries to find exactly what you need.

4. **Relevance Scoring and Ranking**: Elasticsearch calculates the relevance score for each document in response to a query, allowing for the ranking of results based on their relevance, which can help in providing more accurate answers to user queries.

5. **Analysis and Tokenization**: Elasticsearch can analyze and tokenize text data in various ways, making it possible to handle linguistic nuances such as stemming, synonyms, etc., which can improve the accuracy of the search.

6. **Integration with Embeddings**: In our setup, Elasticsearch will be used not only to search the text data but also to search the embeddings created from the text data. This integration allows for semantically intelligent searches, where we can find sentences that are semantically similar to the query, not just textually similar.

7. **Aggregations for Analytics**: Elasticsearch provides powerful aggregations that can help you summarize and analyze your data, which can be used to build complex analytics and visualization interfaces.

8. **Real-Time Operations**: Elasticsearch performs data indexing and searching in near real-time, which means that the latency between indexing a document and making it searchable is very short, providing a real-time search experience.

In the context of our project, integrating Elasticsearch will allow us to build a more powerful and flexible search interface where we can perform semantically intelligent searches on the embeddings created from the text data in the PDF files. The search results can be ranked based on various criteria, including textual and semantic similarity, to provide more relevant and accurate answers to user queries.

In the following sections, we will explore how to integrate Elasticsearch into our setup and use it to enhance the search functionality.

In [5]:
import os
import fitz  # PyMuPDF
from transformers import BertModel, BertTokenizer
import torch
from minio import Minio
import pickle
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

# Step 1: Extract text from PDF files and tokenize into sentences
directory = 'gitignore-files'
pdf_text_data = {}

for filename in tqdm(os.listdir(directory), desc="Reading PDFs", unit="file"):
    if filename.endswith(".pdf"):
        file_path = os.path.join(directory, filename)
        
        # Open the PDF file
        doc = fitz.open(file_path)
        text = ""
        
        # Combine text from all pages into a single string
        for page in doc:
            text += page.get_text()
        pdf_text_data[filename] = sent_tokenize(text)  # Tokenize text into sentences

# Step 2: Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


# Step 3: Create sentence-level embeddings for the text from each PDF file
embeddings_data = {}

# Calculate the total number of sentences across all files
total_sentences = sum(len(sentences) for sentences in pdf_text_data.values())

# Create a single tqdm progress bar to track progress across all sentences
progress_bar = tqdm(total=total_sentences, desc="Creating Embeddings", unit="sentence")

for filename, sentences in pdf_text_data.items():
    embeddings_data[filename] = []
    for sentence in sentences:
        # Update the progress bar
        progress_bar.update(1)
        
        # Skip empty sentences
        if not sentence.strip():
            continue
        
        # Tokenize the sentence and create an embedding
        inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings_data[filename].append(outputs.last_hidden_state.mean(dim=1).numpy())

# Close the progress bar
progress_bar.close()

Reading PDFs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.51file/s]
Creating Embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12706/12706 [10:14<00:00, 20.69sentence/s]


## Integrating Elasticsearch with Our Embeddings Application

### Step 1: Setting Up an Elasticsearch Account

1. **Create an Account on Elasticsearch**: Visit the official [Elasticsearch website](https://www.elastic.co/) and sign up for an account.
2. **Start a Deployment**: Once your account is set up, initiate a new deployment from the Elasticsearch console. This setup will provide you with the necessary credentials including the Elasticsearch endpoint, Cloud ID, and an API key. 
3. **Environment Variables**: For security reasons, store your credentials as environment variables. This way, they can be easily and securely accessed in your Python script. You can use the `os.getenv` method to retrieve these values in your script. Below are the environment variable keys you should set:
   - `YOUR_ELASTICSEARCH_ENDPOINT`
   - `YOUR_ELASTICSEARCH_CLOUD_ID`
   - `YOUR_ELASTICSEARCH_API_KEY`

### Step 2: Connecting to Elasticsearch in Python

1. **Installing the Elasticsearch Client**: To interact with your Elasticsearch deployment in Python, you need to install the official Elasticsearch client. You can add it to your Python environment using pip.
2. **Initializing the Client**: Use the Elasticsearch Python client to establish a connection to your Elasticsearch instance. Utilize the environment variables set earlier to securely use your credentials in the script.

### Step 3: Creating an Index with Appropriate Mappings

Before you can index your sentence embeddings, you need to create an index with the appropriate mappings. This step involves specifying the structure of your documents, which includes setting the data types of your fields (such as keyword, text, and dense_vector).

### Step 4: Indexing Sentence Embeddings

Once your index is ready, proceed to index your sentence embeddings. This process involves iterating over your sentences and their corresponding embeddings and adding them to the index one by one.

### Step 5: Querying the Index

After indexing your data, create a function to query the index. This function should be able to:
1. Accept a user query and create an embedding for it.
2. Use this embedding to query the Elasticsearch index and retrieve the most similar sentences.
3. Display the retrieved sentences along with their similarity scores and the documents they belong to.

### Step 6: Testing the Setup

Finally, test your setup by entering various queries and observing the results. Ensure the system is able to find and return the most relevant sentences from your indexed documents.

Remember to handle the 'quit' command to allow users to exit the query loop gracefully.

This setup will allow you to perform semantic searches on your indexed documents, retrieving the most relevant sentences based on the embeddings created from your PDF files.


In [8]:
from elasticsearch import Elasticsearch
from dotenv import load_dotenv
load_dotenv()

# Initialize the Elasticsearch client

es = Elasticsearch(
    hosts=[{"host": os.getenv("YOUR_ELASTICSEARCH_ENDPOINT"), "port": 443, "scheme": "https"}],
    headers={"Authorization": f"ApiKey {os.getenv('YOUR_ELASTICSEARCH_API_KEY')}"}
)
# Create an index with mappings to define the structure of your documents
es.indices.create(
    index="pdf_embeddings",
    body={
        "mappings": {
            "properties": {
                "filename": {"type": "keyword"},
                "sentence": {"type": "text"},
                "embedding": {"type": "dense_vector", "dims": 768}
            }
        }
    },
    ignore=400  # Ignore "Index Already Exist" error
)

  es.indices.create(


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'pdf_embeddings'})

In [11]:
# Step 5: Index data (sentences and embeddings) into Elasticsearch

# Calculate the total number of embeddings to index
total_embeddings = sum(len(embed_list) for embed_list in embeddings_data.values())

# Create a single tqdm progress bar to track progress across all embeddings
progress_bar = tqdm(total=total_embeddings, desc="Indexing data", unit="embedding")

for filename, embeddings in embeddings_data.items():
    sentences = pdf_text_data[filename]
    for i, embedding in enumerate(embeddings):
        # Skip empty embeddings (for empty sentences)
        if embedding.size == 0:
            continue

        # Index each sentence and its embedding into Elasticsearch
        es.index(index="pdf_embeddings", body={
            "filename": filename,
            "sentence": sentences[i],
            "embedding": embedding.flatten().tolist()  # Flatten the numpy array before converting to list
        })

        # Update the progress bar
        progress_bar.update(1)

# Close the progress bar
progress_bar.close()

Indexing data:   0%|                                                                                                                                                                                                 | 0/12706 [00:53<?, ?embedding/s]
Indexing data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12706/12706 [08:53<00:00, 23.80embedding/s]


In [18]:
from sklearn.metrics.pairwise import cosine_similarity

def search():
    while True:
        user_query = input("Please enter your query (or type 'quit' to exit): ")
        
        if user_query.lower() == 'quit':
            break
        
        # Step 1: Create an embedding for the user's query
        inputs = tokenizer(user_query, return_tensors='pt')
        with torch.no_grad():
            outputs = model(**inputs)
        query_embedding = outputs.last_hidden_state.mean(dim=1).numpy()
        
        # Step 2: Create a script query to calculate the cosine similarity between the query embedding and stored embeddings
        script_query = {
            "script_score": {
                "query": {"match_all": {}},
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                    "params": {"query_vector": query_embedding[0].tolist()}
                }
            }
        }
        
        # Step 3: Execute the query and retrieve the top 5 most similar sentences
        response = es.search(
            index="pdf_embeddings",
            body={
                "size": 5,
                "query": script_query,
                "_source": {"includes": ["filename", "sentence"]}
            }
        )
        
        # Print the top 5 most similar sentences
        for hit in response["hits"]["hits"]:
            print(f"Match: {hit['_source']['sentence']} from file {hit['_source']['filename']} (score: {hit['_score']})")

search()

Please enter your query (or type 'quit' to exit): does ahab kill the whale?
Match: how the richer or better is Ahab now? from file moby-dick.pdf (score: 1.8019496)
Match: Why did the old Persians hold the sea holy? from file moby-dick.pdf (score: 1.7933602)
Match: Is Ahab, Ahab? from file moby-dick.pdf (score: 1.7905133)
Match: But what is this lesson that the book of Jonah teaches? from file moby-dick.pdf (score: 1.7855389)
Match: Doesn’t the devil live for ever;
who ever heard that the devil was dead? from file moby-dick.pdf (score: 1.7788833)
Please enter your query (or type 'quit' to exit): quit


---

In our ongoing project, we have made significant advancements from the previous iteration where we used `Minio` to the current iteration involving the integration of `Elasticsearch`. 

Below is a detailed comparison illustrating the improvements achieved:


| Aspect                   | Previous Iteration (Using MinIO) | Current Iteration (Using Elasticsearch) |
|--------------------------|----------------------------------|-----------------------------------------|
| **Storage System**       | Utilized MinIO, an object storage service, for storing the embeddings. | Transitioned to using Elasticsearch, a powerful search and analytics engine, for storing and indexing the embeddings. |
| **Indexing Strategy**    | Serialized embeddings were stored in MinIO, with each sentence embedding stored as a separate object. | Embeddings are indexed directly into Elasticsearch, facilitating more efficient data retrieval and search capabilities. |
| **Search Capability**    | Employed a basic search strategy where embeddings were retrieved from MinIO and cosine similarity calculations were performed in Python. | Utilizes Elasticsearch's advanced search capabilities, where cosine similarity computations are integrated directly into the search queries, potentially offering more accurate and faster results. |
| **Scalability**          | Although MinIO can handle large data storage, the search and retrieval process might face challenges in scalability due to the computational intensity of similarity computations in Python. | Being a distributed system, Elasticsearch can scale horizontally, enhancing both the speed and scalability of the search process, especially with large datasets. |
| **Data Retrieval Speed** | The speed of data retrieval and similarity computation might be slower, particularly as the dataset grows, due to the Python-based computation process. | Expected to offer faster data retrieval speeds due to the in-built search and analytics capabilities of Elasticsearch. |
| **Data Visualization**   | Did not natively support data visualization; would require integration with other tools for data analysis and visualization. | Can be coupled with Kibana for intuitive data visualization and analysis, paving the way for more advanced data analytics in future developments. |

In summary, the current iteration with Elasticsearch integration promises a more robust and scalable solution, capable of handling larger datasets more efficiently and offering sophisticated querying capabilities. It marks a substantial step forward in developing a feature-rich application compared to the previous MinIO-based approach.
