# OpenSearch Index Creation and Document Ingestion

This notebook demonstrates how to:
1. Create an OpenSearch index with proper mappings for hybrid search
2. Process PDF documents into chunks
3. Generate embeddings for the chunks
4. Ingest the chunks with their embeddings into OpenSearch

All functions are defined directly in this notebook, allowing you to modify them and experiment with different approaches.


In [1]:
# Import necessary libraries
import json
import os
import re
import sys
from typing import Dict, Any, List

from PyPDF2 import PdfReader
from opensearchpy import OpenSearch, helpers
from sentence_transformers import SentenceTransformer



  from tqdm.autonotebook import tqdm, trange


In [2]:
# Set up Python path to access project modules
sys.path.insert(0, "..")

%load_ext autoreload
%autoreload 2


In [3]:
# Define constants
# You can modify these values to experiment with different settings

# OpenSearch connection settings
OPENSEARCH_HOST = "localhost"  # OpenSearch host
OPENSEARCH_PORT = 9200  # OpenSearch port
OPENSEARCH_INDEX = "tech-document-2"  # Index name for document storage

# Embedding settings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # Model for generating embeddings
EMBEDDING_DIMENSION = 384  # Embedding dimension for the model
ASSYMETRIC_EMBEDDING = False  # Whether to use asymmetric embeddings

# Chunking settings
TEXT_CHUNK_SIZE = 500  # Number of tokens per chunk
TEXT_CHUNK_OVERLAP = 100  # Overlap between chunks

print("Constants defined. You can modify these values to experiment with different settings.")


Constants defined. You can modify these values to experiment with different settings.


In [4]:
# Utility functions for text processing

def clean_text(text: str) -> str:
    """
    Cleans OCR-extracted text by removing unnecessary newlines, hyphens, and correcting common OCR errors.

    Args:
        text (str): The text to clean.

    Returns:
        str: The cleaned text.
    """
    # Remove hyphens at line breaks (e.g., 'exam-\nple' -> 'example')
    text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)

    # Replace newlines within sentences with spaces
    text = re.sub(r"(?<!\n)\n(?!\n)", " ", text)

    # Replace multiple newlines with a single newline
    text = re.sub(r"\n+", "\n", text)

    # Remove excessive whitespace
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()


def chunk_text(text: str, chunk_size: int, overlap: int = 100) -> List[str]:
    """
    Splits text into chunks with a specified overlap.

    Args:
        text (str): The text to split.
        chunk_size (int): The number of tokens in each chunk.
        overlap (int): The number of tokens to overlap between chunks.

    Returns:
        List[str]: A list of text chunks.
    """
    # Clean the text before chunking
    text = clean_text(text)

    # Tokenize the text into words
    tokens = text.split(" ")

    chunks = []
    start = 0
    while start < len(tokens): # PP@ this is an infinite loop if wrong inputs are given
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = " ".join(chunk_tokens)
        chunks.append(chunk_text)
        start = end - overlap  # Move back by 'overlap' tokens

    return chunks

print("Utility functions defined. You can modify these functions to experiment with different text processing techniques.")


Utility functions defined. You can modify these functions to experiment with different text processing techniques.


In [5]:
# text = "wow what a great day . could become shitty at any point. but for now, it is a great day"
# TEXT_CHUNK_OVERLAP=2
# TEXT_CHUNK_SIZE = 5
# chunks = chunk_text(text, TEXT_CHUNK_SIZE, TEXT_CHUNK_OVERLAP)
# chunks

In [6]:
# Embedding functions

def get_embedding_model():
    """
    Loads and returns the sentence transformer embedding model.
    
    Returns:
        SentenceTransformer: The loaded embedding model.
    """
    print(f"Loading embedding model: {EMBEDDING_MODEL_NAME}")
    model = SentenceTransformer(EMBEDDING_MODEL_NAME)
    return model


def generate_embeddings(texts: List[str]):
    """
    Generates embeddings for a list of text chunks.
    
    Args:
        texts (List[str]): List of text chunks to embed.
        
    Returns:
        List[numpy.ndarray]: List of embedding vectors.
    """
    model = get_embedding_model()
    
    # If using asymmetric embeddings, prefix each text with "passage: "
    if ASSYMETRIC_EMBEDDING:
        texts = [f"passage: {text}" for text in texts]
        
    # Generate embeddings
    embeddings = model.encode(texts)
    return embeddings

print("Embedding functions defined. You can modify these functions to experiment with different embedding techniques.")

Embedding functions defined. You can modify these functions to experiment with different embedding techniques.


In [7]:
# embeddings =  generate_embeddings(chunks)
# print(f"Generated {len(chunks)} chunks")
# print(f"Generated {len(embeddings)} embeddings")
# print(f"Embedding shape: {embeddings[0].shape}")
# embeddings

## 1. Connect to OpenSearch and Create Index

Now that we have all our utility functions defined, let's connect to OpenSearch and create an index with the right mappings for hybrid search.

The index configuration includes three main components:
1. **Text Field (`text`)**: Used for full-text search with BM25 algorithm
2. **Vector Field (`embedding`)**: Used for semantic search using KNN
3. **Metadata Field (`document_name`)**: Used for filtering and organizing documents

Make sure you have OpenSearch running locally (typically in a Docker container).


In [8]:
# Create an OpenSearch client
client = OpenSearch(
    hosts=[{"host": OPENSEARCH_HOST, "port": OPENSEARCH_PORT}],
    http_compress=True,
    timeout=30,
    max_retries=3,
    retry_on_timeout=True,
)

# Check connection
try:
    info = client.info()
    print(f"Successfully connected to OpenSearch {info['version']['number']}")
except Exception as e:
    print(f"Failed to connect to OpenSearch: {e}")
    print("Make sure OpenSearch is running on localhost:9200")
    raise



Successfully connected to OpenSearch 2.11.0


In [9]:
# Define the index configuration
def create_index_config() -> Dict[str, Any]:
    """
    Creates the index configuration with mappings for text, embeddings, and metadata.
    
    Returns:
        Dict[str, Any]: The index configuration.
    """
    config = {
        "settings": {
            "index": {
                "number_of_shards": 1,
                "number_of_replicas": 0,
                "knn": True
            }
        },
        "mappings": {
            "properties": {
                "text": {
                    "type": "text"  # For standard text search
                },
                "embedding": {
                    "type": "knn_vector",
                    "dimension": EMBEDDING_DIMENSION,  # Match your embedding model's dimension
                    "method": {
                        "engine": "faiss",
                        "space_type": "l2",
                        "name": "hnsw",
                        "parameters": {}
                    }
                },
                "document_name": {
                    "type": "keyword"  # For exact match on document names
                }
            }
        }
    }
    return config

# Get the index configuration
index_config = create_index_config()
print("\nIndex Configuration:")
print(json.dumps(index_config, indent=2))



Index Configuration:
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "engine": "faiss",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "document_name": {
        "type": "keyword"
      }
    }
  }
}


In [10]:
# Create the index if it doesn't exist
if not client.indices.exists(index=OPENSEARCH_INDEX):
    response = client.indices.create(index=OPENSEARCH_INDEX, body=index_config)
    print(f"\nCreated index {OPENSEARCH_INDEX} with response: {response}")
else:
    print(f"\nIndex {OPENSEARCH_INDEX} already exists")


Created index tech-document-2 with response: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'tech-document-2'}


In [11]:
## Use script #1 to set up the pipeline using Opensearch Dashboard at port 5601

In [12]:
from opensearchpy.exceptions import NotFoundError
#pipeline_name = "nlp-search-pipeline"
pipeline_name = "personal-paper-search-pipeline"

try:
    result = client.transport.perform_request(
        "GET",
        f"/_search/pipeline/{pipeline_name}"
    )
    print(f"\n✅ Search pipeline '{pipeline_name}' exists.")
    print(result)
except NotFoundError:
    print(f"\n⚠️ Search pipeline '{pipeline_name}' does NOT exist.")
except Exception as e:
    print(f"\n🚨 Error: {e}")


✅ Search pipeline 'personal-paper-search-pipeline' exists.
{'personal-paper-search-pipeline': {'description': 'Post processor for hybrid search', 'phase_results_processors': [{'normalization-processor': {'normalization': {'technique': 'min_max'}, 'combination': {'technique': 'arithmetic_mean', 'parameters': {'weights': [0.3, 0.7]}}}}]}}


## 2. Process PDF Document

Now let's process a PDF document to extract its content:
1. Read the PDF and extract the text
2. Clean and chunk the text into smaller segments
3. Generate embeddings for each chunk

You can replace the PDF path with your own document if you want to experiment with different content.


In [13]:
import os

cwd = os.getcwd()
print(cwd)

/Users/parulpandey/Library/CloudStorage/OneDrive-Personal/Ext Github Repos/RAG_UI_fresh/notebooks


In [14]:
# Read and process the PDF
pdf_path = "attention is all you need.pdf"  # Path relative to notebook directory
pdf_path = "Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory MachiImproved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learningne Learning.pdf"

# Read the PDF file

reader = PdfReader(pdf_path)
text = "".join([page.extract_text() for page in reader.pages])
print(f"Extracted {len(text)} characters from {pdf_path}")

# Show a sample of the extracted text
print("\nSample of extracted text:")
print(text[:500] + "...")

# Clean the text
cleaned_text = clean_text(text)
print(f"\nText cleaned. Length: {len(cleaned_text)} characters")

# Chunk the text
chunks = chunk_text(cleaned_text, chunk_size=TEXT_CHUNK_SIZE, overlap=TEXT_CHUNK_OVERLAP)
print(f"Split text into {len(chunks)} chunks")

# Display a sample chunk
print("\nSample chunk:")
print(chunks[0])
print(f'shape of chunks: {len(chunks)}')

Extracted 46610 characters from Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory MachiImproved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learningne Learning.pdf

Sample of extracted text:
Citation: Kim, Y.J.; Song, J.H.; Cho,
K.H.; Shin, J.H.; Kim, J.S.; Yoon, J.S.;
Hong, S.J. Improved Plasma Etch
Endpoint Detection Using
Attention-Based Long Short-Term
Memory Machine Learning.
Electronics 2024 ,13, 3577. https://
doi.org/10.3390/electronics13173577
Academic Editors: Claudio Turchetti
and Laura Falaschetti
Received: 31 July 2024
Revised: 2 September 2024
Accepted: 5 September 2024
Published: 9 September 2024
Copyright: ©2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This...

Text cleaned. Length: 46528 characters
Split text into 15 chunks

Sample chunk:
Citation: Kim, Y.J.; Song, J.H.; Cho, K.H.; Shin, J.H.; Kim, J.S.; Yoon, J.S.; Hong, S.J. Improved Plasma Etch Endpoint Detection Using Attention-Bas

In [15]:
pdf_file_name = pdf_path.replace('.pdf', '')

# Generate embeddings for the chunks
print("Generating embeddings for chunks. This might take a moment...")
embeddings = generate_embeddings(chunks)
print(f"Generated {len(chunks)} chunks")
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings[0].shape}")
print(f"Embedding shape: {embeddings.shape}")

# Display a sample embedding (just a few values to avoid clutter)
print("\nSample embedding:")
print(embeddings[0])

# Prepare documents for indexing
documents_to_index = [
    {
        "doc_id": f"{pdf_file_name}_{i}",
        "text": chunk,
        "embedding": embedding,
        "document_name": pdf_file_name,
    }
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]

print(f"\nPrepared {len(documents_to_index)} documents for indexing")


Generating embeddings for chunks. This might take a moment...
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Generated 15 chunks
Generated 15 embeddings
Embedding shape: (384,)
Embedding shape: (15, 384)

Sample embedding:
[ 1.05578750e-02  6.27988353e-02  7.31234998e-02 -1.91180296e-02
  6.34288266e-02  2.84780236e-03 -2.56312601e-02 -3.32025476e-02
 -3.85175401e-04 -4.29034270e-02 -3.03010773e-02 -4.74573635e-02
 -3.76687311e-02 -4.51399712e-04 -1.03420904e-02 -6.69936612e-02
  3.93932946e-02 -2.49275193e-02  1.27381766e-02 -2.90004369e-02
  6.56380951e-02 -2.35572457e-02  2.47118203e-03  1.78134406e-03
 -3.48995142e-02  6.87616244e-02  6.14231415e-02  3.15121487e-02
 -5.21094389e-02 -4.79027852e-02  2.80021615e-02 -2.75245998e-02
 -3.21146883e-02  4.95025851e-02 -5.21548502e-02  1.91652849e-02
 -5.89792356e-02  5.31596802e-02 -4.61701043e-02  9.97116556e-04
  3.27665098e-02 -7.22514391e-02 -2.38627102e-02 -1.45818666e-02
  1.49840891e-01  7.41566624e-03  2.69192588e

In [16]:
print(len(documents_to_index))
documents_to_index[1]

15


{'doc_id': 'Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory MachiImproved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learningne Learning_1',
 'text': 'growth of various sectors, including cloud services and mobile devices. The latest generation of 3D-NAND features more than 200 layers of vertical gate stacks. To manufacture such a highly integrated 3D-NAND flash memory, high-aspect-ratio (HAR) etching technology is essential, enabling the precise etching of numerous layers, ranging in the several hundreds [ 1,2]. In these structures, accurately controlling the etching depth and precisely detecting the etching endpoint for each layer has become increasingly important. Consequently, reliable etch endpoint detection (EPD) techniques are crucial. EPD plays a vital role in determining the yield and quality of the etching process, and its significance continues to grow [3]. Accurate etch endpoint detection plays a 

## 3. Ingest Documents into OpenSearch

Now that we've processed the document and generated embeddings, let's ingest them into OpenSearch. We'll:

1. Format each document with its text, embedding, and metadata
2. Use the bulk API to efficiently insert all documents
3. Verify that the documents were properly indexed

This creates searchable content in the OpenSearch index that we can later query using hybrid search.


In [17]:
## each chunk is a document with its own embedding

In [18]:
documents_to_index[0]['text']

'Citation: Kim, Y.J.; Song, J.H.; Cho, K.H.; Shin, J.H.; Kim, J.S.; Yoon, J.S.; Hong, S.J. Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learning. Electronics 2024 ,13, 3577. https:// doi.org/10.3390/electronics13173577 Academic Editors: Claudio Turchetti and Laura Falaschetti Received: 31 July 2024 Revised: 2 September 2024 Accepted: 5 September 2024 Published: 9 September 2024 Copyright: ©2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). electronics Article Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learning Ye Jin Kim1, Jung Ho Song2, Ki Hwan Cho2, Jong Hyeon Shin2, Jong Sik Kim2 , Jung Sik Yoon2 and Sang Jeen Hong1,* 1Department of Semiconductor Engineering, Myongji University, Yongin 17058, Republic

In [19]:
## Index the documents in OpenSearch - so we are just adding index to documents -> create an action for each document

In [20]:
# Prepare bulk actions for OpenSearch
actions = []
for doc in documents_to_index:
    # Handle asymmetric embedding if enabled
    if ASSYMETRIC_EMBEDDING:
        prefixed_text = f"passage: {doc['text']}"
    else:
        prefixed_text = doc['text']
    
    
    # Create an action for this document
    #you’re describing a series of actions for OpenSearch to perform.
    action = {
        "_index": OPENSEARCH_INDEX,
        "_id": doc["doc_id"],
        "_source": {
            "text": prefixed_text,
            "embedding": doc["embedding"].tolist(),  # Convert numpy array to list
            "document_name": doc["document_name"],
        },
    }
    actions.append(action)

In [21]:
# Perform bulk indexing
print(f"Indexing {len(actions)} documents into OpenSearch...")
try:
    success, errors = helpers.bulk(client, actions, raise_on_error=True)
    if errors:
        print(f"Indexed {success} documents with {len(errors)} errors")
        print(f"First error: {errors[0]}")
    else:
        print(f"Successfully indexed {success} documents")
except Exception as e:
    print(f"Error during bulk indexing: {e}")    

Indexing 15 documents into OpenSearch...
Successfully indexed 15 documents


In [22]:
  # Verify the documents are indexed
response = client.count(index=OPENSEARCH_INDEX)
print(f"Total documents in index: {response['count']}")

# Get one document to verify content
if response['count'] > 0:
    sample = client.search(
        index=OPENSEARCH_INDEX, 
        body={
            "size": 1,
            "_source": {"excludes": ["embedding"]},  # Exclude embeddings as they're large
            "query": {"match_all": {}}
        }
    )
    print("\nSample document from index:")
    print(json.dumps(sample['hits']['hits'][0]['_source'], indent=2))

Total documents in index: 0


In [23]:
# Search with keyword matching

query = {
    "size": 2,
    "_source": {"excludes": ["embedding"]},
    "query": {
        "match": {
            "text": "tranformers"
        }
    }
}
results = client.search(index=OPENSEARCH_INDEX, body=query)
for hit in results['hits']['hits']:
    print(json.dumps(hit['_source'], indent=3))

In [24]:
#query_text = "What are transformers?"
query_text = "How can attention mechanism be used for CD predictions?"

# Generate embedding
query_embedding = generate_embeddings([query_text])[0].tolist()

# Set top_k
top_k = 3

query_body = {
    "_source": {"exclude": ["embedding"]},
    "query": {
        "hybrid": {
            "queries": [
                {"match": {"text": {"query": query_text}}},
                {
                    "knn": {
                        "embedding": {
                            "vector": query_embedding,
                            "k": top_k
                        }
                    }
                }
            ]
        }
    },
    "size": top_k
}

response = client.search(
        index=OPENSEARCH_INDEX, body=query_body, search_pipeline=pipeline_name
    )

# Print the results
print(f"\nTop {top_k} results for query: '{query_text}'\n")
for i, hit in enumerate(response['hits']['hits'], 1):
    print(f"Result {i}:")
    print(json.dumps(hit['_source'], indent=2))
    print("-" * 60)

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2

Top 3 results for query: 'How can attention mechanism be used for CD predictions?'

Result 1:
{
  "document_name": "Improved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory MachiImproved Plasma Etch Endpoint Detection Using Attention-Based Long Short-Term Memory Machine Learningne Learning",
  "text": "model. The acquired OES data are characterized by a gradual change in intensity as the etching process progresses, with different intensity widths at certain critical points where the intensity changes in unit time. To effectively capture these change features, we used an attention mechanism. The attention mechanism can focus on points in a sequence of OES data that have different intensity changes. It emphasizes the important parts of the sequence by assigning higher weights to points where the change in intensity represents a change from the previous state. During the learning process of the mo

## Conclusion

Congratulations! You've successfully:
1. Created an OpenSearch index with proper mappings for hybrid search
2. Processed a PDF document and split it into chunks
3. Generated embeddings for each chunk
4. Ingested the chunks with their embeddings into OpenSearch

All the code is defined directly in this notebook, so you can experiment with different:
- Text cleaning and chunking strategies
- Embedding models and parameters
- OpenSearch index configurations

In the next notebook, you'll learn how to perform hybrid search on this indexed content and generate responses using LLMs.
