### OpenSearch - Hybrid Search 

Importing Required Libraries

In [1]:
from OpenSearchVectorSearch import OpenSearchVectorSearch
from utils import pretty_print
import os

Loading all environment variables

In [2]:
from dotenv import load_dotenv

load_dotenv()

True

Instantiating OpenSearchVectorSearch class

Need to pass 

In [3]:
opensearch_vector_search = OpenSearchVectorSearch(
    OPENSEARCH_URL=os.getenv('OPENSEARCH_URL'),
    OPENSEARCH_PORT=os.getenv('OPENSEARCH_PORT'),
    OPENSEARCH_USERNAME=os.getenv('OPENSEARCH_USERNAME'),
    OPENSEARCH_PASSWORD=os.getenv('OPENSEARCH_PASSWORD'),
)

Creating Search Pipeline: 

In [4]:
opensearch_vector_search.create_search_pipeline(
    search_pipeline_name="search_pipeline_1",
    keyword_weight=0.3,
    vector_weight=0.7,
)



Search pipeline search_pipeline_1 created successfully....!
Response: {'acknowledged': True}


Hybrid Search

In [5]:
results = opensearch_vector_search.hybrid_search(
    query="what is langchain?",
    top_k=3, 
    index_name="semantic-index",
    search_pipeline_name="search_pipeline_1",
)



In [6]:
results

[{'_index': 'semantic-index',
  '_id': 'doc333#chunk0',
  '_score': 1.0,
  '_source': {'metadata': {'date': '2024-02-28',
    'parent_id': 'doc333',
    'name': 'Langchain',
    'source': 'https://api.python.langchain.com/',
    'published': False,
    'lang': 'eng'},
   'text': "LangChain's developers highlight the framework's applicability to use-cases including chatbots,[7] retrieval-augmented generation,[8] document summarization,[9] and synthetic data generation. As of March 2023, LangChain included integrations with systems including Amazon, Google, and Microsoft Azure cloud storage; API wrappers for news, movie information, and weather; Bash for summarization,"}},
 {'_index': 'semantic-index',
  '_id': 'doc333#chunk4',
  '_score': 0.13202284,
  '_source': {'metadata': {'date': '2024-02-28',
    'parent_id': 'doc333',
    'name': 'Langchain',
    'source': 'https://api.python.langchain.com/',
    'published': False,
    'lang': 'eng'},
   'text': 'text mapping for k-nearest neigh

In [7]:
pretty_print(results)

Document-1:
Content: 
 LangChain's developers highlight the framework's applicability to use-cases including chatbots,[7] retrieval-augmented generation,[8] document summarization,[9] and synthetic data generation. As of March 2023, LangChain included integrations with systems including Amazon, Google, and Microsoft Azure cloud storage; API wrappers for news, movie information, and weather; Bash for summarization,


Metadata: 
 {'date': '2024-02-28', 'parent_id': 'doc333', 'name': 'Langchain', 'source': 'https://api.python.langchain.com/', 'published': False, 'lang': 'eng'}


Score: 1.0


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 text mapping for k-nearest neighbors search; time zone conversion and calendar operations; tracing and recording stack symbols in threaded and asynchronous subprocess runs; and the Wolfram Alpha website and SDK.[13] As of April 2023, it can read from more

#### Experiment - 1 




#### Hybrid Search 

Keyword_weight: 1, vector_weight: 0 

In [6]:
opensearch_vector_search.create_search_pipeline(
    search_pipeline_name="search_pipeline_keyword_1_vector_0",
    keyword_weight=1.0,
    vector_weight=0.0,
)

Search pipeline search_pipeline_keyword_1_vector_0 created successfully....!
Response: {'acknowledged': True}




In [7]:
results = opensearch_vector_search.hybrid_search(
    query="what are the country named in our database?",
    top_k=3, 
    index_name="semantic-index",
    search_pipeline_name="search_pipeline_keyword_1_vector_0",
)

results



[{'_index': 'semantic-index',
  '_id': 'doc653#chunk0',
  '_score': 1.0,
  '_source': {'metadata': {'date': '2022-06-01',
    'parent_id': 'doc653',
    'name': 'Vector Store',
    'source': 'https://api.python.langchain.com/',
    'published': False,
    'lang': 'eng'},
   'text': 'A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms,[1][2] so that one can search the database with a query vector to retrieve the closest matching database records.Vectors are mathematical'}},
 {'_index': 'semantic-index',
  '_id': 'doc653#chunk1',
  '_score': 0.7864764,
  '_source': {'metadata': {'date': '2022-06-01',
    'parent_id': 'doc653',
    'name': 'Vector Store',
    'source': 'https://api.python.langchain.com/',
    'published': False,
    'lang': 'eng'},
   'text': "database records.Vectors are mathem

#### Keyword Search

In [40]:
results = opensearch_vector_search.keyword_search(
    index_name="semantic-index",
    query="what are the country named in our database?",
    top_k=3,
)

pretty_print(results)

OpenSearch client created successfully....!
Document-1:
Content: 
 A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms,[1][2] so that one can search the database with a query vector to retrieve the closest matching database records.Vectors are mathematical


Metadata: 
 {'date': '2022-06-01', 'parent_id': 'doc653', 'name': 'Vector Store', 'source': 'https://api.python.langchain.com/', 'published': False, 'lang': 'eng'}


Score: 5.952817


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 database records.Vectors are mathematical representations of data in a high-dimensional space. In this space, each dimension corresponds to a feature of the data, with the number of dimensions ranging from few h



### Experiment - 2

#### Hybrid Search 

keyword_weight = 0.0, vector_weight = 1.0

In [10]:
# create search pipeline
opensearch_vector_search.create_search_pipeline(
    search_pipeline_name="search_pipeline_keyword_0_vector_1",
    keyword_weight=0.0,
    vector_weight=1.0,
) 

Search pipeline search_pipeline_keyword_0_vector_1 created successfully....!
Response: {'acknowledged': True}




In [34]:
results = opensearch_vector_search.hybrid_search(
    query="what are the country named in our database?",
    top_k=3, 
    index_name="semantic-index",
    search_pipeline_name="search_pipeline_keyword_0_vector_1",
)

pretty_print(results)

Document-1:
Content: 
 India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia. It is the seventh-largest country by area; the most populous country with effect from June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name': 'India', 'source': 'https://india.com/', 'published': True, 'lang': 'eng'}


Score: 1.0


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 Italy,[a] officially the Italian Republic,[b] is a country in Southern[12] and Western[13][c] Europe. It is located on a peninsula that extends into the middle of the Mediterranean Sea, with the Alps on its northern land border, as well as islands, notably Sicily and Sar



#### Vector Search

In [35]:
results = opensearch_vector_search.similarity_search(
    index_name="semantic-index",
    query="what are the country named in our database?",
    top_k=3,
)

pretty_print(results)



OpenSearch client created successfully....!
Document-1:
Content: 
 India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia. It is the seventh-largest country by area; the most populous country with effect from June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name': 'India', 'source': 'https://india.com/', 'published': True, 'lang': 'eng'}


Score: 1.7695986


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 Italy,[a] officially the Italian Republic,[b] is a country in Southern[12] and Western[13][c] Europe. It is located on a peninsula that extends into the middle of the Mediterranean Sea, with the Alps on its northern land 



### Experiment - 3

Hybrid Search - balanced 

keyword_weight = 0.5, vector_weight = 0.5

In [8]:
opensearch_vector_search.create_search_pipeline(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    keyword_weight=0.5,
    vector_weight=0.5,
)

Search pipeline search_pipeline_keyword_05_vector_05 created successfully....!
Response: {'acknowledged': True}




In [9]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what are the country named in our database?",
    top_k=3,
    index_name="semantic-index",
)

pretty_print(results)

Document-1:
Content: 
 India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia. It is the seventh-largest country by area; the most populous country with effect from June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name': 'India', 'source': 'https://india.com/', 'published': True, 'lang': 'eng'}


Score: 0.5005


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms,[1][2] so that one 



## Hybrid Search - Metadata Filtering

#### Example-1: 

Retrieve only top-k documents which are not yet published & related to this query: "what are the country named in our database?"

published == 'False'

Weightage: keyword_weight=0.5, vector_weight=0.5,

In [10]:
opensearch_vector_search.create_search_pipeline(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    keyword_weight=0.5,
    vector_weight=0.5,
)

Search pipeline search_pipeline_keyword_05_vector_05 created successfully....!
Response: {'acknowledged': True}




In [11]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what are the country named in our database?",
    top_k=3,
    index_name="semantic-index",
    post_filter= {"bool": {"filter": {"term": {"metadata.published": False}}}}
)

pretty_print(results)

Document-1:
Content: 
 A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms,[1][2] so that one can search the database with a query vector to retrieve the closest matching database records.Vectors are mathematical


Metadata: 
 {'date': '2022-06-01', 'parent_id': 'doc653', 'name': 'Vector Store', 'source': 'https://api.python.langchain.com/', 'published': False, 'lang': 'eng'}


Score: 0.5


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 database records.Vectors are mathematical representations of data in a high-dimensional space. In this space, each dimension corresponds to a feature of the data, with the number of dimensions ranging from few hundreds to tens of thousands, depending on the co



Same Query but without any Metadata filtering

In [13]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what are the country named in our database",
    top_k=3,
    index_name="semantic-index"
)

pretty_print(results)

Document-1:
Content: 
 India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia. It is the seventh-largest country by area; the most populous country with effect from June 2023;[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name': 'India', 'source': 'https://india.com/', 'published': True, 'lang': 'eng'}


Score: 0.5005


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor (ANN) algorithms,[1][2] so that one 



#### Example-2: 

Retrieve only top-k documents which are in italian language & related to this query: "what are the country named in our database?"

lang == 'ita'

Weightage: keyword_weight=0.5, vector_weight=0.5,

In [8]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what are the news",
    top_k=3,
    index_name="semantic-index",
    post_filter= {"bool": {"filter": {"term": {"metadata.lang": "ita"}}}}
)

pretty_print(results)

Document-1:
Content: 
 software continui. La smentita fa seguito alla reazione negativa sui commenti fatti dal CEO Hanneke Faber, che aveva accennato alla possibilità durante un'intervista podcast con The Verge.In risposta a questi rapporti, la responsabile delle comunicazioni di Logitech, Nicole Kenyon, ha chiarito: Non ci sono piani per un mouse in abbonamento. Questa dichiarazione è stata rilasciata a diversi media


Metadata: 
 {'date': '2023-07-28', 'parent_id': 'doc513', 'name': 'L’ultimo aggiornamento', 'section': 'news', 'source': 'https://dailyhunt.com', 'published': True, 'lang': 'ita'}


Score: 1.0


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 La smentita fa seguito alla reazione negativa sui commenti fatti dal CEO Hanneke Faber, che aveva accennato alla possibilità durante un'intervista podcast con The Verge.In risposta a questi rapporti, la responsabile delle comunicazi



#### Example-3

Retrieve only top-k documents which are coming from 'sports' section & related to this query: "what are the country named in our database?

section == sports

Weightage: keyword_weight=0.5, vector_weight=0.5

In [16]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what is latest sports news today",
    top_k=3,
    index_name="semantic-index",
    post_filter= {"bool": {"filter": {"term": {"metadata.section": "sports"}}}}
)

pretty_print(results)

Document-1:
Content: 
 place in the morning of the first competition day. On the second competition day, wrestlers who have qualified for the finals and repechage are weighed in again.It is with regret that the Indian contingent shares news of the disqualification of Vinesh Phogat from the women’s wrestling 50kg class,” the Indian Olympic Association (IOA) said in a statement. “Despite the best efforts by the team


Metadata: 
 {'date': '2024-08-07', 'parent_id': 'doc680', 'name': 'Why Vinesh Phogat was disqualified from Paris 2024 Olympics wrestling', 'section': 'sports', 'source': 'https://newstoday.com', 'published': True, 'lang': 'eng'}


Score: 0.5005


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 statement. “Despite the best efforts by the team through the night, she weighed in a few grams over 50kg this morning.No further comments will be made by the contingent at this time. T



#### Example-4

Retrieve only top-k documents related to this query: "what are the country named in our database? 

& (post filter)

which are published in between these dates from '2018-01-01' to '2021-01-01'

**Condition:** range: greater than: 2018-01-01 & less than: 2021-01-01

Weightage: keyword_weight=0.5, vector_weight=0.5

In [19]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what is latest sports news today",
    top_k=3,
    index_name="semantic-index",
    post_filter= {"bool": 
                      {"filter": 
                       {"range": {"metadata.date": 
                                  {"gte": '2018-01-01', "lte": '2021-01-01'}}}}}
)

pretty_print(results)

Document-1:
Content: 
 of Helen telling him they need to go home. He breaks into a nearby animal clinic, treats his wounds, and adopts a pit bull puppy scheduled to be euthanized before beginning to walk home.


Metadata: 
 {'date': '2020-09-16', 'parent_id': 'doc123', 'name': 'John wick: Chapter 1', 'source': 'https://johnwick.com/', 'published': True, 'lang': 'eng'}


Score: 0.5


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE.[32] By 1200 BCE, an archaic form of Sanskrit, an Indo-European language, had diffused into India from the northwest.[33][34] Its evidence today is found in the hymns of the Rigveda. Preserved by an oral tradition that was resolutely vigilant, the Rigveda records the


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name



#### Example-5 

Combining muliple metadata filtering

In [22]:
results = opensearch_vector_search.hybrid_search(
    search_pipeline_name="search_pipeline_keyword_05_vector_05",
    query="what is latest sports news today",
    top_k=3,
    index_name="semantic-index",
    post_filter= {"bool": 
        {"filter": [
        {"range": {"metadata.date": {"gte": '2018-01-01', "lte": '2021-01-01'}}},
        {"term": {"metadata.published": True}}
        ]
        }}
)

pretty_print(results)

Document-1:
Content: 
 of Helen telling him they need to go home. He breaks into a nearby animal clinic, treats his wounds, and adopts a pit bull puppy scheduled to be euthanized before beginning to walk home.


Metadata: 
 {'date': '2020-09-16', 'parent_id': 'doc123', 'name': 'John wick: Chapter 1', 'source': 'https://johnwick.com/', 'published': True, 'lang': 'eng'}


Score: 0.5


-----------------------------------------------------------------------------------------------------------------------------
Document-2:
Content: 
 margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisation of the third millennium BCE.[32] By 1200 BCE, an archaic form of Sanskrit, an Indo-European language, had diffused into India from the northwest.[33][34] Its evidence today is found in the hymns of the Rigveda. Preserved by an oral tradition that was resolutely vigilant, the Rigveda records the


Metadata: 
 {'date': '2018-05-19', 'parent_id': 'doc633', 'name

