# Semantic search using OpenSearch and OpenAI

This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.

## Prerequisites

Before you begin, ensure you have created all necessary accounts and services as highlighted in the [README](./README.md) to follow the prerequisites:
- You have an [Aiven Account](./README.md#setup-your-aiven-account)
- You have created your [opensearch service](./README.md#create-an-opensearch-service)
- You have and OpenAI Account
- You have created AND SAVED an OpenAI API key
- You have setup your python environment for this notebook

## Adding our Environment Variables
To avoid leaking api_keys we will store them in an .env file that is ignored from version control.

**make a copy of `.env_sample`**

In [None]:
! cp .env_sample .env

## Add our OpenAI API key

Open `.env` and replace `<YOUR_OPENAI_API_KEY>` with the key that you saved from OpenAI.

## Add our OpenSearch Service URI

Verify the Aiven for OpenSearch service is in the `RUNNING` state.

![OpenSearch service in the running state](./assets/opensearch-running-state.png)

Select the running service and copy the **Service URI**.

![Copy the OpenSearch Service URI](assets/copy-opensearch-service-uri.png)

Add the OpenSearch Service URI to your `.env` file created above, replacing `<OPENSEARCH_SERVICE_URI>`

## Load our environment variables and Connect to our Opensearch Service

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

from opensearchpy import OpenSearch

connection_string = os.getenv("OPENSEARCH_SERVICE_URI")

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(connection_string, use_ssl=True, timeout=100)

# Verify the connection

client.info()

## Download the dataset
To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:

In [None]:
import wget
import zipfile

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip",
"r") as zip_ref:
    zip_ref.extractall("data")

The file contains:
* `id` a unique Wikipedia article identifier
* `url` the Wikipedia article URL
* `title` the title of the Wikipedia page
* `text` the text of the article
* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively
* `vector_id` the id of the vector

We can create an OpenSearch mapping optimized for this information with:

In [None]:
index_settings ={
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100
    }
  }

index_mapping= {
    "properties": {
      "content_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        },
      },
      "text": {"type": "text"},
      "title": {"type": "text"},
      "url": { "type": "keyword"},
      "vector_id": {"type": "long"}
      
    }
}

## Create an index in Aiven for OpenSearch

This is where we will store our data

In [None]:
index_name = "openai_wikipedia_index"
client.indices.create(index=index_name, body={"settings": index_settings, "mappings":index_mapping})

## Index data into OpenSearch

Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:

In [None]:
import json
import csv

from opensearchpy import helpers
from rich import console

csv_file = open("data/vector_database_wikipedia_articles_embedded.csv")

def _load_data():
        wikipedia_articles = csv.DictReader(csv_file)

        for row in wikipedia_articles:
            
            yield {
                "_index": index_name,
                "_id": row['id'],
                "_source": {
                    'url' : row["url"],
                    'title' : row["title"],
                    'text' : row["text"],
                    'content_vector' : json.loads(row["content_vector"]),
                    'vector_id' : row["vector_id"]
                }
            }

csv_file.seek(0)
succeeded = []
failed = []


console = console.Console()

with console.status("Indexing data...") as status:
    for success, item in helpers.parallel_bulk(client, actions=_load_data()):
        if success:
            succeeded.append(item)
        else:
            failed.append(item)

    if len(failed) > 0:
        console.print(f"There were {len(failed)} errors:", style="bold red")
        for item in failed:
            print(item["index"]["error"])

    if len(succeeded) > 0:
        console.print(f"Bulk-inserted {len(succeeded)} items (streaming_bulk).", style="bold green")

    csv_file.close()

## Verify that our index has populated in our Aiven Console

In the Aiven Console, select **Indexes** in the sidebar and verify that you have documents populated. There should be OVER 20,000 documents.

![OpenSearch Indexes in the Aiven Console](assets/opensearch-indexes.png)

Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:

In [None]:
from pprint import pprint

res = client.search(index=index_name, body={
    "_source": {},
    "query": {
        "match": {
            "text": {
                "query": "Pizza"
            }
        }
    }
})

pprint(res["hits"]["hits"][0])

[![](https://img.shields.io/badge/2-Create%20Search%20with%20LangChain-48a1db?style=for-the-badge)](2-search-with-langchain.ipynb)