# Cohere Embedding Models for Semantic Search in OCI OpenSearch
In this tutorial, we will walk through the steps to conduct a semantic search using the Cohere embedding model with OCI Search.

### Prerequesites
- You have a Running Instance of OCI Search

To check how to spin up an instance of OCI search, see [Search and visualize data using OCI Search Service with OpenSearch](https://docs.oracle.com/en/learn/oci-opensearch/index.html#introduction)

In [None]:
!pip install --upgrade oci langchain opensearch-py

### Step 1: Load and Split your Documents Into Chunks
Let's say you're looking to create a search engine that enables users to search through documentation stored as markdown files. Actually, it does not matter what file format your documentation are in as Langchain offers support for various types of document loaders. In this tutorial, we will just use markdown file as an example.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

with open("your_markdown_file.md") as f:
    report = f.read()
    
    
headers_to_split_on = [
    ("#", "head 1"),
    ("##", "head 2"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(report)
texts = [text.page_content for text in md_header_splits]

### Step 2: Embed your Documents

You can use oracle-ads to access the GenerativeAI embedding models. The embedding models returns embedding vectors of length 1024. 

In [None]:
import ads
from ads.llm import GenerativeAIEmbeddings
 
ads.set_auth(auth="resource_principal")
 
oci_embedings = GenerativeAIEmbeddings(
    compartment_id="ocid1.compartment.oc1.######",
    client_kwargs=dict(service_endpoint="generativeai_service_endpoint")
)
embeddings = oci_embedings.embed_document(texts=texts)

### Step 3: Create an Index for your Documents

First connect to your OCI search cluster.

In [None]:
# Connect to the opensearch cluster.
from opensearchpy import OpenSearch
 
# Create a connection to your OpenSearch cluster
es = OpenSearch(
    ['https://####'],  # Replace with your OpenSearch endpoint URL
    http_auth=('username', 'password'),  # Replace with your credentials
    verify_certs=False,  # Set to True if you want to verify SSL certificates
    timeout=30
)

First, you must create a k-NN index and set the index.knn parameter to true. This settings tells the plugin to generate native library indexes specifically tailored for k-NN searches. 

Next, you must add one or more fields of the knn_vector data type. This example creates an index with one knn_vector and one text. THe knn_vector uses lucene fields.

See [documentation](https://opensearch.org/docs/2.7/search-plugins/knn/knn-index#method-definitions) for more details on parameters' definitions.

In [None]:
INDEX_NAME = "arxiv-cosine-1"
VECTOR_1_NAME = "embedding_vector"
VECTOR_2_NAME = "text"
 
body = {
    "settings": {"index": {"knn": "true", "knn.algo_param.ef_search": 100}}, # Index setting: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index
    "mappings": { # Explicit mapping: https://opensearch.org/docs/2.11/field-types/index/#explicit-mapping
        "properties": {
            VECTOR_1_NAME: {
                "type": "knn_vector", # Supported field types: https://opensearch.org/docs/2.11/field-types/supported-field-types/index/
                "dimension": 1024,
                "method": { # Method definition: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index#method-definitions
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "lucene",
                    "parameters": {"ef_construction": 128, "m": 24},
                },
            },
            VECTOR_2_NAME: {
                 "type": "text"
               },
        }
    },
}
response = es.indices.create(INDEX_NAME, body=body)
response

### Step 4: Insert the Embedding Vectors for your Documents
Now let's populate the index using the embedding vectors calculated from your documents using Cohere Embedding Models. 

In [None]:
INDEX_NAME = "arxiv-cosine-1"
i = 0
# insert each row one-at-a-time to the document index
for text, embed in zip(texts, embeddings):
 
    try:
         
        body = {
            VECTOR_1_NAME: embed,
            VECTOR_2_NAME: text,
        }
        response = es.index(index=INDEX_NAME, body=body)
    except Exception as e:
        print(f"[ERROR]: {e}")
        continue
    i += 1


A new query coming in, first calcualte the embedding vector.

In [None]:
query_vector = oci_embedings.embed_query(texts="how to build the html documentation")
query = {
    "size": 2,
    "query": {"knn": {VECTOR_1_NAME: {"vector": query_vector, "k": 2}}},
}
 
response = es.search(body=query, index=INDEX_NAME)  # the same as before
response

You can also run it in terminal.

In [None]:
INDEX_NAME = "arxiv-cosine-1"
with open('my_vector.txt', 'w') as file:
    # Convert the list elements to strings and write them to the file
    for item in query_vector:
        file.write(str(item) + ',\n')
        i += 1

In [None]:
export vector_data=$(<my_vector.txt)
 
# Construct the CURL command with the vector data
curl -X GET "https://private_endpoint/arxiv-cosine-1/_search" --insecure  -k -u username:password  -H "Content-Type: application/json" -d '{
  "size": 2,
  "query": {
    "knn": {
      "embedding_vector": {
        "vector": '"$vector_data"',
        "k": 2
      }
    }
  }
}'