# Cohere Embedding Models for Semantic Search in OCI OpenSearch
In this tutorial, we will walk through the steps to conduct a semantic search using the Cohere embedding model with OCI Search.

### Prerequesites
- You have a Running Instance of OCI Search.
- OpenSearch version has to be at least 2.8.0.
- You need to install langchain, opensearch-py, and oracle-ads.

To check how to spin up an instance of OCI search, see [Search and visualize data using OCI Search Service with OpenSearch](https://docs.oracle.com/en/learn/oci-opensearch/index.html#introduction)

In [None]:
!pip install --upgrade oracle-ads langchain opensearch-py

### Step 1: Load and Split your Documents Into Chunks
Let's say you're looking to create a search engine that enables users to search through documentation stored as markdown files. Actually, it does not matter what file format your documentation are in as Langchain offers support for various types of document loaders. In this tutorial, we will just use markdown file as an example.

In [1]:
import fsspec
from langchain.text_splitter import MarkdownHeaderTextSplitter

with fsspec.open(
    "https://raw.githubusercontent.com/oracle-samples/oci-data-science-ai-samples/155f76109d24860ceeb72ed6b742ced33a46ce22/README.md",
    "r"
) as f:
    report = f.read()
    
    
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(report)
texts = [text.page_content for text in md_header_splits]

In [2]:
print(f"Number of documents: {len(texts)}")
print(f"First document:\n{texts[0]}")

Number of documents: 15
First document:
The Oracle Cloud Infrastructure (OCI) Data Science service has created this repo to make demos, tutorials, and code examples that highlight various features of the [OCI Data Science service](https://www.oracle.com/data-science/cloud-infrastructure-data-science.html) and AI services. We welcome your feedback and would like to know what content is useful and what content is missing. Open an [issue](https://github.com/oracle/oci-data-science-ai-samples/issues) to do this. We know that a lot of you are creating great content and we would like to help you share it. See the [contributions](CONTRIBUTING.md) document.  
Oracle Cloud Infrastructure (OCI) Data Science Services provide a powerful suite of tools for data scientists, enabling faster machine learning model development and deployment. With features like the Accelerated Data Science (ADS) SDK, distributed training, batch processing and machine learning pipelines, OCI Data Science Services offer 

### Step 2: Embed your Documents

You can use oracle-ads to access the GenerativeAI embedding models. The embedding models returns embedding vectors of length 1024. oracle-ads is an open source library. It speeds up common data science activities by providing tools that automate and simplify common data science tasks. Additionally, provides data scientists a friendly pythonic interface to OCI services. Check [oracle-ads github](https://github.com/oracle/accelerated-data-science) for more information.

In [3]:
import ads
from ads.llm import GenerativeAIEmbeddings
 
oci_embedings = GenerativeAIEmbeddings(
    compartment_id="ocid1.compartment.oc1.######",
    client_kwargs=dict(service_endpoint="https://generativeai.aiservice.us-chicago-1.oci.oraclecloud.com") # this can be omitted after Generative AI service is GA.
)
embeddings = oci_embedings.embed_documents(texts=texts)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimensions: {len(embeddings[0])}")

Number of embeddings: 15
Embedding dimensions: 1024


### Step 3: Create an Index for your Documents

First connect to your OCI search cluster.

In [6]:
# Connect to the opensearch cluster.
from opensearchpy import OpenSearch
 
# Create a connection to your OpenSearch cluster
es = OpenSearch(
    ['https://####'],  # Replace with your OpenSearch endpoint URL
    http_auth=('username', 'password'),  # Replace with your credentials
    verify_certs=False,  # Set to True if you want to verify SSL certificates
    timeout=30
)

First, you must create a k-NN index and set the index.knn parameter to true. This settings tells the plugin to generate native library indexes specifically tailored for k-NN searches. 

Next, you must add one or more fields of the knn_vector data type. This example creates an index with one knn_vector and one text. The knn_vector uses lucene fields.

See [documentation](https://opensearch.org/docs/2.7/search-plugins/knn/knn-index#method-definitions) for more details on parameters' definitions.

**Note**: The Lucene engine can support dimension up to 1,024.

In [7]:
INDEX_NAME = "cosine-similarity"
VECTOR_1_NAME = "embedding_vector"
VECTOR_2_NAME = "text"
 
body = {
    # Index setting: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index
    "settings": {"index": {"knn": "true", "knn.algo_param.ef_search": 100}},
    # Explicit mapping: https://opensearch.org/docs/2.11/field-types/index/#explicit-mapping
    "mappings": { 
        "properties": {
            VECTOR_1_NAME: {
                # Supported field types: https://opensearch.org/docs/2.11/field-types/supported-field-types/index/
                "type": "knn_vector", 
                "dimension": 1024,
                # Method definition: https://opensearch.org/docs/2.11/search-plugins/knn/knn-index#method-definitions
                "method": { 
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "lucene",
                    "parameters": {"ef_construction": 128, "m": 24},
                },
            },
            VECTOR_2_NAME: {
                 "type": "text"
               },
        }
    },
}
response = es.indices.create(INDEX_NAME, body=body)
response

### Step 4: Insert the Embedding Vectors for your Documents
Now let's populate the index using the embedding vectors calculated from your documents using Cohere Embedding Models. 

In [8]:
i = 0
# insert each row one-at-a-time to the document index
for text, embed in zip(texts, embeddings):
 
    try:
         
        body = {
            VECTOR_1_NAME: embed,
            VECTOR_2_NAME: text,
        }
        response = es.index(index=INDEX_NAME, body=body)
    except Exception as e:
        print(f"[ERROR]: {e}")
        continue
    i += 1


A new query coming in, first calcualte the embedding vector and then conduct a semantic search.

- `k`: the number of neighbors the search will return
- `size`: (required) how many results the query actually returns. The plugin returns k amount of results for each shard (and each segment) and size amount of results for the entire query. The plugin supports a maximum k value of 10,000.

In [9]:
query_vector = oci_embedings.embed_query(text="oci job")
query = {
    "size": 2,
    "query": {"knn": {VECTOR_1_NAME: {"vector": query_vector, "k": 3}}},
}
 
response = es.search(body=query, index=INDEX_NAME)  # the same as before
print(response["hits"]["hits"][0]['_source']['text'])

Oracle Cloud Infrastructure (OCI) [Data Science Jobs](https://docs.oracle.com/en-us/iaas/data-science/using/jobs-about.htm) is a powerful tool that allows you to define and run repeatable machine learning tasks on a fully managed infrastructure. With Jobs, you have the flexibility to apply custom tasks to meet your specific use cases, such as data preparation, model training, hyperparameter optimization, batch inference, large model training and more.  
On-demand jobs and batch processing are especially important for businesses that need to process large volumes of data on a regular basis, as they enable companies to automate data processing workflows, reduce the need for manual intervention, and save costs associated with running compute resources for extended periods of time. With the ability to define and schedule jobs to run at specific times, businesses can automate their data processing workflows and reduce the need for manual intervention. This helps to improve efficiency, reduc

#### Comparing Semantic Search with Keywords search

here is the result of keyword search. you can see that since it does not understand the meaning of oci job, it gave irrelevant results.

In [10]:
query = {
  "query": {
    "match": {
      "text": {
        "query": "oci job",
        "analyzer": "standard"
      }
    }
  }
}
 
response = es.search(body=query, index=INDEX_NAME)  # the same as before
print(response["hits"]["hits"][0]['_source']['text'])

Check out the following resources for more information about the OCI Data Science and AI services:  
* [ADS class documentation](https://accelerated-data-science.readthedocs.io/en/latest/modules.html)
* [ADS user guide](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
* [AI & Data Science blog](https://blogs.oracle.com/ai-and-datascience/)
* [OCI Data Science service guide](https://docs.oracle.com/en-us/iaas/data-science/using/data-science.htm)
* [OCI Data Science service release notes](https://docs.cloud.oracle.com/en-us/iaas/releasenotes/services/data-science/)
* [YouTube playlist](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
* [OCI Data Labeling Service guide](https://docs.oracle.com/en-us/iaas/data-labeling/data-labeling/using/home.htm)
* [OCI DLS DP API](https://docs.oracle.com/en-us/iaas/api/#/en/datalabeling-dp/20211001/)
* [OCI DLS CP API](https://docs.oracle.com/en-us/iaas/api/#/en/datalabeling/20211001/)
