## About this example
This sample is useful for developers and data scientists who wish to use their data to create an Index which can be used in the RAG pattern.

This sample shows how to:
- create an index locally or on the cloud with Azure AI resources
- register a local index to cloud
- retrieve index from the cloud
- consume an index in langchain

## Before you begin

### Parameters

In [None]:
# project details
subscription_id: str = "<your-subscription-id>"
resource_group_name: str = "<your-resource-group>"
project_name: str = "<your-project-name>"

# connection details
ai_search_connection_name: str = "<your-ai-search-connection>"
aoai_connection_name: str = "<your-aoai-connection>"
# serverless_connection_name: str = "<your-serverless-connection>"

# names of indexes we will create
local_index_name = "local-index"
cloud_index_name = "cloud-index"

# model used for embedding
embedding_model_aoai: str = "text-embedding-ada-002"
deployment_name_aoai: str = "text-embedding-ada-002"
embedding_model_cohere: str = "cohere-embed-v3-multilingual" # or "cohere-embed-v3-english"

### Connect to your project

To start with let us create a config file with your project details. This file can be used in this sample or other samples to connect to your workspace. To get the required details, you can go to the Project Overview page in the AI Studio.

In [None]:
import json
from pathlib import Path

config = {
    "subscription_id": subscription_id,
    "resource_group": resource_group_name,
    "project_name": project_name,
}

p = Path("config.json")

with p.open(mode="w") as file:
    file.write(json.dumps(config))

Initalize MlClient to interact with resources in your Azure AI Studio.

Please make sure you have connections for your embedding model and Azure AI Search in this workspace. 

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

client=MLClient.from_config(DefaultAzureCredential(), path="./config.json")

### Retrieve connections to embedding model and AI Search
We will use an Azure Open AI service to access the LLM and embedding model. We will also use an Azure Cognitive Search to store the index. Let us get the details of these from your project.

In [None]:
aoai_connection = client.connections.get(aoai_connection_name)
ai_search_connection = client.connections.get(ai_search_connection_name)
# serverless_connection = client.connections.get(serverless_connection_name)

print(f"aoai connection id is {aoai_connection.id}")
print(f"aoai connection id is {ai_search_connection.id}")

### 1. Build an Index
You can build index locally or on the cloud with Azure AI resources

#### 1.1 Build index locally

##### 1.1.1 Input types
You can build index from
1. local files or
2. an existing ai search index

In [None]:
from promptflow.rag.resources import AzureAISearchSource, LocalSource
# local files
local_input_source_local = LocalSource(input_data="product-info/")

# existing ai search index 
# keys might be different, please refer to your MLIndex
local_input_source_ai_search = AzureAISearchSource(ai_search_index_name="<index-name>",
                                   ai_search_content_key="content",
                                   ai_search_embedding_key="contentVector",
                                   ai_search_title_key="title",
                                   ai_search_metadata_key="meta_json_string",
                                   ai_search_connection_id=ai_search_connection.id
                                 )

##### 1.1.3 With AOAI embedding model
To connect to your aoai embedding model, you can either set your api-key and endpoint in the environment variable, or pass in connction id if you have a connection to the model deployment in your workspace.

In [None]:
from promptflow.rag.resources import LocalSource, AzureAISearchConfig, EmbeddingsModelConfig
from promptflow.rag import build_index

ai_search_index_path=build_index(
    name=local_index_name + "aoai",  # name of your index
    embeddings_model_config=EmbeddingsModelConfig(
        model_name=embedding_model_aoai,
        deployment_name=deployment_name_aoai,
        connection_id=aoai_connection.id
    ),
    input_source=LocalSource(input_data="product-info/"),  # the location of your file/folders
    index_config=AzureAISearchConfig(
        ai_search_index_name=local_index_name + "-aoai-store" # the name of the index store inside the azure ai search service
    ),
    tokens_per_chunk = 800, # Optional field - Maximum number of tokens per chunk
    token_overlap_across_chunks = 0, # Optional field - Number of tokens to overlap between chunks
)

##### 1.1.4 With Cohere embedding model
To use your cohere embedding model, please specify the connection id (or connection_config) to the model deployment you want to use.

In [None]:
from promptflow.rag.resources import LocalSource, AzureAISearchConfig, EmbeddingsModelConfig, ConnectionConfig
from promptflow.rag import build_index

ai_search_index_path=build_index(
    name=local_index_name + "cohere",  # name of your index
    embeddings_model_config=EmbeddingsModelConfig(
        model_name=embedding_model_cohere,
        connection_id=serverless_connection.id
        # connection_config=ConnectionConfig(
        #     subscription = "<subscription>",
        #     resource_group = "<resource-group>",
        #     workspace = "<workspace>",
        #     connection_name = "<connection-name>"
        # )
    ),
    input_source=LocalSource(input_data="product-info/"),  # the location of your file/folders
    index_config=AzureAISearchConfig(
        ai_search_index_name=local_index_name + "cohere-store" # the name of the index store inside the azure ai search service
    ),
    tokens_per_chunk = 800, # Optional field - Maximum number of tokens per chunk
    token_overlap_across_chunks = 0, # Optional field - Number of tokens to overlap between chunks
)

##### 1.2 Register the index
Register the index so that it shows up in the AI Studio Project.

In [None]:

from azure.ai.ml.entities import Index
client.indexes.create_or_update(Index(name=local_index_name, path=ai_search_index_path, version="1", stage="Development"))

#### 1.2 Build index on cloud

##### 1.2.1 Input types
You can build index from the following four types of inputs:
1. Local files/folders
2. Github repo
3. Azure Storages
4. Existing AI Search index

Examples of various data sources:


In [None]:
## Input sources
from azure.ai.ml.entities._indexes import LocalSource, AISearchSource, GitSource

# Local source
input_source_local = LocalSource(input_data="product-info/")

# Github repo
input_source_git = GitSource(git_url="https://github.com/rust-lang/book.git", git_branch_name="main", git_connection_id="")

# Azure Storage
input_source_subscription = "<subscription>"
input_source_resource_group = "<resource_group>"
input_source_workspace = "<workspace>"
input_source_datastore = "<datastore_name>"
input_source_path = "path"
input_source_urls=f"azureml://subscriptions/{input_source_subscription}/resourcegroups/{input_source_resource_group}/workspaces/{input_source_workspace}/datastores/{input_source_datastore}/paths/{input_source_path}"

# Existing AI Search index
input_source_ai_search = AISearchSource(ai_search_index_name="remote_index",
                                        ai_search_index_content_key="content",
                                        ai_search_index_embedding_key="contentVector",
                                        ai_search_index_title_key="title",
                                        ai_search_index_metadata_key="meta_json_string",
                                        ai_search_index_connection_id=ai_search_connection.id
                                        )

##### 1.2.2 Connections

The following connection types are supported:
1. Entra id connections
2. Api key based connections
3. Connections to serverless models (cohere)

Please make sure you have deployments for the embedding model in this workspace

In [None]:
from azure.ai.ml.entities._indexes import ModelConfiguration
## aoai and acs connections
aoai_connection = client.connections.get("<aoai_connection>", populate_secrets=True)
ai_search_connection = client.connections.get("<search_connection>")
embeddings_model_config = ModelConfiguration.from_connection(aoai_connection, 
                                                             model_name="text-embedding-ada-002",
                                                             deployment_name="text-embedding-ada-002")
# workaround for connections.get() not returning api_key
# os.environ["AZURE_OPENAI_API_KEY"] = "<aoai_api_key>"

## aoai and acs connections - entra id
## TODO: You will hit embedding error with aoai entra-id connection, fix in progress
# aoai_connection = client.connections.get("<aoai_entra_id_connection_name>")
# ai_search_connection = client.connections.get("<search_entra_id_connection_name>>")
# embeddings_model_config = ModelConfiguration.from_connection(aoai_connection, 
#                                                              model_name="text-embedding-ada-002",
#                                                              deployment_name="text-embedding-ada-002")

## cohere embedding model
# embeddings_model_config = ModelConfiguration.from_connection(serverless_connection)

##### 1.3 Build index on cloud

You can change the input_source to anything listed above. input_source_credential is needed for Azure Storage input.

In [None]:
from azure.ai.ml.entities._credentials import UserIdentityConfiguration
from azure.ai.ml.entities._indexes import AzureAISearchConfig

client.indexes.build_index(
    name=cloud_index_name, # name of your index
    embeddings_model_config=embeddings_model_config,
    input_source=input_source_local, 
    # input_source_credential=UserIdentityConfiguration(), # user specified identity used to access the data.
    index_config=AzureAISearchConfig(
        ai_search_index_name=cloud_index_name,  # the name of the index store in AI search service
        ai_search_connection_id=ai_search_connection.id, # AI Search connection details
    ),
    tokens_per_chunk = 800, # Optional field - Maximum number of tokens per chunk
    token_overlap_across_chunks = 0, # Optional field - Number of tokens to overlap between chunks
)

### 2. Retrieve index from the cloud
Get the index object once the job is finished

In [None]:
ml_index=client.indexes.get(name=cloud_index_name, label="latest")

### 3. Consume the index as a langchain retriever

Known issue this is broken for index build on cloud.

In [None]:
# retriever = ml_index.as_langchain_retriever()
# retriever.get_relevant_documents("which tent is the most waterproof?")

Workaround: specify the storage uri of the MLIndex file and consume it

In [None]:
path = "azureml://subscriptions/f375b912-331c-4fc5-8e9f-2d7205e3e036/resourcegroups/rg-jingyizhuai/workspaces/jingyizhu-project-2/datastores/workspaceblobstore/paths/indexes/remote-local-02/3a76509a-600b-4c65-a593-1b0944fa68ff/"
from azureml.rag.mlindex import MLIndex as InternalMLIndex
retriever = InternalMLIndex(str(path)).as_langchain_retriever()
retriever.get_relevant_documents("which tent is the most waterproof?")

In [None]:
ml_index=client.indexes.get(name=cloud_index_name, label="latest")
