## About this example
This sample shows how to create an index on the cloud with Azure AI resources.

This sample is useful for developers and data scientists who wish to use their data to create an Index which can be used in the RAG pattern.

### Parameters





In [None]:
subscription_id = "<subscription_id>"
resource_group = "<resource_group>"
workspace = "<workspace>"

index_name = "<your_index_name>"

# change model name and deployment name if it's different from yours
# don't need these if you are using serverless models
embedding_model_name = "text-embedding-ada-002"
embedding_deployment_name = "text-embedding-ada-002"

### MLClient

Initalize MlClient to interact with resources in your Azure AI Studio
Please make sure you have connections for your embedding model and Azure AI Search in this workspace. 

In [None]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

client=MLClient(DefaultAzureCredential(), 
                subscription_id=subscription_id,
                resource_group_name=resource_group,
                workspace_name=workspace)

# client=MLClient.from_config(DefaultAzureCredential(), path="./config.json")

### Input types

You can build index from the following four types of inputs:
1. Local files/folders
2. Github repo
3. Azure Storages
4. Existing AI Search index

Examples of various data sources:

In [None]:
## Input sources
from azure.ai.ml.entities._indexes import LocalSource, AISearchSource, GitSource

# Local source
input_source_local = LocalSource(input_data="product-info/")

# Github repo
input_source_git = GitSource(git_url="https://github.com/rust-lang/book.git", git_branch_name="main", git_connection_id="")

# Azure Storage
input_source_subscription = "<subscription>"
input_source_resource_group = "<resource_group>"
input_source_workspace = "<workspace>"
input_source_datastore = "<datastore_name>"
input_source_path = "path"
input_source_urls=f"azureml://subscriptions/{input_source_subscription}/resourcegroups/{input_source_resource_group}/workspaces/{input_source_workspace}/datastores/{input_source_datastore}/paths/{input_source_path}"

# Existing AI Search index
input_source_ai_search = AISearchSource(ai_search_index_name="remote_index",
                                        ai_search_index_content_key="content",
                                        ai_search_index_embedding_key="contentVector",
                                        ai_search_index_title_key="title",
                                        ai_search_index_metadata_key="meta_json_string",
                                        ai_search_index_connection_id=ai_search_connection.id
                                        )

### Connections

The following connection types are supported:
1. Entra id connections
2. Api key based connections
3. Connections to serverless models (cohere)

Please make sure you have deployments for the embedding model in this workspace

In [None]:
from azure.ai.ml.entities._indexes import ModelConfiguration
## aoai and acs connections - entra id
aoai_connection = client.connections.get("<aoai_entra_id_connection>")
ai_search_connection = client.connections.get("<search_entra_id_connection>>")
embeddings_model_config = ModelConfiguration.from_connection(aoai_connection, 
                                                             model_name="text-embedding-ada-002",
                                                             deployment_name="text-embedding-ada-002")

## aoai and acs connections
aoai_connection = client.connections.get("<aoai_connection>", populate_secrets=True)
ai_search_connection = client.connections.get("<search_connection>")
# workaround for connections.get() not returning api_key
# os.environ["AZURE_OPENAI_API_KEY"] = "<aoai_api_key>"
embeddings_model_config = ModelConfiguration.from_connection(aoai_connection, 
                                                             model_name="text-embedding-ada-002",
                                                             deployment_name="text-embedding-ada-002")

## cohere
cohere_serverless_connection = client.connections.get("<cohere_severless>")
ai_search_connection = client.connections.get("<search_connection>")
embeddings_model_config = ModelConfiguration.from_connection(cohere_serverless_connection)

### Build index on cloud

You can change the input_source to anything listed above. input_source_credential is needed for Azure Storage input.

In [None]:
from azure.ai.ml.entities._credentials import UserIdentityConfiguration
from azure.ai.ml.entities._indexes import AzureAISearchConfig

client.indexes.build_index(
    name=index_name, # name of your index
    embeddings_model_config=embeddings_model_config,
    input_source=input_source_local, 
    # input_source_credential=UserIdentityConfiguration(), # user specified identity used to access the data.
    index_config=AzureAISearchConfig(
        ai_search_index_name=index_name,  # the name of the index store in AI search service
        ai_search_connection_id=ai_search_connection.id, # AI Search connection details
    ),
    tokens_per_chunk = 800, # Optional field - Maximum number of tokens per chunk
    token_overlap_across_chunks = 0, # Optional field - Number of tokens to overlap between chunks
)

Get the index object once the job is finished

In [None]:
ml_index=client.indexes.get(name=index_name, label="latest")

Consume the index as a langchain retriever

Known issue this is broken right now.

In [None]:
# retriever = ml_index.as_langchain_retriever()
# retriever.get_relevant_documents("which tent is the most waterproof?")

Workaround: specify the storage uri of the MLIndex file and consume it

In [None]:
path = "azureml://subscriptions/f375b912-331c-4fc5-8e9f-2d7205e3e036/resourcegroups/rg-jingyizhuai/workspaces/jingyizhu-project-2/datastores/workspaceblobstore/paths/indexes/remote-local-02/3a76509a-600b-4c65-a593-1b0944fa68ff/"
from azureml.rag.mlindex import MLIndex as InternalMLIndex
retriever = InternalMLIndex(str(path)).as_langchain_retriever()
retriever.get_relevant_documents("which tent is the most waterproof?")