# RAG Vector Index (and Sample Pipeline) Generation

This notebook shows you how to and helps you create a RAG Vector Index from your data (Git repo).

In [None]:
%pip install azure-ai-ml
%pip install -U 'azureml-rag[faiss]>=0.1.11'
%pip install azureml-core
%pip install azure-identity
%pip install azureml-rag
%pip install azureml.fsspec
%pip install pandas
%pip install openai~=0.27.8 # versioning for to allow dataplane deployment inferring
%pip install python-dotenv
%pip install --upgrade azure-ai-ml
%pip install --upgrade azureml-core

In [None]:
# If `import win32file` fails with a DLL error then run the following and restart kernel:
# %pip uninstall -y pywin32
# %conda install -y --force-reinstall pywin32

### .env File Setup

Make sure to create a .env file in the same directory as this Jupyter notebook.
The .env file needs to contain the following:

```text
AOAI_API_KEY=<AOAI_API_KEY>
AOAI_ENDPOINT=<AOAI_TARGET_ENDPOINT>
AOAI_API_VERSION=<AOAI_API_VERSION>
GIT_REPO_URL=<GIT_REPO_URL>
AOAI_CONNECTION_NAME=<AOAI_CONNECTION_NAME>
AOAI_COMPLETION_MODEL_NAME=<AOAI_COMPLETION_MODEL_NAME>
AOAI_COMPLETION_DEPLOYMENT_NAME=<AOAI_COMPLETION_DEPLOYMENT_NAME>
AOAI_EMBEDDING_MODEL_NAME=<AOAI_EMBEDDING_MODEL_NAME>
AOAI_EMBEDDING_DEPLOYMENT_NAME=<AOAI_EMBEDDING_DEPLOYMENT_NAME>
```

In [None]:
from os import environ as env
from dotenv import load_dotenv

print("Loading environment variables from .env file")
load_dotenv(".env")

### User Input Parameters

Make sure to change the variables in the next section to fit your experiment needs.

In [None]:
# User Input
git_url = '<GIT_REPO_URL>'
data_source_url = '<GIT_REPO_SOURCE_URL>'
chunk_size = "1024"
chunk_overlap = "0"
chunk_prepend_summary = False
temperature = "0.5"
max_tokens = "2000"
serverless_instance_count = 1
serverless_instance_type = "Standard_D4s_v3"
embeddings_dataset_name = "<VECTOR_INDEX_NAME>"

experiment_name = 'qa_faiss_index_generation'

## Get client for AzureML Workspace

The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

In [None]:
# Defaults
registry_name = "azureml"

In [None]:
%%writefile workspace.json
{
    "subscription_id": "<YOUR-SUBSCRIPTION-ID>",
    "resource_group": "<YOUR-RESOURCE-GROUP-NAME>",
    "workspace_name": "<YOUR-WORKSPACE-NAME>"
}

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient
from azureml.core import Workspace

# try:
#     credential = DefaultAzureCredential()
#     # Check if given credential can get token successfully.
#     credential.get_token("https://management.azure.com/.default")
# except Exception as ex:
#     # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential does not work
credential = InteractiveBrowserCredential()

try:
    ml_client = MLClient.from_config(credential=credential, path='workspace.json')
except Exception as ex:    
    raise Exception(
        "Failed to create MLClient from config file. Please modify and then run the above cell with your AzureML Workspace (associated with the AOAI connection) details."
    ) from ex
ws = Workspace(subscription_id=ml_client.subscription_id, resource_group=ml_client.resource_group_name, workspace_name=ml_client.workspace_name)
print(ml_client)

## Azure OpenAI

We recommend using gpt-35-turbo model or newer to get good quality output. [Follow these instructions](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal) to setup an Azure OpenAI Instance and deploy the model. Once you have the model deployed in AOAI you can specify your Model name and Deployment name below.

In [None]:
aoai_completion_model_name = env['AOAI_COMPLETION_MODEL_NAME']
aoai_completion_deployment_name = env['AOAI_COMPLETION_DEPLOYMENT_NAME']
aoai_embedding_model_name = env['AOAI_EMBEDDING_MODEL_NAME']
aoai_embedding_deployment_name = env['AOAI_EMBEDDING_DEPLOYMENT_NAME']
aoai_connection = env['AOAI_CONNECTION_NAME']

Everything below this point does not require user input! Time to watch the magic happen :\) 

In [None]:
from azureml.rag.utils.connections import get_connection_by_name_v2, create_connection_v2

try:
    aoai_connection = get_connection_by_name_v2(ws, aoai_connection)
    aoai_connection_id = aoai_connection['id']
except Exception as ex:
    print(f"Could not get connection '{aoai_connection}', creating a new one")

    target = env['AOAI_ENDPOINT'] # example: 'https://<endpoint>.openai.azure.com/'
    key = env['AOAI_API_KEY']
    apiVersion = env['AOAI_API_VERSION'] # 2023-03-15-preview
    
    if(key is None):
        raise RuntimeError(f"Please provide a valid key for the Azure OpenAI service")
    if(target is None):  
        raise RuntimeError(f"Please provide a valid target for the Azure OpenAI service")
    if(apiVersion is None):
        raise RuntimeError(f"Please provide a valid api-version for the Azure OpenAI service")
    aoai_connection_id = create_connection_v2(
        workspace=ws,
        name=aoai_connection,
        category='AzureOpenAI',
        target=target,
        auth_type='ApiKey',
        credentials={
            'key': key
        },
        metadata={
            'apiType': 'azure',
            'apiVersion': apiVersion
        }
    )['id']

In [None]:
# Uncomment to upgrade azureml-rag if infer_deployment is unrecognized in the package
# %pip install azureml-rag --upgrade

from azureml.rag.utils.deployment import infer_deployment

aoai_completion_deployment_name = infer_deployment(aoai_connection, aoai_completion_model_name)
print(f"Deployment name in AOAI workspace for model '{aoai_completion_model_name}' is '{aoai_completion_deployment_name}'")

### Setup Pipeline

In [None]:
ml_registry = MLClient(credential=credential, registry_name = registry_name)
git_to_faiss_component = ml_registry.components.get('llm_ingest_git_to_faiss_basic', label='latest')

In [None]:
from azure.ai.ml import Output
from azure.ai.ml.dsl import pipeline

# def use_automatic_compute(component, instance_count=1, instance_type='Standard_D4s_v3'):
#     component.set_resources(instance_count=instance_count, instance_type=instance_type, properties={'compute_specification': {'automatic': True}})
#     return component

# def use_aoai_connection(component, aoai_connection_id, custom_env:str=None):
#     if custom_env is not None:
#         component.environment_variables[custom_env] = aoai_connection_id
#     if aoai_connection_id is not None:
#         component.environment_variables['AZUREML_WORKSPACE_CONNECTION_ID_AOAI'] = aoai_connection_id

# @pipeline(compute=dedicated_cpu_compute)
@pipeline(default_compute='serverless')
def qa_faiss_index_generation(
    git_url,
    data_source_url,
    llm_completion_config,
    embeddings_model,
    aoai_connection_id=None,
    chunk_size=1024,
    chunk_overlap=0,
    chunk_prepend_summary=False,
    serverless_instance_count=1,
    serverless_instance_type="Standard_D4s_v3",
    embeddings_dataset_name="git-repository_VectorIndex",
):

    # Ingest Git to Faiss Vector Index
    git_to_faiss = git_to_faiss_component(
        git_repository = git_url,
        data_source_url = data_source_url,
        llm_config = llm_completion_config,
        llm_connection = aoai_connection_id,
        embeddings_model = embeddings_model,
        embedding_connection = aoai_connection_id,
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
        chunk_prepend_summary = chunk_prepend_summary,
        serverless_instance_count = serverless_instance_count,
        serverless_instance_type = serverless_instance_type,
        embeddings_dataset_name = embeddings_dataset_name,
    )


    return {
        'qa_faiss_index': git_to_faiss.outputs.faiss_index,
    }

In [None]:
# Defaults
embeddings_model = f'azure_open_ai://deployment/{aoai_embedding_deployment_name}/model/{aoai_embedding_model_name}'
llm_completion_config = f'{{"type":"azure_open_ai","model_name":"{aoai_completion_model_name}","deployment_name":"{aoai_completion_deployment_name}","temperature":"{temperature}","max_tokens":"{max_tokens}"}}'
print(embeddings_model)
print(llm_completion_config)

In [None]:
from azure.ai.ml import Input
from azure.ai.ml.entities import UserIdentityConfiguration

# data_source_glob=data_source_glob,
# asset_name=asset_name,
# document_path_replacement_regex=document_path_replacement_regex,
pipeline_job = qa_faiss_index_generation(
    git_url = git_url,
    data_source_url = data_source_url,
    llm_completion_config = llm_completion_config,
    embeddings_model = embeddings_model,
    aoai_connection_id=aoai_connection_id,
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    chunk_prepend_summary = chunk_prepend_summary,
    serverless_instance_count=serverless_instance_count,
    serverless_instance_type=serverless_instance_type,
    embeddings_dataset_name=embeddings_dataset_name,
)

pipeline_job.identity = UserIdentityConfiguration()
pipeline_job.settings.continue_on_step_failure = False

### Submit Pipeline
Click on the generated link below access the job details on studio. Make sure all necessary flights are added on the URL to access these preview features.

**In case of any errors see [TROUBLESHOOT.md](../../TROUBLESHOOT.md).**

In [None]:
running_pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name=experiment_name
)
running_pipeline_job

### Review token usage

In [None]:
# running_pipeline_job = ml_client.jobs.get("<pipeline run id>")
child_runs = ml_client.jobs.list(parent_job_name=running_pipeline_job.name)
child_runs = list(child_runs)
data_generation_run = child_runs[-1]

In [None]:
from azureml.core import Run

run = Run.get(ws, data_generation_run.name)
metrics = run.get_metrics()

In [None]:
# print(f"Tokens used: {metrics['total_tokens']}")
# print(f"Model used: {metrics['llm_model_name']}")

Given the token usage and the model you can compute cost using the pricing here: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/.