# Query
- 構築した検索インデックスをクエリして回答を生成する。

## Use case
- 内閣府のAI戦略ドキュメントをインプットデータとする。
  - https://www8.cao.go.jp/cstp/ai/index.html
- ドキュメントには、テキスト、テーブル、図、グラフなどが含まれており、それらをもとにした回答ができるようなRAGアプリケーションを構築する。

In [None]:
! pip install azure-search-documents==11.6.0b4
! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-ai-textanalytics promptflow-evals promptflow-azure

## Set up

In [1]:
import os

from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    AIServicesVisionParameters,
    AIServicesVisionVectorizer,
    AIStudioModelCatalogName,
    AzureMachineLearningVectorizer,
    AzureOpenAIVectorizer,
    AzureOpenAIModelName,
    AzureOpenAIParameters,
    AzureOpenAIEmbeddingSkill,
    BlobIndexerDataToExtract,
    BlobIndexerParsingMode,
    CognitiveServicesAccountKey,
    DefaultCognitiveServicesAccount,
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    FieldMapping,
    HnswAlgorithmConfiguration,
    HnswParameters,
    IndexerExecutionStatus,
    IndexingParameters,
    IndexingParametersConfiguration,
    InputFieldMappingEntry,
    KeyPhraseExtractionSkill,
    OutputFieldMappingEntry,
    ScalarQuantizationCompressionConfiguration,
    ScalarQuantizationParameters,
    SearchField,
    SearchFieldDataType,
    SearchIndex,
    SearchIndexer,
    SearchIndexerDataContainer,
    SearchIndexerDataIdentity,
    SearchIndexerDataSourceConnection,
    SearchIndexerIndexProjections,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    SearchIndexerSkillset,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    SimpleField,
    SplitSkill,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
    VisionVectorizeSkill
)
from azure.search.documents.models import (
    HybridCountAndFacetMode,
    HybridSearch,
    SearchScoreThreshold,
    VectorizableTextQuery,
    VectorizableImageBinaryQuery,
    VectorizableImageUrlQuery,
    VectorSimilarityThreshold,
)
from azure.storage.blob import BlobServiceClient
from dotenv import load_dotenv
import pandas as pd
from IPython.display import Image, display, HTML
from openai import AzureOpenAI
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch

In [2]:
# Load environment variables
load_dotenv(override=True)

# Configuration
AZURE_AI_VISION_API_KEY = os.getenv("AZURE_AI_VISION_API_KEY")
AZURE_AI_VISION_ENDPOINT = os.getenv("AZURE_AI_VISION_ENDPOINT")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
STORAGE_ACCOUNT_NAME = os.getenv("STORAGE_ACCOUNT_NAME")
BLOB_CONTAINER_NAME = "rag-knowledge-03-business"
BLOB_CONNECTION_STRING = os.getenv("BLOB_CONNECTION_STRING")
INDEX_NAME = "rag-search-index-push-03"
AZURE_SEARCH_ADMIN_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_AI_MULTI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_MULTI_SERVICE_ENDPOINT")
AZURE_AI_MULTI_SERVICE_KEY = os.getenv("AZURE_AI_MULTI_SERVICE_KEY")
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT")
AZURE_DOCUMENT_INTELLIGENCE_KEY = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_KEY")
SUBSCRIPTION_ID = os.getenv("SUBSCRIPTION_ID")
RESOURCE_GROUP_NAME = os.getenv("RESOURCE_GROUP_NAME")
PROJECT_NAME = os.getenv("PROJECT_NAME")

## Query-Design
Tips for Azure AI Search Query-Design

検索手法・クエリ設計ごとにパイプラインを構築します。それぞれ以下の仕様です。

#### Keyword Search Pipeline
- **目的**: キーワードによるシンプルな検索を行う。
- **特徴**: クエリを再構成し、シンプルなキーワード検索を実施。高精度な情報検索を行うのではなく、関連する結果を素早く取得するための手法。
- **他との違い**: キーワードマッチングを用いた検索で、ベクトル検索やセマンティックランカーを使用しないため、実装がシンプルで軽量。

#### Vector Search Pipeline
- **目的**: クエリの意味的な類似性に基づく検索を行う。
- **特徴**: クエリをベクトル化し、データベース内のベクトルと類似度の高いものを検索。リランキングの処理を経ないため、より迅速に関連度の高い結果を返す。
- **他との違い**: キーワードベースの検索ではなく、意味的な類似性を評価するため、より柔軟で直感的な検索が可能。

#### Hybrid Search Pipeline
- **目的**: キーワード検索とベクトル検索の利点を組み合わせて、より豊かな検索結果を提供する。
- **特徴**: キーワード検索とベクトル検索を同時に実行し、結果を統合する。これにより、テキストの意味的な側面と明示的なキーワードを考慮した検索が可能。
- **他との違い**: キーワードとベクトル検索の両方の特性を活かし、幅広い検索ニーズに対応する。

#### Hybrid Search + Semantic Ranker Pipeline
- **目的**: ハイブリッド検索の結果をセマンティックランカーでリランキングし、最も関連性の高い結果を提供する。
- **特徴**: セマンティックランカーを使用して検索結果をリランキングすることで、ユーザーの意図により忠実な検索結果を返す。
- **他との違い**: セマンティックランカーの導入により、検索結果の精度が向上し、特に長いクエリや複雑な意図のクエリに対して有効。

#### RAG with HyDE Pipeline
- **目的**: ユーザーの質問に対してより的確な回答を生成するために、Hypothetical Document Embedding (HyDE) を使用する。
- **特徴**: クエリに対して仮説的な回答を生成し、それをもとにベクトル検索を実行。これにより、より関連性の高い文書を検索し、ユーザーの質問に応答。
- **他との違い**: HyDE の仮説生成機能を利用することで、クエリの明示的な回答が存在しない場合でも、有益な検索結果を生成する能力が向上。

In [None]:
# User-specified parameter
USE_AAD_FOR_SEARCH = False  # Set this to False to use API key for authentication

def authenticate_azure_search(api_key=None, use_aad_for_search=False):
    if use_aad_for_search:
        print("Using AAD for authentication.")
        credential = DefaultAzureCredential()
    else:
        print("Using API keys for authentication.")
        if api_key is None:
            raise ValueError("API key must be provided if not using AAD for authentication.")
        credential = AzureKeyCredential(api_key)
    return credential

azure_search_credential = authenticate_azure_search(api_key=AZURE_SEARCH_ADMIN_KEY, use_aad_for_search=USE_AAD_FOR_SEARCH)

In [4]:
import re
import os
from openai import AzureOpenAI
import json

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version="2024-02-01"
)


def generate_answer(query, context):
    system_message = f"""
    system:
	You are an AI assistant that helps users answer questions given a specific context. You will be given a context and asked a question based on that context. Your answer should be as precise as possible and should only come from the context.
	You must generate a response in markdown format. You must include the image url for showing the image in the response, if the context corresponding to the answer contains "image_url".
	Please add citation after each sentence when possible in a form "(Source: citation)".
	context: {context}
	user: 
	"""
    message_text = [
		{"role":"system","content": system_message},
		{"role":"user","content": query}
	]
    completion = client.chat.completions.create(
		model="gpt-4o", # model = "deployment_name"
		messages = message_text,
		# response_format={"type": "json_object"},
		temperature=0,
		)
    return completion.choices[0].message.content

In [5]:
import re
import os
from openai import AzureOpenAI
import json

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version="2024-02-01"
)

system_message = """
# Your Task
- Given the following conversation history and the users next question,rephrase the question to be a stand alone question.
- You must output json format.

# Json format example:
{
	"questions": [
		"rephrase question content ....",
	]
}
"""

def generate_rephrase_query(text):
    message_text = [
		{"role":"system","content": system_message},
		{"role":"user","content": text}
	]
    completion = client.chat.completions.create(
		model="gpt-4o-mini", # model = "deployment_name"
		messages = message_text,
		response_format={"type": "json_object"},
		temperature=0,
		)
    return completion.choices[0].message.content

In [6]:
import re
import os
from openai import AzureOpenAI
import json

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version="2024-02-01"
)

system_message = """
# Your Task
- Given the following conversation history and the users next question,rephrase the question to be a stand alone question.
- You also need to extend the original question to generate 5 related queries. This is done to capture the broader context of the user's question.
- You must output json format. In other words, You must output array of questions that length is 5.

# Json format example:
{
	"questions": [
		"related question 1",
		"related question 2",
		"related question 3",
		"related question 4",
		"related question 5"
	]
}
"""

def generate_expanded_query(text):
    message_text = [
		{"role":"system","content": system_message},
		{"role":"user","content": text}
	]
    completion = client.chat.completions.create(
		model="gpt-4o-mini", # model = "deployment_name"
		messages = message_text,
		response_format={"type": "json_object"},
		temperature=0,
		)
    return completion.choices[0].message.content


In [7]:
import re
import os
from openai import AzureOpenAI
import json

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
  api_version="2024-02-01"
)


def generate_hypothetical_query(text):
    hypothetical_gen_instruction = f"""Please write a passage to answer the question
	Question: {text}
	Passage:
	"""
    message_text = [
		{"role":"system","content": "You are an AI assistant."},
		{"role":"user","content": hypothetical_gen_instruction}
	]
    completion = client.chat.completions.create(
		model="gpt-4o-mini", # model = "deployment_name"
		messages = message_text,
		# response_format={"type": "json_object"},
		temperature=0,
		)
    return completion.choices[0].message.content

In [8]:
search_client = SearchClient(endpoint=AZURE_SEARCH_ENDPOINT, index_name=INDEX_NAME, credential=azure_search_credential)

In [9]:
# Define keyword search pipeline
def keyword_search_pipeline(search_client, search_text):
    # Perform the search
    search_text_rephrased = json.loads(generate_rephrase_query(search_text))["questions"][0]
    results = search_client.search(
        query_type='simple',  # without semantic ranker
        query_language='ja',
        search_text=search_text_rephrased,
        top=5,
        select="content, title, image_url",
        search_fields=["content", "title", "key_phrases"],
    )
    
    # Collecting search results
    context_text = ""
    retrieved_results = []
    for result in results:
        context_text += result["content"] + " "
        retrieved_results.append({
            "title": result.get("title"),
            "content": result.get("content"),
            "image_url": result.get("image_url")
        })
    
    # Generate the final answer
    final_answer = generate_answer(search_text_rephrased, context_text)
    
    # Return a dictionary with prompt, retrieved results, and the final answer
    return {
        "prompt": search_text,
        "rephrased_prompt": search_text_rephrased,
        "retrieved_results": retrieved_results,
        "final_answer": final_answer
    }

In [10]:
# Define vector search pipeline
def vector_search_pipeline(search_client, search_text):
    # Rephrase the search text
    search_text_rephrased = json.loads(generate_rephrase_query(search_text))["questions"][0]
    vector_query = VectorizableTextQuery(
        text=search_text_rephrased,
        k_nearest_neighbors=50,
        fields="vector",
    )
    
    # Perform the search
    results = search_client.search(
        search_text=None,
        vector_queries=[vector_query],
        top=5,
        select="content, title, image_url",
    )
    
    # Collecting search results
    context_text = ""
    retrieved_results = []
    for result in results:
        context_text += result["content"] + " "
        retrieved_results.append({
            "title": result.get("title"),
            "content": result.get("content"),
            "image_url": result.get("image_url")
        })
    
    # Generate the final answer
    final_answer = generate_answer(search_text_rephrased, context_text)
    
    # Return a dictionary with prompt, retrieved results, and the final answer
    return {
        "prompt": search_text,
        "rephrased_prompt": search_text_rephrased,
        "retrieved_results": retrieved_results,
        "final_answer": final_answer
    }


In [11]:
# Define Hybrid search pipeline
def hybrid_search_pipeline(search_client, search_text):
    # Rephrase the search text
    search_text_rephrased = json.loads(generate_rephrase_query(search_text))["questions"][0]
    vector_query = VectorizableTextQuery(
        text=search_text_rephrased,
        k_nearest_neighbors=50,
        fields="vector",
    )
    
    # Perform the search
    results = search_client.search(
        query_type='simple',  # without semantic ranker
        query_language='ja',
        search_text=search_text_rephrased,
        vector_queries=[vector_query],
        top=5,
        select="content, title, image_url",
        search_fields=["content", "title", "key_phrases"],
    )
    
    # Collecting search results
    context_text = ""
    retrieved_results = []
    for result in results:
        context_text += result["content"] + " "
        retrieved_results.append({
            "title": result.get("title"),
            "content": result.get("content"),
            "image_url": result.get("image_url")
        })
    
    # Generate the final answer
    final_answer = generate_answer(search_text_rephrased, context_text)
    
    # Return a dictionary with prompt, retrieved results, and the final answer
    return {
        "prompt": search_text,
        "rephrased_prompt": search_text_rephrased,
        "retrieved_results": retrieved_results,
        "final_answer": final_answer
    }


In [12]:
# Define Hybrid search + Semantic Ranker pipeline
def hybrid_semantic_pipeline(search_client, search_text):
    # Rephrase the search text
    search_text_rephrased = json.loads(generate_rephrase_query(search_text))["questions"][0]
    vector_query = VectorizableTextQuery(
        text=search_text_rephrased,
        k_nearest_neighbors=50,
        fields="vector",
    )
    
    # Perform the search
    results = search_client.search(
        query_type='semantic',
        query_language='ja',
        semantic_configuration_name='my-semantic-config',
        search_text=search_text_rephrased,
        vector_queries=[vector_query],
        top=5,
        select="content, title, image_url",
        search_fields=["content", "title", "key_phrases"],
    )
    
    # Collecting search results
    context_text = ""
    retrieved_results = []
    for result in results:
        context_text += result["content"] + " "
        retrieved_results.append({
            "title": result.get("title"),
            "content": result.get("content"),
            "image_url": result.get("image_url")
        })
    
    # Generate the final answer
    final_answer = generate_answer(search_text_rephrased, context_text)
    
    # Return a dictionary with prompt, retrieved results, and the final answer
    return {
        "prompt": search_text,
        "rephrased_prompt": search_text_rephrased,
        "retrieved_results": retrieved_results,
        "final_answer": final_answer
    }


In [13]:
# Define RAG with HyDE Pipeline
def rag_pipeline_with_hyde(search_client, search_text):
    # Generate a hypothetical answer
    hypothetical_answer = generate_hypothetical_query(search_text)
    vector_query = VectorizableTextQuery(
        text=hypothetical_answer,
        k_nearest_neighbors=50,
        fields="vector",
    )
    
    # Perform the search
    results = search_client.search(
        query_type='semantic',
        query_language='ja',
        semantic_configuration_name='my-semantic-config',
        search_text=hypothetical_answer,
        vector_queries=[vector_query],
        top=5,
        select="content, title, image_url",
        search_fields=["content", "title", "key_phrases"],
    )
    
    # Collecting search results
    context_text = ""
    retrieved_results = []
    for result in results:
        context_text += result["content"] + " "
        retrieved_results.append({
            "title": result.get("title"),
            "content": result.get("content"),
            "image_url": result.get("image_url")
        })
    
    # Generate the final answer
    final_answer = generate_answer(search_text, context_text)
    
    # Return a dictionary with the query, hypothetical answer, retrieved results, and the final answer
    return {
        "prompt": search_text,
        "rephrased_prompt": hypothetical_answer,
        "retrieved_results": retrieved_results,
        "final_answer": final_answer
    }


## Evaluation

In [15]:
azure_ai_project = { 
    "subscription_id": SUBSCRIPTION_ID,
    "resource_group_name": RESOURCE_GROUP_NAME,
    "project_name": PROJECT_NAME
}

In [16]:
env_var = {
    "gpt-4o": {
        "endpoint": f"{AZURE_OPENAI_ENDPOINT}/deployments/gpt-4o/chat/completions?api-version=2024-02-01",
        "key": f"{AZURE_OPENAI_API_KEY}",
    },
}

In [None]:
df_eval_input = pd.read_json("../eval/03/input/eval_data.jsonl", lines=True)
df_eval_input

In [18]:
import os

output_folder = "../eval/03/output"

if not os.path.exists(output_folder):
	os.makedirs(output_folder)

In [19]:
# Create a new DataFrame to store the results
df_eval_output = df_eval_input.copy()

# Execute the hybrid_semantic_pipeline function for each question
for index, row in df_eval_output.iterrows():
	question = row['question']
	
	# RAG with hybrid semantic pipeline
	result = hybrid_semantic_pipeline(search_client, question)
	
	# Add the result to the DataFrame
	df_eval_output.at[index, 'rephrased_prompt'] = result['rephrased_prompt']
	df_eval_output.at[index, 'retrieved_results'] = json.dumps(result['retrieved_results'], ensure_ascii=False)
	df_eval_output.at[index, 'answer'] = result['final_answer']

# Save the new DataFrame
df_eval_output.to_json("../eval/03/output/eval_output.jsonl", orient="records", lines=True, force_ascii=False)

In [None]:
df_eval_output

### Azure AI Evaluation

In [20]:
from promptflow.core import AzureOpenAIModelConfiguration

configuration = AzureOpenAIModelConfiguration(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2024-02-01",
    azure_deployment="gpt-4o",
)

In [None]:
from eval.app_target import ModelEndpoints
import pathlib
import random

from promptflow.evals.evaluate import evaluate
from promptflow.evals.evaluators import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
)


content_safety_evaluator = ContentSafetyEvaluator(project_scope=azure_ai_project)
relevance_evaluator = RelevanceEvaluator(model_config=configuration)
coherence_evaluator = CoherenceEvaluator(model_config=configuration)
groundedness_evaluator = GroundednessEvaluator(model_config=configuration)
fluency_evaluator = FluencyEvaluator(model_config=configuration)
similarity_evaluator = SimilarityEvaluator(model_config=configuration)

models = [
    "gpt-4o",
]

path = "../eval/03/output/eval_output.jsonl"

for model in models:
    randomNum = random.randint(1111, 9999)
    results = evaluate(
        azure_ai_project=azure_ai_project,
        evaluation_name="Eval-Run-" + str(randomNum) + "-" + model.title(),
        data=path,
        # target=ModelEndpoints(env_var, model),
        evaluators={
            # "content_safety": content_safety_evaluator,
            "coherence": coherence_evaluator,
            "relevance": relevance_evaluator,
            "groundedness": groundedness_evaluator,
            "fluency": fluency_evaluator,
            "similarity": similarity_evaluator,
        },
        evaluator_config={
            # "content_safety": {"question": "${data.question}", "answer": "${data.answer}"},
            "coherence": {"answer": "${data.answer}", "question": "${data.question}"},
            "relevance": {"answer": "${data.answer}", "context": "${data.context}", "question": "${data.question}"},
            "groundedness": {
                "answer": "${data.answer}",
                "context": "${data.context}",
                "question": "${data.question}",
            },
            "fluency": {"answer": "${data.answer}", "context": "${data.context}", "question": "${data.question}"},
            "similarity": {"answer": "${data.answer}", "ground_truth": "${data.ground_truth}", "question": "${data.question}"},
        },
    )