<h1 align="center">
    <img 
        src="./img/Microsoft-Logo.png" 
        width="400"/>
</h1>
<h1 align="center">
    <b>Practical Guide</b>
</h1>
<h4 align="center">
    for the creation of an AI Solution using an accelerator from the <a href="https://www.ds-toolkit.com/">Data Science Toolkit</a>
</h4>

# What to expect

* **Challenge 1:** *Create your own AI solution*
* **Challenge 2:** *Getting answers from our new Copilot*
* **Challenge 3:** *Evaluate the quality of the AI solution*

# Challenge 1: *Create your own AI solution*

Here we are going to create a RAG based Copilot to answer questions about 6 sustainability reports from Microsoft, Apple, Amazon, Google, Meta and Netflix from 2021 or 2022. The documents are part of the  [Mini Esg Bench Dataset](https://llamahub.ai/l/llama_datasets/Mini%20ESG%20Bench%20Dataset?from=llama_datasets).

## Challenge 1 - Step 1:  *Let's import the libraries to be used in this notebook*.

> This step will take around **30 to 90 seconds** to complete. It is going to be done in a quiet mode, so only errors will be displayed (if they occur). If you want to see what is going to be installed check the [requirements.txt](./requirements.txt) file.

In summary two main tools will be installed and used in this notebook:

* **Llama Index**. Which will be used to download the dataset and to create the Semantic Index.
  > It is also possible to use **Azure AI Search** to create the semantic index. However, since it is going to be a small index, to simplify its creation in-memory we are going to use Llama Index.
* **RAGAS**. Ragas will be used to calculate the metrics for the Copilot that we are going to create.

In [None]:
import os
import pandas as pd
from dotenv import load_dotenv

# To create the RAG based copilot
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import VectorStoreIndex, Settings
from ragas.metrics import (
    Faithfulness,
    ContextPrecision,
    ContextRecall
)

# To calculate the Generative AI quality metrics
from ragas.llms import LlamaIndexLLMWrapper
from ragas.embeddings import LlamaIndexEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.evaluation import evaluate
from ragas.run_config import RunConfig

# To create a simple PDF visualization tool
import pymupdf
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

# For formatting purposes
import textwrap

load_dotenv()

## Challenge 1 - Step 2: *Let's load the documents and test questions*

> This step takes around **1 minute** to complete.

The following code loads and prepares data from different sources (a JSON file and a directory of files) for further use:
1. The first line creates an instance of LabelledRagDataset by loading data from a JSON file `rag_dataset.json`. The `from_json` method is a class method that reads the JSON file, parses its content, and initializes the `LabelledRagDataset` object with the parsed data. This method ensures that the data is correctly formatted and validated before being used.
2. The second line initializes an instance of `SimpleDirectoryReader` with the directory path `source_files`. The `SimpleDirectoryReader` class is designed to read files from a specified directory. The `load_data` method is then called on this instance to load the documents from the directory. 

In [None]:
# explain the code below

rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files/").load_data(
    show_progress=True
)

### Tool to visualize the reports just downloaded

The following cell creates a tool to visualize the PDF files we just downloaded. If you want to take a look at the downloaded reports manually, just navigate to the `data/source_files` folder.

In [None]:
data_source_path = "./data/source_files"

# List of pdf files just downloaded
pdf_files = [os.path.join(data_source_path, file_name) for file_name in os.listdir(data_source_path)]

# Function to render a specific page of a PDF
def render_pdf_page(pdf_path, page_number=0):
    # Open the PDF file
    pdf_document = pymupdf.open(pdf_path)
    
    # Ensure the page number is valid
    if page_number < 0 or page_number >= len(pdf_document):
        raise ValueError("Invalid page number.")
    
    # Get the page and render it as an image
    page = pdf_document[page_number]
    pix = page.get_pixmap()
    pdf_document.close()
    
    # Display the image using Matplotlib
    plt.figure(figsize=(10, 8))
    plt.imshow(pix.pil_image())
    plt.axis("off")
    plt.show()

# Function to update the displayed page
def update_page(step):
    global current_page
    pdf_document = pymupdf.open(dropdown.value)
    total_pages = len(pdf_document)
    pdf_document.close()
    
    # Update the current page index
    current_page += step
    if current_page < 0:
        current_page = 0
    elif current_page >= total_pages:
        current_page = total_pages - 1
    
    with output:
        output.clear_output()
        render_pdf_page(dropdown.value, current_page)

# Function to reset the viewer when a new PDF is selected
def reset_viewer(change):
    global current_page
    current_page = 0  # Reset to the first page
    with output:
        output.clear_output()
        render_pdf_page(dropdown.value, current_page)

# Create widgets
dropdown = widgets.Dropdown(
    options=pdf_files,
    description="Select a PDF:",
    style={"description_width": "initial"}
)

prev_button = widgets.Button(description="Previous Page")
next_button = widgets.Button(description="Next Page")
output = widgets.Output()

# Attach event listeners
prev_button.on_click(lambda _: update_page(-1))
next_button.on_click(lambda _: update_page(1))
dropdown.observe(reset_viewer, names="value")

# Initial display
reset_viewer(None)

# Display widgets and output
display(widgets.VBox([dropdown, widgets.HBox([prev_button, next_button]), output]))

## Challenge 1 - Step 3: *Let's create the semantic index*

> This process can take up to **15 seconds** to complete

A semantic index is a method used to organize and retrieve information based on the meaning and context of the content rather than just keywords. Unlike traditional keyword-based search systems, a semantic index understands the relationships between concepts and terms within the data, allowing for more accurate and relevant search results. In this context (i.e., ML and NLP), a semantic index leverages embeddings, which are numerical representations of words, phrases, or documents in a high-dimensional space. These embeddings capture the semantic meaning of the content, enabling the system to identify similar or related items even if they do not share exact keywords. The following code creates a semantic index using Azure OpenAI embeddings. The embedding model is used to generate embeddings for the documents, which are then indexed using `VectorStoreIndex` from `llamaIndex`. This process allows for efficient and meaningful retrieval of information based on the semantic content of the documents.

In [None]:
%%time
# Create the embedding model
embed_model = AzureOpenAIEmbedding(
    model='text-embedding-3-small', # Update with the embeddings deployment name
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

# Pass the embedding model to llamaIndex
Settings.embed_model = embed_model

# Create the actual index
index = VectorStoreIndex.from_documents(
    documents=documents,
    show_progress=True
)


## Challenge 1 - Step 4: *Let's create the copilot*

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation, translation, summarization, and more. These models are characterized by their vast number of parameters, which enable them to understand and generate human-like text with high accuracy and fluency. LLMs are trained on extensive datasets using self-supervised learning, allowing them to acquire a deep understanding of language, including syntax, semantics, and contextual relationships.

Here we are creating an instance of a large language model using Azure OpenAI. The model, identified as `gpt-4o`, is configured with specific parameters such as `engine`, `model`, `temperature`, and API credentials. This setup allows the model to generate text based on the input it receives, making it a powerful tool for various natural language processing tasks.

The following code creates and configures the large language model using Azure OpenAI, and integrates it with the previously created semantic index to enable efficient and meaningful query processing.

In [None]:
# Create the Large Language Model
llm = AzureOpenAI(
    engine="gpt-4o", # Update with the language model deployment name 
    model="gpt-4o", # Update with the language model name
    temperature=0.0,
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

Settings.llm = llm

query_engine = index.as_query_engine() # this is the copilot

### Let's play with the copilot just created

In the test dataset, we have not only downloaded the source data (PDFs), but also 50 examples of questions/answer pairs to use with our Copilot. The following are some examples of the questions:

In [None]:
num_samples = 5

# Load some questions as example
samples = rag_dataset.to_pandas().loc[:num_samples-1, 'query'].values

display(Markdown("\n".join([ "* " + sample for sample in samples])))

The following is a simple tool that allows to ask questions to the new copilot

In [None]:
# Function to get a response
def get_response(user_input):
    return query_engine.query(user_input).response

# Interactive chatbot function
def chatbot():
    output = widgets.Output()
    scrollable_chat = widgets.Box(
        [output],
        layout=widgets.Layout(
            width="100%",
            height="300px",
            overflow="auto",
            border="1px solid #ccc",
            padding="5px",
            display="flex",
            flex_flow="column"

        )
    )
    # output = widgets.Output(layout={})
    text_box = widgets.Text(
        placeholder="Type your message here",
        description="You:",
        style={'description_width': 'initial'},
        layout=widgets.Layout(width="80%")
    )
    submit_button = widgets.Button(
        description="Send",
        button_style="primary",
        layout=widgets.Layout(width="10%")
    )

    chat_container = widgets.VBox([scrollable_chat, widgets.HBox([text_box, submit_button])])
    
    # Function to handle submission
    def on_submit(_):
        with output:

            user_message = text_box.value
            if user_message.strip():  # Process non-empty input
                display(widgets.HTML(
                    f"<div style='color: blue; font-weight: bold;'>You:</div> {text_box.value}"
                ))
                bot_response = get_response(user_message)
                display(widgets.HTML(
                    f"<div style='color: green; font-weight: bold;'>Bot:</div> {bot_response}"
                ))
            text_box.value = ""  # Clear the text box after submission
        
    # Attach event handlers
    submit_button.on_click(on_submit)
    
    display(chat_container)

# Run the chatbot
chatbot()

# Example question: What is Microsoft's most important contribution to the environment? → this is a question outside the dataset.

### Interesting case ...

<div align="left">
    <img 
        src="./img/map.png" 
        width="800"/>
</div>


# Challenge 2: *Getting answers from our new Copilot*

## Challenge 2 - Step 1: *Let's take a look at the test dataset*

First, let's see an example from the test dataset that we downloaded together with the PDF files.

In [None]:
instance_idx = 2

def create_instance_md(k, v):
    md_str = f"**{k}:**\n\n"
    if k == "reference_contexts":
        return md_str + "\n".join([f"* {c}\n" for c in v])
    return md_str + f"{v}"

display(Markdown("### This is one instance in the dataset:\n\n"))
display(Markdown("\n\n".join([create_instance_md(k, v) for k,v in rag_dataset.to_pandas().iloc[instance_idx].items()])))

### The dataset
In total, the test dataset has 50 instances like the one detailed above, Let's take a look at some of them

In [None]:
rag_dataset.to_pandas().head(10)

## Challenge 2 - Step 2: *Let's calculate the responses*

Now, let's have the model we created **provide answers to the questions** in the dataset. Also, to be able to calculate the quality metrics, we need to keep the **contexts retrieved**.  The retrieved contexts are the chunks of data retrieved from the semantic index to be used to answer the question.

> This process could take hours depending on the thoughtput of the LLM used. Just to keep the practical guide in a reasonable time, let's calculate the **responses** and **retrieved contexts** for a small sample of questions. For future steps, we will provide the full dataset with already pre-calculated answers.

In [None]:
# Select a subset sample to test
sample_size = 5
sub_dataset = LabelledRagDataset(examples=rag_dataset.examples[:sample_size])
sub_dataset.to_pandas()

Now, let's calculate the responses for the **subset** (this should take ~1m):

In [None]:
%%time
predictions = sub_dataset.make_predictions_with(
    predictor = query_engine,
    show_progress = True
)

Let's look at the results:

> Here, since we are already preparing the data for the next challenge the column names changed:
> * `user_input` is the same `query` in the previous format.
> * `retrieved_contexts` is a list that includes the chucks of data used by our copilot to answer the question.
> * `response` is the `response` to the question created by our new copilot.
> * `reference` is the `reference answer` from our previous format and refers to the expected answer created by a human.

In [None]:
list_of_samples = []

for idx in range(len(sub_dataset.examples)):
    list_of_samples.append(
        SingleTurnSample (
            user_input = sub_dataset.examples[idx].query,
            reference = sub_dataset.examples[idx].reference_answer,
            response = predictions.predictions[idx].response,
            retrieved_contexts = predictions.predictions[idx].contexts
        )
    )

ragas_evaluation_dataset = EvaluationDataset(list_of_samples)
ragas_evaluation_dataset.to_pandas()

# Challenge 3: *Evaluate the quality of our new copilot*

Evaluating Generative AI (GenAI) solutions, especially those using Retrieval-Augmented Generation (RAG) frameworks, involves several key steps to ensure the quality and effectiveness of the generated outputs. 

GenAI solutions are evaluated using various metrics to measure the quality of the responses generated by the AI models. Libraries like RAGAS or promptflow are instrumental in this process. They help assess the solution's performance using metrics such as faithfulness, groundedness, context precision, and context recall1. These metrics provide insights into how accurate, relevant, and contextually appropriate the AI-generated responses are.

The evaluation process typically includes:

* **Model Selection and Building:** Choosing the right model and developing the GenAI application using iterative development practices and advanced prompt engineering techniques.
* **Quality Metrics:** Using metrics like faithfulness (accuracy and truthfulness), groundedness (relevance to the input context), context precision (alignment with the provided context), and context recall (inclusion of relevant information from the input context).
* **Explainability:** Adding explainability to these metrics to understand why certain responses are rated higher or lower. This helps in debugging and improving the overall solution.
* **User Feedback:** Collecting feedback from users to identify unexpected issues and improve the user experience.

For this demonstration, we use RAGAS, a specialized evaluation framework designed to assess the performance of RAG systems. It provides a structured approach to evaluate the effectiveness of RAG implementations by leveraging advanced Large Language Models (LLMs) as judges. 

## Challenge 3 - Step 1: *Initialize the LLM and Embedding models*

The first step is to initialize the LLM and Embedding models to be used to calculate the GenAI metrics:

In [None]:
evaluator_llm = LlamaIndexLLMWrapper(llm)
evaluator_embeddings = LlamaIndexEmbeddingsWrapper(embed_model)

## Challenge 3 - Step 2: *Calculate the GenAI metrics*

The process to calculate the GenAI metrics for all the questions in the test dataset could take hours depending on the throughtput of the LLM and the Embedding model used. To keep the time of this guide we are going to calculate only 3 metrics for the sub dataset created before:

* **Context precision:** Measures how much of the generated output is relevant and aligns with the context provided in the input.
* **Context recall:** Measures how much of the relevant information in the input context is included in the output.
* **Faithfulness:** Measures how accurate and truthful the generated output is in relation to the input context and factual correctness. Faithfulness is about avoiding "hallucinations" (made-up or false information).


> This process can take up to **25 seconds** to complete

In [None]:
%%time

metrics = [
    Faithfulness(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm)
]
ragas_evaluation_result = evaluate(
    dataset=ragas_evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    run_config=RunConfig(timeout=1800, max_wait=180, max_retries=20),
    show_progress=True,
    batch_size=5
)

The following are the results of the calculation of the GenAI metrics

In [None]:
df_ragas_result = ragas_evaluation_result.to_pandas()
df_ragas_result

### Full dataset results

Let's take a look at a table with the responses, retrieved contexts and metrics already pre-calculated:

In [None]:
df_test_dataset = pd.read_json('./test-dataset.json', orient='records')
df_test_dataset.head(20)