# Challenge 04-B - Retrieval Augmented Generation (RAG) for Unstructured Data


## Introduction

Businesses have a lot of proprietary information that needs to be taken into account when answering user's questions - these cannot always be answered through the data that the GPT models have been trained on. 

In the last notebook, we worked with structured data primarily. A lot of the time, your enterprise data is not limited to just structured formats like CSV files or SQL tables. It may also include unstructured data like PDF documents or images. In fact, your individual documents could have both unstructured and structured data built into them. Extracting information from these diverse formats in a comprehensible manner presents a challenge. Tools like Azure Form Recognizer enable the extraction of data from unstructured sources such as forms or documents. Once the data is extracted into a structured JSON format, then Cognitive Search can be utilized to consolidate the entire information from different data types into indexes, facilitating the retrieval of relevant documents.

In this notebook, we will walk you through a use case of Retrieval Augmented Generation (RAG) that involves working with unstructured data. The RAG approach combines various technologies to enhance the quality and relevance of generated outputs. We will leverage Azure Form Recognizer to process complex documents, utilizing the layout API to extract text and tables effectively. We will utilize Azure Cognitive Search to create an index by configuring semantic search capabilities, enabling the retrieval of relevant document pages. Additionally, embeddings will be incorporated to retrieve content that is more closely aligned with the user's question. Finally, Azure OpenAI's ChatGPT model will utilize the extracted content to generate a more meaningful answer. It is important to emphasize that this grounding process follows the RAG pattern mentioned in the previous notebook and helps eliminate inaccuracies in the generated responses.

Your goals for this challenge are to read through this notebook, run each code block, observe the results, and then be able to answer the questions posed in the student guide.


In [36]:
# Import Azure Forms Recognizer, Azure Cognitive Search, OpenAI, and other python modules

import os, json, requests, sys, re
import requests
from pprint import pprint
import pandas as pd
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient 
from azure.search.documents import SearchClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
    SemanticSettings
)


import openai
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

from dotenv import load_dotenv
load_dotenv()

True

In [37]:
# This is secure and recommended way to load OpenAI resource credentials and deployment names

openai.api_key = os.environ['OPENAI_API_KEY']
openai.api_base = os.environ['OPENAI_API_BASE']
openai.api_type = os.environ['OPENAI_API_TYPE']
openai.api_version = os.environ['OPENAI_API_VERSION']

chat_model = os.environ['CHAT_MODEL_NAME']
embedding_model=os.environ['EMBEDDING_MODEL_NAME']

**NOTE:** The path in the code cell below is referring to the `/data/unstructured/raw` folder of the `Resources.zip` file. You may need to update this path if you are running this notebook from a different location then from where you extracted it.

In [38]:
# -- raw data
RAW_DATA_FOLDER= '../data/unstructured/raw'
# -- extracted json file 
EXTRACTED_DATA_FOLDER = '../data/unstructured/extracted'

In [39]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = os.environ["AZURE_FORM_RECOGNIZER_ENDPOINT"]
key = os.environ["AZURE_FORM_RECOGNIZER_KEY"]

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

We want to extract the data from our unstructured data into a more readable format for the model to understand. The Form Recognizer tool helps us do so by leveraging the prebuilt layout models. Here, we primarily are working with PDFs but we could also have JPG and PNG formats that the form recognizer tool also supports.

For each document, we want to specify the way information is being extracted. For example in this use case, each document has many pages. To keep track of the pages, we store them in page_number. We also want to extract the content for each page and drop it in a page_context field.

In [40]:
def extract_local_single_file(file_name: str):
    not_completed = True
    while not_completed:
        with open(file_name, "rb") as f:
            poller = document_analysis_client.begin_analyze_document(
                "prebuilt-layout", document=f
            )
            not_completed=False
    result = poller.result()
    return get_page_content(file_name, result)

def extract_files( folder_name: str, destination_folder_name: str):
    os.makedirs(destination_folder_name, exist_ok=True)
    for file in os.listdir(folder_name):
        if file[-3:].upper() in ['PDF','JPG','PNG']:
            print('Processing file:', file, end='')
        
            page_content = extract_local_single_file(os.path.join(folder_name, file))
            output_file = os.path.join(destination_folder_name, file[:-3] +'json')
            print(f'  write output to {output_file}')
            with open(output_file, "w") as f:
                f.write(json.dumps(page_content))


def get_page_content(file_name:str, result):
    page_content = []
    for page in result.pages:
        all_lines_content = []
        for line_idx, line in enumerate(page.lines):
            all_lines_content.append(' '.join([word.content for word in line.get_words()]))
        page_content.append({'page_number':page.page_number, 
                                'page_content':' '.join(all_lines_content)})
    return {'filename':file_name, 'content':page_content}





In [41]:
extract_files(RAW_DATA_FOLDER, EXTRACTED_DATA_FOLDER)

Processing file: AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf  write output to ./data/unstructured/extracted\AutoPrompt_Eliciting_Knowledge_From_LanguageModels.json
Processing file: Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.pdf  write output to ./data/unstructured/extracted\Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.json
Processing file: Generated_Knowledge_Prompting_for_Commonsense_Reasoning.pdf  write output to ./data/unstructured/extracted\Generated_Knowledge_Prompting_for_Commonsense_Reasoning.json
Processing file: LLMs_are_Human-Level_Prompt_Engineers.pdf  write output to ./data/unstructured/extracted\LLMs_are_Human-Level_Prompt_Engineers.json
Processing file: Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf  write output to ./data/unstructured/extracted\Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.json
Processing file: Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.pdf  write output to ./data/unstructured/extracted\Precis

## More About our data

For this walkthrough, we will take a look at various Research Papers on LLM topics in PDF documents. This includes topics like autoprompting, chain of thought prompting, precise zero shot dense retrival, and more. This dataset contains various unstructured formats such as text, tables, graphs, and formulas.

## Data Description

The relevant schema for our work today consists of 

- document_id
- document_name
- file_path
- page_number
- page_text


In [7]:
documents=[]
for file in os.listdir(EXTRACTED_DATA_FOLDER):
    with open(os.path.join(EXTRACTED_DATA_FOLDER, file)) as f:
        page_content= json.loads(f.read())
    documents.extend(
        [
            {
                'document_id':page_content['filename'].split('\\')[-1].split('.')[0] + '-' + str(page['page_number']),
                'document_name':page_content['filename'].split('\\')[-1],
                'file_path':page_content['filename'],              
                'page_number':page['page_number'],
                'page_text':page['page_content']
            }
            for page in page_content['content']
        ]
    )

In [8]:
#Example of a single page of research paper file that will be indexed in Azure Cognitive Search
documents[1]

{'document_id': 'AutoPrompt_Eliciting_Knowledge_From_LanguageModels-2',
 'document_name': 'AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf',
 'file_path': './data/unstructured/raw\\AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf',
 'page_number': 2,
 'page_text': 'Original Input xinp a real joy. AUTOPROMPT Xprompt a real joy. atmosphere alot dialogue Clone totally [MASK]. Trigger Tokens Œtrig atmosphere, alot, dialogue, Clone ... Masked LM p([MASK] | @prompt) Cris marvelous philanthrop Template (xinp, "trig) {sentence }[T][T][T][T][T][P]. p(y|xprompt ) + positive worse incompetence Worse + negative Figure 1: Illustration of AUTOPROMPT applied to probe a masked language model\'s (MLM\'s) ability to per- form sentiment analysis. Each input, «inp, is placed into a natural language prompt, «prompt, which contains a single [MASK] token. The prompt is created using a template, A, which combines the original input with a set of trigger tokens, «trig. The trigger tokens are share

This section will focus on Cognitive Search and the following topics:
1. Creating an index client
2. Defining the index fields with necessary attributes
3. Creating a semantic configuration
4. Loading our index with the document pages

In [9]:
# Create an SDK client
# service_endpoint = os.getenv("AZURE_COGNITIVE_SEARCH_ENDPOINT")   
key = os.getenv("AZURE_COGNITIVE_SEARCH_KEY")
credential = AzureKeyCredential(key)

index_name = "research-paper-index"

index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
index_client

<azure.search.documents.indexes._search_index_client.SearchIndexClient at 0x2f357dc9cd0>

In [10]:
fields = [
    SimpleField(name="document_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="page_number", type=SearchFieldDataType.Int64),
    SimpleField(name="file_path", type=SearchFieldDataType.String),
    SearchableField(name="document_name", type=SearchFieldDataType.String,
                searchable=True, retrievable=True),
    SearchableField(name="page_text", type=SearchFieldDataType.String,
                filterable=True, searchable=True, retrievable=True),
]

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="document_id"),
        prioritized_keywords_fields=[SemanticField(field_name="document_name")],
        prioritized_content_fields=[SemanticField(field_name="page_text")]
    )
)


# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 research-paper-index created


In [11]:
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded {len(documents)} documents") 

Uploaded 179 documents


In [12]:
len(result)

179

Here we see Azure Cognitive Search in action! We can retrive the most relevant documents out of all the ones that we are working with. 

In [13]:
query = "What is automated prompt engineering?"
count = 10
results = search_client.search(search_text=query, top=count, include_total_count=True)
page_chunks = []
citations = []
for result in results:
    page_chunks.append(result['page_text'])
    citations.append(result['document_name'])
    
    

In [16]:
embed_df = pd.DataFrame(page_chunks, columns = ["page_chunks"]) #datframe with document chunks
embed_df

Unnamed: 0,page_chunks
0,A PROMPT ENGINEERING IN THE WILD Large models ...
1,LARGE LANGUAGE PROMPT ENGINEERS MODELS ARE HUM...
2,AUTOPROMPT: Eliciting Knowledge from Language ...
3,Question Tracy used a piece of wire 4 feet lon...
4,Instruction Only In-context Only Instruction +...
5,Original Input xinp a real joy. AUTOPROMPT Xpr...
6,"> p([MASK] = w| @prompt) WEVy However, writing..."
7,Table 24: Few-shot exemplars for full chain of...
8,Task CSQA2 Prompt Generate some knowledge abou...
9,Task NumerSense Prompt Generate some numerical...


Once we have the most relevant documents, let us create embeddings for all the page chunks. This will help us find the most similar documents to our given user query.

In [17]:
# Handling Rate Limits

from openai.error import RateLimitError
from time import sleep


def get_embedding(text: str, engine: str = "text-embedding-ada-002"):
    count=0
    while True:
        try:
            embedding = openai.Embedding().create(input=[text], engine=engine)["data"][0]["embedding"]
            break;
        except RateLimitError:
            count+=1
            #print(f'RateLimitError Count: {count}')
            sleep(2)            
    return np.array(embedding).astype(np.float32)

def get_completion(prompt, model="gpt-35-turbo"): 
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        engine=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]


In [19]:
#Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, engine = embedding_model))

In [20]:
embed_df

Unnamed: 0,page_chunks,embedding
0,A PROMPT ENGINEERING IN THE WILD Large models ...,"[-0.024409005, 0.0011260926, 0.006095364, -0.0..."
1,LARGE LANGUAGE PROMPT ENGINEERS MODELS ARE HUM...,"[-0.02069425, -0.003942089, 0.00025089542, -0...."
2,AUTOPROMPT: Eliciting Knowledge from Language ...,"[-0.019387873, 0.0026623704, 0.0016202723, -0...."
3,Question Tracy used a piece of wire 4 feet lon...,"[0.012753163, 0.006085661, 0.015384983, -0.032..."
4,Instruction Only In-context Only Instruction +...,"[-0.023519488, 0.011074509, 0.017816115, -0.00..."
5,Original Input xinp a real joy. AUTOPROMPT Xpr...,"[-0.027597042, -0.014452211, 0.0037054217, -0...."
6,"> p([MASK] = w| @prompt) WEVy However, writing...","[-0.039878074, -0.0032404505, -0.0033636412, -..."
7,Table 24: Few-shot exemplars for full chain of...,"[0.013435593, 0.012619638, 0.033219036, -0.006..."
8,Task CSQA2 Prompt Generate some knowledge abou...,"[0.017375462, 0.031131035, 0.036393795, -0.005..."
9,Task NumerSense Prompt Generate some numerical...,"[-0.0027567996, 0.012855793, 0.03290082, -0.00..."


In [21]:
query_embedding = get_embedding(query, engine=embedding_model)
embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

top_results = (
    embed_df.sort_values("similarities", ascending=False)
    .reset_index(drop=True)
    .head(3)
)
top_results

Unnamed: 0,page_chunks,embedding,similarities
0,A PROMPT ENGINEERING IN THE WILD Large models ...,"[-0.024409005, 0.0011260926, 0.006095364, -0.0...",0.891098
1,LARGE LANGUAGE PROMPT ENGINEERS MODELS ARE HUM...,"[-0.02069425, -0.003942089, 0.00025089542, -0....",0.866633
2,AUTOPROMPT: Eliciting Knowledge from Language ...,"[-0.019387873, 0.0026623704, 0.0016202723, -0....",0.841785


In [23]:
prompt = f"""
Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```{query}```
List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

Answer:
"""

print(prompt)


Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```What is automated prompt engineering?```
List of Extracted Pages: ```['A PROMPT ENGINEERING IN THE WILD Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Among others, see, for example: · https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/ . https://techcrunch.com/2022/07/29/a-startup-is-charging-1-99-for-strings-of-text-to-feed-to-dall-e-2/ . https://news.ycombinator.com/item?id=32943224 . https://promptomania.com/stable-d

In [24]:
response = get_completion(prompt, chat_model)
print(response)

Automated prompt engineering refers to the process of automatically generating natural language instructions or prompts to steer large language models (LLMs) towards desired behaviors. This is done by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. The quality of the selected instruction is evaluated by evaluating the zero-shot performance of another LLM following the selected instruction. A number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. Some examples of these tools include Promptomania, Stable Diffusion Prompt Builder, and MagicPrompt. The goal of automated prompt engineering is to reduce the human effort involved in creating and validating effective instructions, and to improve the performance of LLMs on downstream tasks.


In [27]:

def query_search(query, count=10):
    results = search_client.search(search_text=query, top=count, include_total_count=True)
    page_chunks = []
    for result in results:
        page_chunks.append(result['page_text'])
        
    #Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
    embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, engine = embedding_model))

    query_embedding = get_embedding(query, engine=embedding_model)
    embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

    top_results = (
        embed_df.sort_values("similarities", ascending=False)
        .reset_index(drop=True)
        .head(3)
    )
    
    prompt = f"""
    Provided below are user query and list of extracted pages from research papers separated by triple backticks.
    Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

    User Query: ```{query}```
    List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

    Answer:
    """
    
    response = get_completion(prompt)
    return response

In [28]:
answer = query_search("How does automated prompt engineering work?", 5)
print(answer)

Automated prompt engineering involves using natural language instructions to steer large language models (LLMs) towards desired behaviors. This is achieved through the generation and selection of effective prompts, which can be difficult for humans to find. To address this, a number of guides and tools have been developed, including the Automatic Prompt Engineer (APE) algorithm, which uses LLMs to automatically generate and select instructions. APE has been shown to outperform prior LLM baselines and achieve better or comparable performance to instructions generated by human annotators on 19/24 tasks. Another automated method for creating prompts is AUTOPROMPT, which uses a gradient-guided search to create prompts for a diverse set of tasks. Using AUTOPROMPT, it has been shown that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with

In [29]:
answer = query_search("what is prompt tuning?", 10)
print(answer)

Prompt tuning refers to the process of finding the right prompt for natural language interfaces, including models for text generation and image synthesis. It can be difficult for humans to find the right prompt, so a number of guides on prompt engineering as well as tools to aid in prompt discovery have been developed. AUTOPROMPT is an automated method for generating prompts for any task, illustrated in Figure 1. Given a task, e.g., sentiment analysis, AUTOPROMPT creates a prompt by combining the original task inputs (e.g. reviews) with a collection of trigger tokens according to a template. The same set of trigger tokens is used for all inputs, and is learned using a variant of the gradient-based search strategy proposed in Wallace et al. (2019). The LM predictions for the prompt are converted to class probabilities by marginalizing over a set of associated label tokens, which can either be learned or specified ahead of time, enabling the LM to be evaluated the same as one would any o