#### __Use case__
Knowledge bases in enterprises are very common in the industry today and can have extensive number of documents in different categories. Retrieving relevant content based on a user query is a challenging task. This notebook walks through on how leverage Azure OpenAI models to create embeddings from the documents, store them in memory, and leverage vector similarity search to extract text based on query.

#### __Step-by-step process flow__

1. Read news articles and count the number of tokens (chunking)
2. Create embeddings using the embeddings model in Azure OpenAI
3. Leverage vector similarity to get top n similar chunks
4. Take top n similar texts as context and prompt engineer into passage based QnA solution

#### __Step 0: Configure connection parameters__

In [1]:
!pip install openai
!pip install azure-identity
!pip install azure-keyvault-secrets
!pip install pandas
!pip install matplotlib
!pip install plotly
!pip install scipy
!pip install scikit-learn
!pip install transformers
!pip install azure-storage-blob



In [2]:
import openai
import os
import re
import requests
import sys
import pandas as pd
from openai.embeddings_utils import get_embedding, cosine_similarity
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

In [3]:
OPENAI_API_KEY = "<INSERT-YOUR-API-KEY>" # to be completed
AZURE_OPENAI_RESOURCE_ENDPOINT = "<INSERT-YOUR-RESOURCE-ENDPOINT>" # to be completed


openai.api_type = "azure"
openai.api_key = OPENAI_API_KEY
openai.api_base = AZURE_OPENAI_RESOURCE_ENDPOINT

# Enter connection string to connect with Blob storage (in your blob storage, find the access keys tab and retreive your connection string)
STORAGE_ACCOUNT_CONNECTION_STRING = "<INSERT-YOUR-STORAGE-ACCOUNT-CONNECTION-STRING>" # to be completed

#### __Step 1: Tokenization__
Read news articles and count the number of tokens (chunking)

https://platform.openai.com/tokenizer

In [4]:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
print(tokenizer.encode('Hi there'))
print(len(tokenizer.encode('Hi there')))
print(tokenizer.encode('Let us build a poc on azure openai'))
print(len(tokenizer.encode('Let us build a poc on azure openai')))

[17250, 612]
2
[5756, 514, 1382, 257, 279, 420, 319, 35560, 495, 1280, 1872]
11


In [5]:
import pandas as pd
from azure.storage.blob import BlobServiceClient

# Connect to the Azure Storage container
connection_string = STORAGE_ACCOUNT_CONNECTION_STRING
container_name = "sampledata"
blob_name = "sample_data_dayinlifeofdatascientist.json"

blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(blob_name)

# Download the JSON file from Azure Storage
downloaded_blob = blob_client.download_blob()
json_data = downloaded_blob.readall()



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
import io 

# Convert the bytes data to a file-like object
file_obj = io.BytesIO(json_data)

# Load JSON data into a Pandas DataFrame
df = pd.read_json(file_obj, orient='records')

# Access the articles data
#articles_df = df['articles']

# Create a new DataFrame with 'header' and 'body' as separate columns
articles_df = pd.DataFrame(df['articles'].tolist())

# Access the 'header' and 'body' columns
header_column = articles_df['header']
body_column = articles_df['body']

# Print the resulting DataFrame
print(articles_df)

                                              header  \
0  Paris to charge SUV drivers higher parking fee...   
1  Trump asks for classified documents trial to t...   
2          Tom Holland: ‘I felt enslaved to alcohol’   
3  Six dead as tourist helicopter crashes in Ever...   

                                                body  
0  Paris city hall is to impose higher parking fe...  
1  Donald Trump asked the federal judge overseein...  
2  Tom Holland, the British actor best known for ...  
3  All six people onboard a tourist helicopter in...  


In [7]:
articles_df['#tokens'] = articles_df['body'].apply(lambda x: len(tokenizer.encode(x)))
articles_df.head()

Unnamed: 0,header,body,#tokens
0,Paris to charge SUV drivers higher parking fee...,Paris city hall is to impose higher parking fe...,503
1,Trump asks for classified documents trial to t...,Donald Trump asked the federal judge overseein...,510
2,Tom Holland: ‘I felt enslaved to alcohol’,"Tom Holland, the British actor best known for ...",670
3,Six dead as tourist helicopter crashes in Ever...,All six people onboard a tourist helicopter in...,590


#### __Step 2:  Create embeddings__
Create embeddings using the embeddings model in Azure OpenAI

In [8]:
#from openai.embeddings_utils import get_embedding, cosine_similarity

openai.api_type = "azure"
openai.api_key = OPENAI_API_KEY
openai.api_base = AZURE_OPENAI_RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

url = openai.api_base + "/openai/deployments?api-version=2022-12-01" 

r = requests.get(url, headers={"api-key": OPENAI_API_KEY})

print(r.text)

{
  "data": [
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-embedding-ada-002",
      "owner": "organization-owner",
      "id": "text-embedding-ada-002",
      "status": "succeeded",
      "created_at": 1689094873,
      "updated_at": 1689094873,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "gpt-35-turbo",
      "owner": "organization-owner",
      "id": "gpt-35-turbo",
      "status": "succeeded",
      "created_at": 1689107924,
      "updated_at": 1689107924,
      "object": "deployment"
    }
  ],
  "object": "list"
}


In [9]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

articles_df['body']= articles_df["body"].apply(lambda x : normalize_text(x))
articles_df

Unnamed: 0,header,body,#tokens
0,Paris to charge SUV drivers higher parking fee...,Paris city hall is to impose higher parking fe...,503
1,Trump asks for classified documents trial to t...,Donald Trump asked the federal judge overseein...,510
2,Tom Holland: ‘I felt enslaved to alcohol’,"Tom Holland, the British actor best known for ...",670
3,Six dead as tourist helicopter crashes in Ever...,All six people onboard a tourist helicopter in...,590


In [10]:
get_embedding("hello", engine = 'text-embedding-ada-002')

[-0.02504645101726055,
 -0.01940273866057396,
 -0.027782395482063293,
 -0.03103380836546421,
 -0.024649936705827713,
 0.027438750490546227,
 -0.012470357120037079,
 -0.00849861092865467,
 -0.01743338815867901,
 -0.008465568535029888,
 0.03254055976867676,
 0.004275739658623934,
 -0.024583851918578148,
 -0.0006298786029219627,
 0.01412910595536232,
 -0.0015034478856250644,
 0.03938703238964081,
 0.002009002957493067,
 0.026843979954719543,
 -0.012569485232234001,
 -0.02101522870361805,
 0.008881907910108566,
 0.008445742540061474,
 -0.0030630684923380613,
 -0.005362848285585642,
 -0.00950311217457056,
 0.01106934156268835,
 -0.0016967483097687364,
 0.003452973673120141,
 -0.023235704749822617,
 0.006730820517987013,
 -0.007903840392827988,
 -0.02392299473285675,
 -0.008901732973754406,
 0.00683986209332943,
 -0.01367972418665886,
 0.00950311217457056,
 -0.014115888625383377,
 0.02176860347390175,
 -0.010573700070381165,
 0.0034133223816752434,
 -0.014591705054044724,
 0.0052438941784203

In [11]:
articles_df['ada_v2'] = articles_df['body'].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002')) # engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model

In [12]:
articles_df

Unnamed: 0,header,body,#tokens,ada_v2
0,Paris to charge SUV drivers higher parking fee...,Paris city hall is to impose higher parking fe...,503,"[0.02916806936264038, 0.004837016109377146, 0...."
1,Trump asks for classified documents trial to t...,Donald Trump asked the federal judge overseein...,510,"[-0.034529391676187515, -0.007772498764097691,..."
2,Tom Holland: ‘I felt enslaved to alcohol’,"Tom Holland, the British actor best known for ...",670,"[0.014175072312355042, -0.03149457275867462, 0..."
3,Six dead as tourist helicopter crashes in Ever...,All six people onboard a tourist helicopter in...,590,"[-0.0024575560819357634, -0.01783216930925846,..."


#### __Step 3: Vector search__
Leverage vector similarity to get top n similar chunks

In [13]:
# search through the reviews for a specific product
def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine="text-embedding-ada-002" # engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
    )
    articles_df["similarities"] = articles_df.ada_v2.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res

question = "How many women were there during the helicopter crash in Nepal of July 11?"
res = search_docs(articles_df, question, top_n=1)

Unnamed: 0,header,body,#tokens,ada_v2,similarities
3,Six dead as tourist helicopter crashes in Ever...,All six people onboard a tourist helicopter in...,590,"[-0.0024575560819357634, -0.01783216930925846,...",0.866517


In [14]:
pd.set_option('display.max_colwidth', None)
str(res['body'])

'3    All six people onboard a tourist helicopter in Nepal have been killed after it crashed soon after takeoff in the Everest region. The Manang Air flight was heading for the capital, Kathmandu, from near Lukla, a gateway for climbing expeditions to the world’s highest peak, with five Mexican tourists – two men and three women – and a Nepali pilot onboard. The helicopter lost contact eight minutes after taking off on Tuesday morning, the Civil Aviation Authority of Nepal (CAAN) said in a statement. “The six bodies have been recovered and brought to Kathmandu,” Pratap Babu Tiwari, the general manager at the Tribhuvan international airport, told AFP. Two helicopters were deployed for search and rescue but could not land at the crash site because of the weather. “The teams on the ground brought the bodies to the helicopters which were able to land close by,” Tiwari said. Lhakpa Sherpa, a resident who joined search and rescue efforts, described the scene as “very scary”. “It looks like t

#### __Step 4: ChatGPT__
Take top n similar texts as context and prompt engineer into passage based QnA solution

In [16]:
openai.api_version = "2023-05-15" 

response = openai.ChatCompletion.create(
    engine="gpt-35-turbo", # The deployment name you chose when you deployed the GPT-35-Turbo or GPT-4 model.
    messages=[
        {"role": "system", "content": "Assistant is a large language model trained by OpenAI. Only use the following information: " + str(res["body"])},
        {"role": "user", "content": question}
    ]
)

#print(response)

print(response['choices'][0]['message']['content'])

There were three women onboard the tourist helicopter that crashed in Nepal on July 11.


#### __Appendix__
For potential debugging

In [None]:
'''
# Load JSON data into a Pandas DataFrame
df = pd.read_json(json_data, orient='records')

# Access the articles data
#articles_df = df['articles']

# Create a new DataFrame with 'header' and 'body' as separate columns
articles_df = pd.DataFrame(df['articles'].tolist())

# Access the 'header' and 'body' columns
header_column = articles_df['header']
body_column = articles_df['body']

# Print the resulting DataFrame
articles_df
'''

In [None]:
'''
import math
from openai.embeddings_utils import get_embedding, cosine_similarity

# Define the chunk size
CHUNK_SIZE = 1000

import re

# Perform light data cleaning (removing redudant whitespace and cleaning up punctuation)
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", " ")
    s = s.strip()
    
    return s

def create_embeddings(text):
    """Splits the text into chunks and returns a list of (chunk text, embeddings)."""
    # Calculate the number of chunks
    num_chunks = math.ceil(len(text) / CHUNK_SIZE)

    # Initialize an empty list to store the embeddings
    embeddings = []

    # Loop over the chunks of text
    for i in range(num_chunks):
        start = i * CHUNK_SIZE
        end = (i + 1) * CHUNK_SIZE

        chunk_text = text[start:end]
        embedding = get_embedding(normalize_text(chunk_text), engine = TEXT_SEARCH_DOC_EMBEDDING_ENGINE)
        embeddings.append((chunk_text, embedding))
    return embeddings
    
def create_embeddings_for_query(query):
    """get embeddings for the given query."""
    return get_embedding(normalize_text(query), engine = TEXT_SEARCH_QUERY_EMBEDDING_ENGINE)

'''