# Azure OpenAI Own Data - Local CSV

This notebook contains an example how to use local CSV-file as own data with Azure Open AI Services. You can use your own CSV in prompt or as a separate data source. With Python (or any other language that supports data handling), you can find the most important lines from CSV and make questions based on that. This is not usually the best way to access external data, but in some cases it might be useful. When you want to access better external data (such as many data from files), use Retrieval Augmented Generation (RAG) to access content. More of those later in other notebooks.

At the end, this notebook opened my eyes more, how embeddings are working with AI and why those are so important to find the correct answer from your data.

During creation of this notebook I had inspiration from [Shweta Lodha's video](https://www.youtube.com/watch?v=wdhWQuGnnwo). It does not work anymore as it is created with older version of OpenAI services, but I have used her idea of local CSV managing.


## Pre-requirements 

- Create OpenAI service to Azure and deploy at least one model. Fill your own *config.jsonc* file. You can find an example file from *example-config.jsonc*. 
- CSV file that has some data. You can use mine AI-generated *synchro_elements.csv* or your own file.

## Initialize OpenAI service

In [None]:
%pip install --upgrade --quiet openai pandas tiktoken numpy

In [None]:
# Import configuration and initliaze client
from openai import AzureOpenAI
import json

config = json.load(open('config.jsonc'))

client = AzureOpenAI(
    api_version=config['azure_oai_api_version'],
    azure_endpoint=config['azure_oai_endpoint'],
    api_key=config['azure_oai_key']
)
gpt_model_name=config['azure_oai_gpt_model_name']
embedding_name=config['azure_oai_embedding_model_name']

## Import local CSV

We import local CSV file as panda dataframe, summarize it and count tokens that they will spend during the prompt. 

In [None]:
import pandas as pd
import tiktoken
import numpy as np

# Set up tiktoken
encoding = tiktoken.get_encoding("cl100k_base")

# Read the CSV data
df = pd.read_csv('synchro_elements.csv',delimiter=';')
df["summarized"] = ("abbreviation: " + df["abbreviation"] + "; name: " + df["name"] + "; description: " + df["description"] + "; level_features: " + df["level_features"])
df["tokens"] = df["summarized"].apply(lambda x: len(encoding.encode(str(x))))
df


## Create functions how to handle the data

All functions are described before each cell.

### Get_text_embedding

Get an embedding (vectors) for a single text.

### Get_dataframe_embeddings

Get embeddings for each text in dataframe and return indexes of dataframe with their vectors.

In [None]:
# Embedding functions
# Get embeddings for a single text
def get_text_embedding(text):
    result = client.embeddings.create(
        model=embedding_name,
        input=text
    )
    return result.data[0].embedding

# Get embeddings for a dataframe
def get_dataframe_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    return { idx: get_text_embedding(r.summarized) for idx, r in df.iterrows() }

csv_embeddings = get_dataframe_embeddings(df)
csv_embeddings



### Calculate_vector_similarity

Get two vectors and calculate similarity of those with numpy's *np.dot()*-function.

### Get_docs_with_similarity

Get query and embedding dictionary and return similarities of the query's embeddings compared to the dictionary. Return values are sorted from the highest similarity to the lowest similarity.  

In [None]:
# Calculate similarity by taking in two vectors and returning the dot product (best match)
def calculate_vector_similarity(x: list[float], y: list[float]) -> float:
    return np.dot(np.array(x), np.array(y))

# Get query and dictonary of embeddings and return a list of tuples with the similarity and the index of the document
def get_docs_with_similarity(query: str, df_embedding: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    query_embedding = get_text_embedding(query)

    document_similarities = sorted([
        (calculate_vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in df_embedding.items()
    ], reverse=True)

    return document_similarities


get_docs_with_similarity("In which element, skaters are not holding from each other while skating?", csv_embeddings)[:3]


### Create_prompt

Prompt is initialized in this function. It takes in the question, embeddings of context and the CSV-file as a dataframe and returns a prepared string (JSON-array). Script counts similarities from the documents for the questions until the used tokens reach at least 500. Script adds system message to the message and includes chosen sections of the source data to the variable *joined_content* and creates the user prompt. Script returns the prepared prompt for the OpenAI Service.

In [None]:
encoding = tiktoken.get_encoding("cl100k_base")
separator_len = len(encoding.encode("\n" ''))

def create_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    relevant_document_selections = get_docs_with_similarity(question, context_embeddings)

    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []

    for _, section_index in relevant_document_selections:
        document_section = df.loc[section_index]

        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len >= 500:
            break

        chosen_sections.append("\n- " + document_section.summarized.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))

    systemMsg = """Answer the question truthfully and to the best quality you can using the provided context. If there is not an answer in the data, say I do not have the data."""
    joined_content = "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"
    message_text = [
    {
        "role":"system",
        "content": systemMsg 
    },
    {
        "role": "user",
        "content": joined_content
    }
    ]

    return  message_text

### Get_answer

Script takes in query, source CSV as dataframe and dictionary of document embeddings. It runs the prompt towards the OpenAI Service and returns the response aas a string.

In [None]:
def get_answer(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array]
) -> str:
    prompt = create_prompt(
        query,
        document_embeddings,
        df
    )

    response = client.chat.completions.create(
        messages=prompt,
        temperature = 0,
        max_tokens = 1000,
        model = gpt_model_name
        )
    
    return response

In [None]:
query = "If you should make an interesting program with five different elements, which one would you choose and why?"
response = get_answer(query, df, csv_embeddings)
print(f"\nQ: {query}\nA: {response.choices[0].message.content}")
print("Cost: ", response.usage.total_tokens)