# Item Attribution Recommender Project Testing

Test out the capabilities of the API on Azure OpenAI Studio

Here's some resources:
- [Video on using Python with the studio](https://www.youtube.com/watch?v=gNCk8tW5QS8)
- [Azure Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=command-line)
- [Medium article tutorial](https://medium.com/microsoftazure/querying-structured-data-with-azure-openai-e59ee43867e5)
- [OpenAI API Documentation on Fine-Tuning](https://platform.openai.com/docs/guides/fine-tuning)
- [Microsoft documentation on Bringing Your Own Data in Python](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-python#create-the-python-app)
- [Azure SDK Community Standup 9/14/23](https://www.youtube.com/watch?v=hFBXtGqTo7A)
- [Azure OpenAI REST API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference)

In [None]:
# # Imports and API setup
# import openai
# import tiktoken
# import pandas as pd
# import numpy as np
# import requests
# import json
# import dotenv
# from azure.identity import DefaultAzureCredential
# 
# # May need to update API version and deployment for the custom data section
# openai.api_key = dbutils.secrets.get(scope="openai", key="access-token") # Added using the Databricks CLI, see below for details
# openai.api_base = dbutils.secrets.get(scope="openai", key="endpoint") 
# openai.api_type = 'azure'
# openai.api_version = '2023-05-15'
# 
# # Switch out deployments HERE
# deployment_name='gpt-4-item-recommender' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

To create/utilize secrets, you'll need to use the Databricks CLI. [Here's](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/install) how to install the CLI on your local machine (I directly used a ZIP of the CLI exe). Once the CLI is installed, you can create an access token in your Databricks user profile in the top right corner of the workspace. Add this profile in the CLI by following this [page](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/authentication). 

Once all that's set up, you can use the help command on the `databricks secrets put-secrets` and `databricks secrets create-scope` commands to add secrets for use in your notebooks. Don't follow the Microsoft syntax for these commands, as they use a different version of the CLI.

In [None]:
# Send a completion call to generate an answer
# Make sure to use the 2023-05-15 version of the API for this cell

#start_phrase = 'Python can be used '
#response = openai.Completion.create(engine=deployment_name, prompt=start_phrase, max_tokens=100)
#text = response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip()
#print(start_phrase+text)

In [None]:
# To add the unfi data, use the Azure Cognitive Search Index "unfi-sample" instead of the data directly in Databricks
# Set up utility functions
# These utility functions came directly from the Azure SDK Community Standup from 9/14/23, see above for link
#openai.api_type = 'azure'
#openai.api_version = '2023-08-01-preview' #Following a tutorial in terms of versioning - switch this on in case you use this cell

In [None]:
# def setup_byod(deployment_id: str) -> None:
#     """ Sets up the OpenAI Python SDK to use your own data for the chat endpoint 
# 
#        :param deployment_id: The deployment ID for the model to use with your own data.
# 
#        To remove this configuration, simply set openai.requestssession to None.
#     """
#     class BringYourOwnDataAdapter(requests.adapters.HTTPAdapter):
# 
#         def send(self, request, **kwargs):
#             request.url = f"{openai.api_base}/openai/deployments/{deployment_id}/extensions/chat/completions?api-version={openai.api_version}"
#             return super().send(request, **kwargs)
# 
#     session = requests.Session()
#     session.mount(
#         prefix=f"{openai.api_base}/openai/deployments/{deployment_id}",
#         adapter=BringYourOwnDataAdapter()
#     )
# 
#     openai.requestssession = session 

In [None]:
# SDK Configuration
# setup_byod(deployment_name)

# Set up the chat functionality
#def ask_question(question: str, show_context: bool = False) -> str:
    #completion = openai.ChatCompletion.create(
     #   messages=[{"role": "user", "content": question}],
      #  temperature=0.1,  # Keep temp low for deterministic answers
       # deployment_id=deployment_name,
        #dataSources=[
         #   {
          #      "type": "AzureCognitiveSearch",
           #     "parameters": {
            #        "endpoint": dbutils.secrets.get(
             #           scope="openai", key="search-endpoint"
              #      ),
               #     "key": dbutils.secrets.get(scope="openai", key="search-key"),
                #    "indexName": "unfi-sample",
                #},
            #}
        #],
    #)
    #if show_context:
    #    print("Context:")
    #    print(
    #        json.dumps(
     #           json.loads(completion.choices[0].message.context.messages[0].content),
      #          indent=2,
       #     )
       # )
    #return f"Answer: {completion.choices[0].message.content}"

The example above is working for simple queries of a sample of the dataset in txt format, but not for the pattern completion we need of this model. Instead of continuing with the BYOD chat completion, let's instead perform some prompt engineering and [few shot prompting](https://www.promptingguide.ai/techniques/fewshot) to try and get the model to recognize the pattern.

In [None]:
# Bring in data from warehouse
# df_unfi = spark.read.table("unfi_item_list")
# 
# # Separate this table into one free of null values and one containing nulls
# df_unfi_train = df_unfi.filter(df_unfi["ADV_Category"] != "NA")
# 
# df_unfi_test = df_unfi.filter(df_unfi["ADV_Category"] == "NA")

In [None]:
# # Convert a sample of the above dataframe to a str
# unfi_sample = str(df_unfi_train.head(45))
# 
# # Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
# unfi_sample_val = str(df_unfi_test.select("SVItemCD", "SVBrand", "SVDescrip", "VendPack", "SVSize", "SVPack", "UPCItem", "UPCCase").tail(5))

In [None]:
# Call the same model via the chat completion API
# First, write out the prompt input
#setup_byod(deployment_name)

# prompt = f"""
# The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
# The first eight columns are used to fill in the last 9 columns. Can you autocomplete the last 9 columns of the following sample from the UNFI Item List:
# {unfi_sample_val} and format your answer into a table?
# """
# 
# response = openai.ChatCompletion.create(
#     engine=deployment_name,
#     messages=[
#         {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
#         {"role": "user", "content": prompt}
#         ]
#     
# )
# 
# print(f"Answer: {response['choices'][0]['message']['content']}")

In [None]:
# Validation set for the above query
# display(df_unfi_test.tail(5))

Results from this are better than above - the model is actually recognizing the data and attempting to impute nulls. However, the results are inconsistent in their quality, so we'll have to keep experimenting with prompt engineering as that's been the most successful method so far.

In [None]:
# Try out splitting the task into sub-prompts
# Imputation of Category and SubCategory
# prompt = f"""
# The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
# The first eight columns are used to fill in the last 9 columns. Can you fill in the null values in the ADV_Category and ADV_SubCategory columns of the following sample from the UNFI Item List:
# {unfi_sample_val} and format your answer into a table? The ADV_Category column is based on the SVDescrip column and the ADV_SubCategory column is based on the SVBrand column.
# """
# 
# response = openai.ChatCompletion.create(
#     engine=deployment_name,
#     messages=[
#         {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
#         {"role": "user", "content": prompt}
#         ]
#     
# )
# 
# print(f"Answer: {response['choices'][0]['message']['content']}")

In [None]:
# Imputation of the rest of the columns
# prompt = f"""
# The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
# The first eight columns are used to fill in the last 7 columns. Can you autocomplete the last 7 columns of the following sample from the UNFI Item List:
# {unfi_sample_val} and format your answer into a table?
# """
# 
# response = openai.ChatCompletion.create(
#     engine=deployment_name,
#     messages=[
#         {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
#         {"role": "user", "content": prompt}
#         ]
#     
# )
# 
# print(f"Answer: {response['choices'][0]['message']['content']}")

In [None]:
# Validation set for the above queries
# display(df_unfi_test.tail(5))

In [None]:
# display(unfi_sample_val)

## Testing API Changes/2nd Code Draft

The code below is a more refined and updated version of the stuff above. I'll keep all of the preceding code for posterity's sake, but know that the code below is closer to the final version. 

In [None]:
# # New data import - use Pandas instead
# # Bring in data from warehouse
# df_unfi = spark.read.table("unfi_item_list")
# 
# # Convert to Pandas dataframe
# df_unfi = df_unfi.toPandas()
# 
# # Replace na with actual null values
# df_unfi = df_unfi.replace('NA', np.nan)
# #df_unfi.isnull().sum()
# 
# # Separate this table into one free of null values and one containing nulls
# df_unfi_test = df_unfi[~(df_unfi.notna().all(axis=1))]
# 
# df_unfi_train = df_unfi[df_unfi.notna().all(axis=1)]
# 
# # Convert a sample of the above dataframe to a str
# unfi_sample = df_unfi_train.head(800).to_string(sparsify=False, justify="center") # Keep this at 800 max while using gpt-4 turbo
# 
# # Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
# unfi_sample_val = df_unfi_test.head(20).to_string(sparsify=False, justify="center")

Between the timeout errors and the model running out of output tokens, it looks like the limit for output is **15 rows at a time**. This means we'll have to implement a loop, maybe we can parallelize the task so we can run multiple loops at once. Trying to get outputs beyond 15 rows makes the run time increase exponentially - it goes from a few minutes to over an hour. This might be because trying to output over the token limit makes the model hang until it times out. So maybe another solution here is to get a quota increase on the output tokens like we did with the input tokens.

In [None]:
# # Call the same model via the chat completion API
# # Also add more detail in the prompt on what columns map to each other
# # To-do: Create a user defined method of mapping columns
# prompt = f"""
# The following data is a sample of a dataset of product attributes called the UNFI Item List: {unfi_sample} 
# The missing item attribution values need to be filled in. Here's how the columns map to each other:
# the ADV_Category column is based on the SVItemCD column,
# the ADV_Brand column is based on the SVBrand column,
# the ADV_SubCategory column is inferred using the SVBItemCD and SVItemDescrip columns,
# the ADV_Brand column is based on the SVBrand column,
# the ADV_ItemDescrip column is based on the SVDescrip column,
# the ADV_Size column is based on the SVSize column,
# the ADV_Store_Pack cloumn is based on the VendPack column,
# the ADV_ItemUPC column is copied from the UPCItem column and should not be formatted in scientific notation,
# the ADV_CaseUPC10 column is exactly the same as the ADV_ItemUPC column and should not be formatted in scientific notation,
# and the RptLvlFilter column values can be left null.
# 
# Autocomplete the last 9 columns of the following validation sample from the UNFI Item List:
# {unfi_sample_val} and format your answer into a table using | to separate columns. Make sure your output starts with '| SVItemCD |' and ends with '| EXCLUDE      |'. Do not insert a blank row under the header in the output. 
# """
# 
# response = openai.ChatCompletion.create(
#     engine=deployment_name,
#     request_timeout=3600, # Up the time to timeout significantly - this is in seconds
#     temperature=1, # Try raising this to suppress the "I can't generate data" response
#     messages=[
#         {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
#         {"role": "user", "content": prompt}
#         ]
#     
# )
# 
# print(f"Answer: {response['choices'][0]['message']['content']}")

If you need help with the Chat Completion API and Python function, [here](https://platform.openai.com/docs/api-reference/introduction?lang=python) is a reference from OpenAI and [here's](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chatgpt?tabs=python&pivots=programming-language-chat-completions) one from Microsoft. Here's a good [Stackoverflow thread](https://stackoverflow.com/questions/75787638/openai-gpt-3-api-error-request-timed-out) on the timeout issues and how to turn the above code into production code.

In [None]:
# # Take the response from the bot and transfer the data into an exportable format
# output = response['choices'][0]['message']['content']
# 
# # Trim out the non-tabular data
# output_start = output.find('| SVItemCD |')
# output_end = output.rfind('| EXCLUDE    |')
# 
# # StringIO converts the trimmed response string to a file pandas read_csv can convert to a dataframe
# output_df = pd.read_csv(StringIO(output[output_start:output_end]), delimiter='|', index_col=[0], skiprows=[1]) # Skip the first row where the model keeps putting dash marks
# 
# display(output_df)

In [None]:
# # Validation set for the above query
# display(df_unfi_test.head(5))

## Testing out the Model on the AWG dataset

This dataset doesn't have any natural null values to impute, so we'll need to create nulls artifically using numpy. Source [here](https://cmdlinetips.com/2019/05/how-to-randomly-add-nan-to-pandas-dataframe/)

In [None]:
# Bring in AWG data
df_awg = spark.read.table("awg_item_list")

# Convert to Pandas dataframe
df_awg = df_awg.toPandas()

# Create artifical null values - the number of nulls is higher here than in the UNFI dataset to stress test the model a bit
# Make sure the nulls only occur in the ADV columns though!
# Define the columns where you want to introduce NaNs  
target_columns = ['ADV_Brand', 'ADV_Category', 'ADV_SubCategory', 'ADV_ItemDescrip', 'ADV_ItemUPC', 'ADV_CaseUPC10', 'ADV_Size', 'ADV_StorePack', 'RptLvlFilter']  
  
# Define the proportion of NaNs you want in your DataFrame  
nan_proportion = 0.4  # 40% NaNs  
  
# Calculate the number of values to replace with NaN for each column  
nan_count_per_column = {column: int(df_awg.shape[0] * nan_proportion) for column in target_columns}  
  
# Randomly choose indices to be replaced with NaN for each target column  
for column in target_columns:  
    nan_indices = np.random.choice(df_awg.index, nan_count_per_column[column], replace=False)  
    df_awg.loc[nan_indices, column] = np.nan  

# Separate this table into one free of null values and one containing nulls
df_awg_test = df_awg[~(df_awg.notna().all(axis=1))]

df_awg_train = df_awg[df_awg.notna().all(axis=1)]

# Convert a sample of the above dataframe to a str
awg_sample = df_awg_train.head(800).to_string(sparsify=False, justify="center") # Keep this at 200 while using the 32k gpt 4 model

# Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
awg_sample_val = df_awg_test.head(5).to_string(sparsify=False, justify="center")

In [None]:
# # Model prompt
# prompt = f"""
# The following data is a sample of a dataset of product attributes called the AWG Item List: {awg_sample} 
# The missing item attribution values need to be filled in. Here's how the columns map to each other:
# the ADV_Category column is based on the Category Name column,
# the ADV_Brand column is based on the Brand column,
# the ADV_SubCategory column is inferred using the Sub Category Name column,
# the ADV_ItemDescrip column is based on the Item Description column,
# the ADV_Size column is based on the Size column,
# the ADV_Store_Pack cloumn is based on the Store Pack column,
# the ADV_ItemUPC column is copied from the UPCItem column and should not be formatted in scientific notation,
# the ADV_CaseUPC10 column is exactly the same as the UPCCase column and should not be formatted in scientific notation,
# and the RptLvlFilter column values can be left null.
# 
# Autocomplete the last 9 columns of the following validation sample from the AWG Item List:
# {awg_sample_val} and format your answer into a table using | to separate columns. Make sure your output starts with '| Item Code |' and ends with '| INCLUDE      |'.
# """
# 
# response = openai.ChatCompletion.create(
#     engine=deployment_name,
#     temperature=1, # Try raising this to suppress the "I can't generate data" response
#     messages=[
#         {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
#         {"role": "user", "content": prompt}
#         ]
# 
# )
# 
# print(f"Answer: {response['choices'][0]['message']['content']}")

## Final Draft on Dev Environment

After meeting with the team on 1/22/24, I'm trying a new approach to the imputation where the data will be split column-wise and fed into the model separately. This will hopefully avoid the token/timeout issues we've been experiencing lately and allow us to feed the model more input columns for training. It also makes sense from a parallelization standpoint; we can run 6 of these column API calls at the same time to save time on the overall processing in Azure ML (our compute cluster there has 6 cores). Once the columns are imputed correctly, we'll zip the columns back together along with the original data for export. I'll also update our process to use the Azure key vault we set up instead of a local Databricks CLI.

In [3]:
# Imports and API setup
import openai
import tiktoken
import pandas as pd
import numpy as np
import requests
import json
# import dotenv
from azure.identity import DefaultAzureCredential
from io import StringIO
import time

# Use newer API version here
#openai.api_key = dbutils.secrets.get(scope="openai", key="access-token")
#openai.api_base = dbutils.secrets.get(scope="openai", key="endpoint") 
openai.api_type = 'azure'
openai.api_version = '2023-12-01-preview'

# Switch out deployments HERE too
deployment_name='item-recommender-main' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

In [None]:
# Temporary api key and base enviroment setting while I can't get the databricks CLI installed
# DELETE THE STRINGS AFTER RUNNING THIS COMMAND
openai.api_key = ""
openai.api_base = ""

In [4]:
# Use the AWG data here since there's more control over where nulls can exist and the data's very similar to the UNFI data anyways
df_awg = spark.read.table("awg_item_list")

# Convert to Pandas dataframe
df_awg = df_awg.toPandas()

# Create artifical null values - the number of nulls is higher here than in the UNFI dataset to stress test the model a bit
# Make sure the nulls only occur in the ADV columns though!
# Define the columns where you want to introduce NaNs and the source columns for the imputer to read 
target_columns = ['ADV_Brand', 'ADV_Category', 'ADV_SubCategory', 'ADV_ItemDescrip', 'ADV_ItemUPC', 'ADV_CaseUPC10', 'ADV_Size', 'ADV_StorePack', 'RptLvlFilter'] # Need to make these read from the frontend in the future to be more programmatic
source_columns = ["Item Code", "Brand", "Category Name", "Sub Category Name", "UPCItem", "UPCCase", "Item Description", "Size", "Store Pack"] 

# Define the proportion of NaNs you want in your DataFrame  
nan_proportion = 0.4  # 40% NaNs  
  
# Calculate the number of values to replace with NaN for each column  
nan_count_per_column = {column: int(df_awg.shape[0] * nan_proportion) for column in target_columns}  
  
# Randomly choose indices to be replaced with NaN for each target column
np.random.seed(42) # Reproducibility
for column in target_columns:  
    nan_indices = np.random.choice(df_awg.index, nan_count_per_column[column], replace=False)  
    df_awg.loc[nan_indices, column] = np.nan  

# Separate this table into one free of null values and one containing nulls
df_awg_test = df_awg[~(df_awg.notna().all(axis=1))]

df_awg_train = df_awg[df_awg.notna().all(axis=1)]

NameError: name 'spark' is not defined

In [None]:
# Loop through the dataframe column-wise and call the model to impute nulls in the ADV_ columns
for n, column in enumerate(target_columns):
  # Basic prompt setup - plug in column names from here from frontend
  prompt = f"""
    The following data is a column of a dataset of product attributes called the AWG Item List: {df_awg_train.to_string(index=False, columns=[column], justify='center')} 
    The missing item attribution values need to be filled in. Here's how missing values in the ADV_ column are imputed manually:
    the {target_columns[n]} column is based on the {source_columns[n]} column. 

    Autocomplete the data in the {target_columns[n]} column of the following validation sample from the AWG Item List:
    {df_awg_test.to_string(index=False, columns=[column], justify='center')} and format your answer into a table.
  """ 
  # While loop to prevent errors if the API doesn't connect successfully
  retries = 3    
  while retries > 0:    
    try: 
      print("Imputing...")
      # API call to OpenAI GPT-4 deployment
      response = openai.ChatCompletion.create(
        engine=deployment_name,
        temperature=1,
        messages=[
          {"role": "system", "content": "You are a model used to fill in missing data."},
          {"role": "user", "content": prompt}
          ]
      )
      data_out = response['choices'][0]['message']['content']
      print(f"Imputed in column {target_columns[n]}")
      time.sleep(2)
      break # End the while loop after it succeeds in calling the API. Might be bad practice to do things this way
    except Exception as e:
        print(e)  
        retries -= 1  
        if retries > 0:  
            print('Timeout error, retrying...')  
            time.sleep(5)  
    else:  
        print('API is not responding, all retries exhausted. Raising exception...')
        raise ValueError("Time out error - try restarting the script and checking your connection to OpenAI Studio")
  
  # Turn the output into a Pandas series
  rows = data_out.split('\n')
  data_out_series = pd.Series(rows[1:], name=column)

  # Append the series back to the test df
  df_awg_test[column] = data_out_series


In [None]:
# Zip up the test data back with the training data now that its filled out
display(df_awg_test.head(100))

In [None]:
### Some fixes to the above code


In [None]:
%md
# Item Attribution Recommender Project Testing

Test out the capabilities of the API on Azure OpenAI Studio

Here's some resources:
- [Video on using Python with the studio](https://www.youtube.com/watch?v=gNCk8tW5QS8)
- [Azure Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=command-line)
- [Medium article tutorial](https://medium.com/microsoftazure/querying-structured-data-with-azure-openai-e59ee43867e5)
- [OpenAI API Documentation on Fine-Tuning](https://platform.openai.com/docs/guides/fine-tuning)
- [Microsoft documentation on Bringing Your Own Data in Python](https://learn.microsoft.com/en-us/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-python#create-the-python-app)
- [Azure SDK Community Standup 9/14/23](https://www.youtube.com/watch?v=hFBXtGqTo7A)
- [Azure OpenAI REST API Documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference)
# Imports and API setup
import openai
import tiktoken
import pandas as pd
import numpy as np
import requests
import json
# import dotenv
# from azure.identity import DefaultAzureCredential

# May need to update API version and deployment for the custom data section
openai.api_key = dbutils.secrets.get(scope="openai", key="access-token") # Added using the Databricks CLI, see below for details
openai.api_base = dbutils.secrets.get(scope="openai", key="endpoint")
openai.api_type = 'azure'
openai.api_version = '2023-05-15'

# Switch out deployments HERE
deployment_name='gpt-4-item-recommender' #This will correspond to the custom name you chose for your deployment when you deployed a model. 
%md
To create/utilize secrets, you'll need to use the Databricks CLI. [Here's](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/install) how to install the CLI on your local machine (I directly used a ZIP of the CLI exe). Once the CLI is installed, you can create an access token in your Databricks user profile in the top right corner of the workspace. Add this profile in the CLI by following this [page](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/authentication).

Once all that's set up, you can use the help command on the `databricks secrets put-secrets` and `databricks secrets create-scope` commands to add secrets for use in your notebooks. Don't follow the Microsoft syntax for these commands, as they use a different version of the CLI.
# Send a completion call to generate an answer
# Make sure to use the 2023-05-15 version of the API for this cell

#start_phrase = 'Python can be used '
#response = openai.Completion.create(engine=deployment_name, prompt=start_phrase, max_tokens=100)
#text = response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip()
#print(start_phrase+text)
# To add the unfi data, use the Azure Cognitive Search Index "unfi-sample" instead of the data directly in Databricks
# Set up utility functions
# These utility functions came directly from the Azure SDK Community Standup from 9/14/23, see above for link
#openai.api_type = 'azure'
#openai.api_version = '2023-08-01-preview' #Following a tutorial in terms of versioning - switch this on in case you use this cell

def setup_byod(deployment_id: str) -> None:
    """ Sets up the OpenAI Python SDK to use your own data for the chat endpoint 

       :param deployment_id: The deployment ID for the model to use with your own data.

       To remove this configuration, simply set openai.requestssession to None.
    """
    class BringYourOwnDataAdapter(requests.adapters.HTTPAdapter):

        def send(self, request, **kwargs):
            request.url = f"{openai.api_base}/openai/deployments/{deployment_id}/extensions/chat/completions?api-version={openai.api_version}"
            return super().send(request, **kwargs)

    session = requests.Session()
    session.mount(
        prefix=f"{openai.api_base}/openai/deployments/{deployment_id}",
        adapter=BringYourOwnDataAdapter()
    )

    openai.requestssession = session
# SDK Configuration
# setup_byod(deployment_name)

# Set up the chat functionality
#def ask_question(question: str, show_context: bool = False) -> str:
#completion = openai.ChatCompletion.create(
#   messages=[{"role": "user", "content": question}],
#  temperature=0.1,  # Keep temp low for deterministic answers
# deployment_id=deployment_name,
#dataSources=[
#   {
#      "type": "AzureCognitiveSearch",
#     "parameters": {
#        "endpoint": dbutils.secrets.get(
#           scope="openai", key="search-endpoint"
#      ),
#     "key": dbutils.secrets.get(scope="openai", key="search-key"),
#    "indexName": "unfi-sample",
#},
#}
#],
#)
#if show_context:
#    print("Context:")
#    print(
#        json.dumps(
#           json.loads(completion.choices[0].message.context.messages[0].content),
#          indent=2,
#     )
# )
#return f"Answer: {completion.choices[0].message.content}"
%md
The example above is working for simple queries of a sample of the dataset in txt format, but not for the pattern completion we need of this model. Instead of continuing with the BYOD chat completion, let's instead perform some prompt engineering and [few shot prompting](https://www.promptingguide.ai/techniques/fewshot) to try and get the model to recognize the pattern.
# Bring in data from warehouse
df_unfi = spark.read.table("unfi_item_list")

# Separate this table into one free of null values and one containing nulls
df_unfi_train = df_unfi.filter(df_unfi["ADV_Category"] != "NA")

df_unfi_test = df_unfi.filter(df_unfi["ADV_Category"] == "NA")
# Convert a sample of the above dataframe to a str
unfi_sample = str(df_unfi_train.head(45))

# Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
unfi_sample_val = str(df_unfi_test.select("SVItemCD", "SVBrand", "SVDescrip", "VendPack", "SVSize", "SVPack", "UPCItem", "UPCCase").tail(5))
# Call the same model via the chat completion API
# First, write out the prompt input
#setup_byod(deployment_name)

prompt = f"""
The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
The first eight columns are used to fill in the last 9 columns. Can you autocomplete the last 9 columns of the following sample from the UNFI Item List:
{unfi_sample_val} and format your answer into a table?
"""

response = openai.ChatCompletion.create(
    engine=deployment_name,
    messages=[
        {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
        {"role": "user", "content": prompt}
    ]

)

print(f"Answer: {response['choices'][0]['message']['content']}")
# Validation set for the above query
display(df_unfi_test.tail(5))
%md
Results from this are better than above - the model is actually recognizing the data and attempting to impute nulls. However, the results are inconsistent in their quality, so we'll have to keep experimenting with prompt engineering as that's been the most successful method so far.
# Try out splitting the task into sub-prompts
# Imputation of Category and SubCategory
prompt = f"""
The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
The first eight columns are used to fill in the last 9 columns. Can you fill in the null values in the ADV_Category and ADV_SubCategory columns of the following sample from the UNFI Item List:
{unfi_sample_val} and format your answer into a table? The ADV_Category column is based on the SVDescrip column and the ADV_SubCategory column is based on the SVBrand column.
"""

response = openai.ChatCompletion.create(
    engine=deployment_name,
    messages=[
        {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
        {"role": "user", "content": prompt}
    ]

)

print(f"Answer: {response['choices'][0]['message']['content']}")
# Imputation of the rest of the columns
prompt = f"""
The following data is a sample of a dataset of product attributes called the UNFI Item List. {unfi_sample} 
The first eight columns are used to fill in the last 7 columns. Can you autocomplete the last 7 columns of the following sample from the UNFI Item List:
{unfi_sample_val} and format your answer into a table?
"""

response = openai.ChatCompletion.create(
    engine=deployment_name,
    messages=[
        {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
        {"role": "user", "content": prompt}
    ]

)

print(f"Answer: {response['choices'][0]['message']['content']}")
# Validation set for the above queries
display(df_unfi_test.tail(5))
display(unfi_sample_val)
%md
## Testing API Changes/2nd Code Draft

The code below is a more refined and updated version of the stuff above. I'll keep all of the preceding code for posterity's sake, but know that the code below is closer to the final version.
# New data import - use Pandas instead
# Bring in data from warehouse
df_unfi = spark.read.table("unfi_item_list")

# Convert to Pandas dataframe
df_unfi = df_unfi.toPandas()

# Replace na with actual null values
df_unfi = df_unfi.replace('NA', np.nan)
#df_unfi.isnull().sum()

# Separate this table into one free of null values and one containing nulls
df_unfi_test = df_unfi[~(df_unfi.notna().all(axis=1))]

df_unfi_train = df_unfi[df_unfi.notna().all(axis=1)]

# Convert a sample of the above dataframe to a str
unfi_sample = df_unfi_train.head(800).to_string(sparsify=False, justify="center") # Keep this at 800 max while using gpt-4 turbo

# Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
unfi_sample_val = df_unfi_test.head(20).to_string(sparsify=False, justify="center")
%md
Between the timeout errors and the model running out of output tokens, it looks like the limit for output is **15 rows at a time**. This means we'll have to implement a loop, maybe we can parallelize the task so we can run multiple loops at once. Trying to get outputs beyond 15 rows makes the run time increase exponentially - it goes from a few minutes to over an hour. This might be because trying to output over the token limit makes the model hang until it times out. So maybe another solution here is to get a quota increase on the output tokens like we did with the input tokens.
# Call the same model via the chat completion API
# Also add more detail in the prompt on what columns map to each other
# To-do: Create a user defined method of mapping columns
prompt = f"""
The following data is a sample of a dataset of product attributes called the UNFI Item List: {unfi_sample} 
The missing item attribution values need to be filled in. Here's how the columns map to each other:
the ADV_Category column is based on the SVItemCD column,
the ADV_Brand column is based on the SVBrand column,
the ADV_SubCategory column is inferred using the SVBItemCD and SVItemDescrip columns,
the ADV_Brand column is based on the SVBrand column,
the ADV_ItemDescrip column is based on the SVDescrip column,
the ADV_Size column is based on the SVSize column,
the ADV_Store_Pack cloumn is based on the VendPack column,
the ADV_ItemUPC column is copied from the UPCItem column and should not be formatted in scientific notation,
the ADV_CaseUPC10 column is exactly the same as the ADV_ItemUPC column and should not be formatted in scientific notation,
and the RptLvlFilter column values can be left null.

Autocomplete the last 9 columns of the following validation sample from the UNFI Item List:
{unfi_sample_val} and format your answer into a table using | to separate columns. Make sure your output starts with '| SVItemCD |' and ends with '| EXCLUDE      |'. Do not insert a blank row under the header in the output. 
"""
##### I received a few responses that some variables were undefined, so I defined them earlier. Then I received this
##### this message for the response object: 
# To fix the error, you need to reduce the length of the messages sent to the OpenAI API. The error states that the maximum context length is 8192 tokens, but your messages resulted in 79184 tokens. Here are a few suggestions to reduce the length:

# Remove unnecessary details from the prompt.

# Go through the prompt and remove any extra information that is not essential for generating the desired output.
# Focus on providing the required information and instructions concisely.
# Split the prompt into multiple parts or messages.

# Instead of sending the entire prompt as a single message, split it into smaller parts if possible.
# Send the parts sequentially as separate messages to stay within the token limit.
# Reduce the length of variable values.

# If the variables unfi_sample and unfi_sample_val contain long strings, try to shorten them or use a summarization technique.
# Use shorter or more concise language.

# Review the content of the prompt and messages, and see if you can express the same information using fewer words or shorter sentences.

response = openai.ChatCompletion.create(
    engine=deployment_name,
    request_timeout=3600, # Up the time to timeout significantly - this is in seconds
    temperature=1, # Try raising this to suppress the "I can't generate data" response
    messages=[
        {"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
        {"role": "user", "content": prompt}
    ]

)

print(f"Answer: {response['choices'][0]['message']['content']}")
%md
If you need help with the Chat Completion API and Python function, [here](https://platform.openai.com/docs/api-reference/introduction?lang=python) is a reference from OpenAI and [here's](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chatgpt?tabs=python&pivots=programming-language-chat-completions) one from Microsoft. Here's a good [Stackoverflow thread](https://stackoverflow.com/questions/75787638/openai-gpt-3-api-error-request-timed-out) on the timeout issues and how to turn the above code into production code.
# Take the response from the bot and transfer the data into an exportable format
output = response['choices'][0]['message']['content']

# Trim out the non-tabular data
output_start = output.find('| SVItemCD |')
output_end = output.rfind('| EXCLUDE    |')

# StringIO converts the trimmed response string to a file pandas read_csv can convert to a dataframe
output_df = pd.read_csv(StringIO(output[output_start:output_end]), delimiter='|', index_col=[0], skiprows=[1]) # Skip the first row where the model keeps putting dash marks

display(output_df)
# Validation set for the above query
display(df_unfi_test.head(5))
%md
## Testing out the Model on the AWG dataset

This dataset doesn't have any natural null values to impute, so we'll need to create nulls artifically using numpy. Source [here](https://cmdlinetips.com/2019/05/how-to-randomly-add-nan-to-pandas-dataframe/)
# Bring in AWG data
df_awg = spark.read.table("awg_item_list")

# Convert to Pandas dataframe
df_awg = df_awg.toPandas()

# Create artifical null values - the number of nulls is higher here than in the UNFI dataset to stress test the model a bit
# Make sure the nulls only occur in the ADV columns though!
# Define the columns where you want to introduce NaNs  
target_columns = ['ADV_Brand', 'ADV_Category', 'ADV_SubCategory', 'ADV_ItemDescrip', 'ADV_ItemUPC', 'ADV_CaseUPC10', 'ADV_Size', 'ADV_StorePack', 'RptLvlFilter']

# Define the proportion of NaNs you want in your DataFrame  
nan_proportion = 0.4  # 40% NaNs  

# Calculate the number of values to replace with NaN for each column  
nan_count_per_column = {column: int(df_awg.shape[0] * nan_proportion) for column in target_columns}

# Randomly choose indices to be replaced with NaN for each target column  
for column in target_columns:
nan_indices = np.random.choice(df_awg.index, nan_count_per_column[column], replace=False)
df_awg.loc[nan_indices, column] = np.nan

# Separate this table into one free of null values and one containing nulls
df_awg_test = df_awg[~(df_awg.notna().all(axis=1))]

df_awg_train = df_awg[df_awg.notna().all(axis=1)]

# Convert a sample of the above dataframe to a str
awg_sample = df_awg_train.head(800).to_string(sparsify=False, justify="center") # Keep this at 200 while using the 32k gpt 4 model

# Create a "validation" set to test the AI's ability to recognize the pattern from the first sample
awg_sample_val = df_awg_test.head(5).to_string(sparsify=False, justify="center")
# Model prompt
prompt = f"""
The following data is a sample of a dataset of product attributes called the AWG Item List: {awg_sample} 
The missing item attribution values need to be filled in. Here's how the columns map to each other:
the ADV_Category column is based on the Category Name column,
the ADV_Brand column is based on the Brand column,
the ADV_SubCategory column is inferred using the Sub Category Name column,
the ADV_ItemDescrip column is based on the Item Description column,
the ADV_Size column is based on the Size column,
the ADV_Store_Pack cloumn is based on the Store Pack column,
the ADV_ItemUPC column is copied from the UPCItem column and should not be formatted in scientific notation,
the ADV_CaseUPC10 column is exactly the same as the UPCCase column and should not be formatted in scientific notation,
and the RptLvlFilter column values can be left null.

Autocomplete the last 9 columns of the following validation sample from the AWG Item List:
{awg_sample_val} and format your answer into a table using | to separate columns. Make sure your output starts with '| Item Code |' and ends with '| INCLUDE      |'.
"""

response = openai.ChatCompletion.create(
engine=deployment_name,
temperature=1, # Try raising this to suppress the "I can't generate data" response
messages=[
{"role": "system", "content": "You are a model used to fill in missing data."}, # Try out different language to avoid imputation tactics on the part of the model
{"role": "user", "content": prompt}
]

)

print(f"Answer: {response['choices'][0]['message']['content']}")
%md
## Final Draft on Dev Environment

After meeting with the team on 1/22/24, I'm trying a new approach to the imputation where the data will be split column-wise and fed into the model separately. This will hopefully avoid the token/timeout issues we've been experiencing lately and allow us to feed the model more input columns for training. It also makes sense from a parallelization standpoint; we can run 6 of these column API calls at the same time to save time on the overall processing in Azure ML (our compute cluster there has 6 cores). Once the columns are imputed correctly, we'll zip the columns back together along with the original data for export. I'll also update our process to use the Azure key vault we set up instead of a local Databricks CLI.
# Temporary api key and base enviroment setting while I can't get the databricks CLI installed
# DELETE THE STRINGS AFTER RUNNING THIS COMMAND
openai.api_key = ""
openai.api_base = ""
# I commented out some packaged installs that did not seem to be used
# and were problematic to install

# Imports and API setup
import openai
import tiktoken
import pandas as pd
import numpy as np
import requests
import json
# import dotenv
# from azure.identity import DefaultAzureCredential
from io import StringIO
import time

# Use newer API version here
#openai.api_key = dbutils.secrets.get(scope="openai", key="access-token")
#openai.api_base = dbutils.secrets.get(scope="openai", key="endpoint") 
openai.api_type = 'azure'
openai.api_version = '2023-12-01-preview'

# Switch out deployments HERE too
deployment_name='item-recommender-main' #This will correspond to the custom name you chose for your deployment when you deployed a model. 
# Use the AWG data here since there's more control over where nulls can exist and the data's very similar to the UNFI data anyways
df_awg = spark.read.table("awg_item_list")

# Convert to Pandas dataframe
df_awg = df_awg.toPandas()

# Create artifical null values - the number of nulls is higher here than in the UNFI dataset to stress test the model a bit
# Make sure the nulls only occur in the ADV columns though!
# Define the columns where you want to introduce NaNs and the source columns for the imputer to read 
target_columns = ['ADV_Brand', 'ADV_Category', 'ADV_SubCategory', 'ADV_ItemDescrip', 'ADV_ItemUPC', 'ADV_CaseUPC10', 'ADV_Size', 'ADV_StorePack', 'RptLvlFilter'] # Need to make these read from the frontend in the future to be more programmatic
source_columns = ["Item Code", "Brand", "Category Name", "Sub Category Name", "UPCItem", "UPCCase", "Item Description", "Size", "Store Pack"]

# Define the proportion of NaNs you want in your DataFrame  
nan_proportion = 0.4  # 40% NaNs  

# Calculate the number of values to replace with NaN for each column  
nan_count_per_column = {column: int(df_awg.shape[0] * nan_proportion) for column in target_columns}

# Randomly choose indices to be replaced with NaN for each target column
np.random.seed(42) # Reproducibility
for column in target_columns:
nan_indices = np.random.choice(df_awg.index, nan_count_per_column[column], replace=False)
df_awg.loc[nan_indices, column] = np.nan

# Separate this table into one free of null values and one containing nulls
df_awg_test = df_awg[~(df_awg.notna().all(axis=1))]

df_awg_train = df_awg[df_awg.notna().all(axis=1)]
# Loop through the dataframe column-wise and call the model to impute nulls in the ADV_ columns
for n, column in enumerate(target_columns):
# Basic prompt setup - plug in column names from here from frontend
prompt = f"""
    The following data is a column of a dataset of product attributes called the AWG Item List: {df_awg_train.to_string(index=False, columns=[column], justify='center')} 
    The missing item attribution values need to be filled in. Here's how missing values in the ADV_ column are imputed manually:
    the {target_columns[n]} column is based on the {source_columns[n]} column. 

    Autocomplete the data in the {target_columns[n]} column of the following validation sample from the AWG Item List:
    {df_awg_test.to_string(index=False, columns=[column], justify='center')} and format your answer into a table.
  """
# While loop to prevent errors if the API doesn't connect successfully
retries = 3
while retries > 0:
    ### moved this definition up
    data_out = response['choices'][0]['message']['content']
    print(f"Imputed in column {target_columns[n]}")
    time.sleep(2)
    try:
        print("Imputing...")
        # API call to OpenAI GPT-4 deployment
        response = openai.ChatCompletion.create(
            engine=deployment_name,
            temperature=1,
            messages=[
                {"role": "system", "content": "You are a model used to fill in missing data."},
                {"role": "user", "content": prompt}
            ]
        )
        data_out = response['choices'][0]['message']['content']
        print(f"Imputed in column {target_columns[n]}")
        time.sleep(2)
        break # End the while loop after it succeeds in calling the API. Might be bad practice to do things this way
    except Exception as e:
        print(e)
        retries -= 1
        if retries > 0:
            print('Timeout error, retrying...')
            time.sleep(5)
    else:
        print('API is not responding, all retries exhausted. Raising exception...')
        raise ValueError("Time out error - try restarting the script and checking your connection to OpenAI Studio")

    # Turn the output into a Pandas series
rows = data_out.split("'\n'")
data_out_series = pd.Series(rows[1:], name=column)

# Append the series back to the test df
df_awg_test[column] = data_out_series

# Zip up the test data back with the training data now that its filled out
display(df_awg_test.head(100))