# Creating Human Classifier with LLMs - A Python SDK Experience

This notebook will conduct a deep-dive onto a custom-evaluator to evaluate human scoring. This might be needed as part of system operations to properly evaluate benchmark and test datasets, as a correlated metric that can best emulate the evaluation criteria by the business and evaluations conducted by humans must be measured. This is different than simply evaluating physical structure and nuance of english language - the human element also includes other elements that are not structurally consistent in evaluation such as bias and variability.

# ⚠️ 🧪 CAUTION: This is not meant for production workloads!

Please note, that this document does not **recommend** using LLM's in replacement of human evaluations. Multiple studies have been conducted across standardized datasets, and as of June 2024 the research suggests LLMs are not ready for full on evaluation in lieu of humans. However, we do proceed with a framework below to anticipate advances and explore the possibility using known techniques to us.

To read more on the publication, you can refer to the study (here)[https://arxiv.org/html/2406.18403v1]

## 💡 Strategy

To attempt at doing this, we will use LLMs to understand relationships with the prompt, context and answer pairing and have it deduce a relationship. Typical approach starts with Adding your Data, then into Prompt Engineering, finally with Fine-tuning. In extreme circumstances would pre-training ever be considered, which we will rule out as an option here.

We will demonstrate how to fine-tune using the model, and acknowledge that there are additional strategies that we will not go to. These do include:

1. RLHF, which will allow for reinforcement learning from human feedback to further improve the model based on input as the model progresses over time
2. Chain-of-Thought evaluation, citing the results of human-correlation performance with the G-Eval paper: (https://arxiv.org/abs/2303.16634)[https://arxiv.org/abs/2303.16634]

## 🧠 Data Inputs

The inputs we would use as part of all our will follow the similar syntax:

* Question - this is the question being asked on behalf of our QnA system
* Ground Truth - this is the actual answer the bot should supply
* Prompt Answer - this is the answer our QnA system would provide
* Human Evaluation - this is the output our evaluator would provide
* Generalized Context - this is general context provided to the prompt, perhaps on a specific topic retrieved by the question.
* Specialized Context - this is specialized information that can give greater context specific to the question being asked

The above structure can be formatted as you see fit for your use case. The below information is simply used as an example, nothing more.

Consider this as an example for your data set.

```json
{
    "question": "Are eye check-ups included in my plan?",
    "truth": "Some additional services provided by the health plan include: INN: $0.00 copay per visit. (Includes check-up and assessment). Limited to one visit per year. OON: Not included",
    "answer": "Eye exams are covered in-network with a $0.00 copay, limited to one annual visit. Out-of-network services are not included.",
    "evaluation": "5: Excellent response",
    "general_context": "Eye exams are generally not included under standard healthcare plans. However, they may be part of supplemental benefits under certain health plans, possibly managed by a third-party provider.",
    "specialized_context": "Eye Examination (with Assessment) - In Network: - $0.00 copay per visit (Includes check-up and assessment) - Limited to one visit per year - Copay/Coinsurance does not contribute toward the In-Network Maximum - Out of Network: - Not covered. Review the plan details for further information."
}
```

## 🛠️ Prerequities

To move forward, we need to setup and have the following available:

### Infrastructure

* An Azure Subscription
* Access to Azure OpenAI Service, in a [region](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#fine-tuning-models) that supports fine-tuning

### Data

* Access to prepared datasets for both training and validations:
    * Preferably 50 high-quality samples, 1000s is even better.

### Software

* Python 3.7.1 or greater available
* Python libraries installed, consistent with `requirements.txt`
* Jupyter Notebook runtime available to run

## ⚙️ Setup

1. To run this notebook, you will need to setup the `.env` consistent with the `.env.sample`. Populate this with the Azure OpenAI key and endpoint.
2. Install the pip requirements made available, we would recommend a `virtualenv` which you can install into: `pip install -r requirements.txt`
    1. For conda users, you can use: `conda create --name custom-human-evaluator python=3.9 -y`
    2. `conda activate custom-human-evaluator`
3. Run the following cells to run the examples

### Step 1: Python Dependencies and Setup

#### Import required Python libraries 

In [1]:
import os
import time
import openai
from openai import AzureOpenAI
import requests
import tiktoken
import numpy as np

from dotenv import load_dotenv

#### Load Azure OpenAI credentials

In [2]:
load_dotenv(".env")

client = AzureOpenAI(
    api_version=os.getenv("AZURE_AOAI_DEPLOYMENT_VERSION"),
    azure_endpoint=os.getenv("AZURE_AOAI_ENDPOINT"),
    api_key=os.getenv("AZURE_AOAI_API_KEY"),
)

### Step 2: Prepare Training & Validation Datasets

### CSV to JSON convert, with field mappings!

Feel free to use the following code to help stage the file for you.

In [3]:
from src.utils import csv_to_jsonl_with_mapping

# Example usage - moves up 1 directories to traverse to Data folder
source_csv = "%s/my_utils/data/evaluations/jsonl/trainingdata.csv" % os.getcwd()
output_jsonl = "%s/my_utils/data/evaluations/jsonl/CustomHumanEvaluator-Customer.jsonl" % os.getcwd()

# Define the key mapping from input keys to output keys
key_mapping = {
    "Question": "question",
    "Ground Truth": "truth",
    "Blended Response - Prompt 1": "answer",
    "Rating 1-3-5 (P1)": "evaluation",
    "General Documents Summary": "general_context",
    "Specialized  Benefit information summary": "specialized_context"
}

# Call the function with the source CSV, output path, and key mapping
csv_to_jsonl_with_mapping(source_csv, output_jsonl, key_mapping)

#### Generate the training and validation test sets

In [4]:
import json
from src.utils import split_dataset

# Example usage
# output_jsonl = 'path_to_your_jsonl_file.jsonl'  # Uncomment with your file path if not using above.
field_to_optimize = 'evaluation'  # The field on which to optimize the distribution
train_data, incr_data, val_data = split_dataset(output_jsonl, field_to_optimize, training_split=0.6, incremental_split=0.2, random_seed=10)

--- Training Dataset Stats ---
5: 5 instances
3: 2 instances
1: 4 instances
Total: 11

--- Incremental Training Dataset Stats ---
5: 3 instances
Total: 3

--- Validation Dataset Stats ---
5: 2 instances
1: 2 instances
3: 1 instances
Total: 5



### Step 2: We will try a n-shot learning approach for the model so that it can best predict the output. Here we will use a combination of our incremental training and validation set as the test set.

In [8]:
from src.utils import generate_nshot_prompt
import re

prompt = "You are a evaluator on behalf of providing scores of responses by call-center agents. Your answer should only be one number between 0 and 5. Your goal is to match the behavior of the human evaluators as closely as possible. \n\n"
evaluations = []
for data in val_data:
    print("Processing data!")
    # Generate the many-shot prompt
    prompt = generate_nshot_prompt(prompt, train_data, data, n=9)

    # Call Azure OpenAI to evaluate the new example
    completion = client.chat.completions.create(  # This is the updated API method
        model="gpt-4o",  # Update to the engine you're using, such as "gpt-35-turbo" or "gpt-4"
        messages=[
            {"role": "system", "content": "You are a healthcare call center advocate, who is evaluating answers based on answer completeness and accuracy."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,  # Adjust as necessary
        temperature=0.7  # Controls creativity, adjust based on your requirements
    )

    # Print the evaluation from Azure OpenAI
    llmEvaluation = completion.choices[0].message.content.strip()
    evaluations.append((data, int(llmEvaluation), data['evaluation']))

Processing data!
Processing data!
Processing data!
Processing data!
Processing data!


We're able to calculate the accuracy based on our data so far:

In [9]:
accuracy = sum([1 for data, llmEval, humanEvaluation in evaluations if llmEval == humanEvaluation]) / len(evaluations)
print(accuracy * 100)

60.0


This approach is a simple way, but it does not provide near-perfect correlation with human evaluations. Additionally, we quickly run out of context size so this approach would not scale well to numerous examples when we want to train on data of examples in the thousands.

To achieve better results, you can use a more advanced approach, such as fine-tuning the model on your data. This requires more data and computational resources, but it can significantly improve the performance of the model. Let's try this next.

### Step 3: Upload Datasets for Fine-Tuning

We want fine-tuning to detect the behaviors in our training data set, that is it takes the full context, paired with input data, and maps it to a targeted human evaluation response. Fine-tuning on Azure OpenAI uses the LoRA algorithm to create an additional vector space of embeddings that closely resemble the behaviors the model will begin to learn. The more data points we throw at it, the better the model can be at detecting patterns and representations of question and answer pairs.

In [10]:
from src.utils import convert_to_jsonl_in_memory, generate_prompt
import io
from io import BufferedReader

# Upload the training and validation dataset files to Azure OpenAI with the SDK.
training_file_name = "training_set.jsonl"
validation_file_name = "validation_set.jsonl"

finetuning_train_data = []
for data in train_data:
    dataDict = {}
    dataDict["messages"] = []
    dataDict["messages"].append({"role": "system", "content": "You are a healthcare call center agent, who is evaluating answers based on answer completeness and accuracy."})
    dataDict["messages"].append({"role": "user", "content": generate_prompt(data)})
    dataDict["messages"].append({"role": "assistant", "content": data["evaluation"]})
    finetuning_train_data.append(dataDict)

finetuning_val_data = []
for data in val_data:
    dataDict = {}
    dataDict["messages"] = []
    dataDict["messages"].append({"role": "system", "content": "You are a healthcare call center agent, who is evaluating answers based on answer completeness and accuracy."})
    dataDict["messages"].append({"role": "user", "content": generate_prompt(data)})
    dataDict["messages"].append({"role": "assistant", "content": data["evaluation"]})
    finetuning_val_data.append(dataDict)

training_jsonl_content = convert_to_jsonl_in_memory(finetuning_train_data)
training_jsonl_bytes = io.BytesIO(training_jsonl_content.encode('utf-8'))

validation_jsonl_content = convert_to_jsonl_in_memory(finetuning_val_data)
validation_jsonl_bytes = io.BytesIO(validation_jsonl_content.encode('utf-8'))

training_response = client.files.create(
    file=(training_file_name, training_jsonl_bytes), 
    purpose="fine-tune"
)
training_file_id = training_response.id

validation_response = client.files.create(
    file=(validation_file_name, validation_jsonl_bytes), 
    purpose="fine-tune"
)
validation_file_id = validation_response.id

We must first confirm that the file status has completed!

In [12]:
# Retrieve file status
while True:
    training_file_status = client.files.retrieve(training_file_id)
    print(f"Training file status: {training_file_status.status}")

    validation_file_status = client.files.retrieve(validation_file_id)
    print(f"Training file status: {validation_file_status.status}")
    if (training_file_status.status != "pending" and validation_file_status.status != "pending") and \
        (training_file_status.status != "running" and validation_file_status.status != "running"):
        break


Training file status: processed
Training file status: processed


### Step 4: Begin Fine-Tuning Job

Now you can submit your fine-tuning training job. 

The fine-tuning job will take some time to start and complete.

You can use the job ID to monitor the status of the fine-tuning job. 

In [14]:
response = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-4o-2024-08-06", # Enter base model name. Note that in Azure OpenAI the model name contains dashes and cannot contain dot/period characters. 
    seed = 105  # seed parameter controls reproducibility of the fine-tuning job. If no seed is specified one will be generated automatically.
)

job_id = response.id

print("Job ID:", response.id)
print("Status:", response.status)
print(response)

Job ID: ftjob-dcf10753567b43d0aa67af4bcf42b59b
Status: pending
FineTuningJob(id='ftjob-dcf10753567b43d0aa67af4bcf42b59b', created_at=1725588833, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=-1, batch_size=-1, learning_rate_multiplier=1), model='gpt-4o-2024-08-06', object='fine_tuning.job', organization_id=None, result_files=None, seed=None, status='pending', trained_tokens=None, training_file='file-d53050ccdcf548d793854354a31f5c69', validation_file='file-2ba3a48947584c49b3aff3404e018b07', estimated_finish=None, integrations=None, updated_at=1725588833)


### Step 5: Track Fine-Tuning Job Status

You can track the training job status by running:

In [16]:
from IPython.display import clear_output

# Track fine-tuning job training status
start_time = time.time()

# Get the status of our fine-tuning job.
response = client.fine_tuning.jobs.retrieve(job_id)

status = response.status

# If the job isn't done yet, poll it every 10 seconds.
while status not in ["succeeded", "failed"]:
    time.sleep(10)
    
    response = client.fine_tuning.jobs.retrieve(job_id)
    print(response)
    print("Elapsed time: {} minutes {} seconds".format(int((time.time() - start_time) // 60), int((time.time() - start_time) % 60)))
    status = response.status
    print(f"Status: {status}")
    clear_output(wait=True)

print(f"Fine-tuning job {job_id} finished with status: {status}")

# List all fine-tuning jobs for this resource.
print("Checking other fine-tune jobs for this resource.")
response = client.fine_tuning.jobs.list()
print(f'Found {len(response.data)} fine-tune jobs.')

Fine-tuning job ftjob-dcf10753567b43d0aa67af4bcf42b59b finished with status: succeeded
Checking other fine-tune jobs for this resource.
Found 2 fine-tune jobs.


To get the full results, you can run the following:

In [17]:
# Retrieve fine_tuned_model name
response = client.fine_tuning.jobs.retrieve(job_id)
print(response)

fine_tuned_model = response.fine_tuned_model

FineTuningJob(id='ftjob-dcf10753567b43d0aa67af4bcf42b59b', created_at=1725588833, error=None, fine_tuned_model='gpt-4o-2024-08-06.ft-dcf10753567b43d0aa67af4bcf42b59b', finished_at=1725592251, hyperparameters=Hyperparameters(n_epochs=9, batch_size=1, learning_rate_multiplier=1), model='gpt-4o-2024-08-06', object='fine_tuning.job', organization_id=None, result_files=['file-9faaf6d1afba4814a4500a4ddd4ebdaf'], seed=None, status='succeeded', trained_tokens=30447, training_file='file-d53050ccdcf548d793854354a31f5c69', validation_file='file-2ba3a48947584c49b3aff3404e018b07', estimated_finish=None, integrations=None, updated_at=1725592251)


### Step 6: Deploy The Fine-Tuned Model

Model deployment must be done using the [REST API](https://learn.microsoft.com/en-us/rest/api/cognitiveservices/accountmanagement/deployments/create-or-update?view=rest-cognitiveservices-accountmanagement-2023-05-01&tabs=HTTP), which requires separate authorization, a different API path, and a different API version.

<table>
<thead>
<tr>
<th>variable</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>token</td>
<td>There are multiple ways to generate an authorization token. The easiest method for initial testing is to launch the Cloud Shell from the <a href="https://portal.azure.com" data-linktype="external">Azure portal</a>. Then run <a href="/en-us/cli/azure/account#az-account-get-access-token()" data-linktype="absolute-path"><code>az account get-access-token</code></a>. You can use this token as your temporary authorization token for API testing. We recommend storing this in a new environment variable</td>
</tr>
<tr>
<td>subscription</td>
<td>The subscription ID for the associated Azure OpenAI resource</td>
</tr>
<tr>
<td>resource_group</td>
<td>The resource group name for your Azure OpenAI resource</td>
</tr>
<tr>
<td>resource_name</td>
<td>The Azure OpenAI resource name</td>
</tr>
<tr>
<td>model_deployment_name</td>
<td>The custom name for your new fine-tuned model deployment. This is the name that will be referenced in your code when making chat completion calls.</td>
</tr>
<tr>
<td>fine_tuned_model</td>
<td>Retrieve this value from your fine-tuning job results in the previous step. It will look like <code>gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83</code>. You will need to add that value to the deploy_data json.</td>
</tr>
</tbody>
</table>

Make sure you have your `az login` command setup and authenticated, otherwise the below will fail! 

In [19]:
import subprocess

# get az token
result = subprocess.run(["az", "account", "get-access-token"], capture_output=True, text=True, check=True)
token = json.loads(result.stdout)["accessToken"]

# Parse the JSON output into a Python dictionary
output_dict = json.loads(result.stdout)

subscription = os.getenv("AZURE_AI_STUDIO_SUBSCRIPTION_ID") 
resource_group = os.getenv("AZURE_AI_STUDIO_RESOURCE_GROUP_NAME")
resource_name = os.getenv("AZURE_AOAI_ENDPOINT").split('/')[-2].split('.')[0] #returns resource name from the AOIA endpoint
model_deployment_name ="custom-evaluator-model" 

deploy_params = {"api-version": "2023-05-01"} 
deploy_headers = {"Authorization": "Bearer {}".format(token), "Content-Type": "application/json"}
deploy_data = {
    "sku": {"name": "standard", "capacity": 1}, 
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": fine_tuned_model, #retrieve this value from the previous call, it will look like gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83
            "version": "1"
        }
    }
}
deploy_data = json.dumps(deploy_data)

print("Creating a new deployment...")
request_url = f"https://management.azure.com/subscriptions/{subscription}/resourceGroups/{resource_group}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{model_deployment_name}"
r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)

print(r)
print(r.reason)
print(r.json())

Creating a new deployment...
<Response [201]>
Created
{'id': '/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourceGroups/uhg-advassist-slm-eval/providers/Microsoft.CognitiveServices/accounts/slm-human-eval/deployments/custom-evaluator-model', 'type': 'Microsoft.CognitiveServices/accounts/deployments', 'name': 'custom-evaluator-model', 'sku': {'name': 'standard', 'capacity': 1}, 'properties': {'model': {'format': 'OpenAI', 'name': 'gpt-4o-2024-08-06.ft-dcf10753567b43d0aa67af4bcf42b59b', 'version': '1'}, 'versionUpgradeOption': 'NoAutoUpgrade', 'capabilities': {'chatCompletion': 'true', 'maxContextToken': '128000', 'maxOutputToken': '16384'}, 'provisioningState': 'Creating'}, 'systemData': {'createdBy': 'marcjimenez@microsoft.com', 'createdByType': 'User', 'createdAt': '2024-09-06T19:25:44.7487275Z', 'lastModifiedBy': 'marcjimenez@microsoft.com', 'lastModifiedByType': 'User', 'lastModifiedAt': '2024-09-06T19:25:44.7487275Z'}, 'etag': '"e87a8003-6555-48f6-a41f-28ce886593c6"'}


This will take quite a bit of time to run, be sure to keep an eye on it and clean up resources as soon as you're done!

### Step 7: Test And Use The Deployed Fine-Tuned Model

After your fine-tuned model is deployed, you can use it like any other deployed model in either the [Chat Playground of Azure OpenAI Studio](https://oai.azure.com/), or via the chat completion API. 

For example, you can send a chat completion call to your deployed model, as shown in the following Python code snippet. 

In [None]:
import os

for data in finetuning_val_data:
    response = client.chat.completions.create(
        model=model_deployment_name, # engine = "Custom deployment name you chose for your fine-tuning model"
        messages=data["messages"][:-1], # Remove the last message which is the ground truth
    )
    print("Expected %s, got: %s", (data["messages"][2]['content'], response.choices[0].message.content))

# print(response)
# print(response['choices'][0]['message']['content'])

### Step 8: Delete The Deployment

It is **strongly recommended** that once you're done with this tutorial and have tested a few chat completion calls against your fine-tuned model, that you delete the model deployment, since the fine-tuned / customized models have an [hourly hosting cost](https://azure.microsoft.com/zh-cn/pricing/details/cognitive-services/openai-service/#pricing) associated with them once they are deployed.