<a href="https://colab.research.google.com/github/nirb28/llm/blob/main/PII_FT_Llama_Eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Supercharge Small Language Models

Author: Thierry Moreau - Co-founder, Head of DevRel @ OctoAI

Authored date: Aug 22nd 2024

### Intro

In this notebook you'll learn how to fine-tune an open source LLM (Llama3.1-8B) from scratch to perform a specialized task - PII redaction via function calling.

**We'll show that taking a small and efficient LLM like Llama3.1-8B and fine-tuning it, you can achieve significant quality improvements over a state of the art model like GPT-4o, while also achieving significant cost savings (>32x cheaper).**

![llm evaluation results](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/leaderboard-results.png)

This notebook is divided into 4 parts:

1. Fine-tuning dataset collection
2. Kick off fine-tuning on OpenPipe
3. Deploy your fine-tune on OctoAI
4. Evaluate your fine-tune against GPT-4o under different prompting scenarios

### Pre-requisites

## OctoAI

We'll use OctoAI to fine-tune our model, and deploy the resulting fine-tune on an inferencing endpoint.

OctoAI offers efficient, customizable and reliable GenAI inference endpoints. You can sign up for an account on http://octoai.cloud/, and get access to the latest and greatest open source LLMs.

Create an API token by following [this guide](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) and save it somewhere safe: we'll need it later in this notebook.

By signing up you will automatically get $10 in credits, enough to generate 66.7M Llama3.1-8B tokens.

## OpenPipe

We'll use OpenPipe to fine tune our Llama3.1-8B model. OpenPipe lets developers build fine tuning datasets, fire off fine-tune jobs across a variety of base models, and run comprehensive quality evaluations.

Get started by signing up for an account: https://openpipe.ai/.

## Weights & Biases

We'll use Weights & Biases to build an evaluation dashboard to compare the performance of the fine-tune against other models across different metrics (quality, performance, cost).

Get started by signing up for an account: https://wandb.ai/site.

## OpenAI (optional)

We'll use OpenAI for quality comparisons against our fine-tuned LLM. More specifically we'll compare our fine-tune against GPT-4o and GPT-4o-mini.

You can create an OpenAI account at the following URL: https://platform.openai.com. Create a new API key on [this link](https://platform.openai.com/api-keys) once you've created an account.

In [None]:
import os
from getpass import getpass

# Enter your OctoAI Token
OCTOAI_TOKEN = getpass()
os.environ["OCTOAI_TOKEN"] = OCTOAI_TOKEN

In [None]:
# Enter your OpenAI Token
OPENAI_API_KEY = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

## Install the OctoAI CLI

To deploy the OpenPipe Llama3.1-8B fine tune onto OctoAI inference endpoints, we'll need to use the OctoAI CLI.

In [None]:
%%capture
# install the octoai CLI
!curl https://s3.amazonaws.com/downloads.octoai.cloud/octoai/install_octoai_cli_and_sdk.sh -sSfL | sh
!apt-get install jq

In [None]:
# Log into the OctoAI CLI using the $OCTOAI_TOKEN set in your environment
!octoai login

## Install Python Packages

Run the cell below to install the necessary pip packages.

In [None]:
# Install and read in required packages
print('⏳ Installing packages')
%pip install -q openai datasets weave tqdm ipywidgets requests
print('✅ Packages installed')

## Log into Weave from W&B

Run the cell below to authenticate yourself into weave with your W&B account.

In [None]:
import weave

# Initialize W&B Weave project
weave.init('llama31-vs-gpt4o-quality-exploration')

# 1. Building a Fine-Tuning Dataset

A fine-tuned model is only as good as the dataset it's been trained on. Therefore the dataset collection step is critical in order to get a high quality fine tune.

We'll build our fine-tuning dataset by taking an already labeled dataset from HuggingFace and turning into a synthetic LLM request log, which we'll then upload to OctoAI in JSONL format.

## PII Masking Dataset

The dataset we're using in this notebook is the Personally Identifiable Information (PII) masking 200k dataset from [AI4Privacy](https://www.ai4privacy.com/), available via this [link on HuggingFace](https://huggingface.co/datasets/ai4privacy/pii-masking-200k).

This dataset has about 200k synthetic text samples that each contain one or more PII entries across 54 PII classes.

An example of PII redaction looks as follow.

Input text:
```
Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools.
```

Redacted text:
```
Dear [FIRSTNAME], as per our records, your license [VEHICLEVIN] is still registered in our records for access to the educational tools.
```

Privacy mask:
```
[ { "value": "Omer", "start": 5, "end": 9, "label": "FIRSTNAME" }, { "value": "78B5R2MVFAHJ48500", "start": 44, "end": 61, "label": "VEHICLEVIN" } ]
```

## Function calling for PII redaction

Instead of training an LLM to do the PII redaction directly on the input text, we'll use the LLM's tool calling ability to call a function that will perform the redaction on the original text instead.

Using a function to perform the redaction gives us flexibility to implement different redaction approaches after the LLM has been fine tuned.
* We can redact by replacing the PII with the PII class it belogs to, e.g. `Omer` becomes `[FIRSTNAME`]`.
* We can redact by replacing the PII with masked information, e.g. `Omer` becomes `XXXXXX`.
* We can redact by replacing the PII with a fake PII by mapping each unique original PII to a corresponding fake PII substitue from a database, e.g. `Omer` becomes `Kendall`.

With the above system prompt and the function call definition, we'll invoke the LLM by passing in the text that needs to be redacted:

```python

import requests

user_prompt = "Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools."

req = requests.post("https://text.octoai.run/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OCTOAI_TOKEN}"
    },
    json={
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}]
        "model": "openpipe-llama-3-8b-32k",
        "max_tokens": 512,
        "presence_penalty": 0,
        "temperature": 0,
        "top_p": 0.9,
        "peft": lora_asset_name, # we'll talk about what to set this to next
        "tool_choice": tool_choice, # defined in the next cell
        "tools": tools # defined in the next cell
    }
)

print(response.json()["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])
```

Which when fine-tuned correcty, should return the following chat completions response containing the  arguments to pass into the `redact()` function call:
```json
{
  'fields_to_redact':
  [
    {
      'string': 'Omer',
      'pii_type': 'FIRSTNAME'
    },
    {
      'string': '78B5R2MVFAHJ48500',
      'pii_type': 'VEHICLEVIN'
    }
  ]
}
```

With that, let's recap how the LLM will be invoked to perform PII redaction via function calling:

![llm deployment cycle](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/llm_call_overview.png)

In [None]:
# Define the system prompt with all of the PII categories
# The nice thing about this system prompt is that it can be very easily extended
# to fit your unique use case.
piir_system_prompt = """
You are an expert model trained to redact potentially sensitive information from documents. You have been given a document to redact. Your goal is to accurately redact the sensitive information from the document. Sensitive information can be in one of the following categories:

- ACCOUNTNAME: name of an account
- ACCOUNTNUMBER: number of an account
- AGE: a person's age
- AMOUNT: information indicating a certain monetary amount
- BIC: a business identifier code
- BITCOINADDRESS: bitcoint address, generally stored in a cryptocurrency wallet
- BUILDINGNUMBER: number of a building in a physical address
- CITY: name of a city indicating location or address
- COMPANYNAME: name of a company
- COUNTRY: name of a country indicating location or address
- CREDITCARDCVV: credit card CVV
- CREDITCARDISSUER: credit card issuer
- CREDITCARDNUMBER: credit card number
- CURRENCY: currency of a balance or transaction
- CURRENCYCODE: the code a currency (e.g. USD)
- CURRENCYNAME: name of a currency (e.g. US dollar)
- CURRENCYSYMBOL: symbol of a currency (e.g. $)
- DATE: a specific calendar date
- DOB: a specific calendar date representing birth
- EMAIL: an email ID
- ETHEREUMADDRESS: ethereum address, generally stored in a cryptocurrency wallet
- EYECOLOR: eye color, used to identify a person
- FIRSTNAME: first name of a person
- GENDER: a gender identifier
- HEIGHT: height of a person
- IBAN: international banking account number
- IP: IP address
- IPV4: IP v4 address
- IPV6: IP v6 address
- JOBAREA: job area, specialization or category
- JOBTITLE: job title
- LASTNAME: last name of a person
- LITECOINADDRESS: litecoin address, generally stored in a cryptocurrency wallet
- MAC: MAC address
- MASKEDNUMBER: masked number
- MIDDLENAME: middle name of a person
- NEARBYGPSCOORDINATE: nearby GPS coordinates
- ORDINALDIRECTION: ordinal direction (north, south, northeast, etc.)
- PASSWORD: a secure string used for authentication
- PHONEIMEI: the IMEI of a phone
- PHONENUMBER: a telephone number
- PIN: a personal identificaiton number (PIN)
- PREFIX: prefix used to identify a person (Mr., Mrs., Dr. etc.)
- SECONDARY ADDRESS: a secondary physical address address
- SEX: a sex identifier (male/female)
- SSN: a social security number
- STATE: name of a state indicating location or address
- STREET: name of a street indicating location or address
- TIME: time of the day
- URL: URL of a website
- USERAGENT: user agent to identify the application, operating system, vendor etc.
- USERNAME: user name to identify user
- VERHICLEVIN: vehicle identification number or license number
- VEHICLEVRM: vehicle registration mark
- ZIPCODE: zipcode indicating location or address

You should return the specific string that needs to be redacted, along with the category of sensitive information that it belongs to. If there is no sensitive information in the document, return no strings.
"""

In [None]:
# Define the tools for the LLM to invoke

piir_tool_choice = {
  "type": "function",
  "function": {"name": "redact"}
}

piir_tools = [
  {
    "function": {
      "name": "redact",
      "parameters": {
        "type": "object",
        "properties": {
          "fields_to_redact": {
            "type": "array",
            "items": {
              "type": "object",
              "required": [
                "string",
                "pii_type"
              ],
              "properties": {
                "string": {
                  "type": "string",
                  "description": "The exact matching string to redact. Include any whitespace or punctuation. Must be an exact string match!"
                },
                "pii_type": {
                  "enum": [
                    "ACCOUNTNAME",
                    "ACCOUNTNUMBER",
                    "AGE",
                    "AMOUNT",
                    "BIC",
                    "BITCOINADDRESS",
                    "BUILDINGNUMBER",
                    "CITY",
                    "COMPANYNAME",
                    "COUNTY",
                    "CREDITCARDCVV",
                    "CREDITCARDISSUER",
                    "CREDITCARDNUMBER",
                    "CURRENCY",
                    "CURRENCYCODE",
                    "CURRENCYNAME",
                    "CURRENCYSYMBOL",
                    "DATE",
                    "DOB",
                    "EMAIL",
                    "ETHEREUMADDRESS",
                    "EYECOLOR",
                    "FIRSTNAME",
                    "GENDER",
                    "HEIGHT",
                    "IBAN",
                    "IP",
                    "IPV4",
                    "IPV6",
                    "JOBAREA",
                    "JOBTITLE",
                    "JOBTYPE",
                    "LASTNAME",
                    "LITECOINADDRESS",
                    "MAC",
                    "MASKEDNUMBER",
                    "MIDDLENAME",
                    "NEARBYGPSCOORDINATE",
                    "ORDINALDIRECTION",
                    "PASSWORD",
                    "PHONEIMEI",
                    "PHONENUMBER",
                    "PIN",
                    "PREFIX",
                    "SECONDARYADDRESS",
                    "SEX",
                    "SSN",
                    "STATE",
                    "STREET",
                    "TIME",
                    "URL",
                    "USERAGENT",
                    "USERNAME",
                    "VEHICLEVIN",
                    "VEHICLEVRM",
                    "ZIPCODE"
                  ],
                  "type": "string"
                }
              }
            }
          }
        }
      }
    },
    "type": "function"
  }
]

## Preparing the dataset

In the next cell we'll load the dataset from Huggingface and generate and build a 10k sample large fine-tuning dataset using a combination of the input text as LLM prompt, and a re-worked privacy mask as the expected tools call response from the LLM.

Since there are 200k samples and we're only using 10k for our fine-tuning dataset, we have more than enough samples remaining as a holdout set to run post-training evaluations.

In [None]:
from datasets import load_dataset

# Load the dataset from huggingface
dataset = load_dataset("ai4privacy/pii-masking-200k")

In [None]:
import json
from google.colab import files

# Fine-tuning dataset size
# To improve on accuracy results you can bump the size to a larger value
TRAINING_SIZE = 10000

# Create a dataset for fine tuning
training_dataset = []
for idx, item in enumerate(dataset['train'].select(range(0, TRAINING_SIZE))):
    function_arguments = {
        "fields_to_redact": []
    }
    for i in item["privacy_mask"]:
        function_arguments["fields_to_redact"].append({
            "string": i["value"],
            "pii_type": i["label"]
        })
    dataitem = {
        "messages": [
            {
                "role": "system",
                "content": piir_system_prompt
            },
            {
                "role": "user",
                "content": item["source_text"]
            },
            {
                "role": "assistant",
                "content": None,
                "tool_calls":
                    [
                        {
                            "id":"",
                            "type":"function",
                            "function":
                            {
                                "name": "redact",
                                "arguments": json.dumps(function_arguments, indent=2)
                            }
                        }
                    ]
            },
        ],
        "tools": piir_tools,
        "tool_choice": piir_tool_choice
    }
    training_dataset.append(dataitem)

file_path='training_dataset.jsonl'
with open(file_path, 'w') as outfile:
    for entry in training_dataset:
        json.dump(entry, outfile)
        outfile.write('\n')

# This will let you download the file from your browser in case you'd like to view
files.download(file_path)

# 2. Fine tune the LLM

We'll use OpenPipe for this step.

## Upload the dataset to OpenPipe

Check your downloads folder, you should find an `openpipe_dataset.jsonl` in there.

Now follow along the instructions on [this page](https://docs.openpipe.ai/features/exporting-data#dataset-export) to upload your dataset on OpenPipe.

1. On your OpenPipe console, click on "Datasets" listed in the bar on the left.
2. Click on "+ New Dataset" button at the top right of the window.
3. Click on "Upload Data" button at the top left of the window.
4. Drop the jsonl file that was just downloaded in the "Upload File" window.
5. Click on the Upload button and wait for the dataset to get uploaded.

You'll get the confirmation window below if the dataset gets successfully uploaded.

![upload confirmation window](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/openpipe_dataset_uploaded.png)


You'll see your dataset entries under the "Dataset view" - 10,000 of them which should have gotten split into a 9,000 training and 1,000 test set.

![dataset view](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/dataset_view.png)

Hit "Settings", and rename your Dataset if need be.

## Launch a fine-tune

Under the Dataset view on the OpenPipe console, you can launch a fine tune by clicking on the "Fine Tune" button at the top right of the window.

You can define a model ID - this lets us uniquely identify the resulting fine tuned model.

Next you can chose your base model:
* You can choose between open source models (Llama, Mistral) or closed source models (OpenAI GPT). Selecting an open source model gives you ownership of the weights, and lets you deploy the model on the platform of your choice. For this notebook we'll select the "Llama 3.1 8B" model.
* You'll see that we have a good working set size to work with with 10k training samples.
* Finally you can choose to tweak the advanced options but we'll leave them as-is.

![fine tuning settings](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/finetuning_launch_llama31.png)

Let's go ahead and hit "Start Training" to kick off the fine-tuning job.


# 3. Deploy the fine-tuned LLM

In this section we'll export the model weights in LoRA FP16 form to be hosted on OctoAI.

You'll be informed that the fine tune job has completed by email. Once you've been notified you can proceed with the steps below.

## Export your model weights

Access your fine-tuned model by clicking on "Fine Tune" on the OpenPipe console.
Click on the model that was just fine tuned.

![model fine tune](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/model_finetune.png)

At the bottom of the page, you can select a format that the model weights gets exported in. Select "LoRA:FP32" under the "Format" drop-down and hit Export Weights.

It will take a couple of minutes until your model weights are ready for download. When the weights are ready, right click on "Download Weights" to copy the link to the weights and set the URL aside, which we'll need in the next step to set `lora_url`.

![fine tune download link](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/model_finetune_link.png)

Below, you'll need to set the `lora_url` to the URL you just copied from the "Download Weights" link in the previous step.




In [None]:
import random

# Llama 3.1 instruct
checkpoint_name = "octoai:meta-llama-3-1-8b-instruct"
lora_url = "SET ME"
assert(lora_url != "SET ME") # Please update lora_url

# Define an asset name on OctoAI to uniquely identify the LoRA
lora_asset_name = "pii-redaction-finetune-{}".format(str(random.randint(1000, 999999)))

# Set checkpoint name and LoRA URL as env vars
%env checkpoint_name=$checkpoint_name
%env lora_url=$lora_url
%env lora_asset_name=$lora_asset_name

Below, we upload the LoRA to OctoAI.

We need to specify what base checkpoint and architecture ("engine") the model corresponds to.

The command below uses "--upload-from-url" which lets you upload these files from the OpenPipe download URL. Note also that there is an "upload-from-dir" that lets you specify a local directory if you've downloaded the LoRA zip file on your local drive.

The "--wait" flag allows to block until the upload has completed, making scripting possible.

In [None]:
%%bash

octoai asset create \
  --checkpoint $checkpoint_name \
  --format safetensors \
  --type lora \
  --engine text/llama-3.1-8b \
  --name $lora_asset_name \
  --data-type fp16 \
  --upload-from-url $lora_url \
  --wait

You can double check to make sure the LoRA got added using `octoai asset list` command.

In [None]:
!octoai asset list

## Sanity checking that the fine-tune is running on OctoAI

Let's go ahead and use OctoAI to run a test inference with the code below. It's doable by supplying the LoRA name as `peft` parameter (a LoRA is a type of Parameter-Efficient Fine Tune) when making a call to the model.

In [None]:
import requests
import json

print("Using assset ", lora_asset_name)

test_prompt = "Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools. Please feedback on it's operability."

messages=[
    {
        "role": "system",
        "content": piir_system_prompt
    },
    {
        "role": "user",
        "content": test_prompt
    },
]

req = requests.post("https://text.octoai.run/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer {}".format(os.environ["OCTOAI_TOKEN"])
    },
    json={
        "messages": messages,
        "model": "meta-llama-3.1-8b-instruct-lora",
        "max_tokens": 512,
        "presence_penalty": 0,
        "temperature": 0,
        "top_p": 0.9,
        "peft": lora_asset_name,
        "tool_choice": piir_tool_choice,
        "tools": piir_tools
    }
)

print(json.dumps(req.json(),indent=4))

The output looks good! Let's now run a more exhaustive set of quality evaluations to see where this model stands next to very capable LLMs like GPT4o


# 4. Quality Evaluations Using W&B Weave

## Model Coverage

In this section we'll run quality tests between:
* Our Llama3.1-8b model fine tune hosted by OctoAI
* A Llama3.1-8b-instruct model hosted by OctoAI (not tuned for the task)
* A GPT4o model hosted by OpenAI

# Techniques Coverage
We evaluate different prompting techniques for the non-tuned Llama3.1-8b and GPT4o models, namely:
* Zero-shot prompting
* Single-shot prompting
* Few-shot prompting

With zero-shot prompting, we invoke the language model by passing in a system prompt and a user prompt. This is the most basic way to invoke the model’s chat completions API.

With single-shot prompting, we insert in the messages a full conversation round trip that includes an example user prompt (example text to be redacted), the agent’s first response (language model determining what tool to call and what arguments to pass in the tool call, i.e. PII to redact), the tool response (containing the correctly redacted text scrubbed from its PII), and the agent’s second response (language model providing the correctly redacted text back to the user).

With few-shot prompting, we insert more than one full conversation round trip to supplement additional usage examples.


![few-shot prompting](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/few_shot_prompting.png)

## Metrics

We'll use Weights & Biases Weave library to monitor various metrics across our LLMs:
* `has_response` as a measure of LLM robustness/availability overall
* `has_tools_call` as a measure of LLM's ability to properly handle tools call requests
* `cost_cents` as a measure of how expensive the LLM inference is
* `quality_score` as a measure of how high the LLM quality is at the PII redaction task
* `model_latency` as a measure of performance/speed

## Quality Evaluation

All quality evaluations start by defining a quality metric. In our case, we already have a labeled dataset from AI4Privacy, which we can use as our ground evaluation ground truth.

We introduce a scoring system that works fairly simply. Each PII that needs to be redacted is represented as a pair containing:
* The PII string itself, e.g. `5943919109159496`
* The PII class, e.g. `CREDITCARDNUMBER`

We use the SequenceMatcher library to obtain a similarity score between the ground truth PII and the one that's been inferred by the LLMs.

If the PII string and class match perfectly, we get a score of 1.0. If any information starts to divert (e.g. LLM classifies PII as `MIDDLENAME` instead of `FIRSTNAME`, the score becomes lower, but is not 0.

In [None]:
from openai import OpenAI
import copy

# PII redaction function
def redact(text: str, fields_to_redact: list[dict]) -> str:
    for elem in fields_to_redact:
        text = text.replace(elem["string"], "[{}]".format(elem["pii_type"]))
    return text

# Generates few-shot prompts in messages payload
def get_few_shot_prompts(dataset, n=1):
    assert n > 0
    messages = [
      dataset[0]['messages'][0],
    ]
    for idx in range(0, n):
        messages.append(dataset[idx]['messages'][1]) # user message
        messages.append(dataset[idx]['messages'][2]) # tools response
        # Get tools response
        f_value = redact(
            text=dataset[idx]['messages'][1]['content'],
            fields_to_redact=json.loads(dataset[idx]['messages'][2]['tool_calls'][0]['function']['arguments'])['fields_to_redact']
        )
        messages.append(
            {
                "tool_call_id": "",
                "role": "tool",
                "name": "redact",
                "content": f_value
            }
        )
        messages.append({"role": "assistant", "content": f_value})

    return messages

In [None]:
from openai import OpenAI

# Temperature
# Set to 0 for reproducibility
temperature = 0

# OctoAI max tokens
octoai_max_tokens = 32768

# Cost Table for LLMs from OctoAI/OpenAI
# As of Aug 22nd 2024
# Everything is prices in $/M token
cost_table = {
    lora_asset_name: {
        "input": 0.15,
        "output": 0.15
    },
    "meta-llama-3.1-8b-instruct": {
        "input": 0.15,
        "output": 0.15
    },
    "gpt-4o-2024-08-06": {
        "input": 2.5,
        "output": 10
    }
}

def chat_completions_infer(client, model: str, data: dict, shot_prompting: int) -> dict:
    system_msg = data['messages'][0]
    system_msg_content = system_msg['content']
    user_msg = data['messages'][1]
    user_msg_content = user_msg['content']

    # Pre-process - multi-shot
    if shot_prompting > 0:
        messages = get_few_shot_prompts(
            # sorted_training_dataset,
            training_dataset,
            n=shot_prompting
        )
        messages.append(user_msg)
        data['messages'] = messages

    # Invoke LLM
    try:
        response = client.chat.completions.create(**data)
        response = json.loads(response.model_dump_json())
    except:
        return None
    # Process outputs
    output = None
    usage = None
    try:
      tool_calls = response['choices'][0]['message']['tool_calls']
      output = json.loads(tool_calls[0]['function']['arguments'])
      usage = response['usage']
    except:
      pass
    # Fix for OctoAI Llama3.1 models
    try:
        output["fields_to_redact"] = json.loads(output["fields_to_redact"])
    except:
        pass

    return {
        "model_name": model,
        "output": output,
        "usage": usage
    }


class OctoAILLM_FT(weave.Model):
    model: str
    shot_prompting: int

    @weave.op()
    def predict(self, system_prompt: str, user_prompt: str, tool_choice: str, tools: str) -> dict:
        client = OpenAI(
            base_url="https://text.octoai.run/v1",
            api_key=os.environ["OCTOAI_TOKEN"],
        )
        data = {
            "model": "meta-llama-3.1-8b-instruct-lora",
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            "temperature": temperature,
            "max_tokens": octoai_max_tokens,
            "tool_choice": tool_choice,
            "tools": tools,
            "extra_body": {"peft": self.model}
        }

        return chat_completions_infer(client, self.model, data, self.shot_prompting)


class OctoAILLM(weave.Model):
    model: str
    shot_prompting: int

    @weave.op()
    def predict(self, system_prompt: str, user_prompt: str, tool_choice: str, tools: str) -> dict:
        client = OpenAI(
            base_url="https://text.octoai.run/v1",
            api_key=os.environ["OCTOAI_TOKEN"],
        )
        data = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            "temperature": temperature,
            "max_tokens": octoai_max_tokens,
            "tool_choice": tool_choice,
            "tools": tools
        }
        return chat_completions_infer(client, self.model, data, self.shot_prompting)


class OpenAILLM(weave.Model):
    model: str
    shot_prompting: int

    @weave.op()
    def predict(self, system_prompt: str, user_prompt: str, tool_choice: str, tools: str) -> dict:
        client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
        data = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            "temperature": temperature,
            "tool_choice": tool_choice,
            "tools": tools
        }
        return chat_completions_infer(client, self.model, data, self.shot_prompting)


# Llama3.1-8B-FT
octoai_llama31_8b_ft = OctoAILLM_FT(model=lora_asset_name, shot_prompting=0)
# Llama3.1-8B
octoai_llama31_8b = OctoAILLM(model='meta-llama-3.1-8b-instruct', shot_prompting=0)
octoai_llama31_8b_ss = OctoAILLM(model='meta-llama-3.1-8b-instruct', shot_prompting=1)
octoai_llama31_8b_ms = OctoAILLM(model='meta-llama-3.1-8b-instruct', shot_prompting=2)
# GPT-4o
openai_gpt4o = OpenAILLM(model='gpt-4o-2024-08-06', shot_prompting=0)
openai_gpt4o_ss = OpenAILLM(model='gpt-4o-2024-08-06', shot_prompting=1)
openai_gpt4o_ms = OpenAILLM(model='gpt-4o-2024-08-06', shot_prompting=2)

In [None]:
# Let's build a weave dataset out of holdout data from our original
# huggingface dataset
from weave import Dataset

# Set to a larger number but as a test start,run with 10 to
# sanity check that all is working as intended
TEST_SIZE = 10

pii_redaction_test_data = Dataset(
    name="pii_redaction_data",
    rows=[
        {
            "source_text": t["source_text"],
            "fields_to_redact": [
                {
                    "string": elem["value"],
                    "pii_type": elem["label"]
                } for elem in t["privacy_mask"]
            ],
        } for t in dataset['train'].select(range(TRAINING_SIZE, TRAINING_SIZE+TEST_SIZE))
    ]
)

In [None]:
from difflib import SequenceMatcher

# All of the instantiated models
models = [
    octoai_llama31_8b_ft,
    octoai_llama31_8b,
    octoai_llama31_8b_ss,
    octoai_llama31_8b_ms,
    openai_gpt4o,
    openai_gpt4o_ss,
    openai_gpt4o_ms
]

# Similarity score helper function for assessing quality
def similar(a, b):
    a_string = "{}, {}".format(a['string'], a['pii_type'])
    b_string = "{}, {}".format(b['string'], b['pii_type'])
    return SequenceMatcher(None, a_string, b_string).ratio()

# Define our scoring functions
@weave.op()
def has_response(model_output: dict) -> dict:
    return {'has_response': model_output is not None}

@weave.op()
def has_tools_call(model_output: dict) -> dict:
    try:
        return {'has_tools_call': model_output['output'] is not None}
    except:
        return {'has_tools_call': False}

@weave.op()
def is_expensive(model_output: dict) -> dict:
    try:
        usage = model_output['usage']
        model = model_output['model_name']
        cost = int(usage['prompt_tokens']) * cost_table[model]['input'] + \
              int(usage['completion_tokens']) * cost_table[model]['output']
        cost = cost / 1000000 * 100
        return {'cost_cents': cost}
    except:
        return {'cost_cents': None}

@weave.op()
def is_good_quality(fields_to_redact: dict, model_output: dict) -> dict:
    try:
        # Assess accuracy
        score = 0
        # sum of all of the best similarity scores across the test PII
        for t in model_output['output']['fields_to_redact']:
            # we retain the best similarity score across all pairwise PII comparisons
            best_score = 0
            for r in fields_to_redact:
                sim_score = similar(r, t)
                if sim_score > best_score:
                    best_score = sim_score
            score += best_score
        # divide the sum by the max of PII classes in reference data, and test data
        # this is a simple formula to introduce a penalty in case we have a false positive or false negative
        score = score/max(len(fields_to_redact), len(model_output['output']['fields_to_redact']))
        return {'quality_score': score}
    except:
        return {'quality_score': 0}

# Define the preprocess_model_input function
def preprocess_model_input(row):
    return {
        'system_prompt': piir_system_prompt,
        'user_prompt': row['source_text'],
        'tool_choice': piir_tool_choice,
        'tools': piir_tools
    }

# Define the evaluation
ner_evaluation = weave.Evaluation(
    name='pii_redaction_eval',
    dataset=pii_redaction_test_data,
    trials=1,
    scorers=[
        has_response,
        has_tools_call,
        is_expensive,
        is_good_quality
    ],
    preprocess_model_input=preprocess_model_input
)

In [None]:
# Run evaluation for each model
# Set to 2 will run the eval slowly, but you won't run into RPM limits
# with any of the providers
os.environ["WEAVE_PARALLELISM"] = "2"

pii_results = {}
for model in models:
    print(f"Evaluating {model.model}...")
    result = await ner_evaluation.evaluate(model)
    pii_results[model.name] = result