# Fine-tuning OpenAI's base models using Kili-technology

In this tutorial, we'll learn how to build a model that is able to assign one of 8 predefined categories to short twitter-length news. We'll use one of OpenAI's base models and we'll fine-tune it using our data, with the hope of reaching far superior results that the one-size-fits-all standard models can provide.

We'll start by finding and downloading an interesting dataset to train our model with.
Next, we'll process the dataset so that it fits our needs and set up our project in the Kili app.
Then, we'll use OpenAI's Curie model to generate first model-based, inference labels, which we'll upload to the Kili app.
After that, we'll simulate human labeling to benchmark our model against human-generated labels.
Having done that, we'll download the *gold standard* labels from Kili and use them to fine-tune Curie, one of OpenAi's base models.
Finally, we'll ask the fine-tuned model to generate predictions on a new dataset.

## Finding the dataset to work with

After rummaging through Kaggle for some time, we managed to locate something that seems interesting: a dataset listing Huffingpost's articles published over the course of several years, with links to articles, short descriptions, authors, and dates they were published.

https://www.kaggle.com/datasets/rmisra/news-category-dataset

It's published under a very generous license https://creativecommons.org/licenses/by/4.0/ so we can wring it in any way we want; which we will, because the original file is huuuuuge. Plus, we don't really need most of its contents. What we're interested in is headlines, their respective short descriptions, and assigned categories.

To save some time and to eliminate some of the complexity, we've decided to filter the dataset and leave only the data matching 8 most basic categories that should be fairly easy to process by an off-the-shelf, general-purpose model.

We ended up removing all the entries with vague categories like `IMPACT` or `PARENTING`. Most of the time they matched other categories, too and we bet that the next time they got tagged by humans, the chosen category would be different. The `WEIRD NEWS` category with summaries like `There is guano on the court,` of course got removed, too. We have to admit we had a moment of hesitation on whether or not we should combine `CRIME` and `POLITICS` into one category, but we ended up not doing that. This is the final list:

- MEDIA & ENTERTAINMENT
- WORLD NEWS
- CULTURE & ARTS
- SCIENCE & TECHNOLOGY
- SPORTS
- POLITICS
- MONEY & BUSINESS
- STYLE & BEAUTY

To make matters as simple as possible, the assumption is that one piece of news can match only one of these categories.

Based on OpenAI's recommendations:
> The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality.

So to make sure to match these needs, we've created 4 sample files, each one containing 100 sample labeled examples for each one of the 8 classes. Should this not be enough, you can easily process the original dataset to get more.

Let's now download the training and fine-tuning datasets that we created based on the original dataset:

In [None]:
%pip install wget

In [None]:
!wget https://raw.githubusercontent.com/kili-technology/kili-python-sdk/main/recipes/datasets/curie1.txt https://raw.githubusercontent.com/kili-technology/kili-python-sdk/main/recipes/datasets/curie2.txt https://raw.githubusercontent.com/kili-technology/kili-python-sdk/main/recipes/datasets/curie3.txt https://raw.githubusercontent.com/kili-technology/kili-python-sdk/main/recipes/datasets/curie4.txt

## Setting up a project in the Kili app

In [None]:
%pip install  kili

For the next step, you'll need a Kili API key. Here's how you can create it: https://docs.kili-technology.com/docs/creating-an-api-key

In [None]:
from kili.client import Kili

kili = Kili(api_key="<YOUR KILI API KEY>")

This code is basically your Kili project ontology in JSON format. Note all the categories that we listed ealier on plus an additional `UNABLE_TO_CLASSIFY` category for situations when, for whatever reason, the model fails to do the job:

In [None]:
interface = {
    "jobs": {
        "CLASSIFICATION_JOB": {
            "content": {
                "categories": {
                    "MEDIA_AND_ENTERTAINMENT": {
                        "children": [],
                        "name": "MEDIA AND ENTERTAINMENT",
                        "id": "category11",
                    },
                    "WORLD_NEWS": {"children": [], "name": "WORLD NEWS", "id": "category12"},
                    "CULTURE_AND_ARTS": {
                        "children": [],
                        "name": "CULTURE AND ARTS",
                        "id": "category13",
                    },
                    "SCIENCE_AND_TECHNOLOGY": {
                        "children": [],
                        "name": "SCIENCE AND TECHNOLOGY",
                        "id": "category14",
                    },
                    "SPORTS": {"children": [], "name": "SPORTS", "id": "category15"},
                    "POLITICS": {"children": [], "name": "POLITICS", "id": "category16"},
                    "MONEY_AND_BUSINESS": {
                        "children": [],
                        "name": "MONEY AND BUSINESS",
                        "id": "category17",
                    },
                    "STYLE_AND_BEAUTY": {
                        "children": [],
                        "name": "STYLE AND BEAUTY",
                        "id": "category18",
                    },
                    "UNABLE_TO_CLASSIFY": {
                        "children": [],
                        "name": "UNABLE TO CLASSIFY",
                        "id": "category19",
                    },
                },
                "input": "radio",
            },
            "instruction": "Select a matching category:",
            "mlTask": "CLASSIFICATION",
            "required": 1,
            "isChild": False,
            "isNew": False,
        }
    }
}

We'll now use this ontology as an input parameter to a method that creates an actual Kili project:

In [None]:
result = kili.create_project(
    title="[Kili SDK Notebook]: Kili GPT fine-tuning project",
    description="Kili GPT fine-tuning project",
    input_type="TEXT",
    json_interface=interface,
)

project_id = result["id"]

As the next step, we need to extract the data from the curated dataset and upload the news headlines to Kili. Each line of the input file contains the input text and its assigned class, separated by the "|" symbol.

In [None]:
full_content_dict = dict()

with open("curie1.txt") as cur:
    for e in enumerate(cur):
        entry = e[1].split("|")
        full_content_dict[e[0]] = (entry[0], entry[1].replace("\n", ""))

content_array = [i[0] for i in full_content_dict.values()]
categories_array = [i[1] for i in full_content_dict.values()]
external_id_array = [f"text_{i}" for i in full_content_dict.keys()]

This code creates the actual assets (in bulk) in our Kili project:

In [None]:
kili.append_many_to_dataset(
    project_id=project_id,
    content_array=content_array,
    external_id_array=external_id_array,
)

## Generating first predictions

Now, let's install and instantiate openai. You'll need your OpenAI organization id and your OpenAI API key to complete these steps.

In [None]:
%pip install --upgrade openai

In [None]:
import openai

openai.api_key = "<YOUR OPENAI API KEY>"
openai.organization = "YOUR OPENAI ORGANIZATION ID"

Based on OpenAI's documentation, this the list of OpenAI's base models that can be fine-tuned (only GPT3 models can be fine-tuned):

- Davinci –  *Most capable GPT-3 model. Can do any task the other models can do, often with higher quality*.
- Curie –  *Very capable, but faster and lower cost than Davinci*.
- Babbage – *Capable of straightforward tasks, very fast, and lower cost*.
- Ada – *Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost*.

The cost is an important factor here. According to OpenAI:

> Tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:
> - 1 token ~= 4 chars in English
> - 1 token ~= ¾ words
> - 100 tokens ~= 75 words

More information on tokens here: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them.

Davinci is the most sophisticated of the lot and seems best fit for our needs, but it's also the most expensive ($0.0200 / 1K tokens) so if we're going to use more-or-less Twitter-length messages to train and test the model, with many fine-tuning iterations, the cost may be significantly higher than expected.

OpenAI suggest using Ada for classification:

> Classifiers are the easiest models to get started with. For classification problems we suggest using ada, which generally tends to perform only very slightly worse than more capable models once fine-tuned, whilst being significantly faster and cheaper.

... but for some reason it kept failing when we were testing it at our end. So we decided to try and use the second-best Curie model, as it seemed best value-for-money for our specific POC-type task.

You can, of course, try it out and experiment with it yourself.

Right, so with that behind us, let's now try to write some code that gets actual predictions from the base Curie model:

In [None]:
prompt_text = """Classify the text of the following message as exactly one of the following:
MEDIA AND ENTERTAINMENT, WORLD NEWS, CULTURE AND ARTS, SCIENCE AND TECHNOLOGY,
SPORTS, POLITICS, MONEY AND BUSINESS, STYLE AND BEAUTY.
message: <SAMPLE_MESSAGE_TO_CLASSIFY>.
text: """


def classify_text(content, model):
    classification = openai.Completion.create(
        model=model,
        engine="text_curie_001",
        prompt=prompt_text.replace("<SAMPLE_MESSAGE_TO_CLASSIFY>", content),
        max_tokens=10,
        temperature=0,
    )

    # Based on OpenAI's tokenizer, https://platform.openai.com/tokenizer,
    # our longest class is 6-tokens long.
    # Just in case, we've set the max_tokens value to 10

    returned_value = classification["choices"][0]["text"]  # type: ignore

    all_predefined_classes = [
        "POLITICS",
        "MEDIA AND ENTERTAINMENT",
        "WORLD NEWS",
        "CULTURE AND ARTS",
        "SCIENCE AND TECHNOLOGY",
        "SPORTS",
        "MONEY AND BUSINESS",
        "STYLE AND BEAUTY",
    ]

    result = "UNABLE_TO_CLASSIFY"

    # Sometimes the model returns more than one class so let's filter them out:

    index = len(returned_value)
    for predefined_class in all_predefined_classes:
        if (
            predefined_class in returned_value.upper()
            and returned_value.upper().index(predefined_class) < index
        ):
            result = predefined_class

    return result

In [None]:
sample_message_to_classify = (
    "Golden Globes Returning To NBC In January After Year Off-Air. For the past 18 months,"
    " Hollywood has effectively boycotted the Globes after reports that the HFPA's 87 members of"
    " non-American journalists included no Black members."
)
# sample_message_to_classify_2 = "Amazon Greenlights 'Blade Runner 2099' Limited Series Produced By Ridley Scott. The director of the original 1982 film joins a writer of the 2017 sequel for the newest installment in the sci-fi franchise."
# sample_message_to_classify_3 = "Tips For Hand-Washing Clothes In The Tub. Doing laundry at home during the coronavirus pandemic? Experts share their tried-and-true ways to clean clothing by hand."

classify_text(content=sample_message_to_classify, model="curie")
# classify_text(content=sample_message_to_classify_2, model="curie")
# classify_text(content=sample_message_to_classify_3, model="curie")

'CULTURE AND ARTS'

OK, at first glance the base Curie model seems to work fine. Let's now try to create 800 predictions and upload them to Kili

In [None]:
predicted_classes_dict = dict()

for entry in full_content_dict:
    content = full_content_dict[entry][0]

    # For some weird reason, the model doesn't want to predict on some of the
    # messgages. We'll need to find a way around this limitation:

    try:
        predicted_class = classify_text(content=content, model="curie")
    except Exception as e:
        print(f"getting predictions on entry # {entry} failed.\nException: {e}")
        predicted_class = "UNABLE_TO_CLASSIFY"
        print(predicted_class)

    predicted_classes_dict[entry] = predicted_class

As you see in the code above, for some internal reason, trying to assign a class to some of the content fails. We've added an exception to label these as "UNABLE_TO_CLASSIFY".

Now we'll upload the model-generated labels to the assets that we upload to Kili earlier on.
For more information on Kili's INFERENCE-type labels, see https://docs.kili-technology.com/docs/asset-lifecycle#inference.

In [None]:
inference_cat_array = ["UNABLE_TO_CLASSIFY"] * 800
# This is just in case, really.
# 100% of the UNABLE_TO_CLASSIFY labels should have been handled by the code that queried the model earlier on.
# Setting the length of this list to 800, which is the length of the dataset we're working with (8 categories, 100 examples per each category)

for i in range(len(predicted_classes_dict)):
    inference_cat_array[i] = predicted_classes_dict[i]

# Here, we'll dynamically generate an array of labels so that we can upload them to Kili in bulk.
kili.append_labels(
    json_response_array=[
        {
            "CLASSIFICATION_JOB": {
                "categories": [
                    {
                        "confidence": 100,
                        "name": i.replace(" ", "_").replace("&", "AND").replace("\n", ""),
                    }
                ]
            }
        }
        for i in inference_cat_array
    ],
    model_name="Curie",
    label_type="INFERENCE",
    project_id=project_id,
    asset_external_id_array=external_id_array,
)

## Simulating human labeling in a Kili project

Now that we have our assets labeled by the model, we need to ask humans to add their own labels. We'll simulate this here by adding a bunch of labels from our pre-labeled dataset. 

In [None]:
kili.append_labels(
    json_response_array=[
        {
            "CLASSIFICATION_JOB": {
                "categories": [
                    {
                        "confidence": 100,
                        "name": i.replace(" ", "_").replace("&", "AND").replace("\n", ""),
                    }
                ]
            }
        }
        for i in categories_array
    ],
    label_type="DEFAULT",
    project_id=project_id,
    asset_external_id_array=external_id_array,
)

OK, this is just a simulation. In a real-world project where a lot is at stake and you want to avoid the situation when different labelers assign different classes to ambiguous content, we'd recomment using one of numerous Kili QA settings.

You could start off by scrutinizing your dataset and creating very detailed instructions that take into account possible corner cases.
Then you'd want to set up and then closely monitor honeypot and consensus metrics to trace how much your labelers agree with each other and how far away from ground truth they can be. If you pair this with Kili's powerful QA plugins, you can even set up a system that automatically adds issues to labels that differ from the *gold standard*.
On top of that, you'd want to monitor the questions being asked by your labelers and adjust your instructions accordingly.
Kili offers a ton of options and we strongly encourage you to explore them: https://docs.kili-technology.com/docs/best-practices-for-quality-workflow.


Let's now use Kili's KPIs to check how well the model did when compared to our *ground truth*. This is how it's calculated for single-category classification tasks:

1. Take all the selected categories. If the available categories were "MEDIA AND ENTERTAINMENT" and "POLITICS", and all the labelers selected the same category, this value will be 1. If two different categories were selected, the value will be 2.
2. Iterate through the selected categories:
For each labeler who selected a category, we calculate 1 / total number of labelers. For 2 labelers who selected "MEDIA AND ENTERTAINMENT", this value will be 1/2 for each one of them, which gives us a total score of 1 for this category. If one labeler selected "MEDIA AND ENTERTAINMENT" and the model selected "POLITICS", both of these categories will have a score of 1/2
3. Add all the scores per category and then divide by the number of selected categories. In our case, if the model and our simulated *ground truth* labelers both pointed to the same class, the score will be 1 / 1 = 1 (100%). If two different classes were selected, the score will be 1/2 = 0.5 (50%).

So in essence, in our project, there's only two possible IoU scores we can have per asset: either 50% for differing results or 100% for aligned results.

Now, let's count how many assets in our dataset have differing results and try and get a percentage score:

In [None]:
# inference_mark_lte ("lower than or equal to") is used to filter assets that have an agreement score lower than 51%

low_scores = kili.count_assets(project_id=project_id, inference_mark_lte=0.50)
low_scores

671

When we ran the test, 671 examples were wrongly classified. This means that  the model was only right 16% of the time.

Since we have 8 classes, a random "dumb" model would be able to have 1/8 = 12.5% accuracy. This means that our base LLM is just a little bit smarter than a dumb, untrained model.

We'll need to fine-tune the model to make this score as low as possible.

The total number of assets was 800 (100 per category). So the total score is:

In [None]:
initial_score = (len(content_array) - low_scores) / len(content_array)
initial_score

0.16125

When we ran the test, 671 examples were wrongly classified. This means that  the model was only right 16% of the time.

Since we have 8 classes, a random "dumb" model would be able to have 1/8 = 12.5% accuracy. This means that our base LLM is just a little bit smarter than a dumb, untrained model.

We'll need to fine-tune the model to make this score as high as possible.

## Fine-tuning a base model

We'll first have to export our labels from the Kili project. As in this case we don't need anything fancy, let's just select the "raw" format:

In [None]:
kili.export_labels(project_id=project_id, filename="kili_export.zip", fmt="raw")

For more information on exporting annotations, see: https://python-sdk-docs.kili-technology.com/2.135/sdk/label/#kili.entrypoints.queries.label.__init__.QueriesLabel.export_labels. For more information on supported export formats and general info on exporting with Kili, see https://docs.kili-technology.com/docs/exporting-project-data.QueriesLabel.export_labels

To make this a bit simpler, you can also retrieve the assets and labels directly using `kili.assets` or `kili.labels`.

Let's now extract all the exported labels:

In [None]:
!unzip kili_export.zip

To be able to use this data to fine-tune our base model, we'll need to upload it to the OpenAI instance as a JSONL document, where each line is a prompt-completion pair corresponding to a training example, ending with the newline symbol.

> To fine-tune a model, you'll need a set of training examples that each consist of a single input ("prompt") and its associated output ("completion"). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.


> Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \n\n###\n\n. The separator should not appear elsewhere in any prompt.


> Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.


> Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.


> For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.


For more information, see OpenAI's documentation here: https://platform.openai.com/docs/guides/fine-tuning.

In [None]:
import json

field_path = ["latestLabel", "jsonResponse", "CLASSIFICATION_JOB", "categories"]


def get_nested_field(json_data, field_path):
    current_data = json_data
    try:
        for field in field_path:
            current_data = current_data[field]
        return current_data
    except (KeyError, TypeError):
        return None


kc = open("kili-fine-tune.jsonl", "w")
all_entries_list = []

for id in external_id_array:
    with open(f"labels/{id}.json") as file:
        ordinal = int(id.split("_")[1])
        json_data = json.load(file)
        nested_field = get_nested_field(json_data, field_path)
        exported_class = nested_field[0]["name"]
        prompt = f"{content_array[ordinal]}[PROMPT_STOP]"
        completion = f" {exported_class}[END]"
        prompt_j = {}
        prompt_j["prompt"] = prompt
        prompt_j["completion"] = completion
        kc.write(f"{json.dumps(prompt_j)}\n")

kc.close()

In [None]:
fine_tune_file = "/content/kili-fine-tune.jsonl"

In [None]:
!head /content/kili-fine-tune.jsonl

Now, we'll save this fine to the OpenAI instance so that it can be used for fine-tuning:

In [None]:
openai.File.create(
    file=open(fine_tune_file, "r"),
    purpose="fine-tune",
    user_provided_filename="kili-fine-tune1.jsonl",
)

<File file id=file-16879c80ed3b4ae3a5fe3932a0c30ff6 at 0x7fb2a0ab58a0> JSON: {
  "bytes": 196213,
  "created_at": 1685719658,
  "filename": "kili-fine-tune1.jsonl",
  "id": "file-16879c80ed3b4ae3a5fe3932a0c30ff6",
  "object": "file",
  "purpose": "fine-tune",
  "status": "notRunning",
  "updated_at": 1685719658
}

Let the fine-tuning begin!

In [None]:
kili_ft = openai.FineTune.create(
    training_file="file-16879c80ed3b4ae3a5fe3932a0c30ff6", model="curie"
)

kili_ft

<FineTune fine-tune id=ft-8e922209fe7a4298871a11f26da59ed1 at 0x7fb2a0d23b50> JSON: {
  "created_at": 1685719688,
  "events": [
    {
      "created_at": 1685719688,
      "level": "info",
      "message": "Job enqueued. Waiting for jobs ahead to complete.",
      "object": "fine-tune-event"
    }
  ],
  "hyperparams": {
    "batch_size": 1,
    "compute_classification_metrics": false,
    "learning_rate_multiplier": 0.2,
    "n_epochs": 2,
    "prompt_loss_weight": 0.1
  },
  "id": "ft-8e922209fe7a4298871a11f26da59ed1",
  "model": "curie",
  "object": "fine-tune",
  "status": "notRunning",
  "training_files": [
    {
      "created_at": 1685719658,
      "filename": "m+nhs1g1ThJnWJwIlGQgkeEny1cpTnS6IrutA9XJco9xxYg658QM51gbmmgGpAFr",
      "id": "file-16879c80ed3b4ae3a5fe3932a0c30ff6",
      "object": "file",
      "purpose": "fine-tune",
      "statistics": {
        "examples": 800,
        "tokens": 23835
      },
      "status": "succeeded",
      "updated_at": 1685719672
    }
  ]

Fin-tuning can take a while so you should monitor your fine-tuning training and wait for it to end:

In [None]:
events_gen = openai.FineTune.stream_events(id="ft-8e922209fe7a4298871a11f26da59ed1")

import time

for i in range(1000):
    try:
        output = next(events_gen)
        print(output)
    except StopIteration:
        if "Training hours billed" in output.__str__():
            break
        else:
            time.sleep(10)

{
  "created_at": 1685719688,
  "level": "info",
  "message": "Job enqueued. Waiting for jobs ahead to complete.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685719767,
  "level": "info",
  "message": "Job started.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685719819,
  "level": "info",
  "message": "Preprocessing started.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685720229,
  "level": "info",
  "message": "Training started.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685720537,
  "level": "info",
  "message": "Created results file: file-c0dfab7764ca4963998f16b938f3dfea",
  "object": "fine-tune-event"
}
{
  "created_at": 1685721291,
  "level": "info",
  "message": "Postprocessing started.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685722399,
  "level": "info",
  "message": "Job succeeded.",
  "object": "fine-tune-event"
}
{
  "created_at": 1685722424,
  "level": "info",
  "message": "Completed results file: file-c0dfab7764ca4963998f16b938

To get our fine-tuned model ID, we'll need to get the full list of models. Our model will be the last one in the models list:

In [None]:
openai.Model.list()

<OpenAIObject list at 0x7fb2a0b7ecf0> JSON: {
  "data": [
    {
      "capabilities": {
        "completion": true,
        "embeddings": false,
        "fine_tune": true,
        "inference": false,
        "scale_types": [
          "manual"
        ]
      },
      "created_at": 1646092800,
      "deprecation": {
        "fine_tune": 1709251200,
        "inference": 1709251200
      },
      "id": "ada",
      "lifecycle_status": "preview",
      "object": "model",
      "status": "succeeded",
      "updated_at": 1646092800
    },
    {
      "capabilities": {
        "completion": true,
        "embeddings": false,
        "fine_tune": true,
        "inference": false,
        "scale_types": [
          "manual"
        ]
      },
      "created_at": 1646092800,
      "deprecation": {
        "fine_tune": 1709251200,
        "inference": 1709251200
      },
      "id": "babbage",
      "lifecycle_status": "preview",
      "object": "model",
      "status": "succeeded",
      "updat

In [None]:
tuned_model_id = "curie.ft-8e922209fe7a4298871a11f26da59ed1"

Let's now test the newly fine-tuned model. To test the model, we'll use some completely fresh data.

In [None]:
sample_text = (
    "'Lost Daughter,' 'Drive My Car' Win Top Prizes At Independent Spirit Awards. The show's hosts"
    " and honorary chair also spoke about the war in Ukraine.[PROMPT_STOP]."
)
# sample_text2 = "Most Of Beijing To Be Tested For COVID-19 Amid Lockdown Worry. While only 70 cases have been found since the outbreak surfaced, authorities have followed a “zero-COVID” approach to try to prevent a further spread of the virus."

tuned_model_class = classify_text(content=sample_text, model=tuned_model_id)
tuned_model_class

'CULTURE AND ARTS'

Let's create a new batch of assets in our project and check how well the fine-tuned model is doing against *ground truth*. To make it simpler, we'll take a random set of 100 new assets and we'll try to benchmark it again our *ground truth*.


In [None]:
import random

content_dict_new = dict()
counter = 801  # Start at 801 to assign new external ids to an existing project

with open("curie2.txt") as cur:
    for e in enumerate(cur, 801):
        entry = e[1].split("|")
        content_dict_new[counter] = (entry[0], entry[1].replace("\n", ""))
        counter += 1

hundred_random_entries = random.sample(range(801, 801 + len(content_dict_new)), 100)

content_array_new = [
    content_dict_new[i][0] for i in content_dict_new if i in hundred_random_entries
]
categories_array_new = [
    content_dict_new[i][1] for i in content_dict_new if i in hundred_random_entries
]
external_id_array_new = [f"text_{i}" for i in range(801, 901)]

In [None]:
kili.append_many_to_dataset(
    project_id=project_id,
    content_array=content_array_new,
    external_id_array=external_id_array_new,
)

## Generating predictions on new assets using the fine-tuned model

In [None]:
predicted_classes_dict_new = dict()

for entry in content_dict_new:
    if int(entry) in hundred_random_entries:
        content = content_dict_new[entry][0]
        try:
            predicted_class = classify_text(content=content, model=tuned_model_id)
        except Exception as e:
            print(f"getting predictions on entry # {entry} failed.\nException: {e}")
            predicted_class = "UNABLE_TO_CLASSIFY"
            print(predicted_class)

        predicted_classes_dict_new[entry] = predicted_class

Now we add our predicted assets to our newly-added Kili assets:

In [None]:
inference_cat_array_new = list()

for entry in predicted_classes_dict_new:
    inference_cat_array_new.append(predicted_classes_dict_new[entry])

kili.append_labels(
    json_response_array=[
        {
            "CLASSIFICATION_JOB": {
                "categories": [
                    {
                        "confidence": 100,
                        "name": i.replace(" ", "_").replace("&", "AND").replace("\n", ""),
                    }
                ]
            }
        }
        for i in inference_cat_array_new
    ],
    model_name="Curie",
    label_type="INFERENCE",
    project_id=project_id,
    asset_external_id_array=external_id_array_new,
)

## Summary

Let's simulate human labeling now, to get ground truth labels:

In [None]:
kili.append_labels(
    json_response_array=[
        {
            "CLASSIFICATION_JOB": {
                "categories": [
                    {
                        "confidence": 100,
                        "name": i.replace(" ", "_").replace("&", "AND").replace("\n", ""),
                    }
                ]
            }
        }
        for i in categories_array_new
    ],
    model_name="Ground Truth",
    label_type="DEFAULT",
    project_id=project_id,
    asset_external_id_array=external_id_array_new,
)

Let's calculate the IoU score for the new 100 assets:

In [None]:
low_scores = kili.count_assets(
    project_id=project_id, inference_mark_lte=0.51, external_id_strictly_in=external_id_array_new
)


score_after_finetuning = (len(content_array) - low_scores) / len(content_array)
score_after_finetuning

0.19

In our case, the original score used to be ~16, so we may be technically heading in the right direction, but properly fine-tuning the model would probably require more samples and a few more iterations. You'd need to label another ~800 assets, export them from Kili, then save them to the OpenAI instance, fine-tune some more, and see if it helps.

if you want to further fine-tune your fine-tuned model model, you need to specify that in code:

In [None]:
kili_ft = openai.FineTune.create(
    training_file="file-16879c80ed3b4ae3a5fe3932a0c30ff6", model=tuned_model_id
)

kili_ft