<a href="https://colab.research.google.com/github/kili-technology/kili-python-sdk/blob/master/recipes/ner_pre_annotations_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing OpenAI NER pre-annotations

with OpenAI Davinci 3 (?) model

## Introduction

Open AI Davinci....  zero-shot ... 


## Setup

In [None]:
!pip install kili datasets evaluate ipywidgets

In [None]:
import os
import getpass
import json
import requests
import uuid
import numpy as np
from pprint import pprint
from tqdm import tqdm

## Data preparation
Hugging Face CONNL dataset => filtering? => Open AI prediction with prompt => kili predictions



In [None]:
from datasets import load_dataset

In [None]:
N = 10

In [None]:
dataset = load_dataset("conll2003", split="train").filter(
    lambda datapoint: int(datapoint["id"]) < N
)

Found cached dataset conll2003 (/Users/jonasm/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)


Filter:   0%|          | 0/14041 [00:00<?, ? examples/s]

In [None]:
print(dataset)

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 10
})


In [None]:
print(dataset[0])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


Here is the meaning of each feature in the dataset:

    id: A unique identifier for each token in a sentence.
    tokens: The tokens (words or punctuation marks) in a sentence.
    pos_tags: Part-of-speech tags for each token in the sentence. Part-of-speech tagging is the process of assigning a tag to each word in a sentence that indicates its part of speech (e.g., noun, verb, adjective, etc.).
    chunk_tags: Chunking tags for each token in the sentence. Chunking is the process of grouping words into meaningful phrases based on their syntactic structure.
    ner_tags: Named Entity Recognition (NER) tags for each token in the sentence. NER is the task of identifying named entities in text and classifying them into pre-defined categories such as person, organization, location, etc.

In [None]:
NER_TAGS_ONTOLOGY = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-MISC": 7,
    "I-MISC": 8,
}

NER_TAGS_ONTOLOGY is a dictionary that maps the named entity tags in the CoNLL2003 dataset to integer labels. Here is the meaning of each key-value pair in the dictionary:

    "O": 0: Represents the tag "O" which means that the token is not part of a named entity.
    "B-PER": 1: Represents the beginning of a person named entity.
    "I-PER": 2: Represents a token inside a person named entity.
    "B-ORG": 3: Represents the beginning of an organization named entity.
    "I-ORG": 4: Represents a token inside an organization named entity.
    "B-LOC": 5: Represents the beginning of a location named entity.
    "I-LOC": 6: Represents a token inside a location named entity.
    "B-MISC": 7: Represents the beginning of a miscellaneous named entity.
    "I-MISC": 8: Represents a token inside a miscellaneous named entity.

During the training of a Named Entity Recognition model, the entity tags are typically converted to integer labels using a dictionary like NER_TAGS_ONTOLOGY. This allows the model to predict the integer labels during training and inference, instead of predicting the string tags directly.

## Connect with ChatGPT API

In [None]:
class ChatGptClient:
    def __init__(self, authorization, cookie):
        self._headers = {
            "Authorization": authorization,
            "Content-Type": "application/json",
            "Cookie": cookie,
            "User-Agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like"
                " Gecko) Chrome/110.0.0.0 Safari/537.36"
            ),
        }
        self._conversation_id = None
        self._parent_message_id = None

    def ask(self, text):
        """
        Send the prompt to ChatGPT.
        """
        question_data = {
            "action": "next",
            "messages": [
                {
                    "id": str(uuid.uuid4()),
                    "author": {"role": "user"},
                    "role": "user",
                    "content": {
                        "content_type": "text",
                        "parts": [text],
                    },
                }
            ],
            "parent_message_id": str(uuid.uuid4())
            if self._parent_message_id is None
            else self._parent_message_id,
            "model": "text-davinci-003",
        }
        self._parent_message_id = question_data["messages"][0]["id"]
        if self._conversation_id is not None:
            question_data["conversation_id"] = self._conversation_id
        url = "https://chat.openai.com/backend-api/conversation"
        response = requests.post(url=url, headers=self._headers, json=question_data)
        response_parts = [
            d
            for d in response.content.decode("utf-8").split("\n")
            if '"content": {"content_type": "text", "parts": [' in d
        ]
        response_last_part = response_parts[-1][6:]
        response_data = json.loads(response_last_part)
        self._conversation_id = response_data["conversation_id"]
        return response_data["message"]["content"]["parts"][0]

    def reset_conversation(self):
        self._conversation_id = None
        self._parent_message_id = None


## Evaluation
Show that the accuracy of Open AI Da Vinci is good enough to serve as preannotation.

In [None]:
import evaluate

In [None]:
accuracy_metric = evaluate.load("accuracy")

model: choose your model: https://platform.openai.com/docs/models/overview

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).

promt: string or array!

In [None]:
response = openai.Completion.create(
    model="text-davinci-003", prompt="Say 'this is a test'", temperature=0, max_tokens=100
)

In [None]:
print(response)


## Import to Kili
Import both assets and predictions and show a few screenshots: a good example and a bad example.

In [None]:
if "KILI_API_KEY" in os.environ:
    KILI_API_KEY = os.environ["KILI_API_KEY"]
else:
    KILI_API_KEY = getpass.getpass("Please enter your Kili API key: ")

In [None]:
from kili.client import Kili

In [None]:
kili = Kili(
    api_key=KILI_API_KEY,  # no need to pass the API_KEY if it is already in your environment variables
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Please install version: "pip install kili==2.130.0"
  self.endpoint_kili_version = self.check_versions_match()





class LabelGPT:
    def __init__(self, chatgpt, instructions):
        self._chatgpt = chatgpt
        self._instructions = instructions

    def __get_tokens_index(self, text_tokens, entity_tokens):
        index = list(range(len(text_tokens)))
        for i, token in enumerate(entity_tokens):
            token_index = list(np.where(np.array(text_tokens) == token)[0] - i)
            index = list(set(index) & set(token_index))
        return index

    def ask_iob(self, tokens):
        text = " ".join(tokens)
        question = f"{self._instructions}\n\n{text}"
        response = self._chatgpt.ask(question)
        self._chatgpt.reset_conversation()
        ner_tags = ["O"] * len(tokens)
        try:
            response_json = json.loads(response)
        except:
            return ner_tags
        for key, values in response_json.items():
            for value in values:
                entity_tokens = value.split(" ")
                entity_index = self.__get_tokens_index(
                    text_tokens=tokens,
                    entity_tokens=entity_tokens,
                )
                for i_1 in entity_index:
                    for i_2, token in enumerate(entity_tokens):
                        prefix = "B" if i_2 == 0 else "I"
                        ner_tags[i_1 + i_2] = prefix + "-" + key
        return ner_tags


def main():
    authorization = "Bearer eyJhbGciOiJSUzI1NiIsIn..."
    cookie = "__Host-next-auth.csrf-token=45d80225..."
    chatgpt = ChatGPT(authorization, cookie)
    instructions = """Give, for the following text, the list of:
- organisation names
- for location names
- people names
- miscellaneous entity names.
The output should be formated as a json with the following keys:
- ORG for organisation names
- LOC for location names
- PER for people names
- MISC for miscellaneous.
"""
    dataset = load_dataset("conll2003", split="train").filter(
        lambda x: int(x["id"]) < 10
    )
    labelgpt = LabelGPT(chatgpt, instructions)
    accuracy_metric = evaluate.load("accuracy")
    references = []
    predictions = []
    for d in tqdm(dataset):
        tokens = d["tokens"]
        iob = labelgpt.ask_iob(tokens)
        pprint(list(zip(tokens, iob)))
        predictions += [NER_TAGS_ONTOLOGY.get(e, 0) for e in iob]
        references += d["ner_tags"]
    assert len(predictions) == len(references)
    results = accuracy_metric.compute(references=references, predictions=predictions)
    print(results)


if __name__ == "__main__":
    main()