<a href="https://colab.research.google.com/github/kili-technology/kili-python-sdk/blob/master/recipes/ner_pre_annotations_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing OpenAI NER pre-annotations

with OpenAI Davinci 3 (?) model

## Introduction

Open AI Davinci....  zero-shot ... 


## Setup

In [None]:
!pip install kili datasets evaluate ipywidgets openai

In [None]:
import os
import getpass
import json
import requests
import uuid
import numpy as np
import openai
from pprint import pprint
from tqdm import tqdm
from typing import List

## Data preparation
Hugging Face CONNL dataset => filtering? => Open AI prediction with prompt => kili predictions



In [None]:
from datasets import load_dataset

In [None]:
MAX_DATAPOINTS = 5
MIN_NB_TOKENS_PER_SENTENCE = 9

In [None]:
dataset = load_dataset("conll2003", split="train").filter(
    lambda datapoint: int(datapoint["id"]) < MAX_DATAPOINTS
    and len(datapoint["tokens"]) >= MIN_NB_TOKENS_PER_SENTENCE
)

Found cached dataset conll2003 (/Users/jonasm/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)


Filter:   0%|          | 0/14041 [00:00<?, ? examples/s]

In [None]:
print(dataset)

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3
})


In [None]:
print(dataset[0])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


Here is the meaning of each feature in the dataset:

    id: A unique identifier for each token in a sentence.
    tokens: The tokens (words or punctuation marks) in a sentence.
    pos_tags: Part-of-speech tags for each token in the sentence. Part-of-speech tagging is the process of assigning a tag to each word in a sentence that indicates its part of speech (e.g., noun, verb, adjective, etc.).
    chunk_tags: Chunking tags for each token in the sentence. Chunking is the process of grouping words into meaningful phrases based on their syntactic structure.
    ner_tags: Named Entity Recognition (NER) tags for each token in the sentence. NER is the task of identifying named entities in text and classifying them into pre-defined categories such as person, organization, location, etc.

In [None]:
NER_TAGS_ONTOLOGY = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-MISC": 7,
    "I-MISC": 8,
}

NER_TAGS_ONTOLOGY is a dictionary that maps the named entity tags in the CoNLL2003 dataset to integer labels. Here is the meaning of each key-value pair in the dictionary:

    "O": 0: Represents the tag "O" which means that the token is not part of a named entity.
    "B-PER": 1: Represents the beginning of a person named entity.
    "I-PER": 2: Represents a token inside a person named entity.
    "B-ORG": 3: Represents the beginning of an organization named entity.
    "I-ORG": 4: Represents a token inside an organization named entity.
    "B-LOC": 5: Represents the beginning of a location named entity.
    "I-LOC": 6: Represents a token inside a location named entity.
    "B-MISC": 7: Represents the beginning of a miscellaneous named entity.
    "I-MISC": 8: Represents a token inside a miscellaneous named entity.

During the training of a Named Entity Recognition model, the entity tags are typically converted to integer labels using a dictionary like NER_TAGS_ONTOLOGY. This allows the model to predict the integer labels during training and inference, instead of predicting the string tags directly.

## Connect with ChatGPT API

In [None]:
if "OPENAI_API_KEY" in os.environ:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
else:
    OPENAI_API_KEY = getpass.getpass("Please enter your OpenAI API key: ")

In [None]:
openai.api_key = OPENAI_API_KEY

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).

promt: string or array!

In [None]:
def ask_openai(prompt: str) -> str:
    openai_query_params = {"model": "text-davinci-003", "temperature": 0, "max_tokens": 1024}
    response = openai.Completion.create(
        prompt=prompt,
        **openai_query_params,
    )
    return response["choices"][0]["text"]

In [None]:
print(ask_openai("Hello, are you here?"))



Yes, I am here. How can I help you?


## Prompt creation

In [None]:
base_prompt = """In the sentence below, give me the list of:
- organisation names
- for location names
- people names
- miscellaneous entity names.
Format the output in json with the following keys:
- ORG for organisation names
- LOC for location names
- PER for people names
- MISC for miscellaneous.
Sentence below:
"""

In [None]:
test_sentence = (
    "Elon Musk is the CEO of Tesla and SpaceX. He was born in South Africa and now lives in the"
    " USA. He is one of the founders of OpenAI."
)

In [None]:
print(ask_openai(base_prompt + test_sentence))



{
  "ORG": ["Tesla", "SpaceX", "OpenAI"],
  "LOC": ["South Africa", "USA"],
  "PER": ["Elon Musk"],
  "MISC": []
}


## Create the pre-annotations

In [None]:
openai_answers = []
for datapoint in dataset:
    sentence = " ".join(datapoint["tokens"])
    answer = ask_openai(base_prompt + sentence)
    answer_json = json.loads(answer)
    openai_answers.append(answer_json)

In [None]:
openai_answers

[{'ORG': ['EU', 'German'], 'LOC': ['British'], 'PER': [], 'MISC': ['lamb']},
 {'ORG': ['European Commission'],
  'LOC': ['Germany', 'Britain'],
  'PER': [],
  'MISC': ['mad cow disease']},
 {'ORG': ['European Union', 'Britain'],
  'LOC': ['Germany'],
  'PER': ['Werner Zwingmann'],
  'MISC': []}]


## Import to Kili
Import both assets and predictions and show a few screenshots: a good example and a bad example.

In [None]:
if "KILI_API_KEY" in os.environ:
    KILI_API_KEY = os.environ["KILI_API_KEY"]
else:
    KILI_API_KEY = getpass.getpass("Please enter your Kili API key: ")

In [None]:
from kili.client import Kili

In [None]:
kili = Kili(
    api_key=KILI_API_KEY,  # no need to pass the API_KEY if it is already in your environment variables
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Please install version: "pip install kili==2.130.0"
  self.endpoint_kili_version = self.check_versions_match()



## Evaluation
Show that the accuracy of Open AI Da Vinci is good enough to serve as preannotation.

In [None]:
import evaluate

In [None]:
accuracy_metric = evaluate.load("accuracy")

model: choose your model: https://platform.openai.com/docs/models/overview