[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/development/llms/ner/entity-extraction.ipynb)


# <a id="top">Named entity recognition with LLMs</a>

This notebook illustrates how an LLM used for NER can be uploaded to the Openlayer platform.

## <a id="toc">Table of contents</a>

1. [**Problem statement**](#problem) 

2. [**Downloading the dataset**](#dataset-download)

3. [**Adding the model outputs to the dataset**](#model-output)

2. [**Uploading to the Openlayer platform**](#upload)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
        - [Direct-to-API models](#direct-to-api)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/development/llms/ner/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="problem">1. Problem statement </a>

[Back to top](#top)


In this notebook, we will use an LLM to extract entities from input sentences. The entities we care about are `Person`, `Organization`, `Location`, and `Event`.

For example, if the LLM received the sentence:
```
IBM's Watson beat human players in Jeopardy!
```
it should output a list of entities (JSON formatted):
```
 [
    {
        "entity_group": "Organization",
        "score": 0.75,
        "word": "IBM",
        "start": 0,
        "end": 3,
    },
    {
        "entity_group": "Event",
        "score": 0.70,
        "word": "Jeopardy",
        "start": 36,
        "end": 44,
    },
]
```

To do so, we start with a dataset with sentences and ground truths, use an LLM to extract the entities, and finally upload the dataset and LLM to the Openlaye platform to evaluate the results.

## <a id="dataset-download">2. Downloading the dataset </a>

[Back to top](#top)

The dataset we'll use to evaluate the LLM is stored in an S3 bucket. Run the cells below to download it and inspect it:

In [None]:
%%bash

if [ ! -e "ner_dataset.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/llms/ner/ner_dataset.csv" --output "ner_dataset.csv"
fi

In [None]:
import pandas as pd

In [None]:
dataset = pd.read_csv("ner_dataset.csv")

dataset.head()

Our dataset has two columns: one named `sentence` -- with input sentences -- and one named `ground_truth` -- with a list of entities, such as `Person`, `Location`, `Organization`, mentioned in the sentence. 

Note that even though we have ground truths available in our case, this is not a blocker to use Openlayer. You can check out other Jupyter Notebook examples where we work on problems without access to ground truths.

We will now use an LLM to extract the entities from the `sentences`.

## <a id="dataset-download">3. Adding model outputs to the dataset </a>

[Back to top](#top)

As mentioned, we now want to add an extra column to our dataset: the `model_output` column with the LLM's prediction for each row.

There are many ways to achieve this goal, and you can pursue the path you're most comfortable with. 

One of the possibilities is using the `openlayer` Python Client with one of the supported LLMs, such as GPT-4. 

We will exemplify how to do it now. **This assumes you have an OpenAI API key.** **If you prefer not to make requests to OpenAI**, you can [skip to this cell and download the resulting dataset with the model outputs if you'd like](#download-model-output).

First, let's pip install `openlayer`:

In [None]:
!pip install openlayer

The `openlayer` Python client comes with LLM runners, which are wrappers around common LLMs -- such as OpenAI's. The idea is that these LLM runners adhere to a common interface and can be called to make predictions on pandas dataframes. 

To use `openlayer`'s LLM runners, we must follow the steps:

**1. Prepare the config**

We need to prepare a config for the LLM:

In [None]:
# One of the pieces of information that will go into our config is the `promptTemplate`
prompt_template = """
You will be provided with a `sentence`, and your task is to generate a list
of entities mentioned in the sentence. Each item from the entity list must be
a JSON with the following attributes:
{
    "entity_group": a string. To which entity the `word` belongs to. Must be one of "Person", "Organization", "Event", or "Location",
    "score": a float. Between 0 and 1. Expresses how confident you are that the `word` belongs to this `entity_group`.
    "word": a string. The word from the `sentence`.,
    "start": an int. Starting character of the `word` in the `sentece`.,
    "end": an int. Ending character of the `word` in the sentence.,
}


For example, given:
```
Sentence: IBM's Watson beat human players in Jeopardy!
```

the output should be something like:
```
[
    {
        "entity_group": "Organization",
        "score": 0.75,
        "word": "IBM",
        "start": 0,
        "end": 3,
    },
    {
        "entity_group": "Event",
        "score": 0.70,
        "word": "Jeopardy",
        "start": 36,
        "end": 44,
    },
]

```

Sentence: {{ sentence }}
"""
prompt = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt_template}
]

In [None]:
# Note the camelCase for the keys
model_config = {
    "prompt": prompt,
    "inputVariableNames": ["sentence"],
    "modelProvider": "OpenAI",
    "model": "gpt-3.5-turbo",
    "modelParameters": {
        "temperature": 0
    },
}

To highlight a few important fields:
- `prompt`: this is the prompt that will get sent to the LLM. Notice that our variables are refered to in the prompt template with double handlebars `{{ }}`. When we make the request, the prompt will get injected with the input variables data from the pandas dataframe. Also, we follow OpenAI's convention with messages with `role` and `content` regardless of the LLM provider you choose.
- `inputVariableNames`: this is a list with the names of the input variables. Each input variable should be a column in the pandas dataframe that we will use. Furthermore, these are the input variables referenced in the `prompt` with the handlebars.
- `modelProvider`: one of the supported model providers, such as `OpenAI`.
- `model`: name of the model from the `modelProvider`. In our case `gpt-3.5-turbo`.
- `modelParameters`: a dictionary with the model parameters for that specific `model`. For `gpt-3.5-turbo`, for example, we could specify the `temperature`, the `tokenLimit`, etc.

**2. Get the model runner**

Now we can import `models` from `openlayer` and call the `get_model_runner` function, which will return a `ModelRunner` object. This is where we'll pass the OpenAI API key. For a different LLM `modelProvider` you might need to pass a different argument -- refer to our documentation for details.

In [None]:
from openlayer import models, tasks

llm_runner = models.get_model_runner(
    task_type=tasks.TaskType.LLM,
    openai_api_key="YOUR_OPENAI_API_KEY_HERE",
    **model_config
)

In [None]:
llm_runner

**3. Run the LLM to get the predictions**

Every model runner has with a `run` method. This method expects a pandas dataframe with the input variables as input and returns a pandas dataframe with a single column: the predictions.

For example, to get the output for the first few rows of our dataset:

In [None]:
llm_runner.run(dataset[:3])

Now, we can get the predictions for our full dataset and add them to the column `model_output`. 

**Note that this can take some time and incurs in costs.**

In [None]:
# There are costs in running this cell!
dataset["model_output"] = llm_runner.run(dataset)["output"]

<a id="download-model-output">**Run the cell below if you didn't want to make requests to OpenAI:**</a>

In [None]:
%%bash

if [ ! -e "ner_dataset_with_outputs.csv" ]; then
    curl "https://openlayer-static-assets.s3.us-west-2.amazonaws.com/examples-datasets/llms/ner/ner_dataset_with_outputs.csv" --output "ner_dataset_with_outputs.csv"
fi

In [None]:
dataset = pd.read_csv("ner_dataset_with_outputs.csv")

dataset.head()

## <a id="upload">4. Uploading to the Openlayer platform </a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

### <a id="client">Instantiating the client</a>

In [None]:
import openlayer

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

### <a id="project">Creating a project on the platform</a>

In [None]:
from openlayer.tasks import TaskType

project = client.create_or_load_project(
    name="NER with LLMs",
    task_type=TaskType.LLM,
    description="Evaluating entity extracting LLM."
)

### <a id="dataset">Uploading datasets</a>

Before adding the datasets to a project, we need to do Prepare a `dataset_config`.  

This is a Python dictionary that contains all the information needed by the Openlayer platform to utilize the dataset. It should include the column names, the input variable names, etc. For details on the `dataset_config` items, see the [API reference](https://reference.openlayer.com/reference/api/openlayer.OpenlayerClient.add_dataset.html#openlayer.OpenlayerClient.add_dataset).

Let's prepare the `dataset_config` for our validation set:

In [None]:
# Some variables that will go into the `dataset_config`
input_variable_names = ["sentence"]
ground_truth_column_name = "ground_truth"
output_column_name = "model_output"

In [None]:
validation_dataset_config = {
    "inputVariableNames": input_variable_names,
    "label": "validation",
    "outputColumnName": output_column_name,
    "groundTruthColumnName": ground_truth_column_name
}

In [None]:
# Validation set
project.add_dataframe(
    dataset_df=dataset,
    dataset_config=validation_dataset_config,
)

We can confirm that the validation set is now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="model">Uploading models</a>

When it comes to uploading models to the Openlayer platform, there are a few options:

- The first one is to upload a **shell model**. Shell models are the most straightforward way to get started. They are comprised of metadata and all of the analysis are done via their predictions (which are [uploaded with the datasets](#dataset), in the `outputColumnName`).
- The second one is to upload a **direct-to-API model**. In this is the analogous case to using one of `openlayer`'s model runners in the notebook environment. By doing, you'll be able to interact with the LLM using the platform's UI and also perform a series of robustness assessments on the model using data that is not in your dataset. 


Since we used an LLM runner on the Jupyter Notebook, we'll follow the **direct-to-API** approach. Refer to the other notebooks for shell model examples.

#### <a id="direct-to-api"> Direct-to-API </a>

To upload a direct-to-API LLM to Openlayer, you will need to create (or point to) a model config YAML file. This model config contains the `promptTemplate`, the `modelProvider`, etc. Essentially everything needed by the Openlayer platform to make direct requests to the LLM you're using.

Note that to use a direct-to-API model on the platform, you'll need to **provide your model provider's API key (such as the OpenAI API key) using the platform's UI**, under the project settings.

Since we used an LLM runner in this notebook, we already wrote a model config for the LLM. We'll write it again for completeness:

In [None]:
# Note the camelCase for the keys
model_config = {
    "prompt": prompt,
    "inputVariableNames": ["sentence"],
    "modelProvider": "OpenAI",
    "model": "gpt-3.5-turbo",
    "modelParameters": {
        "temperature": 0
    },
    "modelType": "api",
}

In [None]:
# Adding the model
project.add_model(
    model_config=model_config,
)

We can confirm that both the model and the validation set are now staged using the `project.status()` method. 

In [None]:
project.status()

### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [None]:
project.commit("Initial commit!")

In [None]:
project.status()

In [None]:
project.push()