# NLP Tasks Chaining with GenAI

## Overview

In this noteobook, we will show how to chain NLP tasks with GenAI. For simplicity, we will be utilizing OpenAI's API, but all the steps can be reproduced in LangChain and/or other models. Logic is the important part. We will be only using `user` role for chat completion, as system message is not necessarily available in other models.

We create a chaining where:

1. **Natural Language Generation**: We will generate a list of named entity types given a persona and purpose.
2. **Named Entity Recognition**: We will pass the generated list of named entity types to a NER model to extract the named entities from a given article.
3. **Natural Language Inference**: We will pass the extracted named entities to an NLI task to act as a proxy to data quality check.



In [1]:
# load dotenv
import os
from dotenv import load_dotenv
load_dotenv("../.env")

# get OPENAI_API_KEY
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## Defining Models / Schema

When describing data model, this is not to be confused with machine learning models. A data model is a description of how data should be structured and related to each other. A machine learning model is a mathematical model that learns to make predictions from data.

See [How to Create a Data Model in 9 Steps](https://budibase.com/blog/data/how-to-create-a-data-model/).

[Pydantic](https://docs.pydantic.dev/latest/) is the most popular library for data validation and settings management based on Python type hints. [Instructor](https://github.com/jxnl/instructor) is a library that extends Pydantic to support OpenAI's function calling. [Function calling](https://openai.com/blog/function-calling-and-other-api-updates) is basically passing a JSON schema as part of the prompt to the model, attempting to force the model to generate a specific JSON output. Even if you're not using OpenAI as your model, you can still attempt to pass a JSON schema to your model and see if it generates the output you want. And, you should still use Pydantic or its alternatives to validate the output.

Notice how some of the field descriptions are written in a way that is similar to how you would write a prompt. This is because function calling will be passing these description to the model as part of the prompt and infer what the output should be. If you are writing an API endpoint exposed by your model, such description is not as clean and should be written in a way that is more suitable for API documentation.

In [200]:
from pydantic import Field
from instructor import OpenAISchema
from enum import Enum
from typing import List

class EntityType(OpenAISchema):
    name: str = Field(..., description="Name of the entity type.")
    description: str = Field(..., description="Description of the entity type.")
    examples: List[str] = Field(..., description="List of examples of the entity type.")
class EntityTypeSuggestionsOutput(OpenAISchema):
    entity_types: List[EntityType] = Field(
        ..., description="""List of entity types suggested. Entity type name should be all uppercase 
        and separated by underscores (i.e. PERSON, LOCATION, PHONE_NUMBER, etc.""")

## Workflow

### Step 1: Natural Language Generation
Typically, users should already have specific named entities in their mind to create a Named Entity Recognition task. But in some business cases, users may not already know what named entities they want to extract. In this case, we can use a Natural Language Generation task to suggest a list of named entities.

In [279]:
import openai
PERSONA = 'eCommerce merchandiser'
PURPOSE = 'to provide relevant shopping information for computers to shoppers'

entity_suggestions = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0613",
    functions=[EntityTypeSuggestionsOutput.openai_schema],
    function_call={"name": EntityTypeSuggestionsOutput.openai_schema["name"]},
    messages=[
        {
	"role": "user", 
	"content": f"""
You are given a persona and a purpose. Act as the persona, understand the persona's goal, and generate a list of named entity types that
specifically relate to the persona and the purpose. Consider the following context:
Persona: {PERSONA}
Purpose: {PURPOSE}
 """
	}        
    ],
)    

In [280]:
# entity suggestions is an OpenAI Chat Completion object
entity_suggestions

<OpenAIObject chat.completion id=chatcmpl-8C9BAaQA8kZewAoibUHRHM2x5PBPo at 0x7f645fe86b80> JSON: {
  "id": "chatcmpl-8C9BAaQA8kZewAoibUHRHM2x5PBPo",
  "object": "chat.completion",
  "created": 1697905912,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "function_call": {
          "name": "EntityTypeSuggestionsOutput",
          "arguments": "{\n  \"entity_types\": [\n    {\n      \"name\": \"COMPUTER_BRAND\",\n      \"description\": \"The brand of the computer\",\n      \"examples\": [\"HP\", \"Dell\", \"Apple\"]\n    },\n    {\n      \"name\": \"COMPUTER_MODEL\",\n      \"description\": \"The model name or number of the computer\",\n      \"examples\": [\"Inspiron 15\", \"MacBook Pro\"]\n    },\n    {\n      \"name\": \"COMPUTER_FEATURE\",\n      \"description\": \"The features or specifications of the computer\",\n      \"examples\": [\"Processor\", \"RAM\", \"Storage\", \"Graphi

In [281]:
# OpenAISchema can be used to parse the response again 
EntityTypeSuggestionsOutput.from_response(entity_suggestions)

EntityTypeSuggestionsOutput(entity_types=[EntityType(name='COMPUTER_BRAND', description='The brand of the computer', examples=['HP', 'Dell', 'Apple']), EntityType(name='COMPUTER_MODEL', description='The model name or number of the computer', examples=['Inspiron 15', 'MacBook Pro']), EntityType(name='COMPUTER_FEATURE', description='The features or specifications of the computer', examples=['Processor', 'RAM', 'Storage', 'Graphics Card']), EntityType(name='COMPUTER_PRICE', description='The price range or specific price of the computer', examples=['$500 - $1000', '$1500', 'under $500']), EntityType(name='COMPUTER_REVIEW', description='The reviews or ratings of the computer', examples=['5 stars', 'positive review', 'customer rating'])])

### Step 2: Named Entity Recognition
Once we have a list of named entities, we can pass them to a Named Entity Recognition task to extract the named entities from a given article.

In [204]:
class Entity(OpenAISchema):
    start: int = Field(..., alias="start_pos", description="The starting position of the entity in the text.")
    end: int = Field(..., alias="end_pos", description="The ending position of the entity in the text.")
    label: str = Field(..., description="The label of the entity.")
    text: str = Field(..., description="The text of the entity.")

class NEROutput(OpenAISchema):
    entities: List[Entity] = Field(..., description="The list of entities found in the text.")
    

In [291]:
ENTITY_TYPES = EntityTypeSuggestionsOutput.from_response(entity_suggestions).entity_types

ARTICLE = """
Title: "SmithTech's Quantum X1: A Revolution in Computing"

Description:
Introducing the SmithTech Quantum X1, a masterpiece of computing innovation by renowned tech visionary, John Smith. This cutting-edge computer is not just a device; it's a breakthrough in the world of technology.

Key Features:

Powered by the latest SmithTech Quantum Processor, designed by the genius himself, John Smith, delivering unrivaled speed and efficiency.
Immerse yourself in stunning 4K visuals with the Quantum X1's advanced graphics card.
Lightning-fast NVMe SSD storage ensures rapid data access and load times.
Experience seamless multitasking with ample RAM, allowing you to conquer any task.
Sleek, futuristic design that complements any workspace, embodying the essence of SmithTech's commitment to excellence.
John Smith's Quantum X1 is not just a computer; it's a statement of innovation and a testament to SmithTech's legacy in pushing the boundaries of what's possible in computing technology. Elevate your computing experience to a whole new dimension with the Quantum X1 – where visionary technology meets reality.
"""

In [283]:
ENTITY_TYPES = '\n'.join([x.model_dump_json() for x in ENTITY_TYPES])

In [292]:
import openai
extracted_entities = openai.ChatCompletion.create(
    model="gpt-4",
    functions=[NEROutput.openai_schema],
    function_call={"name": NEROutput.openai_schema["name"]},
    messages=[
        {
	"role": "user", 
	"content":f"""
 You are a named entity recognition model.  Please be as strict as possible in your predictions.
 You are given a text, and you are to extract the following entity types: {ENTITY_TYPES}
 """
	},
        {
	"role": "user", 
	"content": f"Text: {ARTICLE}"
	}        
    ],
)

In [293]:
ENTITIES = NEROutput.from_response(extracted_entities).entities

In [318]:
ENTITIES

[Entity(start=13, end=36, label='COMPUTER_MODEL', text="SmithTech's Quantum X1"),
 Entity(start=69, end=79, label='COMPUTER_BRAND', text='SmithTech'),
 Entity(start=154, end=160, label='COMPUTER_FEATURE', text='device'),
 Entity(start=170, end=198, label='COMPUTER_MODEL', text='SmithTech Quantum X1'),
 Entity(start=210, end=221, label='COMPUTER_BRAND', text='SmithTech'),
 Entity(start=237, end=248, label='COMPUTER_FEATURE', text='Quantum Processor'),
 Entity(start=265, end=275, label='COMPUTER_BRAND', text='SmithTech'),
 Entity(start=343, end=356, label='COMPUTER_FEATURE', text='graphics card'),
 Entity(start=368, end=376, label='COMPUTER_FEATURE', text='NVMe SSD'),
 Entity(start=392, end=418, label='COMPUTER_FEATURE', text='rapid data access and load times'),
 Entity(start=423, end=427, label='COMPUTER_FEATURE', text='RAM'),
 Entity(start=444, end=462, label='COMPUTER_FEATURE', text='conquer any task'),
 Entity(start=483, end=490, label='COMPUTER_FEATURE', text='design'),
 Entity(star

### Step 3: Natural Language Inference
Once we have the extracted named entities, we can pass them to a Natural Language Inference task to act as a proxy to data quality check. Note that this type of automatic check is only for proxy, not a replacement for human review. Typically, you would treat human evaluation as golden standard, and use the model to flag potential issues for human review.

In [295]:
class Nli(Enum):
    contradiction = "contradiction"
    neutral = "neutral"
    entailment = "entailment"
    
class NliOutput(OpenAISchema):
    explanation: str = Field(..., description="Think step by step whether the premise entails the hypothesis")    
    label: Nli = Field(..., description="The label of the text.")

Here we are passing only one entity at a time instead of the entire list. This is because the model more likely would lose context if the entire list is passed. Also, it does not leverage concurrent processing.

In [296]:
def nli_check(entity):
    try:
        output = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        functions=[NliOutput.openai_schema],
        function_call={"name": NliOutput.openai_schema["name"]},
        messages=[
            {
        "role": "user", 
        "content": f"""
    Premise: Text: {ARTICLE}
    Hypothesis: The text contains {entity.label}: {entity.text}
    """
        },
            {
        "role": "user", 
        "content": f"""
ENTITY Span: {entity.start} - {entity.end}
Entity Type: {entity.label}
Entity Value: {entity.text}
    """
        }        
        ],
    )    
        return NliOutput.from_response(output), entity
    except:
        return output

In [297]:
import concurrent
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    evaluations = list(executor.map(nli_check, ENTITIES))

In [298]:
evaluations

[(NliOutput(explanation="The premise text describes a computer called SmithTech's Quantum X1. The hypothesis states that the text contains the computer model 'SmithTech's Quantum X1'. Based on the provided information, the hypothesis is correct.", label=<Nli.entailment: 'entailment'>),
  Entity(start=13, end=36, label='COMPUTER_MODEL', text="SmithTech's Quantum X1")),
 (NliOutput(explanation="The premise explicitly mentions 'SmithTech' several times, including in the title and description. Therefore, the hypothesis that the text contains the computer brand 'SmithTech' is true.", label=<Nli.entailment: 'entailment'>),
  Entity(start=69, end=79, label='COMPUTER_BRAND', text='SmithTech')),
 (NliOutput(explanation="The premise describes the SmithTech Quantum X1 as a computer, not just a device. Therefore, the hypothesis that the text contains the computer feature 'device' is incorrect.", label=<Nli.contradiction: 'contradiction'>),
  Entity(start=154, end=160, label='COMPUTER_FEATURE', tex