# Cost Efficient NLP using OpenAI GPT-3

Over the past few years, developers have encountered significant challenges when dealing with text data, despite the availability of rich resources such as `SpaCy` and other NLP libraries. These challenges persist due to the constantly increasing volume of data, which demands considerable human involvement, thought, and effort. 

Large language models (LLMs), are changing the way we handle text data. LLMs can analyze text data much faster than humans and can understand it in a way that's similar to how we do. So why not leverage their advantages for NLP tasks?

This notebook demonstrates how to use the **OpenAI GPT-3 model** for cost-effective NLP tasks. Prompt engineering plays a crucial role in guiding the model to produce desired outputs with minimal token generation, thereby reducing API usage costs.

## Setup

In [1]:
# Installing required packages if not already installed
!pip install openai

In [2]:
# Importing required libraries (OpenAI)
from openai import OpenAI

# Setting up the OpenAI API key (Replace "YOUR_OPENAI_API_KEY_HERE" with your OpenAI API key)
client = OpenAI(api_key="YOUR_OPENAI_API_KEY_HERE")

We'll be using the `gpt-3.5-turbo-0125` language model (as of March 2024). You can see the latest pricing [here](https://openai.com/pricing#:~:text=Output-,gpt%2D3.5%2Dturbo%2D0125,-%240.50%C2%A0/%201M). Let's check out its specs, like pricing, token limits, and more.

| MODEL             | CONTEXT WINDOW | TRAINING DATA | Input                  | Output             |
|-------------------|----------------|---------------|------------------------|--------------------|
| gpt-3.5-turbo-0125 | 16,385 tokens  | Up to Sep 2021 | $0.50 / 1M tokens     | $1.50 / 1M tokens |

In [3]:
# Setting up the model name
model_name = 'gpt-3.5-turbo-0125'

# setting the temperature
temperature = 0.2

The `temperature` value is set to `0.2` ensuring consistent results. It controls the randomness of the output. A lower value will make the output more deterministic, while a higher value will make it more random.

We'll create a function that takes the user's input, the model name, and the temperature as inputs. The function will return a dictionary containing the following:
1. API response
2. Input tokens price
3. output tokens price
4. Total price (input + output)

In [4]:
# function to chat with the OpenAI model 
def openai_chat(user_prompt: str, model_name: str, temperature:str) -> dict:

    # Create a chat completion using the OpenAI client
    response = client.chat.completions.create(
        model=model_name, 
        messages=[{"role": "user", "content": user_prompt}],
        temperature=temperature
    )

    # Response from the chat completion
    text_response = response.choices[0].message.content

    # Calculate the price for input tokens
    input_tokens_price = response.usage.prompt_tokens * (0.5 / 1e6)

    # Calculate the price for output tokens
    output_tokens_price = response.usage.completion_tokens * (1.5 / 1e6)

    # Calculate the total price
    total_price = input_tokens_price + output_tokens_price

    # Return the answer, input price, output price, and total price
    return {
        "answer": text_response,
        "input_price": f"$ {input_tokens_price}",
        "output_price": f"$ {output_tokens_price}",
        "total_price": f"$ {total_price}"
    }

The price is calculated based on the pricing table I showed you earlier for the **gpt-3.5-turbo-0215** model. Let’s see how this function works by trying it out with a simple, short prompt.

In [5]:
# Calling our function with the prompt
openai_chat("What is the capital of China?", model_name, temperature)

{'answer': 'The capital of China is Beijing.',
 'input_price': '$ 7e-06',
 'output_price': '$ 1.0500000000000001e-05',
 'total_price': '$ 1.7500000000000002e-05'}

## Text Pattern Extraction

When extracting patterns like emails or numbers from text data, defining patterns using regex or other libraries is a common approach. However, the LLM has a significant advantage due to its training on large text data. Unlike regex, you just need to name the pattern you want to extract, such as email or phone number, making it more convenient.

Our input text contain several names, email, phone numbers, and more. Let's see how it can extract these patterns.

In [6]:
# defining the user input
user_input = """
In the bustling metropolis of New Alexandria, Detective Miller was hot on the trail of a notorious hacker known only as "Cipher." A cryptic message intercepted by the cybercrime unit mentioned a meeting at "The Crimson Cafe, 3pm, table 7," and listed two aliases: "Silver Fox" and "Blackbird." Miller knew these were likely the hacker's accomplices. 
His informant, a nervous young man named Alex Turner, claimed Cipher frequented an online forum under the ShadowHunter1337 and often boasted about their exploits. Turner also provided a burner phone number, 555-987-2104, supposedly used by Cipher to contact their associates. 
Armed with this information, Miller headed to The Crimson Cafe. He arrived early, taking a seat across from table 7, his eyes scanning the room. At precisely 3pm, two individuals approached the table. One, a woman with silver hair and piercing blue eyes, exuded an air of confidence and cunning. The other, a man clad in black, remained silent and watchful. 
Miller approached them, flashing his badge. "Excuse me, I'm Detective Miller. I believe you might have information regarding an individual known as Cipher." 
The woman, her lips curling into a sly smile, replied, "Cipher? Never heard of them." 
"""

# Patterns to detect the user input
patterns = "phone_numbers, person_names, location_names, time_expressions, aliases, usernames, physical_descriptions"

I defined the patterns in a comma-separated format, this is one of the advantages of using the LLM, you are not bound to a specific format, you can define the patterns in any way you want.

Next we need to create a prompt template that the model can use to extract the desired pattern.

In [7]:
# Defining the prompt template
prompt_template = f'''
Given the input text:
user input: {user_input}

extract following patterns from it: {patterns}

output must be in this format:
pattern_name: pattern_values
...
'''

The prompt template is not difficult to understand, but one important thing to note is that the output format is the key here. The model will extract the patterns and return them in the same format as the output. If not defined properly, model may return the output in a different format on every run, which result in an error because we have to code to map the extracted values to the defined patterns.

In [8]:
# Calling our function with the prompt
extracted_patterns = openai_chat(prompt_template, model_name, temperature)

# Printing the extracted patterns
print(extracted_patterns['answer'])

phone_numbers: 555-987-2104
person_names: Detective Miller, Alex Turner
location_names: New Alexandria, The Crimson Cafe
time_expressions: 3pm
aliases: Cipher, Silver Fox, Blackbird
usernames: ShadowHunter1337
physical_descriptions: woman with silver hair and piercing blue eyes, man clad in black


It has extracted the patterns successfully. But to make it reusable and more efficient, we can convert the output format into a more meaningful format such as dictionary or list etc. 

In [9]:
# Convert extracted patterns to a dictionary
extracted_patterns_list = extracted_patterns['answer'].split('\n')
extracted_patterns_dict = {}
for pattern in extracted_patterns_list:
    if pattern:
        # Splitting the pattern into key and values
        key, values = pattern.split(':')
        key = key.strip()
        # Splitting the values into a list
        values = [value.strip() for value in values.split(',')]
        # Adding the key and values to the dictionary
        extracted_patterns_dict[key] = values


# Printing the extracted patterns
extracted_patterns_dict

{'phone_numbers': ['555-987-2104'],
 'person_names': ['Detective Miller', 'Alex Turner'],
 'location_names': ['New Alexandria', 'The Crimson Cafe'],
 'time_expressions': ['3pm'],
 'aliases': ['Cipher', 'Silver Fox', 'Blackbird'],
 'usernames': ['ShadowHunter1337'],
 'physical_descriptions': ['woman with silver hair and piercing blue eyes',
  'man clad in black']}

With just few lines of code we have changed the format of the extracted patterns in the same format which `SpaCy` or other NLP libraries have provided. The key point to note here is that some patterns names such as usernames, physical descriptions if working with core NLP libraries will be a hactic task, but with LLM it is just a matter of defining the pattern and extracting it.

## Spell Correction

This is where prompt engineering come in handy. We can define the prompt template in such a way that the model can correct the spelling of the given text. but output only those which needs correction and their corrected version.

Our input text contain several spelling errors. Let's see how it can correct these errors.

In [10]:
# defining the user input
user_input = """She walkd alonng the breezy beech, feelin the kool breeze on her facee. The waves whispered soft melodies, soothin her troubled mind, as she strolld aimlesly alonng the shore."""

Next we need to create a prompt template that the model can use to correct misspelled words.

In [11]:
# Defining the prompt template
prompt_template = f'''Given the input text:
user input: {user_input}

output must be in this format:
misspelled_word:corrected_word
...
output must not contain anyother information
'''

Within the prompt template, we’ve explicitly defined the desired response format. The API should highlight misspelled words and provide their correct replacements. We’ve also specified that no additional information should be included in the response, as we’ll be coding to replace these words in the original **user_input**.

In [12]:
# Calling our function with the prompt
misspelled_words = openai_chat(prompt_template, model_name, temperature)

# Printing the misspelled words
print(misspelled_words['answer'][:])

walkd:walked
alonnng:along
beech:beach
feelin:feeling
kool:cool
facee:face
strolld:strolled
aimlesly:aimlessly


I’ve truncated the response here, but the language model successfully identified misspelled word and provided their correct form. Let’s see how much this task cost us.

In [13]:
print("Input Price:", misspelled_words['input_price'])
print("Output Price:", misspelled_words['output_price'])
print("Total Price:", misspelled_words['total_price'])

Input Price: $ 4.4999999999999996e-05
Output Price: $ 7.05e-05
Total Price: $ 0.0001155


The total cost for this task comes to **$0.000115**, If we hadn’t used this prompt template, we would likely have spent more than this amount. This is because we would be sending all the text as input, and the output cost of $1.50 per million tokens would apply to the complete input text being returned.

We have to code a bit to replace the misspelled words with their corrected form. But this is a one-time effort, and we can reuse this code for any other task that requires spell correction.

In [14]:
# Splitting the misspelled words and their corrected versions
misspelled_words_list = misspelled_words['answer'].split("\n")
result = [(line.split(":")[0], line.split(":")[1]) for line in misspelled_words_list if line.strip()]

# Replace the misspelled words with the corrected words
for misspelled_word, corrected_word in result:
    user_input = user_input.replace(misspelled_word, corrected_word)

# Print the corrected user input
print(user_input)

She walked alonng the breezy beach, feeling the cool breeze on her face. The waves whispered soft melodies, soothin her troubled mind, as she strolled aimlessly alonng the shore.


## NER Detection

Same prompt temaplate can be used to extract named entities from the text data. We just need to define the named entities we want to extract.

Our input text contain several named entities. Let's see how it can extract these entities.

In [15]:
# defining the user input
user_input = """In the heart of New York City, John Smith walked briskly down Fifth Avenue, clutching his briefcase tightly. He was headed to a meeting with Microsoft Corporation, where he would discuss the latest advancements in artificial intelligence. As he passed by Central Park, he couldn't help but admire the towering skyscrapers surrounding him. Suddenly, his phone buzzed with a notification from his friend Sarah, inviting him to dinner at her favorite Italian restaurant, Giovanni's. John quickly replied, agreeing to meet her there later that evening."""

In [16]:
# defining the NER tags
ner_tags = f'''NER TAGS: FAC, CARDINAL, NUMBER, DEMONYM, QUANTITY, TITLE, PHONE_NUMBER, NATIONAL, JOB, PERSON, LOC, NORP, TIME, CITY, EMAIL, GPE, LANGUAGE, PRODUCT, ZIP_CODE, ADDRESS, MONEY, ORDINAL, DATE, EVENT, CRIMINAL_CHARGE, STATE_OR_PROVINCE, RELIGION, DURATION, URL, WORK_OF_ART, PERCENT, CAUSE_OF_DEATH, COUNTRY, ORG, LAW, NAME, COUNTRY, RELIGION, TIME'''

# Generate the prompt template
prompt_template = f'''Given the input text:
user input: {user_input}

perform NER detection on it.
Following are the NER Tags: {ner_tags}

output must be in the format
tag:value
'''

In [17]:
# Calling our function with the prompt
extracted_entities = openai_chat(prompt_template, model_name, temperature)

# Printing the extracted entities
print(extracted_entities['answer'][:])

LOC: New York City
PERSON: John Smith
ORG: Microsoft Corporation
FAC: Central Park
PERSON: Sarah
FAC: Giovanni's
TIME: later that evening


The code that we used to extract the named entities is the same as the one we used to extract the patterns.

In [18]:
# Convert extracted entities to a dictionary
extracted_entities_list = extracted_entities['answer'].split('\n')
extracted_entities_dict = {}
for entity in extracted_entities_list:
    if entity:
        # Splitting the entity into tag and value
        tag, value = entity.split(':')
        tag = tag.strip()
        value = value.strip()
        # Adding the tag and value to the dictionary
        extracted_entities_dict[tag] = value

# Printing the extracted entities
extracted_entities_dict

{'LOC': 'New York City',
 'PERSON': 'Sarah',
 'ORG': 'Microsoft Corporation',
 'FAC': "Giovanni's",
 'TIME': 'later that evening'}

NER detection using OpenAI's GPT-3.5 model is a more powerful approach than using `SpaCy` or other NLP libraries. This is because the model has been trained on a large amount of text data and can extract named entities more accurately.

## Wrapping up

In this notebook, we explored how to use the OpenAI GPT-3 model for NLP tasks. We used prompt engineering to guide the model to produce desired outputs, other NLP tasks such as POS tagging, sentiment analysis, and more can be performed using the same approach. This approach is cost-effective and efficient, as it reduces the number of tokens generated by the model, thereby reducing the cost of using the API.

As a next step, you can try out different NLP tasks such as sentence segmentation where you can split the text into meaningful sentences by defining the prompt template accordingly.
```python
promtp_template = '''
break the above text in meaningful information, each break contains a complete information and can have multiple sentences, dont provide sentences.

provide output in this format:

last word [break] first word
last word [break] first word
last word [break] first word
'''
```

This way you wont be returning the complete text but only the break points which can be used to split the text into meaningful sentences and then using code to split the text into sentences.