# Extracting and querying facts with GPT-3.5: Study 
In this study, we will examine how to use GPT-3 to extract facts from Natural Language (NL) utterances and, later, query those facts from a database. To do so, we will:
  - Define some simple NL example utterances.
  - Prompts to extract facts from those utterances.
  - Auxiliary mechanisms to store the extracted facts. Here, we will use a simple Pandas Dataframe.
  - Prompts to help querying the facts from the database. As we shall see, even here we can leverage GPT-3.5 (i.e., the original ChatGPT model) to help us.
  - Auxiliary mechanisms to apply all of the above to the example utterances, in convenient way for our study.

Please not that in order to be able to execute this notebook, you need to have a working [OpenAI API](https://openai.com/api/) key. 

## Setup

In [1]:
from openai import OpenAI
import os
import pandas as pd
from ast import literal_eval

## Auxiliary Functions

Though the OpenAI library itself does most of the heavy lifting, it is still useful to have some auxiliary functions to help with our particular use.

**For security**, it is recommended to input this key in a way that is not recorded in the notebook. Rather, define it as an environment variable called `OPENAI_API_KEY`. This will be picked up by default by the OpenAI client. In Windows, go to *System Properties* -> *Environment Variables...* and add a new one. You will probably have to restart your shell (and Jupyter) for this to be usable here.

In [2]:
class ChatCompletionClient:
    def __init__(self, init_system_message="You are an intelligent agent.", model='gpt-3.5-turbo'):
        self.openai_client = OpenAI()

        self.init_system_message = init_system_message
        self.model = model
        self.reset()
        
    def add_user_message(self, content):
      self.current_messages.append({'role': 'user', 'content': content})

    def reset(self):
      self.current_messages = [{"role": "system", "content": self.init_system_message}]

    def complete(self, user_msg_content, 
                 add_to_chat=False,
                 temperature=0.7, max_tokens=1000,
                             top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0, stop=None):
        messages = self.current_messages.copy()
        messages.append({'role': 'user', 'content': user_msg_content})

        if add_to_chat:
          self.current_messages = messages

        response = self.openai_client.chat.completions.create(
          model=self.model,
          messages=messages,
          temperature=temperature,
          max_tokens=max_tokens,
          top_p=top_p,
          frequency_penalty=frequency_penalty,
          presence_penalty=presence_penalty,
          stop=stop
        )

        next_message = dict(response.choices[0].message)

        if add_to_chat:
          self.current_messages.append(next_message)
        
        return next_message['content']

In [3]:
client = ChatCompletionClient()

In [4]:
client.complete("Hello, how are you?")

"Hello! I'm an AI and I don't have feelings, but I'm here to help you. How can I assist you today?"

In [5]:
def apply_to_examples(examples, prompt_func, temperature=0.1):
    results = []
    for example in examples:
        print(f">>> INPUT: {example}")
        result = client.complete(prompt_func(example), temperature=temperature)
        print(f">>> OUTPUT:\n{result}")
        results.append(result)
        print("========================================\n\n")
    return results

In [6]:
def extract_lines_from_result(result):
    """
    Extracts the lines from the result string.
    """
    lines = [line.strip(' -*') for line in result.split('\n') if len(line) > 0]
    return lines

In [7]:
def string_to_tuples(s):
    """"
    Converts a string that looks like a tuple to an actual Python tuple.
    """
    try:
        return [literal_eval(s.strip()) for s in extract_lines_from_result(s)]
    except:
        return []

In [8]:
def extract_terms_from_all_results(results):
    """
    Extracts the terms from the result string.
    """
    terms = []
    for result in results:
        terms.append(extract_lines_from_result(result))
    return terms

## Example Data
Let's conveniently have some standard data to exercise our solution below. That shall include *information input* to the system, as well as related later *queries*.

In [9]:
example_information_to_save = ["Flu shot cost = $80", 
                               "things my boss likes: cricket, science and vegetarian food",
                               "my wife wants a vegetarian food book",
                               "sales guy email = jp@example.com",
                               "vanessa's email vanessa@outlook.com, rember to send the ppts she asked",
                               "+55 11 27670-0987 -> pedro whatsapp",
                               "need to buy milk, eggs and bread", 
                               "need to sell old video game, chair",
                               "december receipts for gym: yoga, ballet, ??",
                               "book with KWG the hardware setup",
                               "dog day with the foreign visitors",
                               "event support: we failed :-(",
                               "floor layout: we forgot various details!!",
                               "first aid kit in the reception",
                               "ask the pediatrician about when to start brushing teeth"]

In [10]:
example_queries = ["What is the cost of a flu shot?", 
                   "shopping list", 
                   "what does my boss like?", 
                   "books my wife wants", 
                   "vegetarian food",
                   "emails",
                   "questions for the pediatrician"]

## Prompt Exercises: Inputing Information

### Prompt 1

A very simple prompt.

In [52]:
def extraction_prompt_1(x):

    prompt =\
f"""
Extract pieces of information, like phone numbers, email addresses, names, trivia, reminders, etc.
Input: {x}
"""
    return prompt 

In [13]:
apply_to_examples(example_information_to_save, extraction_prompt_1, temperature=1.0)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
- Piece of information: Flu shot cost
    - Cost: $80


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
Based on the input provided, here is the information I extracted:

- Interests: cricket, science
- Dietary preference: vegetarian food

Is there anything else I can assist you with?


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
Alright, I understand that you are looking for a vegetarian food book for your wife. Is there a specific name or author you have in mind?


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
Extracted information:
- Phone number: N/A
- Email address: jp@example.com
- Name: N/A
- Trivia: N/A
- Reminder: N/A


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
Extracted pieces of information:

1. Email address: vanessa@outlook.com
2. Reminder: Remember to send the ppts she asked for.


>>> INPUT: +55 11 27670-0987 -> pedro 

['- Piece of information: Flu shot cost\n    - Cost: $80',
 'Based on the input provided, here is the information I extracted:\n\n- Interests: cricket, science\n- Dietary preference: vegetarian food\n\nIs there anything else I can assist you with?',
 'Alright, I understand that you are looking for a vegetarian food book for your wife. Is there a specific name or author you have in mind?',
 'Extracted information:\n- Phone number: N/A\n- Email address: jp@example.com\n- Name: N/A\n- Trivia: N/A\n- Reminder: N/A',
 'Extracted pieces of information:\n\n1. Email address: vanessa@outlook.com\n2. Reminder: Remember to send the ppts she asked for.',
 'Extracted information:\n- Phone number: +55 11 27670-0987\n- Name: Pedro\n- Platform: WhatsApp',
 'As an intelligent agent, I can extract the following pieces of information from your input:\n\n- Trivia: There is no trivia in the input provided.\n- Reminders: There are no reminders in the input provided.\n- Grocery items: Milk, eggs, and bread a

We see outputs that are way too creative. Let's try then first to reduce the temperature.

In [53]:
apply_to_examples(example_information_to_save, extraction_prompt_1, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
Extracted information:
- Flu shot cost: $80


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
Based on the input provided, here are the extracted pieces of information:

1. Interests: Cricket, Science, Vegetarian food
2. No specific phone numbers, email addresses, or names were mentioned.
3. No trivia or reminders were mentioned.


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
There are several vegetarian food books available. Some popular ones include "The Vegetarian Flavor Bible" by Karen Page, "Thug Kitchen: The Official Cookbook" by Thug Kitchen, and "Plenty: Vibrant Vegetable Recipes from London's Ottolenghi" by Yotam Ottolenghi.


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
Extracted information:
- Email address: jp@example.com


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
Extracted information:
- Email address: vanessa@outlook.c

['Extracted information:\n- Flu shot cost: $80',
 'Based on the input provided, here are the extracted pieces of information:\n\n1. Interests: Cricket, Science, Vegetarian food\n2. No specific phone numbers, email addresses, or names were mentioned.\n3. No trivia or reminders were mentioned.',
 'There are several vegetarian food books available. Some popular ones include "The Vegetarian Flavor Bible" by Karen Page, "Thug Kitchen: The Official Cookbook" by Thug Kitchen, and "Plenty: Vibrant Vegetable Recipes from London\'s Ottolenghi" by Yotam Ottolenghi.',
 'Extracted information:\n- Email address: jp@example.com',
 'Extracted information:\n- Email address: vanessa@outlook.com\n- Reminder: Send the PowerPoint presentations Vanessa asked for',
 'Phone Number: +55 11 27670-0987\nName: Pedro\nNote: WhatsApp',
 'The information extracted from the input "need to buy milk, eggs and bread" is:\n\n- Items to buy: milk, eggs, bread',
 "To sell your old video game and chair, you can try the foll

This is not particularly useful, though we can see that the generated text is closer to the information provided as input. But the output format is unusable, since it is just free-form text. So let's try to be more specific on the kinds of outputs we want.

### Prompt 2

Now including some desired output structure and semantics.

In [54]:
def extraction_prompt_2(x):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Key, Value)
Input: {x}
"""
    return prompt 

In [55]:
apply_to_examples(example_information_to_save, extraction_prompt_2, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Trivia", "Flu shot cost", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
(Category, Key, Value)
("Boss", "Likes", "cricket")
("Boss", "Likes", "science")
("Boss", "Likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
(Category, Key, Value)
("Category", "Interest", "Food")
("Category", "Preference", "Vegetarian")
("Category", "Relationship", "Wife")
("Category", "Book", "Vegetarian Food Book")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Email", "Sales Guy", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
[('Email', 'Vanessa', 'vanessa@outlook.com'), ('Reminder', 'Vanessa', 'remember to send the ppts she asked')]


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
(Category, Key, Value)
("Phone Number", "Pedro", "+55 11 27670-0987")
("Contact", "Pedro", "WhatsApp")


>>> INPU

['("Trivia", "Flu shot cost", "$80")',
 '(Category, Key, Value)\n("Boss", "Likes", "cricket")\n("Boss", "Likes", "science")\n("Boss", "Likes", "vegetarian food")',
 '(Category, Key, Value)\n("Category", "Interest", "Food")\n("Category", "Preference", "Vegetarian")\n("Category", "Relationship", "Wife")\n("Category", "Book", "Vegetarian Food Book")',
 '("Email", "Sales Guy", "jp@example.com")',
 "[('Email', 'Vanessa', 'vanessa@outlook.com'), ('Reminder', 'Vanessa', 'remember to send the ppts she asked')]",
 '(Category, Key, Value)\n("Phone Number", "Pedro", "+55 11 27670-0987")\n("Contact", "Pedro", "WhatsApp")',
 "Sorry, but I can't assist with that request.",
 '(Category, Key, Value)\n("Item", "Video Game", "old")\n("Item", "Chair", "old")',
 "Sorry, but I can't assist with that request.",
 "Sorry, but I can't assist with that request.",
 '(Category, Key, Value)\n(Category, Key, Value)\n(Category, Key, Value)\n(Category, Key, Value)\n(Category, Key, Value)\n(Category, Key, Value)\n(Cat

We see various problems: the result includes headers, but we don't want that - in one case, we have tens of heades; multiple facts sometimes are put together instead of multiple tuples; some extractions are just wrong. Adding some examples might make our intent more clear. So let's try another prompt.

### Prompt 3

Now including an example.

In [56]:
def extraction_prompt_3(x):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Key, Value)

Example input: "Mom's phone number is 555-555-5555"
Example output: ("Family", "mom's phone number", "555-555-5555")

Input: {x}
"""
    return prompt 

In [57]:
apply_to_examples(example_information_to_save, extraction_prompt_3, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Medical", "flu shot cost", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
("Boss", "likes", "cricket")
("Boss", "likes", "science")
("Boss", "likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "wife's food preference", "vegetarian")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Sales", "sales guy email", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
[("Personal", "Name", "Vanessa"), ("Personal", "Email", "vanessa@outlook.com"), ("Reminder", "Task", "Send the ppts she asked")]


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
("Contact", "Pedro", "+55 11 27670-0987")
("Social Media", "Pedro", "WhatsApp")


>>> INPUT: need to buy milk, eggs and bread
>>> OUTPUT:
("Shopping", "grocery list", "milk, eggs, bread")


>>> INPUT: need to sell old video game, chair
>>

['("Medical", "flu shot cost", "$80")',
 '("Boss", "likes", "cricket")\n("Boss", "likes", "science")\n("Boss", "likes", "vegetarian food")',
 '("Family", "wife\'s food preference", "vegetarian")',
 '("Sales", "sales guy email", "jp@example.com")',
 '[("Personal", "Name", "Vanessa"), ("Personal", "Email", "vanessa@outlook.com"), ("Reminder", "Task", "Send the ppts she asked")]',
 '("Contact", "Pedro", "+55 11 27670-0987")\n("Social Media", "Pedro", "WhatsApp")',
 '("Shopping", "grocery list", "milk, eggs, bread")',
 '("To Do", "sell", "old video game, chair")',
 '("Expenses", "December gym receipts", "yoga, ballet")',
 '("Book", "Title", "KWG the hardware setup")',
 '("Event", "description", "dog day with the foreign visitors")',
 '("Event", "support", "we failed")',
 '("Location", "floor layout", "unknown")',
 '("Location", "reception", "first aid kit")',
 '("Health", "pediatrician", "ask about when to start brushing teeth")']

Problems remain regarding the category names, and multiple facts cases.

### Prompt 4

To address the above points, in the next prompt we introduce a multi-fact example, an additional assumption regarding multiple facts, and a constraint on the valid categories. This is now a rather dynamic prompt: both the user input and the valid categories are parameters. In general, we can use dynamic prompts to make them more general and connect better to the client application.

In [58]:
def extraction_prompt_4(x, categories=["Family", "Work", "Friends", "Shopping", 
                                       "Health", "Finance", "Travel", "Home", 
                                       "Pets", "Hobbies", "Other"]):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Key, Value)
Assume everything mentioned refers to the same thing. Constraints:
  - Allowed Categories: {', '.join(categories)}


Example input: "Mom's phone number is 555-555-5555"
Example output: ("Family", "mom's phone number", "555-555-5555")

Example input: "Need to do: lab work, ultrasound, buy aspirin"
Example output: 
("Health", "to do", "lab work")
("Health", "to do", "ultrasound")
("Health", "buy", "aspirin")	

Input: {x}
"""
    return prompt 

In [59]:
apply_to_examples(example_information_to_save, extraction_prompt_4, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Health", "flu shot cost", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
("Work", "boss likes", "cricket")
("Work", "boss likes", "science")
("Work", "boss likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "wife", "vegetarian food book")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Work", "sales guy email", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
("Friends", "vanessa's email", "vanessa@outlook.com")
("Friends", "to do", "send the ppts she asked")


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
("Friends", "pedro", "+55 11 27670-0987")
("Friends", "pedro", "whatsapp")


>>> INPUT: need to buy milk, eggs and bread
>>> OUTPUT:
("Shopping", "to buy", "milk")
("Shopping", "to buy", "eggs")
("Shopping", "to buy", "bread")


>>> INPUT: need to sell old vid

['("Health", "flu shot cost", "$80")',
 '("Work", "boss likes", "cricket")\n("Work", "boss likes", "science")\n("Work", "boss likes", "vegetarian food")',
 '("Family", "wife", "vegetarian food book")',
 '("Work", "sales guy email", "jp@example.com")',
 '("Friends", "vanessa\'s email", "vanessa@outlook.com")\n("Friends", "to do", "send the ppts she asked")',
 '("Friends", "pedro", "+55 11 27670-0987")\n("Friends", "pedro", "whatsapp")',
 '("Shopping", "to buy", "milk")\n("Shopping", "to buy", "eggs")\n("Shopping", "to buy", "bread")',
 '("Shopping", "to sell", "old video game")\n("Shopping", "to sell", "chair")',
 '("Finance", "gym receipts", "December")\n("Hobbies", "gym activities", "yoga")\n("Hobbies", "gym activities", "ballet")',
 '("Hobbies", "book", "KWG the hardware setup")',
 '("Pets", "event", "dog day with the foreign visitors")',
 '("Work", "event support", "we failed")',
 '("Home", "floor layout", "we forgot various details")',
 '("Health", "reception", "first aid kit")',
 

### Prompt 5

Let'us increase the expressivity of our facts. By adding new fields *Type* and *People* we hope to be able to absorb the information provided more effectivelly.

In [60]:
def extraction_prompt_5(x, categories=["Family", "Work", "Friends", "Shopping", 
                                       "Health", "Finance", "Travel", "Home", 
                                       "Pets", "Hobbies", "Other"]):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Type, People, Key, Value)
Assume everything mentioned refers to the same thing. Constraints:
  - Allowed Categories: {', '.join(categories)}
  - Allowed Types: "List", "Email", "Phone", "Address", "Document", "Pendency", "Price", "Reminder", 
                   "Note", "Doubt", "Wish", "Other"
  - People contain the name or description of the people or organizations concerned, or is empty if no 
    person or organization is mentioned.
  
Example input: "Mom's phone number is 555-555-5555"
Example output: ("Family", "Phone", "mom", "mom's number", "555-555-5555")

Example input: "email of the building administration = adm@example.com"
Example output: ("Work", "Email", "building administration", "email", "adm@example.com")

Example input: "Need to do: lab work, ultrasound, buy aspirin"
Example output: 
("Health", "List", "", "to do", "lab work")
("Health", "List", "", "to do", "ultrasound")
("Shopping", "List", "", "aspirin", "buy")	

Input: {x}
"""
    return prompt 

In [61]:
apply_to_examples(example_information_to_save, extraction_prompt_5, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Health", "Price", "", "flu shot", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
("Work", "List", "boss", "likes", "cricket")
("Work", "List", "boss", "likes", "science")
("Work", "List", "boss", "likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Work", "Email", "sales guy", "email", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")
("Friends", "Reminder", "vanessa", "send ppts", "she asked")


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")


>>> INPUT: need to buy milk, eggs and bread
>>> OUTPUT:
("Shopping", "List", "", "groceries", "buy milk")
("

['("Health", "Price", "", "flu shot", "$80")',
 '("Work", "List", "boss", "likes", "cricket")\n("Work", "List", "boss", "likes", "science")\n("Work", "List", "boss", "likes", "vegetarian food")',
 '("Family", "Book", "wife", "food book", "vegetarian")',
 '("Work", "Email", "sales guy", "email", "jp@example.com")',
 '("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")\n("Friends", "Reminder", "vanessa", "send ppts", "she asked")',
 '("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")',
 '("Shopping", "List", "", "groceries", "buy milk")\n("Shopping", "List", "", "groceries", "buy eggs")\n("Shopping", "List", "", "groceries", "buy bread")',
 '("Other", "List", "", "to sell", "old video game")\n("Home", "List", "", "to sell", "chair")',
 '("Finance", "Document", "", "receipts", "december receipts for gym")\n("Hobbies", "List", "", "yoga", "december receipts for gym")\n("Hobbies", "List", "", "ballet", "december receipts for gym")',
 '("Other", "Document", "", "book"

### Prompt 6

We still see some defects regarding multiple facts. We'll now add a instruction to restrict how many tuples are generated, and add a new example to further demonstrate what must be done.

In [62]:
def extraction_prompt_6(x, categories=["Family", "Work", "Friends", "Shopping", 
                                       "Health", "Finance", "Travel", "Home", 
                                       "Pets", "Hobbies", "Other"]):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Type, People, Key, Value)
Assume everything mentioned refers to the same thing. Constraints:
  - Allowed Categories: {', '.join(categories)}
  - Allowed Types: "List", "Email", "Phone", "Address", "Document", "Pendency", "Price", "Reminder", 
                   "Note", "Doubt", "Wish", "Other"
  - People contain the name or description of the people or organizations concerned, or is empty if no 
    person or organization is mentioned.
  - Put as much information in each tuple as possible, only breaking in multiple tuples if really needed.
  
Example input: "Mom's phone number is 555-555-5555"
Example output: ("Family", "Phone", "mom", "mom's number", "555-555-5555")

Example input: "email of the building administration = adm@example.com"
Example output: ("Work", "Email", "building administration", "email", "adm@example.com")

Example input: "Need to do: lab work, ultrasound, buy aspirin"
Example output: 
("Health", "List", "", "to do", "lab work")
("Health", "List", "", "to do", "ultrasound")
("Shopping", "List", "", "aspirin", "buy")	

Example input: "2024 investment ideas for company: AI, electric cars, heavy industry, come up with more"
Example output: 
("Finance", "List", "company", "2024 investment idea", "AI")
("Finance", "List", "company", "2024 investment idea", "electric cars")
("Finance", "List", "company", "2024 investment idea", "heavy industry")	
("Finance", "Pendency", "company", "2024 investment ideas", "come up with more")	

Input: {x}
"""
    return prompt 

In [63]:
# this one was causing problems, let's focus on it
apply_to_examples(["december receipts for gym: yoga, ballet, ??"], extraction_prompt_6, temperature=0.1)

>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "List", "gym", "december receipts", "yoga")
("Finance", "List", "gym", "december receipts", "ballet")
("Finance", "Doubt", "gym", "december receipts", "??")




['("Finance", "List", "gym", "december receipts", "yoga")\n("Finance", "List", "gym", "december receipts", "ballet")\n("Finance", "Doubt", "gym", "december receipts", "??")']

In [64]:
apply_to_examples(example_information_to_save, extraction_prompt_6, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Health", "Price", "", "flu shot", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
("Work", "List", "boss", "likes", "cricket")
("Work", "List", "boss", "likes", "science")
("Work", "List", "boss", "likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "vegetarian food book", "")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Work", "Email", "sales guy", "email", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")
("Friends", "Reminder", "vanessa", "send ppts", "she asked")


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")


>>> INPUT: need to buy milk, eggs and bread
>>> OUTPUT:
("Shopping", "List", "", "groceries", "buy milk")
(

['("Health", "Price", "", "flu shot", "$80")',
 '("Work", "List", "boss", "likes", "cricket")\n("Work", "List", "boss", "likes", "science")\n("Work", "List", "boss", "likes", "vegetarian food")',
 '("Family", "Book", "wife", "vegetarian food book", "")',
 '("Work", "Email", "sales guy", "email", "jp@example.com")',
 '("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")\n("Friends", "Reminder", "vanessa", "send ppts", "she asked")',
 '("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")',
 '("Shopping", "List", "", "groceries", "buy milk")\n("Shopping", "List", "", "groceries", "buy eggs")\n("Shopping", "List", "", "groceries", "buy bread")',
 '("Finance", "List", "", "to sell", "old video game")\n("Finance", "List", "", "to sell", "chair")',
 '("Finance", "List", "gym", "december receipts", "yoga")\n("Finance", "List", "gym", "december receipts", "ballet")\n("Finance", "Doubt", "gym", "december receipts", "??")',
 '("Other", "Note", "", "book", "with KWG the hardwa

In [65]:
# this one was causing problems, let's focus on it
apply_to_examples(["my wife wants a vegetarian food book"]*5, extraction_prompt_6, temperature=0.0)

>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Book", "wife", "food book", "vegetarian")




['("Family", "Book", "wife", "food book", "vegetarian")',
 '("Family", "Book", "wife", "food book", "vegetarian")',
 '("Family", "Book", "wife", "food book", "vegetarian")',
 '("Family", "Book", "wife", "food book", "vegetarian")',
 '("Family", "Book", "wife", "food book", "vegetarian")']

### Prompt 7

Let'us put some final touches.

In [81]:
def extraction_prompt_7(x, categories=["Family", "Work", "Friends", "Shopping", 
                                       "Health", "Finance", "Travel", "Home", 
                                       "Pets", "Hobbies", "Other"]):

    prompt =\
f"""
Extract pieces of personal information, like phone numbers, email addresses, 
names, trivia, reminders, etc., as tuples with the following format: (Category, Type, People, Key, Value)
Assume everything mentioned refers to the same thing. Constraints:
  - Allowed Categories: {', '.join(categories)}
  - Allowed Types: "List", "Email", "Phone", "Address", "Document", "Pendency", "Price", "Reminder", 
                   "Note", "Doubt", "Wish", "Other"
  - People contain the name or description of the people or organizations concerned, or is empty if no 
    person or organization is mentioned.
  - Put as much information in each tuple as possible, only breaking in multiple tuples if really needed.
  - Don't extract redundant tuples.
  
Example input: "Mom's phone number is 555-555-5555"
Example output: ("Family", "Phone", "mom", "mom's number", "555-555-5555")

Example input: "email of the building administration = adm@example.com"
Example output: ("Work", "Email", "building administration", "email", "adm@example.com")

Example input: "Need to do: lab work, ultrasound, buy aspirin"
Example output: 
("Health", "List", "", "to do", "lab work")
("Health", "List", "", "to do", "ultrasound")
("Shopping", "List", "", "aspirin", "buy")	

Example input: "2024 investment ideas for company: AI, electric cars, heavy industry, come up with more"
Example output: 
("Finance", "List", "company", "2024 investment idea", "AI")
("Finance", "List", "company", "2024 investment idea", "electric cars")
("Finance", "List", "company", "2024 investment idea", "heavy industry")	
("Finance", "Pendency", "company", "2024 investment ideas", "come up with more")	

Example input: "teacher's day with school visitors -> clean up"
Example output: 
("Work", "Reminder", "school visitors", "teacher's day", "clean up")

Input: {x}
"""
    return prompt 

In [82]:
# this one was causing problems, let's focus on it
apply_to_examples(["december receipts for gym: yoga, ballet, ??"]*5, extraction_prompt_7, temperature=0.1)

>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "Document", "gym", "december receipts", "yoga")
("Finance", "Document", "gym", "december receipts", "ballet")


>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "Document", "gym", "december receipts", "yoga")
("Finance", "Document", "gym", "december receipts", "ballet")


>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "Document", "gym", "december receipts", "yoga")
("Finance", "Document", "gym", "december receipts", "ballet")


>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "Document", "gym", "december receipts", "yoga")
("Finance", "Document", "gym", "december receipts", "ballet")


>>> INPUT: december receipts for gym: yoga, ballet, ??
>>> OUTPUT:
("Finance", "Document", "gym", "december receipts", "yoga")
("Finance", "Document", "gym", "december receipts", "ballet")




['("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")',
 '("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")',
 '("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")',
 '("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")',
 '("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")']

In [83]:
# this one was causing problems, let's focus on it
apply_to_examples(["my wife wants a vegetarian food book"]*5, extraction_prompt_7, temperature=0.1)

>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")




['("Family", "Wish", "wife", "vegetarian food book", "")',
 '("Family", "Wish", "wife", "vegetarian food book", "")',
 '("Family", "Wish", "wife", "vegetarian food book", "")',
 '("Family", "Wish", "wife", "vegetarian food book", "")',
 '("Family", "Wish", "wife", "vegetarian food book", "")']

In [84]:
apply_to_examples(example_information_to_save, extraction_prompt_7, temperature=0.1)

>>> INPUT: Flu shot cost = $80
>>> OUTPUT:
("Health", "Price", "", "flu shot", "$80")


>>> INPUT: things my boss likes: cricket, science and vegetarian food
>>> OUTPUT:
("Work", "List", "boss", "likes", "cricket")
("Work", "List", "boss", "likes", "science")
("Work", "List", "boss", "likes", "vegetarian food")


>>> INPUT: my wife wants a vegetarian food book
>>> OUTPUT:
("Family", "Wish", "wife", "vegetarian food book", "")


>>> INPUT: sales guy email = jp@example.com
>>> OUTPUT:
("Work", "Email", "sales guy", "email", "jp@example.com")


>>> INPUT: vanessa's email vanessa@outlook.com, rember to send the ppts she asked
>>> OUTPUT:
("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")
("Friends", "Reminder", "vanessa", "send ppts", "she asked")


>>> INPUT: +55 11 27670-0987 -> pedro whatsapp
>>> OUTPUT:
("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")


>>> INPUT: need to buy milk, eggs and bread
>>> OUTPUT:
("Shopping", "List", "", "groceries", "buy milk")
(

['("Health", "Price", "", "flu shot", "$80")',
 '("Work", "List", "boss", "likes", "cricket")\n("Work", "List", "boss", "likes", "science")\n("Work", "List", "boss", "likes", "vegetarian food")',
 '("Family", "Wish", "wife", "vegetarian food book", "")',
 '("Work", "Email", "sales guy", "email", "jp@example.com")',
 '("Friends", "Email", "vanessa", "email", "vanessa@outlook.com")\n("Friends", "Reminder", "vanessa", "send ppts", "she asked")',
 '("Friends", "Phone", "pedro", "whatsapp", "+55 11 27670-0987")',
 '("Shopping", "List", "", "groceries", "buy milk")\n("Shopping", "List", "", "groceries", "buy eggs")\n("Shopping", "List", "", "groceries", "buy bread")',
 '("Finance", "List", "", "to sell", "old video game")\n("Finance", "List", "", "to sell", "chair")',
 '("Finance", "Document", "gym", "december receipts", "yoga")\n("Finance", "Document", "gym", "december receipts", "ballet")',
 '("Other", "Note", "", "book", "with KWG the hardware setup")',
 '("Pets", "Reminder", "foreign vis

That looks very good now! We'll thus stop iterating on this prompt. To finalize, let's just rename the best prompt function to denote its importance.

In [85]:
best_input_prompt = extraction_prompt_7
best_temperature = 0.1

## Prompt Exercises: Querying Information

We have just gone through some iterations of the information input prompt and we are satisfied with results for the time being. Now we can give the next step: populate a database using this mechanism, and then engineering a prompt to be able to query this database! Let's get the input tuples.

In [86]:
input_tuples = []
for input in example_information_to_save:
    for tuples in string_to_tuples(client.complete(best_input_prompt(input), temperature=best_temperature)):
        print(f">>> Processing: {tuples}")
        input_tuples.append(tuples)
        
input_tuples

>>> Processing: ('Health', 'Price', '', 'flu shot', '$80')
>>> Processing: ('Work', 'List', 'boss', 'likes', 'cricket')
>>> Processing: ('Work', 'List', 'boss', 'likes', 'science')
>>> Processing: ('Work', 'List', 'boss', 'likes', 'vegetarian food')
>>> Processing: ('Family', 'Wish', 'wife', 'vegetarian food book', '')
>>> Processing: ('Work', 'Email', 'sales guy', 'email', 'jp@example.com')
>>> Processing: ('Friends', 'Email', 'vanessa', 'email', 'vanessa@outlook.com')
>>> Processing: ('Friends', 'Reminder', 'vanessa', 'send ppts', 'she asked')
>>> Processing: ('Friends', 'Phone', 'pedro', 'whatsapp', '+55 11 27670-0987')
>>> Processing: ('Shopping', 'List', '', 'groceries', 'buy milk')
>>> Processing: ('Shopping', 'List', '', 'groceries', 'buy eggs')
>>> Processing: ('Shopping', 'List', '', 'groceries', 'buy bread')
>>> Processing: ('Finance', 'List', '', 'to sell', 'old video game')
>>> Processing: ('Finance', 'List', '', 'to sell', 'chair')
>>> Processing: ('Finance', 'Document', '

[('Health', 'Price', '', 'flu shot', '$80'),
 ('Work', 'List', 'boss', 'likes', 'cricket'),
 ('Work', 'List', 'boss', 'likes', 'science'),
 ('Work', 'List', 'boss', 'likes', 'vegetarian food'),
 ('Family', 'Wish', 'wife', 'vegetarian food book', ''),
 ('Work', 'Email', 'sales guy', 'email', 'jp@example.com'),
 ('Friends', 'Email', 'vanessa', 'email', 'vanessa@outlook.com'),
 ('Friends', 'Reminder', 'vanessa', 'send ppts', 'she asked'),
 ('Friends', 'Phone', 'pedro', 'whatsapp', '+55 11 27670-0987'),
 ('Shopping', 'List', '', 'groceries', 'buy milk'),
 ('Shopping', 'List', '', 'groceries', 'buy eggs'),
 ('Shopping', 'List', '', 'groceries', 'buy bread'),
 ('Finance', 'List', '', 'to sell', 'old video game'),
 ('Finance', 'List', '', 'to sell', 'chair'),
 ('Finance', 'Document', 'gym', 'december receipts', 'yoga'),
 ('Finance', 'Document', 'gym', 'december receipts', 'ballet'),
 ('Other', 'Note', '', 'book', 'with KWG the hardware setup'),
 ('Pets', 'Reminder', 'foreign visitors', 'dog d

For simplicity, in this exercise our "database" shall be just a Pandas dataframe, but naturally this can be extended to any actual database system. Below we populate it with the information we have already parsed.

In [87]:
database = pd.DataFrame(input_tuples, columns=["Category", "Type", "People", "Key", "Value"])
database

Unnamed: 0,Category,Type,People,Key,Value
0,Health,Price,,flu shot,$80
1,Work,List,boss,likes,cricket
2,Work,List,boss,likes,science
3,Work,List,boss,likes,vegetarian food
4,Family,Wish,wife,vegetarian food book,
5,Work,Email,sales guy,email,jp@example.com
6,Friends,Email,vanessa,email,vanessa@outlook.com
7,Friends,Reminder,vanessa,send ppts,she asked
8,Friends,Phone,pedro,whatsapp,+55 11 27670-0987
9,Shopping,List,,groceries,buy milk


Now we can begin experimenting with the prompts for querying.

### Prompt 1 (querying):

A very naive solution that must inspect each row of the dataframe. Might work, but it seems too costly.

In [88]:
def querying_promt_1(query, example):

    prompt = \
f"""
Determine whether the query "{query}" is related to the tuple "{example}".
Answer (yes/no):
"""

    return prompt

In [89]:
for query in example_queries:
    print(f"INPUT QUERY: {query}")
    for i, row in database.iterrows():
        row_string = f"{tuple(row.values)}"
        print(f"INPUT ROW: {row_string}")
        print(f"OUTPUT: {client.complete(querying_promt_1(query, row_string))}")
    
    print(f"====================")

INPUT QUERY: What is the cost of a flu shot?
INPUT ROW: ('Health', 'Price', '', 'flu shot', '$80')


OUTPUT: Yes.
INPUT ROW: ('Work', 'List', 'boss', 'likes', 'cricket')
OUTPUT: No
INPUT ROW: ('Work', 'List', 'boss', 'likes', 'science')
OUTPUT: No
INPUT ROW: ('Work', 'List', 'boss', 'likes', 'vegetarian food')
OUTPUT: No
INPUT ROW: ('Family', 'Wish', 'wife', 'vegetarian food book', '')
OUTPUT: No
INPUT ROW: ('Work', 'Email', 'sales guy', 'email', 'jp@example.com')
OUTPUT: No
INPUT ROW: ('Friends', 'Email', 'vanessa', 'email', 'vanessa@outlook.com')
OUTPUT: No
INPUT ROW: ('Friends', 'Reminder', 'vanessa', 'send ppts', 'she asked')
OUTPUT: no
INPUT ROW: ('Friends', 'Phone', 'pedro', 'whatsapp', '+55 11 27670-0987')
OUTPUT: No
INPUT ROW: ('Shopping', 'List', '', 'groceries', 'buy milk')
OUTPUT: No
INPUT ROW: ('Shopping', 'List', '', 'groceries', 'buy eggs')
OUTPUT: No
INPUT ROW: ('Shopping', 'List', '', 'groceries', 'buy bread')
OUTPUT: No
INPUT ROW: ('Finance', 'List', '', 'to sell', 'old video game')
OUTPUT: No
INPUT ROW: ('Finance', 'List', '', 'to sell', 'chair')
OUTPUT: No
INPUT ROW

### Prompt 2
What if instead we extract some key terms from the query the result for searching?

In [90]:
def querying_promt_2(query):

    prompt = \
f"""
Extract the main entities (one per line, without bullets) in the following sentence: "{query}"
"""

    return prompt

In [91]:
results = apply_to_examples(example_queries, querying_promt_2, temperature=0.5)

>>> INPUT: What is the cost of a flu shot?
>>> OUTPUT:
cost
flu shot


>>> INPUT: shopping list
>>> OUTPUT:
shopping list


>>> INPUT: what does my boss like?
>>> OUTPUT:
boss


>>> INPUT: books my wife wants
>>> OUTPUT:
books
wife


>>> INPUT: vegetarian food
>>> OUTPUT:
- vegetarian food


>>> INPUT: emails
>>> OUTPUT:
emails


>>> INPUT: questions for the pediatrician
>>> OUTPUT:
- questions
- pediatrician




In [92]:
examples_terms = extract_terms_from_all_results(results)
examples_terms

[['cost', 'flu shot'],
 ['shopping list'],
 ['boss'],
 ['books', 'wife'],
 ['vegetarian food'],
 ['emails'],
 ['questions', 'pediatrician']]

In [93]:
best_terms_extraction_prompt = querying_promt_2

### Prompt 3

As we can see, the entities extraction seem to work. But what if we need a synonym? Let's try to perform some data augmentation to complement the previous prompt.

In [94]:
def querying_promt_3(term):

    prompt = \
f"""
List some synonyms to the following term: "{term}"
Synonyms (one synonym per line):
"""

    return prompt

In [95]:
augmentation_raw_results = []
for terms in examples_terms:
    augmentation_raw_result = apply_to_examples(terms, querying_promt_3)
    augmentation_raw_results.append(augmentation_raw_result)

augmentation_raw_results

>>> INPUT: cost


>>> OUTPUT:
price
expense
fee
charge
outlay
expenditure
bill
payment
tariff
rate


>>> INPUT: flu shot
>>> OUTPUT:
- Influenza vaccination
- Flu vaccine
- Flu jab
- Flu immunization
- Influenza shot
- Flu inoculation
- Flu injection
- Flu booster
- Flu preventive
- Flu prophylaxis


>>> INPUT: shopping list
>>> OUTPUT:
grocery list
to-buy list
purchase list
shopping checklist
shopping inventory
shopping roster
shopping agenda
shopping register
shopping record
shopping schedule


>>> INPUT: boss
>>> OUTPUT:
manager
supervisor
chief
director
head
leader
executive
administrator
overseer
controller


>>> INPUT: books
>>> OUTPUT:
- Novels
- Literature
- Texts
- Publications
- Tomes
- Manuscripts
- Volumes
- Works
- Publications
- Written works


>>> INPUT: wife
>>> OUTPUT:
spouse
partner
better half
mate
life partner
significant other
companion
helpmate
better half
beloved
bride
woman
missus
lady
husband


>>> INPUT: vegetarian food
>>> OUTPUT:
plant-based food
vegan food
herbivorous food
m

[['price\nexpense\nfee\ncharge\noutlay\nexpenditure\nbill\npayment\ntariff\nrate',
  '- Influenza vaccination\n- Flu vaccine\n- Flu jab\n- Flu immunization\n- Influenza shot\n- Flu inoculation\n- Flu injection\n- Flu booster\n- Flu preventive\n- Flu prophylaxis'],
 ['grocery list\nto-buy list\npurchase list\nshopping checklist\nshopping inventory\nshopping roster\nshopping agenda\nshopping register\nshopping record\nshopping schedule'],
 ['manager\nsupervisor\nchief\ndirector\nhead\nleader\nexecutive\nadministrator\noverseer\ncontroller'],
 ['- Novels\n- Literature\n- Texts\n- Publications\n- Tomes\n- Manuscripts\n- Volumes\n- Works\n- Publications\n- Written works',
  'spouse\npartner\nbetter half\nmate\nlife partner\nsignificant other\ncompanion\nhelpmate\nbetter half\nbeloved\nbride\nwoman\nmissus\nlady\nhusband'],
 ['plant-based food\nvegan food\nherbivorous food\nmeatless food\nnon-meat food\ncruelty-free food\nanimal-free food\nveggie food\ngreen food\nplant-powered food'],
 ['1.

In [96]:
example_augmentations = []
for raw_result in augmentation_raw_results:
    example_augmentation = extract_terms_from_all_results(raw_result)
    flat_example_augmentation = [item for sublist in example_augmentation for item in sublist]
    example_augmentations.append(flat_example_augmentation)

example_augmentations    

[['price',
  'expense',
  'fee',
  'charge',
  'outlay',
  'expenditure',
  'bill',
  'payment',
  'tariff',
  'rate',
  'Influenza vaccination',
  'Flu vaccine',
  'Flu jab',
  'Flu immunization',
  'Influenza shot',
  'Flu inoculation',
  'Flu injection',
  'Flu booster',
  'Flu preventive',
  'Flu prophylaxis'],
 ['grocery list',
  'to-buy list',
  'purchase list',
  'shopping checklist',
  'shopping inventory',
  'shopping roster',
  'shopping agenda',
  'shopping register',
  'shopping record',
  'shopping schedule'],
 ['manager',
  'supervisor',
  'chief',
  'director',
  'head',
  'leader',
  'executive',
  'administrator',
  'overseer',
  'controller'],
 ['Novels',
  'Literature',
  'Texts',
  'Publications',
  'Tomes',
  'Manuscripts',
  'Volumes',
  'Works',
  'Publications',
  'Written works',
  'spouse',
  'partner',
  'better half',
  'mate',
  'life partner',
  'significant other',
  'companion',
  'helpmate',
  'better half',
  'beloved',
  'bride',
  'woman',
  'missus'

In [97]:
best_augmentation_prompt = querying_promt_3

### Querying Demo

We can now actually query our database! Naturally, a proper search mechanism would be much more sophisticated, but this does show the potential of the approach. Below we decouple a basic filtering mechanism from the actual keyword search.

In [98]:
def database_filtered_by(df, categories=None, entry_types=None, people=None):

    def aux_filter(df, column, values):
        if values is not None and len(values) > 0:
            return df[df[column].str.lower().isin([v.lower() for v in values])]
        else:
            return df
    
    df = aux_filter(df, "Category", categories)
    df = aux_filter(df, "Type", entry_types)
    df = aux_filter(df, "People", people)
        
    return df

In [99]:
def search_dataframe(df, original_terms, augmented_terms):
    """
    Searches the database for the specified terms.
    """
    all_terms = original_terms + augmented_terms
    df = df.fillna("")

    df_results = None
    for column in df.columns:
        df_result = df[df[column].str.contains("|".join(all_terms), case=False).fillna(False)]
        if df_results is None:
            df_results = df_result
        else:
            df_results = pd.concat([df_results, df_result])
            
    return df_results

In [100]:
database

Unnamed: 0,Category,Type,People,Key,Value
0,Health,Price,,flu shot,$80
1,Work,List,boss,likes,cricket
2,Work,List,boss,likes,science
3,Work,List,boss,likes,vegetarian food
4,Family,Wish,wife,vegetarian food book,
5,Work,Email,sales guy,email,jp@example.com
6,Friends,Email,vanessa,email,vanessa@outlook.com
7,Friends,Reminder,vanessa,send ppts,she asked
8,Friends,Phone,pedro,whatsapp,+55 11 27670-0987
9,Shopping,List,,groceries,buy milk


In [101]:
for i, original_terms in enumerate(examples_terms):
    augmented_terms = example_augmentations[i]
    print(f"Search terms: {original_terms}")
    print(f"Augmented terms: {augmented_terms}")
    print(search_dataframe(database, original_terms, augmented_terms))
    print(f"====================")

Search terms: ['cost', 'flu shot']
Augmented terms: ['price', 'expense', 'fee', 'charge', 'outlay', 'expenditure', 'bill', 'payment', 'tariff', 'rate', 'Influenza vaccination', 'Flu vaccine', 'Flu jab', 'Flu immunization', 'Influenza shot', 'Flu inoculation', 'Flu injection', 'Flu booster', 'Flu preventive', 'Flu prophylaxis']
  Category   Type People       Key Value
0   Health  Price         flu shot   $80
0   Health  Price         flu shot   $80
Search terms: ['shopping list']
Augmented terms: ['grocery list', 'to-buy list', 'purchase list', 'shopping checklist', 'shopping inventory', 'shopping roster', 'shopping agenda', 'shopping register', 'shopping record', 'shopping schedule']
Empty DataFrame
Columns: [Category, Type, People, Key, Value]
Index: []
Search terms: ['boss']
Augmented terms: ['manager', 'supervisor', 'chief', 'director', 'head', 'leader', 'executive', 'administrator', 'overseer', 'controller']
  Category  Type People    Key            Value
1     Work  List   boss  l

## Complete Solution

Now that we have seen the individual components of the solution, we can put them together into two functions: one for inputing information, and one for querying. These are essentially what we'll be using in the final application, transporting from our notebook studies to the actual product.

In [102]:
facts_database = pd.DataFrame([], columns=["Category", "Type", "People", "Key", "Value"])

In [103]:
def extract_facts(facts_utterance):
    fact_tuples = string_to_tuples(client.complete(best_input_prompt(facts_utterance), temperature=best_temperature))
    return fact_tuples

In [104]:
def insert_facts(facts_utterance, database):
    """
    Inserts a fact into the database.
    """
    fact_tuples = extract_facts(facts_utterance)
    print(f"Facts: {fact_tuples}")
    for fact_tuple in fact_tuples:
        # we add the tuple only if at least one of the important information fields is not empty
        if len(fact_tuple[2]) > 0 or len(fact_tuple[3]) > 0 or len(fact_tuple[4]) > 0:
            df_to_add = pd.DataFrame([fact_tuple], columns=["Category", "Type", "People", "Key", "Value"])
            database = pd.concat([database, df_to_add], ignore_index=True)
    return database

In [105]:
def query(fact_query, database, categories=None, entry_types=None, people=None, verbose=False):
    """
    Queries the database for a fact. If requested, prior to keyword search, filters the database 
    by categories, entry types and people.
    """
    raw_original_terms = client.complete(best_terms_extraction_prompt(fact_query))
    original_terms = extract_lines_from_result(raw_original_terms)
    if verbose:
        print(original_terms)

    augmented_terms = []
    for original_term in original_terms:
        raw_augmented_terms = client.complete(best_augmentation_prompt(original_term))
        augmented_terms += extract_lines_from_result(raw_augmented_terms)
    if verbose:
        print(augmented_terms)
    
    
    return search_dataframe(database_filtered_by(database, categories, entry_types, people), 
                            original_terms, augmented_terms)


### Early Demo
Before you actually invest more time in building an application, it is perhaps wise to exercise it right here in a notebook. You can do it yourself, or you can invite some other stakeholders to do it. It is unlikely that your non-technical stakeholders will be able to use the notebook, so ideally you should try some scenarios together, with you operating the notebook and them providing the input and evaluation feedback.

In [106]:
facts_database

Unnamed: 0,Category,Type,People,Key,Value


In [107]:
facts_database = insert_facts("Flu shot cost = $80", database=facts_database)
facts_database = insert_facts("Don't forget car checkup", database=facts_database)
facts_database = insert_facts("Buy MSFT stock at $250", database=facts_database)
facts_database = insert_facts("MSFT stock PE ratio is around 25", database=facts_database)
facts_database = insert_facts("Mom's phone is 555-0000-1111", database=facts_database)

Facts: [('Health', 'Price', '', 'flu shot', '$80')]
Facts: [('Car', 'Reminder', '', 'car checkup', "don't forget")]
Facts: [('Finance', 'List', '', 'to buy', 'MSFT stock'), ('Finance', 'Price', '', 'MSFT stock', '$250')]
Facts: [('Finance', 'Price', 'MSFT stock', 'PE ratio', 'around 25')]
Facts: [('Family', 'Phone', 'mom', "mom's phone", '555-0000-1111')]


In [108]:
facts_database = insert_facts("need to buy paper towels, tonic water and detergent", database=facts_database)

Facts: [('Shopping', 'List', '', 'to buy', 'paper towels'), ('Shopping', 'List', '', 'to buy', 'tonic water'), ('Shopping', 'List', '', 'to buy', 'detergent')]


In [109]:
# try adding some nonsense
facts_database = insert_facts("ka lkaj kljakl jakl jla;a;;;;", database=facts_database)

Facts: []


In [110]:
facts_database = insert_facts("Colorless green ideas sleep furiously", database=facts_database)

Facts: [('Other', 'Note', '', '', 'Colorless green ideas sleep furiously')]


In [111]:
facts_database

Unnamed: 0,Category,Type,People,Key,Value
0,Health,Price,,flu shot,$80
1,Car,Reminder,,car checkup,don't forget
2,Finance,List,,to buy,MSFT stock
3,Finance,Price,,MSFT stock,$250
4,Finance,Price,MSFT stock,PE ratio,around 25
5,Family,Phone,mom,mom's phone,555-0000-1111
6,Shopping,List,,to buy,paper towels
7,Shopping,List,,to buy,tonic water
8,Shopping,List,,to buy,detergent
9,Other,Note,,,Colorless green ideas sleep furiously


In [112]:
query("vaccine cost", database=facts_database)

Unnamed: 0,Category,Type,People,Key,Value
0,Health,Price,,flu shot,$80
3,Finance,Price,,MSFT stock,$250
4,Finance,Price,MSFT stock,PE ratio,around 25
0,Health,Price,,flu shot,$80


In [113]:
query("vaccine cost", database=facts_database, categories=["Health"])

Unnamed: 0,Category,Type,People,Key,Value
0,Health,Price,,flu shot,$80
0,Health,Price,,flu shot,$80


In [114]:
query("things to purchase", database=facts_database)

Unnamed: 0,Category,Type,People,Key,Value
2,Finance,List,,to buy,MSFT stock
6,Shopping,List,,to buy,paper towels
7,Shopping,List,,to buy,tonic water
8,Shopping,List,,to buy,detergent
1,Car,Reminder,,car checkup,don't forget
