## Summary

Repo: https://github.com/pgahq/instructor-groq-openai-llm-examples

This notebook shows how to use Instructor to extract structured info from unstructured text. The twist here is that the list of entities to extract is specified at runtime, rather than being hard-coded in the model. This could easily be turned into an API. For example:

```
Endpoint: /extract_strings
Body:
{
    "content": "The Boy Who Cried Wolf\n\nOnce upon a time, there was a young shepherd boy...",
    "attribute": [
        ["moral", "Moral of the story"],
    ],
    "model_notes": ""
}
```


Note: this notebook assumes you're using Google Colab. You can safely edit / play here. Or go to `File` -> `Save a copy in Google Drive` to make your own version.

In [1]:
!pip install --quiet instructor groq openai jsonref


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


On the left, click the key and set two secrets with your keys. Be sure to enable "Notebook access" for them. This is how Google Colab works...you're not sharing your keys with anyone.

OPENAI_API_KEY - get a key from https://platform.openai.com/api-keys

GROQ_API_KEY - get a key from https://console.groq.com/keys

In [2]:
import instructor
import openai
import groq
from pydantic import BaseModel, Field
from typing import Optional, List
import os
from rich import print as rprint

try:
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = '' or userdata.get('OPENAI_API_KEY') # or put your key in the '' on this line
    os.environ['GROQ_API_KEY'] = '' or userdata.get('GROQ_API_KEY')
except Exception as e:
    # print(e)
    pass

if not os.environ.get('OPENAI_API_KEY') or not os.environ.get('GROQ_API_KEY'):
    raise ValueError("Both OPENAI_API_KEY and GROQ_API_KEY environment variables must be set and non-empty. Read the text in the notebook (above this block) for more info.")


In [11]:
inference_provider = "openai"   # "openai" or "groq"
client = instructor.from_openai(openai.OpenAI()) if inference_provider == "openai" else instructor.from_groq(groq.Groq())

def extract_strings(content: str, attribute: List[tuple], model_notes: str = "") -> BaseModel:
    # Create the annotations and fields dictionaries
    annotations = {attr: Optional[str] for attr, _ in attribute}
    fields = {attr: Field(description=desc) for attr, desc in attribute}

    # Create the ExtractStrings class dynamically with a docstring
    ExtractStrings = type('ExtractStrings', (BaseModel,), {
        '__annotations__': annotations,
        '__doc__': model_notes,
        **fields
    })

    result = client.chat.completions.create(
        model="llama-3.1-70b-versatile" if inference_provider == "groq" else "gpt-4o-mini",
        response_model=ExtractStrings,
        temperature=0.0,
        messages=[{"role": "user", "content": content}]
    )
    return result.model_dump()



## Let it fly!
extract_strings could be turned into an API. Note that this code only returns strings and treats as optional. You could easily extend this to return more structured data.

In [12]:
fields_to_extract = [
    ("name", "Name of the user"),       # these could be passed in as API args
    ("age", "Age of the user"),
    ("email", "Email of the user"),
]

content = "Jason is the user and he's 25 years older than Rick who was born 3 years ago."   # this could be passed in as an API arg

rprint(extract_strings(content, fields_to_extract))


## Detailed descriptions
The text describing a field can be quite sophisticated to nudge the LLM to give exactly the desired results. LLMs do well with markdown.

In [15]:
fields_to_extract = [
    ("firstName", """
     ## Requirements
     First name or first initial. You must add a trailing period to an initial if it doesn't have one.
     
     ## Additional info
     You will receive a string containing an unformatted name or part of a name (e.g. initials). If there's a comma, that means the last name is first. A middle initial would be ignored because it is neither first nor last name.
     """),
     
    ("lastName", "Last name or last initial. Initials should include a period after the letter. You will receive a string containing an unformatted name or part of a name (e.g. initials). If there's a comma, that means the last name is first. A middle initial would be ignored because it is neither first nor last name."),

    ("punctuationCheck", "Your thought process for punctuating the parts of the name.")
]

print(extract_strings("Mark St. Anthony", fields_to_extract))
print(extract_strings("M. St. Anthony", fields_to_extract))
print(extract_strings("M St. Anthony", fields_to_extract))
print(extract_strings("St. Anthony, Mark", fields_to_extract))
print(extract_strings("Daniel Rios-Munoz", fields_to_extract))
print(extract_strings("Rios-Munoz, D", fields_to_extract))
print(extract_strings("Rios-Munoz, D.", fields_to_extract))
print(extract_strings("Brian J Jeter", fields_to_extract))
print(extract_strings("Brian Jeter", fields_to_extract))
print(extract_strings("Jeter, Brian", fields_to_extract))
print(extract_strings("Jeter, Brian J.", fields_to_extract))

{'firstName': 'Mark', 'lastName': 'St. Anthony', 'punctuationCheck': 'The name is formatted correctly with a first name and a last name.'}
{'firstName': 'M.', 'lastName': 'St. Anthony', 'punctuationCheck': "The first name is an initial with a period, and the last name includes a period after 'St'."}
{'firstName': 'M.', 'lastName': 'St. Anthony', 'punctuationCheck': "The first name is an initial with a period, and the last name is a compound name with a period after 'St'."}
{'firstName': 'Mark', 'lastName': 'St. Anthony', 'punctuationCheck': "Last name is first, so it should be formatted as 'St. Anthony, M.'"}
{'firstName': 'Daniel', 'lastName': 'Rios-Munoz', 'punctuationCheck': "First name is 'Daniel' and last name is 'Rios-Munoz'."}
{'firstName': 'D.', 'lastName': 'Rios-Munoz', 'punctuationCheck': 'The last name is first due to the comma, and the first initial is provided.'}
{'firstName': 'D.', 'lastName': 'Rios-Munoz', 'punctuationCheck': 'The first name is an initial, so it should h

In [6]:
fields_to_extract = [
    ("customer_rights", "Customer rights"),
    ("customer_obligations", "Customer obligations"),
    ("vendor_rights", "Vendor rights"),
    ("vendor_obligations", "Vendor obligations"),
]

content = f"""
5. Ownership of Work Product. Upon full payment of all fees owed to the Service Provider, the Service Provider agrees to assign and transfer to the Customer all rights, title, and interest in and to the Software, including all intellectual property rights, free and clear of any encumbrances.
"""
rprint(extract_strings(content, fields_to_extract))

In [7]:
fields_to_extract = [
    ("customer_concerns", "Short descriptions of all issues that are not favorable to the customer."),
    ("suggested_changes", "Suggested changes to the contract that would resolve the customer's concerns."),
]

content = f"""
4. Change Requests. If the Customer requests any material changes to the scope of services (including specifications, design, or functionality), the Service Provider will assess the impact on the project timeline and costs. The Service Provider will provide the Customer with a written change order detailing the additional costs and time required to implement the changes. A $1,000,000 fee will be assessed. The Customer must approve the change order in writing before the Service Provider proceeds with the changes.

5. Ownership of Work Product. Upon full payment of all fees owed to the Service Provider, the Service Provider agrees to assign and transfer to the Customer all rights, title, and interest in and to the Software, including all intellectual property rights, free and clear of any encumbrances.

6. Termination. Either party may terminate this Agreement upon thirty (30) days’ written notice to the other party. In the event of termination, the Customer will pay the Service Provider for all services rendered and expenses incurred up to the date of termination. If the Customer terminates the Agreement without cause, the Customer will pay the Service Provider for any committed and non-cancelable costs incurred by the Service Provider plus a $1,000,000 fee.

7. Confidentiality. Both parties agree to keep confidential all Confidential Information disclosed by the other party during the term of this Agreement. Confidential Information does not include information that is publicly known through no fault of the receiving party, was in the receiving party's possession before receipt from the disclosing party, or was independently developed by the receiving party without use of the disclosing party's Confidential Information.
"""
rprint(extract_strings(content, fields_to_extract))