## Summary

This notebook shows how to use Instructor to extract structured info from unstructured text. The twist here is that the list of entities to extract is specified at runtime, rather than being hard-coded in the model. This could be turned into an API.

Note: this notebook assumes you're using Google Colab. You can safely edit / play here. Or go to `File` -> `Save a copy in Google Drive` to make your own version.

In [21]:
%%capture
!pip install instructor groq openai

On the left, click the key and set two secrets with your keys. Be sure to enable "Notebook access" for them. This is how Google Colab works...you're not sharing your keys with anyone.

OPENAI_API_KEY

GROQ_API_KEY

In [22]:
import instructor
import openai
import groq
from pydantic import BaseModel, Field
from typing import Optional, List
import os

try:
    from google.colab import userdata
    os.environ['OPENAI_API_KEY'] = '' or userdata.get('OPENAI_API_KEY') # or put your key in the '' on this line
    os.environ['GROQ_API_KEY'] = '' or userdata.get('GROQ_API_KEY')
except Exception as e:
    # print(e)
    pass


In [25]:
inference_provider = "openai"   # "openai" or "groq"
client = instructor.from_openai(openai.OpenAI()) if inference_provider == "openai" else instructor.from_groq(groq.Groq())

def extract_strings(content: str, attribute: List[tuple], model_notes: str = "") -> BaseModel:
    # Create the annotations and fields dictionaries
    annotations = {attr: Optional[str] for attr, _ in attribute}
    fields = {attr: Field(description=desc) for attr, desc in attribute}

    # Create the ExtractStrings class dynamically with a docstring
    ExtractStrings = type('ExtractStrings', (BaseModel,), {
        '__annotations__': annotations,
        '__doc__': model_notes,
        **fields
    })

    result = client.chat.completions.create(
        model="llama3-70b-8192" if inference_provider == "groq" else "gpt-4o",
        response_model=ExtractStrings,
        temperature=0.0,
        messages=[{"role": "user", "content": content}]
    )
    return result.model_dump()



##Let it fly!
extract_strings could be turned into an API. Note that this code only returns strings and treats as optional. You could easily extend this to return more structured data.

In [26]:
fields_to_extract = [
    ("name", "Name of the user"),
    ("age", "Age of the user"),
    ("email", "Email of the user"),
]

content = "Jason is the user and he's 25 years older than Rick who was born 3 years ago."

print(extract_strings(content, fields_to_extract))


{'name': 'Jason', 'age': '28', 'email': None}


In [40]:
fields_to_extract = [
    ("firstName", "First name or first initial. Initials should include a period after the letter."),
    ("lastName", "Last name or last initial. Initials should include a period after the letter."),
]

notes = f"""
You will receive a string containing an unformatted name or part of a name (e.g. initials). If there's a comma, that means the last name is first. A middle initial would be ignored because it is neither first nor last name.
"""

content = "Mark St. Anthony"

print(extract_strings(content, fields_to_extract, notes))
print(extract_strings("M. St. Anthony", fields_to_extract, notes))
print(extract_strings("St. Anthony, Mark", fields_to_extract, notes))
print(extract_strings("Daniel Rios-Munoz", fields_to_extract, notes))
print(extract_strings("Rios-Munoz, D", fields_to_extract, notes))
print(extract_strings("Brian J Jeter", fields_to_extract, notes))
print(extract_strings("Brian Jeter", fields_to_extract, notes))
print(extract_strings("Jeter, Brian", fields_to_extract, notes))
print(extract_strings("Jeter, Brian J.", fields_to_extract, notes))




{'firstName': 'Mark', 'lastName': 'St. Anthony'}
{'firstName': 'M.', 'lastName': 'St. Anthony'}
{'firstName': 'Mark', 'lastName': 'St. Anthony'}
{'firstName': 'Daniel', 'lastName': 'Rios-Munoz'}
{'firstName': 'D.', 'lastName': 'Rios-Munoz'}
{'firstName': 'Brian', 'lastName': 'Jeter'}
{'firstName': 'Brian', 'lastName': 'Jeter'}
{'firstName': 'Brian', 'lastName': 'Jeter'}
{'firstName': 'Brian', 'lastName': 'Jeter'}
