### Expert Structured Output (Using Kor)

[Kor Library](https://eyurtsev.github.io/kor/nested_objects.html)
[Eugene Twitter](https://twitter.com/veryboldbagel)

For complicated data extraction you need a robust library. Kor is an awesome tool just for this.

We are going to explore using Kor with a practical use case.

**Why is this important?**
LLMs are great at text output, but what if you want it to give you structured information that you won't have to parse? Getting a dictionary back makes this easy.

Spoiler: Jump down to the bottom to see a bonefied business idea that you can start and manage today.

In [97]:
import pandas as pd
import requests
import time
import json
from datetime import datetime
from langchain.chat_models import ChatOpenAI
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain.llms import OpenAI
from bs4 import BeautifulSoup

In [198]:
# It's better to do this an environment variable but putting it in plain text for clarity
openai_api_key = 'your_api_key'

In [252]:
openai_api_key = '...'

In [201]:
llm = ChatOpenAI(
#     model_name="gpt-3.5-turbo", # Cheaper but less reliable
    model_name="gpt-4",
    temperature=0,
    max_tokens=2000,
    openai_api_key=openai_api_key
)

### Kor Hello World Example

Create an object that holds information about the fields you'd like to extract

In [202]:
person_schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

Create a chain that will extract the information and then parse it. This uses LangChain under the hood

In [203]:
chain = create_extraction_chain(llm, person_schema)

In [204]:
text = "My name is Bobby. My sister's name is Rachel. My brother's name Joe."
chain.predict_and_parse(text=(text))["data"]

{'person': [{'first_name': 'Bobby'},
  {'first_name': 'Rachel'},
  {'first_name': 'Joe'}]}

Kor also facilitates returning None when the LLM doesn't find what you're looking for

In [205]:
chain.predict_and_parse(text=("The dog went to the park"))["data"]

{'person': []}

### Multiple Fields
You can pass multiple fields if you're looking for more information

In [206]:
plant_schema = Object(
    id="plant",
    description="Information about a plant",
    examples=[
        (
            "Roses are red, lilies are white and a 8 out of 10.",
            [
                {"plant_type": "Roses", "color": "red"},
                {"plant_type": "Lily", "color": "white", "rating" : 8},
            ],
        )
    ],
    attributes=[
        Text(
            id="plant_type",
            description="The common name of the plant."
        ),
        Text(
            id="color",
            description="The color of the plant"
        ),
        Number(
            id="rating",
            description="The rating of the plant."
        )
    ],
    many=True,
)

In [207]:
text="Palm trees are brown and a 6 rating"

chain = create_extraction_chain(llm, plant_schema)
chain.predict_and_parse(text=text)['data']

{'plant': [{'plant_type': 'Palm tree', 'color': 'brown', 'rating': '6.0'}]}

### Working With Lists

You can also extract lists as well.

Note: Check out how I have a nested object. The 'parts' object is in the 'cars_schema'

In [208]:
parts = Object(
    id="parts",
    description="A single part of a car",
    attributes=[
        Text(id="part", description="The name of the part")
    ],
    examples=[
        (
            "the jeep has wheels and windows",
            [
                {"part": "wheel"},
                {"part": "window"}
            ],
        )
    ],
    many=True,  # <-- PLEASE NOTE THIS CHANGE
)

cars_schema = Object(
    id="car",
    description="Information about a car",
    examples=[
        (
            "the bmw is red and has an engine and steering wheel",
            [
                {"type": "BMW", "color": "red", "parts" : ["engine", "steering wheel"]}
            ],
        )
    ],
    attributes=[
        Text(
            id="type",
            description="The make or brand of the car"
        ),
        Text(
            id="color",
            description="The color of the car"
        ),
        parts
    ],
    many=True,
)

In [210]:
# To do nested objects you need to specify encoder_or_encoder_class="json"
text = "The blue jeep has wheels, windows, rims"

chain = create_extraction_chain(llm, cars_schema, encoder_or_encoder_class="json")
chain.predict_and_parse(text=text)['data']

{'car': [{'type': 'jeep',
   'color': 'blue',
   'parts': [{'part': 'wheel'}, {'part': 'window'}, {'part': 'rim'}]}]}

View the prompt that was sent over

In [211]:
prompt = chain.prompt.format_prompt(text=text).to_string()

print(prompt)

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

car: Array<{ // Information about a car
 type: string // The make or brand of the car
 color: string // The color of the car
 parts: Array<{ // A single part of a car
  part: string // The name of the part
 }>
}>
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.

Input: the bmw is red and has an engine and steering wheel
Output: <json>{"car": [{"type": "BMW", "colo

## Opening Attributes - Real World Example

[Opening Attributes](https://twitter.com/GregKamradt/status/1643027796850253824) (my sample project for this application)

If anyone wants to strategize on this project DM me on twitter

In [212]:
llm = ChatOpenAI(
    # model_name="gpt-3.5-turbo",
    model_name="gpt-4",
    temperature=0,
    max_tokens=2000,
    openai_api_key=openai_api_key
)

We are going to be pulling jobs from Greenhouse. No API key is needed.

In [215]:
def pull_from_greenhouse(board_token):
    # If doing this in production, make sure you do retries and backoffs
    
    # Get your URL ready to accept a parameter
    url = f'https://boards-api.greenhouse.io/v1/boards/{board_token}/jobs?content=true'
    
    try:
        response = requests.get(url)
    except:
        # In case it doesn't work
        print ("Whoops, error")
        return
        
    status_code = response.status_code
    
    jobs = response.json()['jobs']
    
    print (f"{board_token}: {status_code}, Found {len(jobs)} jobs")
    
    return jobs

Let's try it out for [Okta](https://www.okta.com/)

In [216]:
jobs = pull_from_greenhouse("okta")

okta: 200, Found 143 jobs


Let's look at a sample job with it's raw dictionary

In [122]:
# Keep in mind that my job_ids will likely change when you run this depending on the postings of the company
job_id = 1

In [114]:
print ("Preview:\n")
print (json.dumps(jobs[job_id])[:400])

Preview:

{"absolute_url": "https://www.okta.com/company/careers/opportunity/4858786?gh_jid=4858786", "data_compliance": [{"type": "gdpr", "requires_consent": false, "requires_processing_consent": false, "requires_retention_consent": false, "retention_period": null}], "education": "education_optional", "internal_job_id": 2474271, "location": {"name": "United States"}, "metadata": null, "id": 4858786, "updat


Let's clean this up a bit

In [217]:
# I parsed through an output to create the function below
def describeJob(job_description):
    print(f"Job ID: {job_description['id']}")
    print(f"Link: {job_description['absolute_url']}")
    print(f"Updated At: {datetime.fromisoformat(job_description['updated_at']).strftime('%B %-d, %Y')}")
    print(f"Title: {job_description['title']}\n")
    print(f"Content:\n{job_description['content'][:550]}")

In [240]:
# Note: I'm using a hard coded job id below. You'll need to switch this if this job ever changes
# and it most definitely will!
job_id = 4982726

job_description = [item for item in jobs if item['id'] == job_id][0]
    
describeJob(job_description)

Job ID: 4982726
Link: https://www.okta.com/company/careers/opportunity/4982726?gh_jid=4982726
Updated At: April 10, 2023
Title: Staff Software Engineer 

Content:
&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;strong&gt;Get to know Okta&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;br&gt;&lt;/span&gt;Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at t


I want to convert the html to text, we'll use BeautifulSoup to do this. There are multiple methods you could choose from. Pick what's best for you.

In [241]:
soup = BeautifulSoup(job_description['content'], 'html.parser')

In [242]:
text = soup.get_text()
print (text[:600])

<div class="content-intro"><p><span style="color: #000000;"><strong>Get to know Okta</strong></span></p>
<p><span style="color: #000000;"><br></span>Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.&nbsp;<br><br>At Okta, we celebrate a variety of perspectives and experiences. We are not looking for som


Let's create a Kor object that will look for tools. This is the meat and potatoes of the application

In [243]:
tools = Object(
    id="tools",
    description="""
        A tool, application, or other company that is listed in a job description.
        Analytics, eCommerce and GTM are not tools
    """,
    attributes=[
        Text(
            id="tool",
            description="The name of a tool or company"
        )
    ],
    examples=[
        (
            "Experience in working with Netsuite, SQL, or Looker a plus.",
            [
                {"tool": "Netsuite"},
                {"tool": "Looker"},
            ],
        ),
        (
           "Experience with Microsoft Excel",
            [
               {"tool": "Microsoft Excel"}
            ] 
        ),
        (
           "You must know AWS to do well in the job",
            [
               {"tool": "AWS"}
            ] 
        ),
        (
           "Troubleshooting customer issues and debugging from logs (Splunk, Syslogs, etc.) ",
            [
               {"tool": "Splunk"},
            ] 
        )
    ],
    many=True,
)

In [244]:
chain = create_extraction_chain(llm, tools)

In [245]:
chain.predict_and_parse(text=text)["data"]

{'tools': [{'tool': 'Java'},
  {'tool': 'Hibernate'},
  {'tool': 'Spring Boot'},
  {'tool': 'SQL'},
  {'tool': 'ElasticSearch'},
  {'tool': 'Docker'},
  {'tool': 'Kubernetes'},
  {'tool': 'AWS'},
  {'tool': 'GCP'}]}

### Salary

Let's grab salary information while we are at it.

Not all jobs will list this information. If they do, it's rarely consistent across jobs. A great use case for LLMs to catch this information!

In [246]:
salary_range = Object(
    id="salary_range",
    description="""
        The range of salary offered for a job mentioned in a job description
    """,
    attributes=[
        Number(
            id="low_end",
            description="The low end of a salary range"
        ),
        Number(
            id="high_end",
            description="The high end of a salary range"
        )
    ],
    examples=[
        (
            "This position will make between $140 thousand and $230,000.00",
            [
                {"low_end": 140000, "high_end": 230000},
            ]
        )
    ]
)

In [247]:
jobs = pull_from_greenhouse("cruise")

cruise: 200, Found 229 jobs


In [248]:
job_id = 4858414

job_description = [item for item in jobs if item['id'] == job_id][0]
    
describeJob(job_description)

soup = BeautifulSoup(job_description['content'], 'html.parser')
text = soup.get_text()

Job ID: 4858414
Link: https://boards.greenhouse.io/cruise/jobs/4858414?gh_jid=4858414
Updated At: April 10, 2023
Title: Senior Data Center Technician

Content:
&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We&#39;re Cruise, a self-driving service designed for the cities we love.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We’re building the world’s most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.&lt;/span&gt;


In [249]:
r = create_extraction_chain(llm, salary_range)
r.predict_and_parse(text=text)["data"]

{'salary_range': [{'low_end': '112300', 'high_end': '165000'}]}

> The salary range for this position is $112,300 - 165,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, restricted stock units, and benefits. These ranges are subject to change.

Awesome!

[OpenAI GPT4 Pricing](https://help.openai.com/en/articles/7127956-how-much-does-gpt-4-cost)

In [250]:
prompt = chain.prompt.format_prompt(text=text).to_string()
print (prompt)

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

tools: Array<{ // 
        A tool, application, or other company that is listed in a job description.
        Analytics, eCommerce and GTM are not tools
    
 tool: string // The name of a tool or company
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.

Input: Experience in working with Netsuite, SQL, or Looker a plus.
Output: tool
Netsuite
Looker

Input: Experience with Microsoft Excel
Output: tool
Microsoft Excel

Input: You must know AWS to do well in the job
Output: tool
AWS

Input: Troubleshootin

In [251]:
num_tokens = llm.get_num_tokens(prompt)

# Pricing as of 2023-4-11
gpt4_pricing_per_1k_tokens = .03

cost = (num_tokens / 1000) * gpt4_pricing_per_1k_tokens

print (f"Running this prompt will cost: ${cost:.2f}")

Running this prompt will cost: $0.09


To do:

* Reduce amount of HTML and low-signal text that gets put into the prompt
* Gather list of 1000s of companies
* Run through most jobs (You'll likely start to see duplicate information after the first 10-15 jobs per department)
* Store results
* Snapshot daily as you look for new jobs
* Follow [Greg](https://twitter.com/GregKamradt) on Twitter for more tools or if you want to chat about this project
* Read the user feedback below for what else to build out with this project (I reached out to everyone who signed up on twitter)


### Business idea: Job Data As A Service

Start a data service that collects information about company's jobs. This can be sold to investors looking for an edge.

After posting [this tweet](https://twitter.com/GregKamradt/status/1643027796850253824) there were 80 people that signed up for the trial. I emailed all of them and most were job seekers looking for companies that used the tech they specialized in.

The more interesting use case were sales teams + investors.

Interesting Investor User Feedback:

> Hey Gregory, thanks for reaching out. <br><br>
I always thought that job posts were a gold mine of information, and often suggest identifying targets based on these (go look at relevant job posts for companies that might want to work with you). Secondly, I also automatically ping BuiltWith from our CRM and send that to OpenAI and have a summarized tech stack created - so I see the benefit of having this as an investor. <br><br>
For me personally, I like to get as much data as possible about a company. Would love to see job post cadence, type of jobs they post and when, notable keywords/phrases used, tech stack (which you have), and any other information we can glean from the job posts (sometimes they have the title of who you'll report to, etc.). <br><br>
For sales people, I think finer searches, maybe even in natural language if possible - such as "search for companies who posted a data science related job for the first time" - would be powerful.

If you do this, let me know! I'd love to hear how it goes.