In [None]:
!wget --quiet -O items.zip https://github.com/jsoma/nicar24-beyond-chatgpt/raw/main/items.zip
!unzip -o -q items.zip

Before we dive into the API, let's check out the [OpenAI model playground](https://platform.openai.com/playground?mode=complete).

## Working with the API

The [chat interface](https://chat.openai.com/) is just one method for interacting with GPT. Another option is to write Python code that skips the website! This is called an **API**. Technically it stands for *Application Programming Interface*, but no one actually remembers that: we can just think of it as a way for two computers to talk directly to each out.

To use the OpenAI GPT API with Python, we're going to need to install the [openai Python package](https://github.com/openai/openai-python).

In [None]:
!pip install --quiet openai

## Asking questions

You use the Python interface just like the chat one: you start from a series of messages and ask the API what's next.

In [None]:
from openai import OpenAI

client = OpenAI(api_key="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")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant.",
    },
    {
        "role": "user",
        "content": "What is the ratio of water to vinegar used when making pickles?",
    }
]

chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-3.5-turbo'
)

chat_completion.choices[0].message.content

The note above includes the **system prompt**, which are guidelines for GPT that set the tone for the conversation. [You can see the complete ChatGPT one here](https://pastebin.com/vnxJ7kQk) (or rather, one of them), but we're just going to use the traditional "You are a helpful assistant."

## Model selection

People who use ChatGPT for free get access to GPT-3.5, while those who pay gain access to GPT-4. GPT-4 can deal with more text at once (a report! a book!) and just generally gives better answers. These are called **models**, and they're the tech that lives behind the interface. You can find more about [the available OpenAI models here](https://platform.openai.com/docs/models).

In [None]:
chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-4'
 )

chat_completion.choices[0].message.content

Different companies each produce different, competing models: Anthropic has made [Claude](https://claude.ai/), Google has [Bard/Gemini](https://gemini.google.com/), Facebook/Meta made [LLaMA](https://www.llama2.ai/), Mistral has [Mixtral](https://mixtral.replicate.dev/).

People have models [fight it out](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) and GPT-4 is currently the best.

For many of these models, you're sending information off into the cloud every time you make a request or ask a question. LLaMA is the only current model from the major companies that you can download and run on your own machine! Notice how the [leaderboards](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) start with famous ones that I've listed from big companies - Mixtral, GPT, Gemini – but then descend into things you've never heard of like Vicuna, Yi, and WizardLM.

Check the "License" column: **they're the non-proprietary ones.**

Most concepts can be traferred from model to model, so just know that even if we're working with OpenAI's GPT in the examples below, we could just as well be working with another one (and maybe we'll give one a try later on!).

## Temperature

When you're using Bing's AI assistant, it has three modes: Creative, Balanced or Precise. They're described like so:

> - Balanced: best for the most common tasks, like search, maximum speed.
> - Creative: whenever you need to generate new content, longer output, more expressive, slower.
> - Precise: most factual, minimizing conjectures.

You can imagine the system prompt - "You are a helpful assistant." - being changed for each of these behind the scenes.

Another change that can be made is **temperature**, which is how "crazy" you let your model get. [This Financial Times piece](https://ig.ft.com/generative-ai/) does a good job explaining how it's just a simple matter of statistics: based on the text it's seen so far, what's the most likely next word?

The `temperature` setting allows you to use less likely words instead of the most likely next one. Even though it isn't the same as Creative mode, I like to think of them as being similar. Increasing the temperature makes the text more unpredictable, and potentially more creative!

By default the temperature with GPT is 0.7, which allows a moderate amount of creativity. If you downgrade the temperature to 0.0, conversations will almost always produce the same result! The maximum is 2.0, which can get some pretty wild results [in the playground](https://platform.openai.com/playground).

In [None]:
messages = [
    {
        "role": "user",
        "content": "Tell me a short story in one paragraph.",
    }
]


chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-3.5-turbo',
    temperature=0.0
 )

chat_completion.choices[0].message.content

What if we try the same prompt again, with the same `0.0` temperature?

In [None]:
messages = [
    {
        "role": "user",
        "content": "Tell me a short story in one paragraph.",
    }
]


chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-3.5-turbo',
    temperature=0.0
 )

chat_completion.choices[0].message.content

If we're tired of being read the same bedtime story every single night, we can increase the temperature to allow GPT to pick less likely words each time it steps forward.

In [None]:
messages = [
    {
        "role": "user",
        "content": "Tell me a short story in one paragraph.",
    }
]


chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-3.5-turbo',
    temperature=0.2
 )

chat_completion.choices[0].message.content

Notice how the further down the text it gets, the further from the original temperature `0.0` version we get! This is because those probabilities add up as you go deeper and deeper into the text, guiding the conversation into completely different paths.

And if we want to get something completely different right out of the gate? Let's maximize the temperature!

In [None]:
messages = [
    {
        "role": "user",
        "content": "Tell me a short story in one paragraph.",
    }
]


chat_completion = client.chat.completions.create(
    messages=messages,
    model='gpt-3.5-turbo',
    temperature=2.0
 )

chat_completion.choices[0].message.content

## Automatic categorization

One of my favorite use cases for using the API is to **put things in categories**, which has the technical term of *classification*. Later we'll look at how to do this with tools that aren't GPT, but let's give it a try for now.

In [None]:
from openai import OpenAI

client = OpenAI(api_key="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")

prompt = """
Categorize the following legislative bill as ENVIRONMENT, HEALTHCARE, IMMIGRATION, TAXES/FINES, or OTHER. 

Bill text: 

The "Celestial Reforestation Act" proposes the ambitious endeavor of transporting arboreal 
specimens beyond Earth's confines. Acknowledging the pivotal role of trees in sustaining 
ecosystems and combating ecological degradation, this bill outlines a strategic roadmap 
for the selection, launch, and maintenance of arboreal lifeforms into outer space. Through 
collaboration with aerospace entities and research institutions, this legislation aims to 
establish extraterrestrial arboreal habitats, facilitating scientific exploration and the 
expansion of green spaces beyond planetary boundaries. By fostering innovation in space biology
and exploration, this act seeks to pioneer sustainable solutions for global challenges while 
advancing humanity's reach into the cosmos.
"""

messages = [
    { "role": "system", "content": "You are a helpful assistant."},
    { "role": "user", "content": prompt}
]

chat_completion = client.chat.completions.create(
    messages=messages,
    model="gpt-3.5-turbo",
    temperature=0
)

chat_completion.choices[0].message.content

The original response says "This legislative bill would be categorized as ENVIRONMENT," which is *not* okay. I want it to just say the category name!

Head on over to [ChatGPT itself](https://chat.openai.com) to engineer a good prompt. Re-run your model until you feel happy with it.

## Bulk processing

Oftentimes you end up with a looooong spreadsheet or database of things that need to be categorized. But whether we're technical or not, it's easy to tackle!

### Python/pandas

> If you're not a Python person: it's fine! Just think about this in terms of concepts. You just want to be familiar with the idea of **api key** and a **prompt**.

Let's say we have a dataset that looks like this:

In [None]:
import pandas as pd

df = pd.DataFrame({
    'title': [
        'Trees in space',
        'Taxes on people who are in outer space',
        'Medical expenses for aliens',
        'Pinecones orbiting the planet'
    ]
})
df

We want to add a new column to this, called `llm_category`. To run the code above for every row in a pandas dataframe, we make two adjustments to the code above:

We build a template for our prompt, which now has a placeholder of `{text}` where our bill details will go:

```python
prompt_template = """
Categorize the following legislative bill as ENVIRONMENT, HEALTHCARE, IMMIGRATION, TAXES/FINES, 
or OTHER. Only respond with the category name.

Bill title: {text}
"""
```

A function called `llm_request`, which receives a single row of data and uses it to complete the template.

```python
prompt = prompt_template.format(text=row['title'])
```

That prompt is then sent to the LLM and the result returned.

In [None]:
from openai import OpenAI

client = OpenAI(api_key="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")

prompt_template = """
Categorize the following legislative bill as ENVIRONMENT, HEALTHCARE, IMMIGRATION, TAXES/FINES, 
or OTHER. Only respond with the category name.

Bill title: {text}
"""

def llm_request(row):
    prompt = prompt_template.format(text=row['title'])
    
    messages = [
        { "role": "system", "content": "You are a legislative assistant."},
        { "role": "user", "content": prompt}
    ]

    chat_completion = client.chat.completions.create(
        messages=messages,
        model="gpt-3.5-turbo",
        temperature=0
    )

    return chat_completion.choices[0].message.content

We can try it out with the first row of our data.

In [None]:
first_row = df.loc[0]

print(first_row)
llm_request(first_row)

Or with a made-up row, just to be able to experiment a little more freely.

In [None]:
llm_request({'title': 'The bill to let people from Mars move to planet Earth'})

If we're happy with how it works in small doses, we can move on to using it with every row. We're going to use the Python library [tqdm](https://github.com/tqdm/tqdm) to get some nice progress bars while it chugs along.

In [None]:
!pip install --quiet tqdm

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
df['llm_category'] = df.progress_apply(llm_request, axis=1)
df

It's magic!

### Google sheets

I personally use [GPT for work](https://gptforwork.com/). I'll show you a demo, but I don't think we can install things on your Google account without getting a little too crazy.

## Checking the results

We can't just say, "oh wow, computers seem cool, let's just trust whatever it says!" Our job as journalists who request trust from our audience is to **actually test the results.**

We can do this one of two ways:

1. Run the classifier over all of our bills, then take a sample of the results to hand-label and compare with the LLM's judgment
2. Have a small hand-labeled test dataset that we use to verify the LLM's results before moving on to classify everything.

The second path is usually the best since it allows you to tweak your results before running your prompt against *everything*. The first path is useful if you have a workflow already and need to check whether it's [still working](https://www.reddit.com/r/ChatGPT/comments/182ubh7/chatgpt_has_become_unusably_lazy/) or has [shifted unexpectedly](https://arxiv.org/abs/2307.09009).

In [None]:
df = pd.read_csv("labeled-bills.csv")
df.head()

Let's take our dataset of human-labeled content and see what the AI thinks about it.

In [None]:
df['llm_category'] = df.progress_apply(llm_request, axis=1)
df.head()

How do the two compare? We'll use `crosstab` to do it here, but you can use a pivot table if you're in Excel. It looks complicated, but it's just matching up the `human_label` and `llm_category` column and seeing how often they match.

In [None]:
pd.crosstab(df.human_label, df.llm_category, margins=True)

In this case, environmental bills often got confused for bills about taxes/fines. Healthcare came out with a 100% score

> I said "in this case," but the answers actually change if you re-run the LLM category assignment! Even with a small 88-row dataset the LLM will change its mind on some of them.

If you were going to do this in Google Sheets, it would be the same thing! You'd just add a `human_label` column, then do a pivot table to see if they match. I [wrote a little Apps Script](https://gist.github.com/jsoma/06783a9e759003e2e69389d677f83c0f) that adds a helper for you.

## Guardrails

Honestly, it did a remarkably good job at returning the results we're looking for! In other situations it's been a little more unpredictable for me. One way to force the AI to return what you're looking for is using a tool like [Guardrails](https://github.com/guardrails-ai/guardrails) to validate responses.

> If you aren't technical: using ChatGPT is fine and good because as a human it's easy to understand the response. The difference between getting "YES, a donut" as a response and "Yes - a donut" is meaningles to a person, but they're very different things when you're a computer! Trying to automatically parse responses when the LLM can decide to go rogue can be a real pain in the neck.
 
There are a LOT of other options for validating/repairing responses, including [LMQL](https://lmql.ai/playground/) and [outlines](https://outlines-dev.github.io/outlines/welcome/). **They are all pretty awful,** though, Guardrails is probably the best.


In [None]:
!pip install --quiet guardrails-ai pydantic

In [None]:
from guardrails import validators, Guard
from rich import print
from pydantic import BaseModel, Field
import json
import openai
import os

os.environ['OPENAI_API_KEY'] = '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'

We'll start by setting up the structure we'd like our response to be in...

In [None]:
prompt = """
    ${text}    
    ${gr.complete_json_suffix_v2}
"""

class Comment(BaseModel):
    name: str = Field(description="Author's name")
    food_item: str = Field(description="Food item being discussed, in English")
    email: str = Field(description="Author's email", validators=[
        validators.ValidLength(min=1, max=32, on_fail='reask'),
        validators.RegexMatch(regex=r'.+@.+', on_fail='reask')
    ])
    language: str = Field(description="Comment language", validators=[
        validators.ValidChoices(
            choices=['English', 'Spanish', 'German', 'Other'],
            on_fail='reask'
        )
    ])

guard = Guard.from_pydantic(output_class=Comment, prompt=prompt)

...and now we'll run it against a single example.

In [None]:
text = """
After the broccoli incident, I never want to look at broccoli again. Please remove me 
from the broccoli email list.

Sincerely,
Jonathan
jonathan.soma@gmail.com
"""

res = guard(
    openai.chat.completions.create,
    prompt_params={"text": text},
    response_format={"type": "json_object"},
    max_tokens=4096,
    temperature=0.3,
)

# Print the validated output from the LLM
print(json.dumps(res.validated_output, ensure_ascii=False, indent=2))

To look at the conversation it had, we can examine the `guard` object. If we want to have a *fun time* we can intentionally break it by (for example) misspelling a language that the LLM will "helpfully" correct.

In [None]:
print(guard.history.last.tree)

## Reflection

In this notebook we looked at how AI isn't just one *thing*, it's has all sorts of versions and options same as everything else. Even ChatGPT has different versions - 3.5 and 4, of which 4 is more powerful in a handful of ways (even if slower and more expensive!).

A simple task for AI is putting things in categories, also known as classification. When doing bulk classification, it's important to examine the outputs systemtically and not just spot check! That way you know how or why the AI might be going wrong, and either tweak your prompt or build knowledge of the mistakes into your process.

Finally, LLMs don't always listen to your rules. They might respond in formats you didn't ask for, add extra categories, or overrule the rules you specified. Validation and repair frameworks like Guardrails re-prompt the LLM to demand it respond in line with your rules, which usually works pretty well.