## 4: Labeling and Data Extraction

Labeling and extracting data

As before, we'll start with the imports

In [1]:
import os, openai, json

Now we need to set the API key and (if applicable) the organization you belong to if that's set up. You can manually put them in or bring them in from the environement.

In [2]:
openai.api_key = os.environ['OPENAI_KEY']

In [3]:
try:
    openai.organization = os.environ['OPENAI_ORG']
    print('Organization Added')
except:
    print('No Organization')

Organization Added


In [4]:
#Run this if you want to use your personal instead of default org
openai.organization=None

To call openai, we need to do an openai.Completion.create() with a bunch of parameters. To make this easier, we'll create a wrapper function that (1) has some default parameters as well as (2) saves each request/response to file for us.

Let's start by creating a folder to drop prompts

In [5]:
os.makedirs('prompts', exist_ok=True)

now we'll actually make the wrapper function that can take args as well as has defaults

In [6]:
def query(prompt, full=False, **kwargs):
    modelKwargs = {
        'temperature':0,
        'engine':'davinci',
        'max_tokens':10,
        'stop':['\n']
    }
    
    for kwarg in kwargs:
        modelKwargs[kwarg] = kwargs[kwarg]
    
    completion = openai.Completion.create(prompt=prompt, **modelKwargs)
    
    with open('prompts/{}.json'.format(completion['id']), 'w') as fh:
        json.dump(completion, fh, indent=4)
    
    if full:
        return completion
    if 'n' in modelKwargs:
        return [x['text'] for x in completion['choices']]
    return completion['choices'][0]['text']

let's make sure it works!

In [8]:
query('1+1=')

'2.'

Let's start with a dataset, say, a CSV

In [11]:
import pandas as pd
foo = [{'name':'bob', 'age':25, 'pet':'turtle'}, {'name':'tom', 'age':18, 'pet':'squid'}]
pd.DataFrame(foo).to_csv()

',name,age,pet\n0,bob,25,turtle\n1,tom,18,squid\n'

In [16]:
prompt = """',name,age,pet\n0,bob,25,turtle\n1,tom,18,squid\n'
names: bob, tom
pets:"""

In [17]:
query(prompt)

' turtle, squid'

So what if we double up on the colors and add more info?

In [21]:
foo = [{'name':'bob', 'age':25, 'pet':'turtle', 'color':'brown'}, {'name':'tom', 'age':18, 'pet':'squid', 'color':'pink'}, {'name':'jerry', 'age':50, 'pet':'squid', 'color':'red'}]
pd.DataFrame(foo).to_csv()

',name,age,pet,color\n0,bob,25,turtle,brown\n1,tom,18,squid,pink\n2,jerry,50,squid,red\n'

In [34]:
prompt = """',name,age,pet,color\n0,bob,25,turtle,brown\n1,tom,18,squid,pink\n2,jerry,50,squid,red\n'
colors:
turtles: ['brown']
"""

In [35]:
query(prompt, stop=['\n\n'], max_tokens=30)

"squids: ['pink', 'red']"

Cool! it can pull out that the squids are pink and red

What about categories?

In [40]:
prompt = """',name,age,pet,color\n0,bob,25,turtle,brown\n1,tom,18,squid,pink\n2,jerry,50,squid,red\n'
Reptile Owners: ['bob']
Mollusk Owners:"""

In [41]:
query(prompt)

" ['tom', 'jerry']"

But we want to make sur that isn't by chance, so let's add another animal in case

In [42]:
foo = [{'name':'bob', 'age':25, 'pet':'turtle', 'color':'brown'}, {'name':'tom', 'age':18, 'pet':'squid', 'color':'pink'}, {'name':'jerry', 'age':50, 'pet':'squid', 'color':'red'}, {'name':'amy', 'age':32, 'pet':'horse', 'color':'brown, white'}]
pd.DataFrame(foo).to_csv()

',name,age,pet,color\n0,bob,25,turtle,brown\n1,tom,18,squid,pink\n2,jerry,50,squid,red\n3,amy,32,horse,"brown, white"\n'

In [43]:
prompt = """',name,age,pet,color\n0,bob,25,turtle,brown\n1,tom,18,squid,pink\n2,jerry,50,squid,red\n3,amy,32,horse,"brown, white"\n'

Reptile Owners: ['bob']
Mollusk Owners:"""

In [44]:
query(prompt)

" ['tom', 'jerry']"

pretty cool!

So how can we do this if we don't know the labels ahead of time? What if we want to come up with labels for each input. Well, let's try this with sentiment analysis.

In [47]:
prompt = """Review: I thought the movie sucked.
Sentiment: negative

Review:{}
Sentiment:"""

In [49]:
query(prompt.format('The movie was great!'))

' positive'

In [48]:
query(prompt.format('The movie was great... if you were blind'))

' negative'

Pretty cool, but what else can you pull from it?

In [54]:
prompt = """Review: I thought the movie sucked.
Sentiment: Negative
Mood: Serious

Review: The movie was great... if you were blind.
Sentiment: Negative
Mood: Sarcastic

Review:{}
Sentiment:"""

In [55]:
r = query(prompt.format('The movie was great!'), stop=['\n\n'])
print(r)

 Positive
Mood: Happy


In [56]:
r = query(prompt.format('That movie was so bad I loved it'), stop=['\n\n'])
print(r)

 Positive
Mood: Excited


In [57]:
r = query(prompt.format('The director seemed to want to just take a shit on the audience'), stop=['\n\n'])
print(r)

 Negative
Mood: Sarcastic


In [59]:
r = query(prompt.format('The actors did their best with a bad script but it was still a bad movie'), stop=['\n\n'])
print(r)

 Negative
Mood: Disappointed


Cool, so we can see GPT goes and starts applying its own labels that we hadn't given it. We can modify the available moods it has by including them at the start.

In [64]:
prompt = """Movie review sentiments fall in [Positive, Negative] and the mood can be either [Serious, Sarcastic]

Review: I thought the movie sucked.
Sentiment: Negative
Mood: Serious

Review: The movie was great... if you were blind.
Sentiment: Negative
Mood: Sarcastic

Review:{}
Sentiment:"""

In [65]:
r = query(prompt.format('The actors did their best with a bad script but it was still a bad movie'), stop=['\n\n'])
print(r)

 Negative
Mood: Serious


In [66]:
r = query(prompt.format('The director seemed to want to just take a shit on the audience'), stop=['\n\n'])
print(r)

 Negative
Mood: Serious


In [68]:
r = query(prompt.format('The movie flopped... right into my heart'), stop=['\n\n'])
print(r)

 Positive
Mood: Sarcastic


K, I might not be the best movie reviewer of all time.

What labels would you want to try to extract/create?

In [69]:
prompt = """
YOUR PROMPT"""