In [1]:
import os
import pandas as pd
import importlib

## Basic usage

### Import the module

In [2]:
import sys
sys.path.append('../..')
import openai_data_tools as dt

### Load some data to code

This example loads data from a CSV into a Pandas dataframe. But you can load your data any way you like, all you need is a list where each item is a string with the text to be coded.

In [3]:
my_data = pd.read_csv('./social_data.csv')

### Create a coder object

In [4]:
my_coder = dt.DataCoder(
    api_key=os.getenv("OPENAI_API_KEY"), 
    model = 'gpt-3.5-turbo', 
    instructions = 'You will be provided with sentences from social media posts. For each sentence, determine if the sentence mentions food, cooking, or eating.'
)

To make requests to the OpenAI API, you will need an API key so the system knows who you are (and can charge you for the requests you make). You can create an API key on the OpenAI site on this page:

https://platform.openai.com/account/api-keys

Anyone with the API key can make requests using your account, so I don’t recommend storing it directly in your script. In the example above, it has been stored in an environment variable.

You will need to specify what model you want to make requests to. In this example, we are using gpt-3.5-turbo.

Finally, this is where you provide instructions for how ChatGPT should code the items it sees. In addition to the instructions you provide, ChatGPT will be asked to provide a yes/no response for each item, so make sure to write your instructions with that in mind. 


### Code some items

This will ask ChatGPT to code each item in the list you provide. It returns a list of 0’s and 1’s, where 0 means the code was not applied, and 1 means it was.

In [5]:
my_coding = my_coder.process(my_data['item'])

Progress: 100%


In [6]:
my_coding

[1, 1, 0, 0, 1, 0, 0, 0, 0, 1]

### Evaluate the coding

If you know the correct coding for each item, you can calculate classification metrics. In the example below, `training_data[‘target’]` is a list with the correct response for each item, again as a list of 0’s and 1’s. This will return accuracy, precision, and recall based on the last coding run with this coder.

In [7]:
my_scorer = dt.ClassificationScorer(my_coding, my_data['target'])
print(f'Accuracy: {my_scorer.accuracy()}')
print(f'Precision: {my_scorer.precision()}')
print(f'Recall: {my_scorer.recall()}')

Accuracy: 0.9
Precision: 1.0
Recall: 0.8


### Get an item-by-item scoring

If you want to know how ChatGPT did for each item, you can score its responses. This will return a list of 0’s and 1’s, where 1 indicates a correct response from ChatGPT and 0 indicates an incorrect response. Examining the specific items ChatGPT is getting wrong can help you revise your instructions.

In [8]:
my_scorer.scores()

[1, 1, 1, 1, 1, 1, 1, 1, 0, 1]

## Advanced usage

### Examining ChatGPT’s explanations

The module asks ChatGPT to explain its answer for every item it codes. This information is stored as part of the coder object, and can be useful to look at when you’re trying to improve performance. You can get the explanation for a particular response like this:

In [9]:
my_coder.explanations()[3]

'The sentence does not mention food, cooking, or eating.'

This shows the explanation for the 4th item (as usual in Python, arrays use zero-based indexing).

### Few-shot learning

You may be able to improve coding performance by providing ChatGPT with some examples where you specify how they should be coded and why. Examples should be in the form of a list of dicts, like this:

In [10]:
examples = [{'item': 'My mom used to bake bread all the time when I was a kid.',
  'target': 1,
  'explanation': 'The sentence mentions baking, which is a kind of cooking.'},
 {'item': 'I need to go to the store to buy napkins.',
  'target': 0,
  'explanation': 'Napkins are used while eating, but the sentence does not directly mention food, cooking, or eating.'}]

You then provide these examples when creating the coder object:

In [11]:
my_coder = dt.DataCoder(
    api_key = os.getenv("OPENAI_API_KEY"), 
    model = 'gpt-3.5-turbo', 
    instructions = 'You will be provided with sentences from social media posts. For each sentence, determine if the sentence mentions food, cooking, or eating.',
    examples = examples
)

### Saving and reloading coding data

Once you’ve coded some data, lots of information about the coding is stored as part of the coder object, such as the explanations described in the previous section. However, once your Python session ends, that object goes away, and all that information is lost. If you want to save that information so you can look at it later, you can write it to a file like this:

In [12]:
my_coder.dump('my_coder_data')

You can then reload it in another session like this:

In [13]:
my_coder.restore('my_coder_data')

Note that you have to create the coder object before you can reload coding data into that object.