# Set-up

## Required packages

For this example, aside from the `openai_batch_manager` repository, I will be using `configparser` to feed in an OpenAI API key without hard-coding it.

In [1]:
import os
from openai_batch_manager.batch_manager import BatchManager
from openai_batch_manager.jsonl_filewriter import write_jsonl_file_from_df
import configparser
import pandas as pd

### Get OpenAI key

In [2]:
config = configparser.ConfigParser()
config.read('../../config.ini')
openai_key = config['API_KEYS']['openai']

## Create sample dataframe

Suppose our task is to take [Yelp reviews](https://huggingface.co/datasets/Yelp/yelp_review_full) of doctors and use ChatGPT to help us rate their bedside manner. Our sample dataframe contains some Yelp reviews:

In [3]:
df = pd.DataFrame({
    "text": [
        "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
        "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
        "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life.",
        "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new position there in June. He will be missed very much. \n\nI think finding a new doctor in NYC that you actually like might almost be as awful as trying to find a date!",
        "I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you call the office, they'll put you through to a voice mail, that NO ONE ever answers or returns your call. Both my adult children and husband have decided to leave this practice after experiencing such frustration. The entire office has an attitude like they are doing you a favor. Give me a break! Stay away from this doc and the practice. You deserve better and they will not be there when you really need them. I have never felt compelled to write a bad review about anyone until I met this pathetic excuse for a doctor who is all about the money."
    ]
})

# `jsonl_filewriter`: Module to create batch files from dataframes

## Prompt

Let's say that we want to apply the same instructions to all comments. In this example, I set the prompt in `system`.

In [4]:
system = """
You are a specialist in patient-focused healthcare in charge of rating a doctor's bedside manner using Yelp comments.
For every entry, you will rate the doctor's bedside manner in a scale from 1-5, where 1 is the worst and 5 is the best.
If the comment contains no relevant information, please give a rating of 0.

Respond in JSON format:
{
    'bedside_rating':<0-5>,
    'explanation':<20 or less word-long fragment of the comment that justifies your choice, 'none' if irrelevant>
}
"""

## Writing input batch file

The OpenAI batch API takes `jsonl` files as inputs. The function `write_jsonl_file_from_df` allows us to convert a dataframe column into one such file in a single step. We only need to input `df`, the `system` prompt, the `df` column from which we will create `user` messages (in our case, `text`), and the model. We can also set other features, such as `max_tokens`.

In [5]:
jsonl_content = write_jsonl_file_from_df(
    df, system, 'text', model="gpt-4o-mini"
)

The input is a string that is ready to be written to a `jsonl` file.

In [6]:
print(jsonl_content)

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "system", "content": "\nYou are a specialist in patient-focused healthcare in charge of rating a doctor's bedside manner using Yelp comments.\nFor every entry, you will rate the doctor's bedside manner in a scale from 1-5, where 1 is the worst and 5 is the best.\nIf the comment contains no relevant information, please give a rating of 0.\n\nRespond in JSON format:\n{\n    'bedside_rating':<0-5>,\n    'explanation':<20 or less word-long fragment of the comment that justifies your choice, 'none' if irrelevant>\n}\n"}, {"role": "user", "content": "dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surge

We then write it to a jsonl file as follows:

In [7]:
with open('infile.jsonl','w') as f:
    f.write(jsonl_content)

# `BatchManager`: Class to upload files, check status and download outputs

We initialize the `BatchManager` class with the OpenAI API key and a path to a `csv` file where the GPT inputs and outputs will be logged (this file is created if it does not yet exist).

In [8]:
manager = BatchManager(
    api_key=openai_key,
    log_path='log.csv'
)

## Submitting batch files

Once initialized, we can use the `BatchManager` class to upload our newly created input file.

In [9]:
manager.submit_batch(
    'infile.jsonl',
    purpose='Rate bedside manner'
)

Logged submission: batch_682cf07d93408190a26fecf77b109a4b
Uploaded file: file-5GCTwc4cDMcMoAxUM9wgoA
Submitted batch: batch_682cf07d93408190a26fecf77b109a4b


('file-5GCTwc4cDMcMoAxUM9wgoA', 'batch_682cf07d93408190a26fecf77b109a4b')

The log file is automatically updated with this information.

In [10]:
manager.logger.df

Unnamed: 0,id,submission_datetime,purpose,infile,infile_id,status,status_datetime,outfile,outfile_id
0,batch_682cf07d93408190a26fecf77b109a4b,2025-05-20T17:13:33.859758,Rate bedside manner,infile.jsonl,file-5GCTwc4cDMcMoAxUM9wgoA,created,2025-05-20T17:13:33.859758,,


### Avoiding duplicate uploads

The `submit_batch` function allows the user to specify if they want to skip files they have previously uploaded, according to the log file. By default, this feature is set to `True`. If set to `False`, the user will get a warning, but the upload will proceed.

If we try to re-upload `infile.jsonl`, this is the output we get:

In [11]:
manager.submit_batch(
    'infile.jsonl',
    purpose='Rate bedside manner',
    duplicate_skip=True
)



ValueError: File 'infile.jsonl' has already been submitted — aborting.

## Checking on batch status

Through `BatchManager`, the user can easily check the progress of their batches. The only required input is the `batch_id` that identifies the relevant task.

In [12]:
manager.check_status(manager.logger.df.iloc[0]['id'])

Batch ID: batch_682cf07d93408190a26fecf77b109a4b
Submitted: 2025-05-20 17:13:33
Status: completed
Updated status for batch: batch_682cf07d93408190a26fecf77b109a4b


Batch(id='batch_682cf07d93408190a26fecf77b109a4b', completion_window='24h', created_at=1747775613, endpoint='/v1/chat/completions', input_file_id='file-5GCTwc4cDMcMoAxUM9wgoA', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1747785874, error_file_id=None, errors=None, expired_at=None, expires_at=1747862013, failed_at=None, finalizing_at=1747785873, in_progress_at=1747775618, metadata={'description': 'Rate bedside manner'}, output_file_id='file-CUz44cMqQ1hT1GZZTCknHi', request_counts=BatchRequestCounts(completed=5, failed=0, total=5))

This updates the `log` file by default, but is a feature that can be turned off (by setting `update_log=False`).

In [13]:
manager.logger.df

Unnamed: 0,id,submission_datetime,purpose,infile,infile_id,status,status_datetime,outfile,outfile_id
0,batch_682cf07d93408190a26fecf77b109a4b,2025-05-20T17:13:33.859758,Rate bedside manner,infile.jsonl,file-5GCTwc4cDMcMoAxUM9wgoA,completed,2025-05-21T08:56:49.755675,,


## Downloading files

Once the download has been completed, `BatchManager` can be used to download the output files and log their ID and path.

In [14]:
manager.download_if_ready(
    batch_id=manager.logger.df.iloc[0]['id'],
    out_path='outfile.json'
)

Updated status for batch: batch_682cf07d93408190a26fecf77b109a4b
Updated outfile for batch: batch_682cf07d93408190a26fecf77b109a4b
Downloaded results for batch: batch_682cf07d93408190a26fecf77b109a4b


True

In [15]:
manager.logger.df

Unnamed: 0,id,submission_datetime,purpose,infile,infile_id,status,status_datetime,outfile,outfile_id
0,batch_682cf07d93408190a26fecf77b109a4b,2025-05-20T17:13:33.859758,Rate bedside manner,infile.jsonl,file-5GCTwc4cDMcMoAxUM9wgoA,completed,2025-05-21T08:56:55.126287,outfile.json,file-CUz44cMqQ1hT1GZZTCknHi
