# Batch processing with the Batch API

The new Batch API allows to **create async batch jobs for a lower price and with higher rate limits**.

Batches will be completed within 24h, but may be processed sooner depending on global usage. 

Ideal use cases for the Batch API include:

- Tagging, captioning, or enriching content on a marketplace or blog
- Categorizing and suggesting answers for support tickets
- Performing sentiment analysis on large datasets of customer feedback
- Generating summaries or translations for collections of documents or articles

and much more!

We will start with an example to categorize movies using `gpt-4o-mini` model.

Please note that multiple models are available through the Batch API, and that you can use the same parameters in your Batch API calls as with the Chat Completions endpoint.

## Setup

In [None]:
# Make sure you have the latest version of the SDK available to use the Batch API
%pip install openai -U -q

In [47]:
import json
from openai import AzureOpenAI
import pandas as pd
from IPython.display import Image, display
import os

from dotenv import load_dotenv
load_dotenv()

True

In [48]:
# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = AzureOpenAI(
    api_version=os.getenv("OPENAI_API_VERSION_GPT4O_MINI_BATCH"),
    api_key=os.getenv("AZURE_OPENAI_KEY_GPT4O_MINI_BATCH"),  
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT_GPT4O_MINI_BATCH")
)

## First example: Categorizing movies

In this example, we will use `gpt-4o-mini` to extract movie categories from a description of the movie. We will also extract a 1-sentence summary from this description. 

We will use [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to extract categories as an array of strings and the 1-sentence summary in a structured format. 

For each movie, we want to get a result that looks like this:

```
{
    categories: ['category1', 'category2', 'category3'],
    summary: '1-sentence summary'
}
```

### Loading data

We will use the IMDB top 1000 movies dataset for this example. 

In [49]:
dataset_path = "../data/imdb_top_1000.csv"

df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


### Processing step 

Here, we will prepare our requests by first trying them out with the Chat Completions endpoint.

Once we're happy with the results, we can move on to creating the batch file.

In [50]:
categorize_system_prompt = '''
Your goal is to extract movie categories from movie descriptions, as well as a 1-sentence summary for these movies.
You will be provided with a movie description, and you will output a json object containing the following information:

{
    categories: string[] // Array of categories based on the movie description,
    summary: string // 1-sentence summary of the movie based on the movie description
}

Categories refer to the genre or type of the movie, like "action", "romance", "comedy", etc. Keep category names simple and use only lower case letters.
Movies can have several categories, but try to keep it under 3-4. Only mention the categories that are the most obvious based on the description.
'''

def get_categories(description):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.1,
    # This is to enable JSON mode, making sure responses are valid json objects
    response_format={ 
        "type": "json_object"
    },
    messages=[
        {
            "role": "system",
            "content": categorize_system_prompt
        },
        {
            "role": "user",
            "content": description
        }
    ],
    )

    return response.choices[0].message.content

In [51]:
# Testing on a few examples
for _, row in df[:5].iterrows():
    description = row['Overview']
    title = row['Series_Title']
    result = get_categories(description)
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")

TITLE: The Shawshank Redemption
OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

RESULT: {
    "categories": ["drama", "crime"],
    "summary": "Two imprisoned men develop a deep friendship and find redemption through their shared experiences and acts of kindness."
}


----------------------------


TITLE: The Godfather
OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

RESULT: {
    "categories": ["crime", "drama"],
    "summary": "An aging crime lord hands over his empire to his hesitant son."
}


----------------------------


TITLE: The Dark Knight
OVERVIEW: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

RESULT: {
    "categories": ["action", "thriller", "superhero"],
    "summary": "Batman 

### Creating the batch file

The batch file, in the `jsonl` format, should contain one line (json object) per request.
Each request is defined as such:

```
{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // other parameters
    }
}
```

Note: the request ID should be unique per batch. This is what you can use to match results to the initial input files, as requests will not be returned in the same order.

In [59]:
# Creating an array of json tasks

tasks = []

for index, row in df[:5].iterrows():
    
    description = row['Overview']
    
    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/chat/completions",
        "body": {
            # This is what you would have in your Chat Completions API call
            "model": "gpt-4o-mini-batch",
            "temperature": 0.1,
            "response_format": { 
                "type": "json_object"
            },
            "messages": [
                {
                    "role": "system",
                    "content": categorize_system_prompt
                },
                {
                    "role": "user",
                    "content": description
                }
            ],
        }
    }
    
    tasks.append(task)

In [60]:
# Creating the file

file_name = "batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')

### Uploading the file

In [61]:
batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)

print(batch_file.model_dump_json(indent=2))
batch_file_id = batch_file.id

{
  "id": "file-350a82a513584d15a3fadc78b29dccef",
  "bytes": 5594,
  "created_at": 1724382168,
  "filename": "batch_tasks_movies.jsonl",
  "object": "file",
  "purpose": "batch",
  "status": "pending",
  "status_details": null
}


### Track file upload status

In [62]:
# Wait until the uploaded file is in processed state
import time
import datetime 

status = "pending"
while status != "processed":
    time.sleep(15)
    file_response = client.files.retrieve(batch_file_id)
    status = file_response.status
    print(f"{datetime.datetime.now()} File Id: {batch_file_id}, Status: {status}")

2024-08-23 11:03:09.921213 File Id: file-350a82a513584d15a3fadc78b29dccef, Status: processed


### Creating the batch job

In [63]:
# Submit a batch job with the file

batch_response = client.batches.create(
  input_file_id=batch_file_id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

# Save batch ID for later use
batch_id = batch_response.id

print(batch_response.model_dump_json(indent=2))

{
  "id": "batch_09150d34-892a-486a-8023-db19d9fd2ea9",
  "completion_window": "24h",
  "created_at": 1724382196,
  "endpoint": "/chat/completions",
  "input_file_id": "file-350a82a513584d15a3fadc78b29dccef",
  "object": "batch",
  "status": "validating",
  "cancelled_at": null,
  "cancelling_at": null,
  "completed_at": null,
  "error_file_id": null,
  "errors": null,
  "expired_at": null,
  "expires_at": 1724468596,
  "failed_at": null,
  "finalizing_at": null,
  "in_progress_at": null,
  "metadata": null,
  "output_file_id": null,
  "request_counts": {
    "completed": 0,
    "failed": 0,
    "total": 0
  }
}


### Track batch job progress

Note: this can take up to 24h, but it will usually be completed faster.

You can continue checking until the status is 'completed'.

In [64]:
import time
import datetime 

status = "validating"
while status not in ("completed", "failed", "canceled"):
    time.sleep(15)
    batch_response = client.batches.retrieve(batch_id)
    status = batch_response.status
    print(f"{datetime.datetime.now()} Batch Id: {batch_id},  Status: {status}")

2024-08-23 11:03:38.963729 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:03:55.539754 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:04:12.151968 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:04:28.891453 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:04:45.438923 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:05:02.141668 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:05:18.844527 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:05:35.727572 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:05:52.306819 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: validating
2024-08-23 11:06:10.523481 Batch Id: batch_09150d34-892a-486a-8023-db19d9fd2ea9,  Status: v

### Retrieving results

In [67]:
import json

file_response = client.files.content(batch_response.output_file_id)
raw_responses = file_response.text.strip().split('\n')  

for raw_response in raw_responses:  
    json_response = json.loads(raw_response)  
    formatted_json = json.dumps(json_response, indent=2)
    task_id = json_response['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = json_response['response']['body']['choices'][0]['message']['content']
    movie = df.iloc[int(index)]
    description = movie['Overview']
    title = movie['Series_Title']
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")  
    #print(formatted_json)

TITLE: The Godfather
OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

RESULT: {
    "categories": ["crime", "drama"],
    "summary": "An aging crime lord hands over his empire to his hesitant son."
}


----------------------------


TITLE: The Shawshank Redemption
OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

RESULT: {
    "categories": ["drama"],
    "summary": "Two imprisoned men develop a deep bond over the years, discovering solace and redemption through their shared acts of kindness."
}


----------------------------


TITLE: 12 Angry Men
OVERVIEW: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.

RESULT: {
    "categories": ["drama", "thriller"],
    "summary": "A jury holdout fights to ensure justice is served by urging his fellow jurors to reevaluate the evide