# Text Classification with kluster.ai Batch API

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kluster-ai/klusterai-cookbook/blob/main/text-classification.ipynb)

Welcome to the text classification notebook with kluster.ai Batch API!

This notebook demonstrates how to use the <a href="https://kluster.ai/" target="_blank">kluster.ai</a> Batch API to classify movies into predefined categories based on their descriptions. Whether you’re working with a small dataset or scaling up to handle thousands of records, this notebook makes it simple to get started.

By following this guide, you’ll be able to:
1. **Input your API Key:** The only required input to connect to the kluster.ai platform. If you don’t have one, sign up <a href="https://platform.kluster.ai/signup" target="_blank">on our platform</a>, and you'll receive a free API key.
2. **Run the Classification Process:** The notebook comes preloaded with a sample dataset containing movie descriptions. Simply run the provided cells to submit the batch job to the API and receive categorized results.

This notebook is designed for users of all experience levels, so whether you’re familiar with APIs or completely new to large language models, you’ll find it easy to use. By the end, you’ll have a categorized dataset and insights ready for your project, all without needing to write any additional code.

Let’s get started!


## Config

Enter your personal kluster.ai API key (make sure it has no blank spaces). Remember to <a href="https://platform.kluster.ai/signup" target="_blank">sign up</a> if you don't have one yet.

In [2]:
from getpass import getpass
# Enter you personal kluster.ai API key (make sure in advance it has no blank spaces)
api_key = getpass("Enter your kluster.ai API key: ")

Enter your kluster.ai API key:  ········


## Setup

In [3]:
%pip install OpenAI

Note: you may need to restart the kernel to use updated packages.


In [4]:
import os
import urllib.request
import pandas as pd
import requests
from openai import OpenAI
import time
import json
from IPython.display import clear_output, display

pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000, 'display.max_colwidth', 500)

In [5]:
# Set up the client
client = OpenAI(
    base_url="https://api.kluster.ai/v1",
    api_key=api_key,
)

## Get the data

This notebook includes a preloaded sample dataset derived from the Top 1000 IMDb Movies dataset. It contains movie descriptions ready for classification. No additional setup is needed—simply proceed to the next steps to begin working with this data.

In [6]:
df = pd.DataFrame({
    "text": [
        "Breakfast at Tiffany's: A young New York socialite becomes interested in a young man who has moved into her apartment building, but her past threatens to get in the way.",
        "Giant: Sprawling epic covering the life of a Texas cattle rancher and his family and associates.",
        "From Here to Eternity: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.",
        "Lifeboat: Several survivors of a torpedoed merchant ship in World War II find themselves in the same lifeboat with one of the crew members of the U-boat that sank their ship.",
        "The 39 Steps: A man in London tries to help a counter-espionage Agent. But when the Agent is killed, and the man stands accused, he must go on the run to save himself and stop a spy ring which is trying to steal top secret information."
    ]
})

## Define the task

Here, we define the prompt to guide the model in classifying the movies into specific genres based on their descriptions, with the output formatted as a JSON object.

In [7]:
SYSTEM_PROMPT = '''
    Classify the main genre of the given movie description based on the following genres: "Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Historical", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Sports", "Thriller", "War", "Western", "Family", "Noir", "Superhero", "Martial Arts", "Psychological Thriller", "Black Comedy", "Post-Apocalyptic", "Cyberpunk", "Steampunk", "Disaster", "Film Noir", "Silent Film", "Slasher", "Zombie", "Paranormal", "Mockumentary", "Anthology", "Coming of Age", "Period Piece", "Road Movie", "Science Fantasy", "Surreal", "Spy", "Teen", "Epic", "Found Footage", "Heist", "Political", "Gangster", "Experimental", "Satire", "Tragedy", "Romantic Comedy", "Dark Fantasy", "Supernatural", "Time Travel", "Vampire", "Alien Invasion", "Musical Drama", "Historical Fiction", "Urban", "Chick Flick", "Buddy Film", "Disaster Comedy", "Adventure Comedy", "Space Opera", "Sword and Sorcery". 
    Provide only a JSON object with the following structure:
    {
        "category": string, // The primary category of the provided movie description
        "confidence": float, // A value between 0 and 1 indicating confidence in the classification
    }
    '''

## Batch Predictions

To execute the batch predictions, we’ll follow two simple steps:
1. **Create the Batch File:** we’ll generate a file containing a collection of tasks (the ones defined earlier) to be processed by the model.
2.	**Upload the Batch File:** once the batch file is ready, we’ll upload it to the kluster.ai platform using the API, where the tasks will be executed.

Everything is set up for you – just run the cells below to watch it all come together!

### Create the Batch File

This example selects the `klusterai/Meta-Llama-3.1-70B-Instruct-Turbo` model. If you'd like to use a different model feel free to change the model's name in the following cell. Please refer to our <a href="https://docs.kluster.ai/getting-started/#list-supported-models" target="_blank">documentation</a> for a list of the models we support.

In [8]:
def create_tasks(df, task_type, system_prompt):
    tasks = []
    for index, row in df.iterrows():
        content = row['text']
        
        task = {
            "custom_id": f"{task_type}-{index}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "klusterai/Meta-Llama-3.1-70B-Instruct-Turbo",
                "temperature": 0.5,
                "response_format": {"type": "json_object"},
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": content}
                ],
            }
        }
        tasks.append(task)
    return tasks

def save_tasks(tasks, task_type):
    filename = f"data/batch_tasks_{task_type}.jsonl"
    with open(filename, 'w') as file:
        for task in tasks:
            file.write(json.dumps(task) + '\n')
    return filename

In [9]:
tasks = []
task_type = 'movie_classification'

task_list = create_tasks(df, task_type=task_type, system_prompt=SYSTEM_PROMPT)
filename = save_tasks(task_list, task_type=task_type)

### Upload Batch File to kluster.ai

In [10]:
def create_batch_job(file_name):
    print(f"Creating batch job for {file_name}")
    batch_file = client.files.create(
        file=open(file_name, "rb"),
        purpose="batch"
    )

    batch_job = client.batches.create(
        input_file_id=batch_file.id,
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )

    return batch_job

job = create_batch_job(filename)

Creating batch job for data/batch_tasks_movie_classification.jsonl


### Check Job progress

The batch jobs have been created, and all tasks are now being processed! In the following section, we’ll monitor the status of each job to see how they’re progressing. Let’s take a look and keep track of their completion.

**REMEMBER:** The time to complete a batch inference job can take up to 24 hours. Although, it typically completes in much less time.

In [11]:
def parse_json_objects(data_string):
    if isinstance(data_string, bytes):
        data_string = data_string.decode('utf-8')

    json_strings = data_string.strip().split('\n')
    json_objects = []

    for json_str in json_strings:
        try:
            json_obj = json.loads(json_str)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")

    return json_objects

all_completed = False
while not all_completed:
    all_completed = True
    output_lines = []

    updated_job = client.batches.retrieve(job.id)

    if updated_job.status != "completed":
        all_completed = False
        completed = updated_job.request_counts.completed
        total = updated_job.request_counts.total
        output_lines.append(f"{task_type.capitalize()} job status: {updated_job.status} - Progress: {completed}/{total}")
    else:
        output_lines.append(f"{task_type.capitalize()} job completed!")

    # Clear the output and display updated status
    clear_output(wait=True)
    for line in output_lines:
        display(line)

    if not all_completed:
        time.sleep(10)

'Movie_classification job completed!'

## Get the results

In [12]:
batch_job = client.batches.retrieve(job.id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)

for res in results:
    task_id = res['custom_id']
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    text = df.iloc[int(index)]['text']
    print(f'\n -------------------------- \n')
    print(f"Task ID: {task_id}. \n\nTEXT: {text}\n\nRESULT: {result}")


 -------------------------- 

Task ID: movie_classification-0. 

TEXT: Breakfast at Tiffany's: A young New York socialite becomes interested in a young man who has moved into her apartment building, but her past threatens to get in the way.

RESULT: {
    "category": "Romance",
    "confidence": 0.8
}

 -------------------------- 

Task ID: movie_classification-1. 

TEXT: Giant: Sprawling epic covering the life of a Texas cattle rancher and his family and associates.

RESULT: {
    "category": "Epic",
    "confidence": 0.8
}

 -------------------------- 

Task ID: movie_classification-2. 

TEXT: From Here to Eternity: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.

RESULT: {
    "category": "Romance",
    "confidence": 0.7
}

 -------------------------- 

Task ID: movie_classification-3. 

TEXT: Lifeboat: Several survivors of a torpedoed merchant ship in World War II find themselve

## Conclusion

You’ve successfully completed the classification task using the kluster.ai Batch API! This process showcases how you can efficiently handle and classify large amounts of data with ease. The Batch API empowers you to scale your workflows seamlessly, making it an invaluable tool for processing extensive datasets.