## Configure Azure Key Vault and OpenAI Credentials

Securely retrieve OpenAI API key from Azure Key Vault for authentication.
This ensures sensitive credentials are not hardcoded in the notebook.

In [1]:
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
import os

def get_openai_key():
    """Retrieve OpenAI API key from Azure Key Vault"""
    try:
        # Initialize the Azure credentials
        credential = DefaultAzureCredential()
        
        # Create a secret client
        vault_url = f"https://kvrunithesis.vault.azure.net/"
        secret_client = SecretClient(vault_url=vault_url, credential=credential)
        
        # Get the secret
        secret = secret_client.get_secret("alon-thesis-openai-key")
        
        # Set as environment variable
        os.environ["OPENAI_API_KEY"] = secret.value
        
        print("Successfully retrieved OpenAI API key from Azure Key Vault")
    except Exception as e:
        print(f"Error retrieving secret from Key Vault: {str(e)}")
        raise

# Retrieve and set the OpenAI API key
get_openai_key()

# Now you can initialize the OpenAI client which will automatically use the environment variable

Successfully retrieved OpenAI API key from Azure Key Vault


In [4]:
import pandas as pd

# Load the test data
csv_file = r'C:\Users\orgrd\workspace\data\patentmatch_test\patentmatch_test_no_claims.csv'
df = pd.read_csv(csv_file)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,index,claim_id,patent_application_id,cited_document_id,text,text_b,label,date
0,5113165,5113165,111187_0,EP3157302A1,EP2903333,A network of handling a paging procedure in a ...,FIG.16 is a diagram illustrating an example of...,0,20170419
1,5658863,5658863,209068_1,EP3202314A1,EP2229880,A sensor information processing program for ca...,In a first step the fundamental movement frequ...,1,20170809
2,5584990,5584990,171472_0,EP3196007A1,EP2939828,A moulded trim part for a vehicle according to...,It was found that the thermoplastic polyuretha...,0,20170726
3,5137320,5137320,87572_0,EP3160147A1,EP1670252,A method for fast channel change characterized...,As to the issue of delivery modes the strategy...,0,20170426
4,5800528,5800528,204115_0,EP3217403A1,EP1855216,An audio asset information storage system comp...,Further it is assumed in the above circumstanc...,0,20170913


## Prepare JSONL Files for OpenAI Processing

This section prepares the data for batch processing with OpenAI's API. Here's what we're doing:

1. **Setup**: Import required libraries and configure logging
2. **Data Model**: Define a Pydantic model `NegationResponse` to validate OpenAI's responses
3. **Batch Processing**: 
   - Split data into batches of 1000 rows each
   - Create JSONL files with proper OpenAI API format
   - Each line contains:
     - Custom ID for tracking
     - API endpoint
     - Request body with messages and response format
4. **Output**: Save batches as separate JSONL files in `output_jsonl` directory

The JSONL format is required for OpenAI's batch processing endpoint.

In [4]:
# Prepare the openai required jsonl files
from functools import partial
import json
import logging
from pathlib import Path
from typing import List, Optional
from pydantic import BaseModel
from tqdm import tqdm

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Define the NegationResponse model
class NegationResponse(BaseModel):
    negation_present: bool
    negation_types: Optional[List[str]]
    short_explanation: str

# Create output directory
output_dir = Path('output_jsonl')
output_dir.mkdir(exist_ok=True)

# Process in smaller batches
batch_size = 1000
num_batches = len(df) // batch_size + 1

def create_jsonl_line(row, column):
    text = row[column]
    messages = [
        {"role": "system", "content": "Analyze the text for negations and identify their types."},
        {"role": "user", "content": f"Analyze the following text: {text}"}
    ]
    
    body = {
        "model": "gpt-4-turbo-preview",
        "messages": messages,
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "negation_response",
                "schema": NegationResponse.model_json_schema()
            }
        },
        "max_tokens": 500
    }
    
    return {
        "custom_id": f'request_{column}_{row["patent_application_id"]}_{row["index"]}',
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": body
    }

# Process batches
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(df))
    batch_df = df.iloc[start_idx:end_idx]
    
    print(f"Processing batch {i + 1}/{num_batches}")
    lines = batch_df.apply(create_jsonl_line, axis=1, args=('text',))
    
    with open(output_dir / f"batch_{i}.jsonl", "w", encoding='utf-8') as f:
        for line in lines:
            f.write(json.dumps(line, ensure_ascii=False) + "\n")

# Show sample output
print("\nSample output from first batch:")
with open(output_dir / "batch_0.jsonl", "r", encoding='utf-8') as f:
    print(f.readline())


Processing batch 1/373
Processing batch 2/373
Processing batch 3/373
Processing batch 4/373
Processing batch 5/373
Processing batch 6/373
Processing batch 7/373
Processing batch 8/373
Processing batch 9/373
Processing batch 10/373
Processing batch 11/373
Processing batch 12/373
Processing batch 13/373
Processing batch 14/373
Processing batch 15/373
Processing batch 16/373
Processing batch 17/373
Processing batch 18/373
Processing batch 19/373
Processing batch 20/373
Processing batch 21/373
Processing batch 22/373
Processing batch 23/373
Processing batch 24/373
Processing batch 25/373
Processing batch 26/373
Processing batch 27/373
Processing batch 28/373
Processing batch 29/373
Processing batch 30/373
Processing batch 31/373
Processing batch 32/373
Processing batch 33/373
Processing batch 34/373
Processing batch 35/373
Processing batch 36/373
Processing batch 37/373
Processing batch 38/373
Processing batch 39/373
Processing batch 40/373
Processing batch 41/373
Processing batch 42/373
P

In [None]:
import os
import json
import asyncio
import aiohttp
import logging
from pathlib import Path
from datetime import datetime
import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception, after_log
import aio_pika
import aiofiles
import redis.asyncio as redis

# --- Logging Setup ---
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# --- Global Concurrency Limit ---
CONCURRENCY_LIMIT = 20
semaphore = asyncio.Semaphore(CONCURRENCY_LIMIT)

# --- Global Exchange Variable (set in main) ---
GLOBAL_EXCHANGE = None

# --- OpenAI API Key ---
openai.api_key = os.getenv("OPENAI_API_KEY")
if not openai.api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# --- RabbitMQ Connection Details ---
RABBITMQ_URL = "amqp://admin:password@localhost:5672/"

# --- Redis Connection ---
# Here we assume Redis is running on localhost with password "password"
redis_url = "redis://:password@localhost:6379/0"
redis_conn = redis.from_url(redis_url)

# --- Directory with Files ---
batch_files_dir = Path(r'C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl')

# --- Exchange & Queue Names ---
EXCHANGE_NAME = "thesis"
DLX_NAME = "thesis_dl"
UPLOAD_QUEUE = "upload_queue"
BATCH_QUEUE = "batch_queue"
COMPLETED_QUEUE = "completed_queue"

# --- Retry Helper ---
def is_rate_limit_error(exception):
    error_str = str(exception)
    return ("rate_limit_exceeded" in error_str or 
            "10054" in error_str or 
            "502" in error_str or 
            "Bad Gateway" in error_str)

@retry(
    stop=stop_after_attempt(10),
    wait=wait_exponential(multiplier=3, min=5, max=120),
    retry=retry_if_exception(is_rate_limit_error),
    after=after_log(logger, logging.WARNING)
)
async def submit_batch_job(session, file_id):
    """Submit the file for batch processing and return the batch_id."""
    url = 'https://api.openai.com/v1/batches'
    headers = {
        'Authorization': f"Bearer {openai.api_key}",
        'Content-Type': 'application/json'
    }
    data = {'file_id': file_id, 'purpose': 'batch'}
    async with session.post(url, headers=headers, json=data) as response:
        if response.content_type != "application/json":
            text = await response.text()
            raise Exception(f"Unexpected content type in batch submission for file {file_id}: {response.content_type}: {text}")
        if response.status != 200:
            response_json = await response.json()
            error_message = response_json.get("error", {}).get("message", "Unknown error")
            raise Exception(f"Error submitting batch job for file {file_id}: {error_message}")
        response_json = await response.json()
        batch_id = response_json.get('id')
        if not batch_id:
            raise Exception(f"Received null batch_id for file {file_id}, retrying...")
        return batch_id

@retry(
    stop=stop_after_attempt(10),
    wait=wait_exponential(multiplier=3, min=5, max=120),
    retry=retry_if_exception(is_rate_limit_error),
    after=after_log(logger, logging.WARNING)
)
async def upload_file(session, file_path: Path):
    """Upload a file to OpenAI and return metadata including file_id."""
    url = 'https://api.openai.com/v1/files'
    headers = {'Authorization': f"Bearer {openai.api_key}"}
    data = aiohttp.FormData()
    data.add_field('purpose', 'batch')
    # Read file contents with aiofiles so the file is closed properly.
    async with aiofiles.open(file_path, 'rb') as f:
        file_bytes = await f.read()
    data.add_field('file', file_bytes, filename=file_path.name, content_type='application/jsonl')
    async with session.post(url, headers=headers, data=data) as response:
        if response.content_type != "application/json":
            text = await response.text()
            raise Exception(f"Unexpected content type {response.content_type} for file {file_path.name}: {text}")
        if response.status != 200:
            try:
                response_json = await response.json()
                error_message = response_json.get("error", {}).get("message", "Unknown error")
            except Exception:
                error_message = await response.text()
            raise Exception(f"Error uploading {file_path.name}: {response.status} {error_message}")
        response_json = await response.json()
        file_id = response_json['id']
        metadata = {
            "file_id": file_id,
            "original_filename": file_path.name,
            "status": response_json['status'],
            "created_at": response_json['created_at'],
            "upload_timestamp": datetime.now().isoformat(),
            "bytes": response_json['bytes'],
            "purpose": response_json['purpose']
        }
        return metadata

# --- Consumer Callbacks ---

async def upload_consumer(message: aio_pika.IncomingMessage):
    """Upload consumer: uploads the file and publishes a batch task."""
    async with semaphore:
        async with message.process():
            try:
                payload = json.loads(message.body.decode())
                file_path = Path(payload["file_path"])
                logger.info(f"Uploading file: {file_path}")
                async with aiohttp.ClientSession() as session:
                    metadata = await upload_file(session, file_path)
                    logger.info(f"Uploaded {file_path.name}, file_id: {metadata['file_id']}")
                    # Write metadata to Redis
                    await redis_conn.hset(f"file:{metadata['file_id']}", mapping=metadata)
                    # Publish batch task message using the global exchange.
                    batch_payload = {"file_id": metadata["file_id"]}
                    await GLOBAL_EXCHANGE.publish(
                        aio_pika.Message(body=json.dumps(batch_payload).encode()),
                        routing_key="batch"
                    )
            except Exception as e:
                logger.error(f"Upload failed: {e}")
                raise e  # Reject message so it goes to DLX

async def batch_consumer(message: aio_pika.IncomingMessage):
    """Batch consumer: submits a batch job and publishes a completion message."""
    async with semaphore:
        async with message.process():
            try:
                payload = json.loads(message.body.decode())
                file_id = payload["file_id"]
                logger.info(f"Submitting batch for file_id: {file_id}")
                async with aiohttp.ClientSession() as session:
                    batch_id = await submit_batch_job(session, file_id)
                    logger.info(f"Batch submitted for file {file_id}, batch_id: {batch_id}")
                    # Update Redis with batch_id and status
                    await redis_conn.hset(f"file:{file_id}", mapping={"batch_id": batch_id, "status": "batch_submitted", "batch_timestamp": datetime.now().isoformat()})
                    # Publish completion message
                    completed_payload = {"file_id": file_id, "batch_id": batch_id}
                    await GLOBAL_EXCHANGE.publish(
                        aio_pika.Message(body=json.dumps(completed_payload).encode()),
                        routing_key="completed"
                    )
            except Exception as e:
                logger.error(f"Batch submission failed for file {payload.get('file_id')}: {e}")
                raise e

async def completed_consumer(message: aio_pika.IncomingMessage):
    """Completed consumer: logs the completion and updates Redis."""
    async with semaphore:
        async with message.process():
            payload = json.loads(message.body.decode())
            file_id = payload["file_id"]
            batch_id = payload["batch_id"]
            logger.info(f"Completed processing: file_id {file_id}, batch_id {batch_id}")
            # Update Redis status to completed
            await redis_conn.hset(f"file:{file_id}", mapping={"status": "completed", "completed_timestamp": datetime.now().isoformat()})

# --- Main Function: Setup Exchanges, Queues, Consumers, and Producers ---
async def main():
    global GLOBAL_EXCHANGE
    connection = await aio_pika.connect_robust(RABBITMQ_URL)
    async with connection:
        channel = await connection.channel()
        # Declare exchanges
        GLOBAL_EXCHANGE = await channel.declare_exchange(EXCHANGE_NAME, aio_pika.ExchangeType.DIRECT)
        dlx = await channel.declare_exchange(DLX_NAME, aio_pika.ExchangeType.FANOUT)
        # Declare queues with DLX arguments
        upload_queue = await channel.declare_queue(UPLOAD_QUEUE, durable=True, arguments={"x-dead-letter-exchange": DLX_NAME})
        batch_queue = await channel.declare_queue(BATCH_QUEUE, durable=True, arguments={"x-dead-letter-exchange": DLX_NAME})
        completed_queue = await channel.declare_queue(COMPLETED_QUEUE, durable=True, arguments={"x-dead-letter-exchange": DLX_NAME})
        # Declare dead-letter queue and bind it
        dead_letter_queue = await channel.declare_queue("dead_letter_queue", durable=True)
        await dead_letter_queue.bind(dlx, routing_key="")
        # Bind queues to the exchange with routing keys
        await upload_queue.bind(GLOBAL_EXCHANGE, routing_key="upload")
        await batch_queue.bind(GLOBAL_EXCHANGE, routing_key="batch")
        await completed_queue.bind(GLOBAL_EXCHANGE, routing_key="completed")
        # Start consumers
        await upload_queue.consume(upload_consumer)
        await batch_queue.consume(batch_consumer)
        await completed_queue.consume(completed_consumer)
        # Producer: Enqueue an upload task for every file matching "batch_*.jsonl"
        for file in sorted(batch_files_dir.glob("batch_*.jsonl")):
            payload = {"file_path": str(file)}
            await GLOBAL_EXCHANGE.publish(
                aio_pika.Message(body=json.dumps(payload).encode()),
                routing_key="upload"
            )
            logger.info(f"Enqueued upload task for {file.name}")
        logger.info("Pipeline is running. Press CTRL+C to exit.")
        await asyncio.Future()  # Run indefinitely

# Run the main function in the current event loop.
await main()


INFO:__main__:Enqueued upload task for batch_0.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl\batch_0.jsonl
INFO:__main__:Enqueued upload task for batch_1.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl\batch_1.jsonl
INFO:__main__:Enqueued upload task for batch_10.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl\batch_10.jsonl
INFO:__main__:Enqueued upload task for batch_100.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl\batch_100.jsonl
INFO:__main__:Enqueued upload task for batch_101.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project\notebooks\output_jsonl\batch_101.jsonl
INFO:__main__:Enqueued upload task for batch_102.jsonl
INFO:__main__:Uploading file: C:\Users\orgrd\workspace\repos\runi-thesis-project

CancelledError: 

## Process OpenAI Batch Requests

This section handles batch processing with OpenAI, including:
1. Reading file IDs from tracking directory
2. Managing batch submissions (max 50 concurrent batches)
3. Tracking progress and handling errors
4. Retrying failed requests
5. Saving results as they arrive

In [2]:
import os
import json
import asyncio
import aiohttp
import logging
from pathlib import Path
from datetime import datetime
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception, after_log
import aio_pika

# --- Logging Setup ---
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# --- RabbitMQ Connection Details ---
# (Using the docker compose settings: username=admin, password=password, host=rabbitmq)
RABBITMQ_URL = "amqp://admin:password@rabbitmq:5672/"

# --- OpenAI API Key ---
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# --- Directory with Files to Process ---
output_dir = Path('output_jsonl')  # This folder should contain files matching "batch_*.jsonl"

# --- Define Exchange and Queue Names ---
EXCHANGE_NAME = "thesis"         # main exchange
DLX_NAME = "thesis_dl"           # dead letter exchange
UPLOAD_QUEUE = "upload_queue"
BATCH_QUEUE = "batch_queue"
COMPLETED_QUEUE = "completed_queue"

# --- Retry Helpers for Rate Limits ---
def is_rate_limit_error(exception):
    return "rate_limit_exceeded" in str(exception)

@retry(
    after=after_log(logger, logging.WARN),
    stop=stop_after_attempt(10),
    wait=wait_exponential(multiplier=3, min=5, max=120),
    retry=retry_if_exception(is_rate_limit_error)
)
async def submit_batch_job(session, file_id):
    """
    Submit the uploaded file for batch processing.
    Returns the batch_id on success.
    """
    url = 'https://api.openai.com/v1/batches'
    headers = {
        'Authorization': f"Bearer {api_key}",
        'Content-Type': 'application/json'
    }
    data = {
        'file_id': file_id,
        'purpose': 'batch'
    }
    async with session.post(url, headers=headers, json=data) as response:
        response_json = await response.json()
        if response.status != 200:
            error_message = response_json.get("error", {}).get("message", "Unknown error")
            raise Exception(f"Error submitting batch job for file {file_id}: {error_message}")
        batch_id = response_json.get('id')
        if not batch_id:
            raise Exception(f"Received null batch_id for file {file_id}, retrying...")
        return batch_id

@retry(
    after=after_log(logger, logging.WARN),
    stop=stop_after_attempt(10),
    wait=wait_exponential(multiplier=3, min=5, max=120),
    retry=retry_if_exception(is_rate_limit_error)
)
async def upload_file(session, file_path: Path):
    """
    Upload a file to OpenAI and return the file_id along with metadata.
    """
    url = 'https://api.openai.com/v1/files'
    headers = {
        'Authorization': f"Bearer {api_key}",
    }
    data = aiohttp.FormData()
    data.add_field('purpose', 'batch')
    data.add_field('file', file_path.open('rb'), filename=file_path.name, content_type='application/jsonl')
    async with session.post(url, headers=headers, data=data) as response:
        response_json = await response.json()
        if response.status != 200:
            error_message = response_json.get("error", {}).get("message", "Unknown error")
            raise Exception(f"Error uploading {file_path.name}: {error_message}")
        file_id = response_json['id']
        metadata = {
            "file_id": file_id,
            "original_filename": file_path.name,
            "status": response_json['status'],
            "created_at": response_json['created_at'],
            "upload_timestamp": datetime.now().isoformat(),
            "bytes": response_json['bytes'],
            "purpose": response_json['purpose']
        }
        return metadata

# --- Consumer Callbacks ---
async def on_upload_message(message: aio_pika.IncomingMessage):
    """Process messages from the 'upload' queue: upload file and then forward for batch submission."""
    async with message.process():
        try:
            payload = json.loads(message.body)
            file_path_str = payload.get("file_path")
            file_path = Path(file_path_str)
            async with aiohttp.ClientSession() as session:
                metadata = await upload_file(session, file_path)
                logger.info(f"Uploaded {file_path.name}: file_id {metadata['file_id']}")
                # Publish to the batch queue
                batch_payload = {
                    "file_id": metadata["file_id"],
                    "original_filename": metadata["original_filename"]
                }
                channel = message.channel
                exchange = await channel.get_exchange(EXCHANGE_NAME)
                await exchange.publish(
                    aio_pika.Message(body=json.dumps(batch_payload).encode()),
                    routing_key="batch"
                )
        except Exception as e:
            logger.error(f"Upload processing failed: {e}")
            raise e  # Message will be rejected and eventually dead-lettered

async def on_batch_message(message: aio_pika.IncomingMessage):
    """Process messages from the 'batch' queue: submit batch job."""
    async with message.process():
        try:
            payload = json.loads(message.body)
            file_id = payload.get("file_id")
            async with aiohttp.ClientSession() as session:
                batch_id = await submit_batch_job(session, file_id)
                logger.info(f"Batch submitted for file {file_id}: batch_id {batch_id}")
                # Publish to the completed queue
                completed_payload = {
                    "file_id": file_id,
                    "batch_id": batch_id,
                    "timestamp": datetime.now().isoformat()
                }
                channel = message.channel
                exchange = await channel.get_exchange(EXCHANGE_NAME)
                await exchange.publish(
                    aio_pika.Message(body=json.dumps(completed_payload).encode()),
                    routing_key="completed"
                )
        except Exception as e:
            logger.error(f"Batch processing failed: {e}")
            raise e

# --- Main Function: Setup Exchanges, Queues, Producers, and Consumers ---
async def main():
    # Connect to RabbitMQ
    connection = await aio_pika.connect_robust(RABBITMQ_URL)
    async with connection:
        channel = await connection.channel()

        # Declare main exchange (direct) and dead-letter exchange (fanout)
        exchange = await channel.declare_exchange(EXCHANGE_NAME, aio_pika.ExchangeType.DIRECT)
        dlx = await channel.declare_exchange(DLX_NAME, aio_pika.ExchangeType.FANOUT)

        # Declare queues with dead-letter exchange arguments
        upload_queue = await channel.declare_queue(
            UPLOAD_QUEUE,
            durable=True,
            arguments={"x-dead-letter-exchange": DLX_NAME}
        )
        batch_queue = await channel.declare_queue(
            BATCH_QUEUE,
            durable=True,
            arguments={"x-dead-letter-exchange": DLX_NAME}
        )
        completed_queue = await channel.declare_queue(
            COMPLETED_QUEUE,
            durable=True,
            arguments={"x-dead-letter-exchange": DLX_NAME}
        )
        # Declare a dead-letter queue for messages that fail repeatedly
        dead_letter_queue = await channel.declare_queue("dead_letter_queue", durable=True)
        await dead_letter_queue.bind(dlx, routing_key="")

        # Bind queues to the exchange using routing keys
        await upload_queue.bind(exchange, routing_key="upload")
        await batch_queue.bind(exchange, routing_key="batch")
        await completed_queue.bind(exchange, routing_key="completed")

        # Start consumers for the upload and batch queues
        await upload_queue.consume(on_upload_message)
        await batch_queue.consume(on_batch_message)

        # --- Producer: Enqueue Upload Tasks ---
        # Publish an upload task for every file matching "batch_*.jsonl" in output_dir
        for file in sorted(output_dir.glob("batch_*.jsonl")):
            payload = {"file_path": str(file)}
            await exchange.publish(
                aio_pika.Message(body=json.dumps(payload).encode()),
                routing_key="upload"
            )
            logger.info(f"Enqueued upload task for {file.name}")

        logger.info("Waiting for messages. To exit press CTRL+C")
        # Run indefinitely
        await asyncio.Future()

# Run the main function
asyncio.run(main())


ValueError: Please set the OPENAI_API_KEY environment variable.

In [18]:
import os
import json
import asyncio
import aiohttp
from pathlib import Path
from datetime import datetime, timedelta
import openai
from tenacity import retry, stop_after_attempt, wait_exponential

# Set your OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')
if not openai.api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# Define directories
tracking_dir = Path('output_jsonl/tracking')
results_dir = tracking_dir / "results"
results_dir.mkdir(parents=True, exist_ok=True)

# Retry configuration: 3 attempts with exponential backoff
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))
async def fetch_batch_status(session, batch_id):
    """Fetch the status of a batch job."""
    url = f'https://api.openai.com/v1/batches/{batch_id}'
    headers = {
        'Authorization': f"Bearer {openai.api_key}",
    }
    async with session.get(url, headers=headers) as response:
        if response.status == 200:
            return await response.json()
        else:
            error_text = await response.text()
            raise Exception(f"Error fetching status for batch {batch_id}: {error_text}")

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))
async def download_file(session, file_id, destination):
    """Download a file from OpenAI and save it to the destination."""
    url = f'https://api.openai.com/v1/files/{file_id}/content'
    headers = {
        'Authorization': f"Bearer {openai.api_key}",
    }
    async with session.get(url, headers=headers) as response:
        if response.status == 200:
            with open(destination, 'wb') as f:
                f.write(await response.read())
        else:
            error_text = await response.text()
            raise Exception(f"Error downloading file {file_id}: {error_text}")

async def process_batch(session, metadata_path):
    """Process a single batch based on its metadata file."""
    with open(metadata_path, 'r') as f:
        metadata = json.load(f)

    batch_id = metadata.get('batch_id')
    if not batch_id:
        print(f"No batch_id found in metadata {metadata_path.name}. Skipping.")
        return

    # Check batch status
    try:
        print(f"Checking status for batch {batch_id}...")
        batch_status = await fetch_batch_status(session, batch_id)
        metadata['batch_status'] = batch_status['status']

        # Save updated metadata
        with open(metadata_path, 'w') as f:
            json.dump(metadata, f, indent=2)

        if batch_status['status'] == 'completed':
            output_file_id = batch_status.get('output_file_id')
            if output_file_id:
                output_path = results_dir / f"{batch_id}_results.jsonl"
                await download_file(session, output_file_id, output_path)
                print(f"Results for batch {batch_id} saved to {output_path}")
            else:
                print(f"Batch {batch_id} completed, but no output file found.")
        elif batch_status['status'] in ['failed', 'expired']:
            print(f"Batch {batch_id} failed or expired.")

    except Exception as e:
        print(f"Error processing batch {batch_id}: {e}")

async def main_check_batches():
    """Main function to check the status of all batch jobs."""
    async with aiohttp.ClientSession() as session:
        tasks = []
        for metadata_file in tracking_dir.glob("*_metadata.json"):
            tasks.append(process_batch(session, metadata_file))

        await asyncio.gather(*tasks)

# Run the batch checking function
await main_check_batches()


No batch_id found in metadata file-14kUprdXHVfy4c9QFkGbrF_metadata.json. Skipping.
No batch_id found in metadata file-14wPTMV99qr2uNevJawPRd_metadata.json. Skipping.
No batch_id found in metadata file-1aDJ7SoQhDh12BAGzFNbgm_metadata.json. Skipping.
No batch_id found in metadata file-1ip9b9hkBn5hn6H8aFjRWy_metadata.json. Skipping.
No batch_id found in metadata file-1Kd8cmQieKgDQCnXmd5La5_metadata.json. Skipping.
No batch_id found in metadata file-1tHtrGc4YUDvVikckkYSJh_metadata.json. Skipping.
No batch_id found in metadata file-1trT6Y1JFP6y9tPP3ZvMjo_metadata.json. Skipping.
No batch_id found in metadata file-1xSYGqcq6bgi3nrGpi89Lw_metadata.json. Skipping.
No batch_id found in metadata file-2Bx2AnihwVWttqq4hnZk2T_metadata.json. Skipping.
No batch_id found in metadata file-2C7cEJYYYmT9q2DyNtQ8RM_metadata.json. Skipping.
No batch_id found in metadata file-2TB2XdH27JgaSA6qWwu8ih_metadata.json. Skipping.
No batch_id found in metadata file-2uV4Hyee78BvXhRgiSPsoh_metadata.json. Skipping.
No b