# 🔄 Running Batch AI Validation Jobs with Azure OpenAI

This notebook demonstrates how to submit and monitor batch inference jobs using Azure OpenAI. Batch jobs allow you to process large volumes of input data asynchronously, making them ideal for running validation jobs with AI at scale. 



## Importing libraries

In [1]:
from openai import AzureOpenAI
import requests
import json
import pandas as pd



## Importing Keys 

These should be on your config.py file, at the root level of your folder

In [2]:
from config import AZURE_API_KEY, AZURE_API_VERSION, AZURE_ENDPOINT, AZURE_MODEL_NAME

## Importing Case Validation Data

We'll use a cross validation set with about 900 cases with insights manually labelled by the SEC team.

In [4]:
csv_path = "cross_validation_set.csv"

cases = pd.read_csv(csv_path)

print(cases.shape)

pd.set_option('display.max_columns', None)
display(cases[:5])

(921, 23)


Unnamed: 0,ID,Case Number,UPN,Line Of Business,Business Goals and Needs,Business Goal Validation,Business Goal Comment,Product Led Growth Conversation,PLG Conversation Validation,PLG Conversation Comment,Product Feedback and Limitations,Product Feedback and Limitations Validation,Product Feedback and Limitations Comment,Copilot Insights,Copilot Insights Validation,Copilot Insights Comment,Recommendation Details,Recommendation Details Validation,Recommendation Details Comment,UT_ID,Submission Day,Invalids_Detection,CaseNumber_C
0,29072,2504070040002405,gig_wfh_jewan@microsoftsupport.com,Trials Nurturing Proactive,An upcoming nonprofit focused on health initia...,1,It details the need of the customer,I recommended Forms which can help with creati...,1,Recommendation addresses the need of the customer,No data submitted,0,No data submitted,No data submitted,0,No data submitted,Nonprofit will focus on health initiatives and...,0,The entry lacks detail of a copilot recommende...,PT67847,"04/09 Wednesday, 2025",Invalids_detected,2504070040002405
1,29067,2504091420002457,gig_wfh_abelr@microsoftsupport.com,Business Advisor Reactive,- A consulting firm specializing in Lean Manag...,1,The entry details the need of the customer,M365 PLG: \n- I recommended that the customer...,0,The entry does not align with the customer's b...,No data submitted,0,No data submitted,- The customer expressed dissatisfaction with ...,1,Insight highlights customer dissatisfaction wi...,Copilot Value add: \n- Given the customer's fr...,1,"It details the feature, and it benefit on cust...",UT142039,"04/09 Wednesday, 2025",Invalids_detected,2504091420002457
2,29063,2503190050003421,gig_wfh_romce@microsoftsupport.com,Business Advisor Reactive,This company is a residential letting agency a...,0,"Focuses on tool cx is interested in using, no ...",M365 PLG: \nGiven the customer's interest in u...,0,"No clear business goal was identified, and PLG...",M365 Product Insights: \nCustomer mentioned th...,1,Valid,No data submitted,0,No data submitted,Copilot Value add: \nalked about Copilot. The ...,1,VAlid,UT142023,"04/09 Wednesday, 2025",Invalids_detected,2503190050003421
3,29059,2504090030004589,gig_wfh_balpa@microsoftsupport.com,Business Advisor Reactive,"The cx, operating in the agriculture sector, ...",1,Valid: mentions a need that can be actioned wi...,M365 PLG: \nProvided an overview of the Teams ...,1,Valid: Addressed features related to business ...,M365 Product Insights: \nDuring the migration ...,1,Valid: actioanble feedback.,No data submitted,0,No data submitted,Copilot Value add: \nCoPilot can assist in cre...,0,"Invalid: not relative to customer's goal, busi...",UT141994,"04/09 Wednesday, 2025",Invalids_detected,2504090030004589
4,29058,2504091420001155,gig_wfh_atbal@microsoftsupport.com,Business Advisor Reactive,Our customer is dedicated to reducing the ener...,0,Not added to DFM case note,"M365 PLG: \nI recommend using PowerPoint's ""Cr...",0,Not added to DFM case note,M365 Product Insights: \nOur customer finds us...,0,Not added to DFM case note,Customer dissatisfaction with Copilot often ar...,0,Not added to DFM case note,Copilot Value add: \nI recommend using Copilot...,0,Not added to DFM case note,UT141992,"04/09 Wednesday, 2025",Invalids_detected,2504091420001155


## Importing our prompts 

As you can see in the dataset, we have different kinds of insights. Each of them will need it's own validation prompt. 

In [15]:
prompt_files = {
    "BUSINESS_GOAL_PROMPT": "./prompts/business_goal.md",
    "COPILOT_FEEDBACK_PROMPT": "./prompts/copilot_feedback.md",
    "COPILOT_VALUE_ADD_PROMPT": "./prompts/copilot_value_add.md",
    "M365_PRODUCT_FEEDBACK_PROMPT": "./prompts/m365_product_feedback.md",
    "M365_RECOMMENDATION": "./prompts/m365_recommendation.md"
}

# Load prompts into a dictionary
prompts = {}
for name, path in prompt_files.items():
    with open(path, "r", encoding="utf-8") as f:
        prompts[name] = f.read()

# Access like this
print(prompts["BUSINESS_GOAL_PROMPT"][:15])

You are an AI a


## Validation File preparation in JSONL ( JSON lines ) 

Our Azure batch processor expects the input file to be in JSONL format — short for JSON Lines. JSONL is a convenient format for streaming structured data where each line is a separate, self-contained JSON object.

📌 Format Rules:
- Each line must be a valid JSON object.
- No commas between lines (unlike in a JSON array).
- The file must not begin with \[ or end with \] — it's not a JSON array.
- Lines should be separated by newline characters (\n).

Example of json lines formatted as expected by Azure 

```
{"custom_id": "task-0", "method": "POST", "url": "/chat/completions", "body": {"model": "REPLACE-WITH-MODEL-DEPLOYMENT-NAME", "messages": [{"role": "system", "content": "You are an AI assistant that validates product feedback"}, {"role": "user", "content": "Copilot is really great, helps me a lot daily"}]}}
{"custom_id": "task-1", "method": "POST", "url": "/chat/completions", "body": {"model": "REPLACE-WITH-MODEL-DEPLOYMENT-NAME", "messages": [{"role": "system", "content": "You are an AI assistant that validates product feedback"}, {"role": "user", "content": "Copilot is really great, helps me a lot daily"}]}}

```

### Important


The custom_id is required to allow you to identify which individual batch request corresponds to a given response. *Responses won't be returned in identical order to the order defined in the .jsonl batch file.* 

The model attribute must be set to match the name of the Global Batch deployment you wish to target for inference responses. The same Global Batch model deployment name must be present on each line of the batch file. If you want to target a different deployment you must do so in a separate batch file/job.




In [20]:
column_mapping = { 
    "BUSINESS_GOAL_PROMPT": {'column_name': "Business Goals and Needs", 'identifier': 'BG'},
    "COPILOT_FEEDBACK_PROMPT": {'column_name': "Copilot Insights", 'identifier': 'CFB'},
    "COPILOT_VALUE_ADD_PROMPT": {'column_name': "Recommendation Details", 'identifier': 'CVA'},
    "M365_PRODUCT_FEEDBACK_PROMPT": {'column_name': "Product Feedback and Limitations", 'identifier': 'PFB'},
    "M365_RECOMMENDATION": {'column_name': "Product Led Growth Conversation", 'identifier': 'PREC'}
}


def transform_csv_to_jsonl(prompt_template, jsonl_path, case_limit=15):

    # implement case limit is pending 
    
    if prompt_template not in column_mapping.keys(): 
        raise ValueError(f"Prompt '{prompt_template}' not found in prompt dictionary.")

    prompt = prompts[prompt_template]
    identifier = column_mapping[prompt_template]['identifier']
    insights_column = column_mapping[prompt_template]['column_name']
    
    with open(jsonl_path, "w", encoding="utf-8") as output_file:
        for _, row in cases.head(case_limit).iterrows():
            if row[insights_column] == '' or row[insights_column] == None:
                continue

            jsonl_row = { 
                "custom_id": identifier + '-' + row['UT_ID'],
                "method": "POST", 
                "url": "/chat/completions", 
                "body": {
                    "model": AZURE_MODEL_NAME, 
                    "messages": [
                        {
                            "role": "system", 
                            "content": prompt
                        }, 
                        {
                            "role": "user", 
                            "content": row[insights_column]
                        }
                    ],
                    "response_format":{
                        "type":"json_schema",
                        "json_schema":{
                            "type":"object",
                            "properties":{
                                "valid":{"type":"boolean"},
                                "comment":{"type":"string"}
                            },
                        "required":["valid","comment"],
                        "additionalProperties": False
                        }
                    }
                }
            }
            
            output_file.write(json.dumps(jsonl_row, ensure_ascii=False) + "\n")

    print(f"✅ Created JSONL at {jsonl_path} with {case_limit} insights.")

In [21]:
transform_csv_to_jsonl('BUSINESS_GOAL_PROMPT', './business_goals_batch.jsonl')

✅ Created JSONL at ./business_goals_batch.jsonl with 15 insights.


## Creating a batch job 

Since each type of insight requires a different validation prompt, it is a better approach to run independent batches for each type of insight. This will keep things easier when it comes to debugging, retrying and optimising prompts.

Azure OpenAI’s batch API expects one deployment + one instruction structure per job — it's simpler and safer to keep each job uniform.

In [None]:
#def submit_batch_job(prompt_path, input_file_path, deployment_name, output_container_url):
def submit_batch_job(input_file_path):

    client = AzureOpenAI(
        api_key= AZURE_API_KEY,
        api_version= AZURE_API_VERSION,
        azure_endpoint = AZURE_ENDPOINT
    )
    # endpoint must point to /files - if you send it to completions it fails 

    # Upload a file with a purpose of "batch"
    file = client.files.create(
        file=open(input_file_path, "rb"), 
        purpose="batch",
        #extra_body={"expires_after":{"seconds": 1209600, "anchor": "created_at"}} # Optional you can set to a number between 1209600-2592000. This is equivalent to 14-30 days
    )

    print(file.model_dump_json(indent=2))

    #print(f"File expiration: {datetime.fromtimestamp(file.expires_at) if file.expires_at is not None else 'Not set'}")

    file_id = file.id

    # Submit a batch job with the file
    headers = {
        "api-key": AZURE_API_KEY,
    }

    body = {
        "input_file_id": file.id,
        "endpoint": "/chat/completions",
        "completion_window": "24h",
        "model": AZURE_MODEL_NAME,  # Use your Azure deployment name here
        "output_format": "jsonl",  # Optional, usually jsonl
        "output_expires_after": {
            "seconds": 1209600,
            "anchor": "created_at"
        }
    }

    response = requests.post(AZURE_ENDPOINT, headers=headers, json=body)

    print(response.status_code)
    print(response.json())
    
    return response.json()

In [23]:
response_case = submit_batch_job("./business_goals_batch.jsonl")

{
  "id": "file-a7b023ce23f84682ba146fb085204ab7",
  "bytes": 33259,
  "created_at": 1745232945,
  "filename": "business_goals_batch.jsonl",
  "object": "file",
  "purpose": "batch",
  "status": "processed",
  "expires_at": null,
  "status_details": null
}
200
{'cancelled_at': None, 'cancelling_at': None, 'completed_at': None, 'completion_window': '24h', 'created_at': 1745232946, 'error_file_id': '', 'expired_at': None, 'expires_at': 1745319346, 'failed_at': None, 'finalizing_at': None, 'id': 'batch_29d3fb4b-27f5-430e-93fc-f905613087d2', 'in_progress_at': None, 'input_file_id': 'file-a7b023ce23f84682ba146fb085204ab7', 'errors': None, 'metadata': None, 'object': 'batch', 'output_file_id': '', 'request_counts': {'total': 0, 'completed': 0, 'failed': 0}, 'status': 'validating', 'endpoint': '/chat/completions'}


## Request to track case and check status + check response

In [29]:
# Track case
headers = {
    "api-key": AZURE_API_KEY,
}

def track_batch(batch_id):
    url = f"https://validationtest.openai.azure.com/openai/batches/{batch_id}?api-version={AZURE_API_VERSION}"
    
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    info = resp.json()
    print(f"Batch status: {info['status']}")
    print(f"Batch json: {info}")
    if info["status"] == "completed":
        print(info["output_file_id"])
    else:
        print(info["status"])
    return info["valid"]+info["comment"]

case_results = track_batch("file-a7b023ce23f84682ba146fb085204ab7")


HTTPError: 404 Client Error: The requested job 'file-a7b023ce23f84682ba146fb085204ab7' does not exist under the account 'Validati for url: https://validationtest.openai.azure.com/openai/batches/file-a7b023ce23f84682ba146fb085204ab7?api-version=2025-01-01-preview

## Basics for structure scores

In [31]:
scores = pd.DataFrame([], columns=["test_name", "sensitivity", "precision", "f1_score"])

scores

Unnamed: 0,test_name,sensitivity,precision,f1_score


## Performance Metrics


In [None]:
column_validation = ["Business Goal Validation", "PLG Conversation Validation", "Product Feedback and Limitations Validation", "Copilot Insights Validation", "Recommendation Details Validation", ]

def catch_key_metrics(test_name, results_store, dataset, response_batch, column_theme):
    '''
    Evaluates the performance of a prompt 

    Args:
        - test_name: name of the test, can be used as an identifier
        - prompt: system prompt passed to the LLM to validate product feedback
        - results_store: dataframe where we can store the results
    '''

    # counters to evaluate metrics
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    for index, row in dataset.iterrows():
        # avoid token limit if needed every 10 rows 
        try:
            response = json.loads(response_batch)
        except json.JSONDecodeError as e:
            print(f"[WARN] Failed to parse LLM response as JSON: {e}")
            continue

        human_validation = row[column_theme]
        if response['valid'] == True and human_validation == 1:
            tp += 1
        elif response['valid'] == False and human_validation == 0:
            tn += 1
        elif response['valid'] == True and human_validation == 0:
            fp += 1
        elif response['valid'] == False and human_validation == 1:
            fn += 1
    sensitivity = tp / ( tp + fn )
    precision = tp / ( tp + fp )
    f_1 = 2 * ( precision * sensitivity ) / ( precision + sensitivity )
    
    new_results_row = pd.DataFrame({"test_name": [test_name] ,"sensitivity": [sensitivity] , "precision": [precision] , "f1_score": [f_1]  })

    test_values = results_store["test_name"].values
    
    if test_name in test_values:
        index_to_replace = results_store[results_store["test_name"] == test_name].index[0]
        results_store.loc[index_to_replace] = new_results_row.iloc[0] 
    else:
        results_store = pd.concat([results_store, new_results_row], ignore_index=True)

    return results_store

In [None]:

scores = catch_key_metrics("./cross_validation_set.csv", scores, cases, case_results, column_validation[0])

## Todo next

- Test submit_batch_job method and make sure it works well ( I would put a cap of 10-20 cases ) ✅
- Build logic to measure F1 ( look into previous jupyter notebook )
- Play with structured outputs ( telling Azure to always give back "valid" and "reasoning" fields)

Miguel
- Make sure Toni has credentials to RodriguezAyala Azure