## Dataset Generation for News Headlines Mood Analysis

This notebook performs the following tasks:

1. Loads a dataset of news headlines using the Hugging Face datasets library
2. Samples 30,000 headlines randomly from the dataset
3. Generates JSON files containing API requests to evaluate headline mood impacts using GPT-4.1:
   - Splits headlines into 2 batches of 15,000 each
   - Creates requests asking for mood impact ratings between 1-10
4. Processes the API response files to extract mood ratings:
   - Normalizes ratings to 0-1 scale
   - Validates rating responses
   - Creates separate DataFrames for ratings and validation flags

The generated dataset will be used for training a model to assess the emotional impact of news headlines.

In [None]:
from datasets import load_dataset
import pandas as pd
import json
from dotenv import load_dotenv
import os

In [None]:
load_dotenv()

huff_post = load_dataset(os.getenv("DATASET"))
huff_post = pd.DataFrame(huff_post['test'])

headlines = huff_post[os.getenv("HEADLINE_COLUMN")].sample(n=30000)

### Rating Scale Choice (1-10)
GPT-4.1 demonstrated more consistent and reliable performance when asked to provide ratings on a 1-10 scale compared to normalized values (0-1).

### Batch Processing Implementation
The API has limitations on batch request sizes, necessitating splitting our 30,000 headlines into smaller chunks:
1. Each batch contains 15,000 headlines to stay within API limits
2. Multiple smaller batches improve error handling and recovery
3. Parallel processing of batches can potentially reduce total processing time

In [None]:
def generate_batch(batch_id: int, data: any) -> None:
    """
    Generates a batch file containing JSON lines for API requests. Each request
    is tailored for evaluating the mood impact of a given headline.

    :param batch_id: The unique identifier for the batch being generated.
    :param data: A list of headlines to process and include in the batch file.
    :return: None
    """

    with open(f'openai/headlines_batch_{batch_id}.jsonl', 'w') as f:
        for i, headline in enumerate(data, 1):
            request_data = {
                "custom_id": f"request-{i}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4.1",
                    "messages": [
                        {
                            "role": "system",
                            "content": "How much would this headline negatively impact a user's mood? Only output a number between 1 and 10."
                        },
                        {
                            "role": "user",
                            "content": headline
                        }
                    ],
                    "max_tokens": 1000
                }
            }
            f.write(json.dumps(request_data) + '\n')

In [None]:
generate_batch(1, headlines.iloc[:15000])
generate_batch(2, headlines.iloc[15000:])

### GPT Data Validation

The validation process for GPT-generated ratings includes:

1. Checking if the response is a valid numeric value
2. Ensuring ratings fall within expected 1-10 range 
3. Normalizing ratings to 0-1 scale
4. Flagging invalid responses for quality control

In [None]:
def extract_ratings_from_batch_file(filepath: str) -> tuple:
    """
    Extract ratings and validation results from a batch file containing JSON lines.

    This function reads a file containing JSON objects in each line, processes the
    `content` field in each entry, and extracts ratings and their validity status.
    Ratings are normalised by dividing by 10. Invalid ratings are flagged as not
    valid.

    :param filepath: The path to the file containing batch JSON lines.
    :type filepath: str
    :return: A tuple containing two pandas DataFrames:
             - One for ratings with a column named 'ratings'.
             - Another for validation flags with a column named 'validation'.
    :rtype: tuple
    """
    ratings = []
    validation = []

    with open(filepath, 'r') as f:
        for line in f:
            batch_response = json.loads(line.strip())
            content = batch_response['response']['body']['choices'][0]['message']['content']

            rating = 0
            validity = True
            if content.isdigit():
                rating = int(content) / 10
                if rating > 1:
                    validity = False
            else:
                validity = False

            ratings.append(rating)
            validation.append(validity)

    return pd.DataFrame({'ratings': ratings}), pd.DataFrame({'validation': validation})

In [None]:
ratings_1, validation_1 = extract_ratings_from_batch_file('../data/openai/headlines_batch_1_output.json')
ratings_2, validation_2 = extract_ratings_from_batch_file('../data/openai/headlines_batch_2_output.json')

In [None]:
headlines.reset_index(drop=True, inplace=True)

all_ratings = pd.concat([ratings_1, ratings_2], axis=0, ignore_index=True)
all_validation = pd.concat([validation_1, validation_2], axis=0, ignore_index=True)
all_data = pd.concat([headlines, all_ratings], axis=1)

In [None]:
data = all_data[all_validation['validation']]
data.columns = ['text', 'labels']
data.to_csv('headline_data.csv', index=False)