In [1]:
from openai import OpenAI
from dotenv import load_dotenv
import json, os, sys
import os
import re
import random

In [2]:
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [3]:
# load the input file and sample 100 documents from the dataset
input_file = f"json/20ng/topic_1_2_documents.json"
masked_file = f"json/20ng/documents_masked_12.json"
output_file = "json/output_questions_v2.json"

# read the input file and extract only the text field then save it to a new json file
with open(input_file, "r") as f:
    data = json.load(f)
    masked_data = [{"text": item["text"]} for item in data]
# save the masked data to a new json file
with open(masked_file, "w") as f:
    json.dump(masked_data, f)

with open(input_file, "r") as f:
    data = json.load(f)
    sampled_data = random.sample(data, 100)

# only keep the id and text fields
sampled_data = [{"text": item["text"]} for item in sampled_data]

# convert the sampled data to a pandas DataFrame
import pandas as pd

df = pd.DataFrame(sampled_data)
sampled_data

[{'text': "Hi,\n\tWe've been having problems on a few setups when printing to a\nserial printer (dmp or Laser). I have used Works and Windows Write. The\noutput is OK from DOS and if I send plain text output, but anything\nfancy garbles or just doesn't output. The exception is outputting to a\nLserjet 4 which 'appears' to be fast enough receiving data, not to\nbother about handshaking messages. I'm sure I'm not alone in this. I've\ntried most of the Print/Network manager options I can think of. Anyone\nhad similar problems they've cured and would like to tell me 'bout it??\nThanks"},
 {'text': "\nWhich translates to 7% not satisfied.  I don't think it's the awkward \nrecursive deletion that's bugging people, it certainly isn't the nice Windows \ninterfaces for new DOS accessories (CPAV, defrager, undelete).\n\n\tAs far as I've noticed, it's DoubleSpace crashes.\n\nFrankly, the fairly high rates of DoubleSpace crashes I've heard of surprises \nme!  I figured that since the OS is presuma

In [None]:
def generate_difference(model, articles):

    narrative_prompt = f"""
    
    You are an AI assistant tasked with analyzing a corpus of documents.
    Your objective is to identify and summarize the key subtopics or themes that emerge across the documents.

    Please follow these instructions carefully:
	1.	Thoroughly read and analyze each document in the corpus.
	2.	Identify and list the key sport games that are present, ensuring you capture nuanced distinctions between them.
	3.	If applicable, highlight contrasts or overlaps in how the same subtopic is addressed in different documents.
	4.	Your analysis will later be used to generate multiple-choice questions aimed at assessing the ability to distinguish between documents based on their themes and content.

    Make your output detailed, well-structured, and easy to adapt for MCQ creation.

    Here is the corpus of documents you need to analyze:
    {articles}
    """
    messages = [{"role": "user", "content": narrative_prompt}]
    try:
        response = client.chat.completions.create(
            model=model, messages=messages, temperature=0
        )

        content = response.choices[0].message.content
        return content
    except Exception as e:  # if the model fails to return a response
        print(f"Error: {e}")
        return "Sorry, error from GPT."

In [5]:
response = generate_difference("gpt-4o-mini", sampled_data)
print(response)

Based on the analysis of the provided corpus of documents, several key subtopics related to sports, particularly hockey and baseball, emerge. Below is a detailed summary of the key subtopics, including the identification of specific sports games, contrasts, and overlaps in themes across the documents.

### Key Subtopics Identified

1. **Hockey Playoffs and Teams**
   - **NHL Playoffs**: Several documents discuss the NHL playoffs, including specific matchups and team performances. For example, there are mentions of the New Jersey Devils, Pittsburgh Penguins, and New York Islanders, highlighting their playoff standings and results.
   - **Player Performances**: Individual player performances are discussed, such as Mario Lemieux and his scoring ability, as well as the impact of players like Ulf Dahlén and Kevin Stevens on their respective teams.
   - **Coaching Changes**: The hiring of coaches, such as Mike Keenan for the Rangers, is noted, along with the implications for team dynamics an

In [12]:
def generate_questions(chunk, client, model="gpt-4o-mini"):
    prompt = f"""
You are an advanced AI assistant tasked with analyzing a corpus of documents focused on sports games. Your role is to design multiple-choice questions (MCQs) that help distinguish key differences among the documents and identify shared subtopics or thematic elements.

You will be provided with an excerpt of pre-analyzed document summaries or content segments. Based on this input, generate clear, non-redundant, and comprehensive MCQs that:
	•	Highlight key differentiators across documents relating to sports games.
	•	Target subtopics or themes that are common across multiple documents, not document-specific trivia.
	•	Include “None of the above” as one of the answer choices for every question.
	•	Follow the format provided below, where each question focuses on one analytical aspect.

Example Format:

1. What is the primary focus of this document?
   - A. Sports Results & Analysis
   - B. Political & International Affairs
   - C. None of the above

2. What is the narrative structure of this document?
   - A. Straight News
   - B. Opinion Pieces
   - C. None of the above

Your task:

Generate unique, comprehensive MCQs from the analysis results provided below. Design questions that are:
	•	Useful for distinguishing between themes.
	•	Easy for an AI system to score against each document.
	•	Focused on broader categorizations rather than niche or overly specific detail.

Documents:

{chunk}

"""
    messages = [{"role": "user", "content": prompt}]
    try:
        response = client.chat.completions.create(
            model=model, messages=messages, temperature=0
        )

        content = response.choices[0].message.content
        return content
    except Exception as e:  # if the model fails to return a response
        print(f"Error: {e}")
        return "Sorry, error from GPT."

In [13]:
questions = generate_questions(response, client)
print(questions)

1. What is the primary focus of the documents analyzed?
   - A. Hockey and Baseball Performance
   - B. Political & International Affairs
   - C. None of the above

2. Which theme is commonly discussed across the documents?
   - A. Player Statistics and Performance
   - B. Environmental Issues
   - C. None of the above

3. What aspect of hockey is highlighted in several documents?
   - A. NHL Playoffs and Team Performances
   - B. Historical Development of the Sport
   - C. None of the above

4. How do the documents address fan engagement?
   - A. Through discussions of Fan Reactions and Media Coverage
   - B. By analyzing Economic Impact of Sports
   - C. None of the above

5. What comparative analysis is present in the documents?
   - A. Differences between Hockey and Baseball
   - B. Comparison of Sports Equipment
   - C. None of the above

6. Which topic is emphasized in the context of team management?
   - A. Franchise Moves and Management Decisions
   - B. Player Health and Safet