# Question Generation per Cluster Pipeline

## Overview

In this notebook, we continue from the **clusters obtained in the previous notebook**.  

For each cluster, we **generate relevant questions** using a large language model (LLM).  

We then apply **nucleus sampling** to produce a diverse set of questions while maintaining coherence and relevance.  

Next, we perform **deduplication** to remove redundant or overlapping questions, ensuring a concise and high-quality question set for each cluster.  

## Configuration

At the beginning of the notebook, update the **variables** and **path definitions** to specify the input clusters, model configuration, sampling parameters, and output directories used throughout the workflow.


In [None]:
prompt_number = 1
file_name = "Israel_Israel-Hamas war-Week 19 2024"
event = "Isreal and Palestine conflict "
country = 'Israel'
input_file = f"Results/Cluster/Clusters+Headline /clusters-{file_name}.json"
#input_file = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook-Reports/Pipeline_DatasetDerya/Results/Clusters+Headline /clusters-{input_file}.json"
output_dir = "Results/Questions/Test questions different prompts/1-Questions generated/Dev set"
#output_dir_deduplicated = "Results/Questions/Test questions different prompts/2-Questions Deduplicated across clusters"


generation_model_path = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook/question_generation/models/t5-base-canard-mode"
duplicate_question_model_path = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook/question_generation/models/cross-encoder"  
expand_question_model_path = "/Users/decostanzi/Desktop/Project-ISI/SmartBook/SmartBook/question_generation/models/quora-roberta-base-model"
duplicate_treshold = 0.7
output_name = f"questions-{file_name}-prompt-{prompt_number}.json"

openaikey = ""

In [None]:
import openai
#from openai.error import RateLimitError
import backoff
import json
from tqdm import tqdm
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer, AutoModelForCausalLM
from copy import deepcopy
from sentence_transformers import CrossEncoder
import random
import numpy as np
import argparse
import os
import pickle
import json 

# Functions

In [None]:
from openai import OpenAI
import openai


import os


os.environ["OPENAI_API_KEY"] = openaikey
client = OpenAI()

def get_questions_from_openai(prompt):
    chatgpt_output = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0,
        max_tokens=256,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    questions = chatgpt_output.choices[0].message.content.strip()
    return questions

# Prompts 

In [None]:
def get_prompt(headline, prompt_number, event, country):
    prompt1 = f'''
You are an expert in developing strategic and tactical questions to analyze and address humanitarian situations, based exclusively on the provided data.
Your task is to generate clear, specific, and insightful questions tailored for a humanitarian situational report.

Input:
You will be provided with:
1.  A set of paragraphs extracted from humanitarian documents.
2.  A headline summarizing the cluster.
3.  The specific event: {event}.
4.  The relevant country: {country}.

Instructions for Generating Questions:

1.  Data-Driven: Each question must rely *solely* on the information present in the provided text. Do not introduce external knowledge or ask about information not contained in the text.
2.  Relevance: Ensure every question is directly relevant to the content of the provided paragraphs, the specified {event}, and the {country}. The questions should aim to capture key aspects of the humanitarian situation described.
3.  Precision: Questions must be well-defined, focused, and unambiguous.
4.  Action-Oriented: Frame questions to elicit actionable insights that can support humanitarian decision-making and response, based *only* on the data provided.
5.  Explicit Acronyms: If an acronym is used in a question (e.g., WHO), it must be explicitly defined within that question (e.g., "What actions has the World Health Organization (WHO) taken..."), assuming the definition is available or inferable from the provided text. If the text uses an acronym without defining it, the question may also use it if it's central to the text, but define it if possible from context.
6.  Neutral and Non-Political: Questions must maintain a strictly neutral tone. Avoid any political content, expressions of personal or subjective opinions, or leading questions.

Output Format Requirements:

* Questions Only: Your output must consist *only* of the generated questions. Do not include any introductory text, preambles, explanations, or concluding remarks (e.g., do NOT start with "Here are some questions:" or similar phrases).
* No Categorization or Prefixes: Do not add any category titles, labels (e.g., "Humanitarian Needs:"), or any other descriptive text before individual questions.
* No Formatting: Generate questions in plain text. Do not use any formatting such as bolding, italics, or asterisks.
* Listing: Present the questions as a simple list. Each question should ideally start on a new line. Numbering (e.g., 1., 2., 3.) is acceptable and preferred if multiple questions are generated.


Headline: {headline}  

Content:
'''
    prompt2 = f'''
You are an expert in creating strategic and tactical questions to analyze and address humanitarian situations based solely on the provided data.  
Your task is to generate clear, specific, and insightful questions tailored for a humanitarian situational report.  

Requirements for the questions:  
1. **Data-Driven:** The questions must rely solely on the information in the provided text, without requiring any external knowledge.  
2. **Relevance:** Ensure the questions are directly relevant to the content and capture key aspects of the text. The questions should be relevant to the event  {event} and related to the country {country}. 
3. **Precision:** Questions should be well-defined and focused, avoiding ambiguity.  
4. **Action-Oriented:** Aim to elicit actionable insights that can support humanitarian decision-making. 
5. **Explicit Acronyms:** If an acronym is used, it must be explicitly defined within the question.

**Chain of Thought:**
Understand the Context: Read the headline and content carefully to grasp the key humanitarian issues.
Highlight Important Details: Identify critical facts such as needs, risks, or affected groups.
Ask Clarifying Questions: Think of questions that provide a deeper understanding without needing outside information.
Ensure Answerability: Make sure each question can be answered directly from the content provided.

You will receive a set of paragraphs extracted from humanitarian documents along with a headline summarizing the cluster.  

Headline: {headline}  

Content:
'''

    prompt3 = f'''
    You are an expert in creating strategic and tactical questions to analyze and address humanitarian situations based solely on the provided data.  
Your task is to generate clear, specific, and insightful questions tailored for a humanitarian situational report.  

Requirements for the questions:  
1. **Data-Driven:** The questions must rely solely on the information in the provided text, without requiring any external knowledge.  
2. **Relevance:** Ensure the questions are directly relevant to the content and capture key aspects of the text. The questions should be relevant to the event  {event} and related to the country {country}. 
3. **Precision:** Questions should be well-defined and focused, avoiding ambiguity.  
4. **Action-Oriented:** Aim to elicit actionable insights that can support humanitarian decision-making. 
5. **Explicit Acronyms:** If an acronym is used, it must be explicitly defined within the question.

**Chain of Thought:**
Understand the Context: Read the headline and content carefully to grasp the key humanitarian issues.
Highlight Important Details: Identify critical facts such as needs, risks, or affected groups.
Ask Clarifying Questions: Think of questions that provide a deeper understanding without needing outside information.
Ensure Answerability: Make sure each question can be answered directly from the content provided.
    
**Examples of good questions for a situational report:**  
- What are the possible motives for sabotage of the Nord Stream gas pipelines?  
- What is the strategic importance of the new U.S. Embassy in Tonga's capital, Nuku'alofa?
- What are the patterns emerging from the frequency and magnitude of the aftershocks following the main quake in southern Turkey?  
- What measures has the IAEA taken to ensure the safety of Ukraine's nuclear power plants?
The previous are just examples of the type of questions I expect. The model should generate new questions based on the provided text, not repeat these.

You will receive a set of paragraphs extracted from humanitarian documents along with a headline summarizing the cluster.  

Headline: {headline}  
Content:


    '''
    
    prompts = [prompt1, prompt2, prompt3]
    
    # Ensure valid prompt selection
    if 1 <= prompt_number <= len(prompts):
        return prompts[prompt_number - 1]
    else:
        raise ValueError("Invalid prompt number. Please select 1, 2, or 3.")

In [None]:


def generate_questions(data, prompt_number , event, country, generation_model_path=None):
    
    for cluster_index in tqdm(data):
        item = data[cluster_index]
        articles = item["cluster_articles"]
        headline = item["cluster_headline"]
            

        input_LLM = get_prompt(headline, prompt_number=prompt_number, event=event, country=country)
        
        

        non_null_texts = [article["text"] for article in articles if article["text"].strip()]
        for index, article in enumerate(non_null_texts):
            text = " ".join(article.split("\n")[0:]).strip()
            input_LLM += f"{index + 1}) {text}\n"

        # Chain-of-thought approach:
        

        print(input_LLM)

        data[cluster_index]["questions"] = []

        for _ in range(3):
            if True:
                questions = get_questions_from_openai(input_LLM)
            else:
                
                questions = None  # Placeholder

            data[cluster_index]["questions"].append(questions)

        data[cluster_index]["article_titles"] = [
            article["text"].split("\n")[0].strip() for article in item["cluster_articles"]
        ]
        data[cluster_index]["question_sets"] = [
            question.split("\n") for question in data[cluster_index]["questions"]
        ]

        # Clean up unnecessary fields
        del data[cluster_index]["cluster_articles"]
        del data[cluster_index]["questions"]

    return data


In [None]:
def expand_questions(data, expand_question_model_path):
    tokenizer = T5Tokenizer.from_pretrained("castorini/t5-base-canard")
    model = T5ForConditionalGeneration.from_pretrained("castorini/t5-base-canard")
    #model.to("cuda")
    print("Model for expanded questions loaded correctly")
    
    for cluster_index in tqdm(data):
        cluster = data[cluster_index]
        title = cluster["cluster_headline"]
        data[cluster_index]["expanded_questions"] = list()
        for question_set in cluster["question_sets"]:
            expand_questions = list()
            question_base = " ".join(question_set[0].split()[1:])
            expand_questions.append(question_base)

            for question in question_set[1:]:
                context = title + " ||| " + question_base + " ||| " + " ".join(question.split()[1:])
                #input_ids = tokenizer(context,return_tensors="pt").input_ids.cuda()
                input_ids = tokenizer(context,return_tensors="pt").input_ids
                outputs = model.generate(input_ids, max_length=100)
                question_new = tokenizer.decode(outputs[0], skip_special_tokens=True)
                expand_questions.append(question_new)
            data[cluster_index]["expanded_questions"].append(deepcopy(expand_questions))

    return data

In [None]:
def filter_questions(data):
    for cluster_index in tqdm(data):
        data[cluster_index]["filtered_questions"] = list()
        cluster = data[cluster_index]
        for question_set in cluster["question_sets"]:
            filtered_questions = list()
            for question in question_set:
                question = question.strip()
                if question and question[-1] == "?":
                    filtered_questions.append(question)
            if len(filtered_questions) >= 1:
                data[cluster_index]["filtered_questions"].append(deepcopy(filtered_questions))
    return data

In [None]:
def remove_duplicates(data, duplicate_question_model_path, threshold):
    model = CrossEncoder("cross-encoder/quora-roberta-base", device='cpu')
    for cluster_index in tqdm(data):
        cluster = data[cluster_index]
        print("Title: ", cluster["cluster_headline"])
        print("\n")
        all_questions = list()
        for s in cluster["filtered_questions"]:
            all_questions.extend(s)
        qset = [all_questions[0]]
        for question in all_questions[1:]:
            q_list = [(q, question) for q in qset]
            scores = model.predict(q_list)
            max_si = np.argmax(scores)
            if np.max(scores) < threshold:
                qset.append(question)
        data[cluster_index]["unique_questions"] = deepcopy(qset)
        qset = qset[1:]
        random.shuffle(qset)
        data[cluster_index]["picked_questions"] = list()
        data[cluster_index]["picked_questions"].append(data[cluster_index]["unique_questions"][0])
        data[cluster_index]["picked_questions"].extend(qset[:5])
    
    return data


In [None]:
def remove_duplicates_across_clusters(data, duplicate_question_model_path, threshold):
    """
    Removes duplicate questions across clusters and equalizes similar ones.
    
    Parameters:
        data (dict): The data dictionary containing clusters.
        duplicate_question_model_path (str): Path to the pre-trained model.
        threshold (float): Similarity threshold to consider two questions as duplicates.
    
    Returns:
        dict: Updated data dictionary with duplicates removed across clusters.
    """
    # Load the model
    model = CrossEncoder(duplicate_question_model_path, device='cpu')
    
    # Extract all picked questions with their cluster indices
    cluster_questions = [
        (cluster_index, question)
        for cluster_index in data
        for question in data[cluster_index]["picked_questions"]
    ]
    
    # Create a unique mapping of questions
    unique_questions = {}
    
    for i, (cluster_index, question) in enumerate(tqdm(cluster_questions, desc="Processing clusters")):
        duplicate_found = False
        for unique_question in unique_questions:
            # Compare current question with all unique questions
            score = model.predict([(unique_question, question)])[0]
            if score >= threshold:
                # If a duplicate is found, map the current question to the unique one
                unique_questions[unique_question].append((cluster_index, question))
                duplicate_found = True
                break
        
        if not duplicate_found:
            # If no duplicate is found, add the question as a new unique one
            unique_questions[question] = [(cluster_index, question)]

In [None]:
def duplicates_across_clusters(data, duplicate_question_model_path, threshold):
    """
    Tracks duplicate questions across clusters while keeping all questions intact.

    Args:
        data (dict): The data dictionary containing clusters.
        duplicate_question_model_path (str): Path to the model for detecting duplicate questions.
        threshold (float): Threshold for considering questions as duplicates.

    Returns:
        dict: The updated data dictionary with duplicates tracked across clusters.
    """
    model = CrossEncoder(duplicate_question_model_path, device='cpu')
    global_question_map = {}  # To map questions to their original cluster

    for cluster_index in tqdm(data):
        cluster = data[cluster_index]
        print("Processing cluster:", cluster["cluster_headline"])
        
        # Initialize duplicate tracking for the current cluster
        cluster["duplicate_questions"] = []
        
        for question in cluster["picked_questions"]:
            is_duplicate = False
            for seen_question, seen_cluster in global_question_map.items():
                score = model.predict([(seen_question, question)])[0]
                if score >= threshold:
                    is_duplicate = True
                    # Add duplicate info (question, first seen cluster) to current cluster
                    cluster["duplicate_questions"].append((question, seen_cluster))
                    break
            
            if not is_duplicate:
                # Add the question to the global map if it's not a duplicate
                global_question_map[question] = cluster_index
    
    return data


# Variables

In [None]:

with open(input_file, "rb") as f:
        headline_data = json.load(f)

# Process the data

questions = generate_questions(headline_data, prompt_number=prompt_number, event=event, country=country)
expanded_questions = expand_questions(questions, expand_question_model_path)
print("Questions expanded correctly")

filtered_questions = filter_questions(expanded_questions)
print("Questions filtered correctly")

final_questions = remove_duplicates(filtered_questions, duplicate_question_model_path, duplicate_treshold)

output_file = os.path.join(output_dir, output_name)
with open(output_file, "w") as f:
    json.dump(final_questions, f, indent=4)

print(f"Final questions saved to {output_file}")


# deduplicated_data = duplicates_across_clusters(final_questions, duplicate_question_model_path, duplicate_treshold)
# output_file_deduplicated = os.path.join(output_dir_deduplicated, output_name)
# with open(output_file_deduplicated, "w") as f:
#     json.dump(deduplicated_data, f, indent=4)
    
# print(f"Deduplicated questions saved to {output_file_deduplicated}")

In [None]:
filtered_questions