## Question generation: example prompt and question summary

In [None]:
from pprint import pprint

import pandas as pd

from info_salience import qgen

In [8]:
from info_salience.llm import LitellmGenerator

llm = LitellmGenerator(
    "gpt-4o-2024-08-06",
    caching=True,
    report_costs=True,
    disk_cache_dir=f".litellm_cache",
)

In [2]:
df_summaries = pd.read_json('../output/cs-cl/Meta-Llama-3.1-8B-Instruct/summaries/temperature0.3-0.json')

In [25]:
prompts = qgen.get_prompts(df_summaries, 'cs-cl', batch_size=5)
print(prompts[1][0]['content'])

response = llm.generate([prompts[1]])[0][0]
response = qgen.parse_response(response)
for length, qa_pairs in response.items():
    for qa_pair in qa_pairs:
        print('Q:', qa_pair['question'])
        print('A:', qa_pair['example_answer'])
        print()

Your task is to analyze summaries of different lengths within a given genre. Your goal is to create question-answer pairs that capture the essence of information typically included in various summary lengths. Below is the dataset where each document was summarized in 5 different lengths.

# Dataset

## Document 1

### Summary 10 words
Benchmarking initiatives evaluate language and time series model performance.

### Summary 20 words
Benchmarking initiatives evaluate language models and time series prediction models for various tasks and challenges.

### Summary 50 words
The document discusses benchmarking initiatives in NLP, time series analysis, and multi-task learning. It highlights GLUE, SuperGLUE, and SQuAD benchmarks for NLP, M3 and M4 competitions for time series forecasting, and multi-task learning techniques for improving model performance and generalization on multiple tasks.

### Summary 100 words
The document discusses benchmarking initiatives in Natural Language Processing 

## Deduplicated Questions

In [None]:
def highlight_rows(row):
    if row['selected'] == 'x':
        return ['background-color: lightgreen'] * len(row)
    else:
        return ['background-color: lightgray'] * len(row)


def show_deduplicated_qas(qa_file):
    df = pd.read_excel(qa_file, sheet_name='clustered')
    df = df.rename({'centroid': 'question'}, axis=1)
    df = df[['question', 'group_name', 'cluster_size', 'selected']]
    df.loc[df['group_name'] == 'NOISE', 'question'] = '--'

    styled_df = (
        df.style
        .apply(highlight_rows, axis=1)  # Apply row-wise styling
        .hide(['selected'], axis=1)     # Hide the 'selected' column
    )
    display(styled_df)

In [31]:
show_deduplicated_qas('../data/annotations/questions/qmsum-generic-clusters-manual.xlsx')

Unnamed: 0,question,group_name,cluster_size
0,--,NOISE,293
1,Who are the participants and their roles discussed in the meeting?,Participants -- roles,44
2,Who were the key participants or roles involved in the meeting?,Participants -- roles,35
3,What main topic was discussed in the meeting?,Topic -- main,50
4,What are the main topics discussed in the meeting?,Topic -- main,25
5,What is the primary topic of the meeting?,Topic -- main,23
6,What were the main objectives or goals discussed in the meeting?,Topic -- goals,27
7,Which aspects of the main topic were covered in the discussion?,Topic -- aspects,64
8,What aspect of the project was primarily discussed?,Topic -- aspects,33
9,What are the identified challenges or concerns discussed?,Discussion -- challenges,28


In [30]:
show_deduplicated_qas('../data/annotations/questions/pubmed-sample-clusters-manual.xlsx')

Unnamed: 0,question,group_name,cluster_size
0,--,NOISE,609
1,What is the main focus of the study?,Topic,26
2,Which population is the study focused on?,Population,53
3,Which patient population is the study concerned with?,Population,39
4,What condition is being addressed in the study?,Population -- Condition,37
5,What is the participant demographic or characteristics in the study?,Population -- demographics,33
6,What was the main intervention used in the study?,Intervention,44
7,What is the primary intervention being studied?,Intervention,29
8,What is the main treatment or intervention being evaluated?,Intervention,36
9,What is the main intervention or treatment studied?,Intervention,34


In [29]:
show_deduplicated_qas('../data/annotations/questions/cs-cl-clusters-manual.xlsx')

Unnamed: 0,question,group_name,cluster_size
0,--,NOISE,541
1,What main topic is the document addressing?,Main topic,71
2,What is the main focus of the research?,Main topic,49
3,What is the main focus of the study?,Main topic,31
4,What is the main focus of the document?,Main topic,28
5,What domain or area does the research emphasize?,Main topic,19
6,What are the main approaches or techniques discussed in the document?,Main techniques,96
7,What is the main technique or method discussed?,Main techniques,48
8,Which specific technologies or methodologies are mentioned?,Main techniques,38
9,What recent advancements or innovations are highlighted in the document?,Recent advances,92


In [27]:
show_deduplicated_qas('../data/annotations/questions/astro-ph-clusters-manual.xlsx')

Unnamed: 0,question,group_name,cluster_size
0,--,NOISE,273
1,What is the main focus of the study?,Topic - main,37
2,What is the primary focus or issue discussed?,Topic - main,29
3,What specific phenomena or processes are being investigated in the study?,Topic - phenomena processes,109
4,What broader context or field does the study contribute to or address?,Topic -- context,15
5,What specific challenges or limitations does the study address or identify?,Challenges and limitations,66
6,What are the main challenges or limitations identified?,Challenges and limitations,34
7,What methodology or techniques are employed in the study?,Method,99
8,What methodologies or techniques are discussed?,Method,56
9,What comparisons are made within the study?,Method -- comparisons,25
