## Generate synthetic clinical summaries for MCQ context

Goal:
- For each group of questions (same focus), generate a compact clinical summary
- Summary format: plain text paragraph (simulating patient vignette)
- Use llama3.1 to generate from grouped QA pairs

In [21]:
import sys
from pathlib import Path

# Add the root directory to sys.path
root_path = Path().resolve().parent
sys.path.append(str(root_path))

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [50]:
import os
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
from itertools import islice
from src.summary_generator import SummaryGenerator

In [23]:
IN_PATH = "../data/parsed_csv/qa_clean.csv"
OUT_PATH = "../data/synthetic_contexts/summary_per_focus.csv"

In [24]:
qa_df = pd.read_csv(IN_PATH)
qa_df

Unnamed: 0,question_id,document_id,source,focus,question_type,question,answer
0,0000309-1,0000309,MPlusHealthTopics,E. Coli Infections,information,What is (are) E. Coli Infections?,E. coli is the name of a type of bacteria that...
1,0000139-1,0000139,MPlusHealthTopics,Calcium,information,Do you have information about Calcium,Summary: You have more calcium in your body th...
2,0000794-1,0000794,MPlusHealthTopics,Scabies,information,What is (are) Scabies?,Scabies is an itchy skin condition caused by t...
3,0000410-1,0000410,MPlusHealthTopics,Granulomatosis with Polyangiitis,information,What is (are) Granulomatosis with Polyangiitis?,"Granulomatosis with polyangiitis (GPA), previo..."
4,0000677-1,0000677,MPlusHealthTopics,Ozone,information,Do you have information about Ozone,Summary: Ozone is a gas. It can be good or bad...
...,...,...,...,...,...,...,...
16349,0000009-4,0000009,NHLBI,Asbestos-Related Lung Diseases,susceptibility,Who is at risk for Asbestos-Related Lung Disea...,"Until the late 1970s, asbestos was widely used..."
16350,0000009-5,0000009,NHLBI,Asbestos-Related Lung Diseases,symptoms,What are the symptoms of Asbestos-Related Lung...,The signs and symptoms of asbestos-related lun...
16351,0000009-6,0000009,NHLBI,Asbestos-Related Lung Diseases,exams and tests,How to diagnose Asbestos-Related Lung Diseases?,Your doctor will diagnose an asbestos-related ...
16352,0000009-7,0000009,NHLBI,Asbestos-Related Lung Diseases,treatment,What are the treatments for Asbestos-Related L...,No treatments can reverse the effects of asbes...


In [25]:
focus_groups = defaultdict(list)
for _, row in qa_df.iterrows():
    focus_groups[row['focus']].append((row['question'], row['answer']))

print(len(focus_groups))
print(focus_groups['Sialadenitis'])

5125
[('What is (are) Sialadenitis?', 'Sialadenitis is an infection of the salivary glands. It is usually caused by a virus or bacteria. The parotid (in front of the ear) and submandibular (under the chin) glands are most commonly affected. Sialadenitis may be associated with pain, tenderness, redness, and gradual, localized swelling of the affected area. There are both acute and chronic forms. Although it is quite common among elderly adults with salivary gland stones, sialadenitis can also occur in other age groups, including infants during the first few weeks of life. Without proper treatment, sialadenitis can develop into a severe infection, especially in people who are debilitated or elderly.'), ('What are the symptoms of Sialadenitis?', 'What are the signs and symptoms of sialadenitis? Signs and symptoms of sialadenitis may include fever, chills, and unilateral pain and swelling in the affected area. The affected gland may be firm and tender, with redness of the overlying skin. P

In [48]:
summary_generator = SummaryGenerator()

results = []

for focus, qas in tqdm(islice(focus_groups.items(), 0, 100)):
    summary = summary_generator.generate_summary(qas)
    results.append({"focus": focus, "summary": summary})

100it [2:32:20, 91.41s/it]


In [51]:
os.makedirs(os.path.dirname(OUT_PATH), exist_ok=True)
sum_df = pd.DataFrame(results)
sum_df.to_csv(OUT_PATH, index=False)
print(f"Saved: {len(sum_df)} summaries → {OUT_PATH}")
sum_df

Saved: 100 summaries → ../data/synthetic_contexts/summary_per_focus.csv


Unnamed: 0,focus,summary
0,E. Coli Infections,E. coli infections are caused by certain types...
1,Calcium,Calcium is an essential mineral that plays a c...
2,Scabies,Scabies is an itchy skin condition caused by t...
3,Granulomatosis with Polyangiitis,Granulomatosis with polyangiitis (GPA) is a ra...
4,Ozone,Ozone is a gas that exists in two forms: good ...
...,...,...
95,Indoor Air Pollution,Indoor Air Pollution poses a significant healt...
96,Kidney Cancer,Kidney cancer is a type of cancer that develop...
97,Oxygen Therapy,Oxygen therapy is a treatment that provides ex...
98,Salmonella Infections,Salmonella infections are a common cause of fo...


In [47]:
# Example
summary_generator = SummaryGenerator()
summary_generator.generate_summary(focus_groups['Sialadenitis'])

'Sialadenitis is an infection of the salivary glands that can cause pain, tenderness, redness, and swelling in the affected area. It often occurs in elderly adults with salivary gland stones, but can also occur in other age groups, including infants. The symptoms may include fever, chills, unilateral pain and swelling, and pus draining through the gland into the mouth.\n\nThe infection is usually caused by bacteria such as Staphylococcus aureus or viruses like mumps virus. Certain conditions, including hyposecretion of saliva, duct obstruction, dehydration, malnutrition, immunosuppression, and radiation therapy to the oral cavity, can increase the risk of developing sialadenitis.\n\nTreatment typically involves antibiotics active against S. aureus, hydration, and good oral hygiene. Abscesses may need to be drained, and in some cases, surgical excision of the affected gland may be necessary. Early treatment is essential to prevent complications, especially in debilitated or elderly indi

In [52]:
qa_df['question_type'].unique()

array(['information', 'symptoms', 'causes', 'inheritance',
       'exams and tests', 'treatment', 'complications', 'prevention',
       'considerations', 'susceptibility', 'frequency', 'outlook',
       'research', 'support groups', 'genetic changes', 'stages'],
      dtype=object)