# Automated Question-Answer Generation from Pharmaceutical Data

This notebook systematically generates question-answer pairs from pharmaceutical data sources including:
1. PubMed target passages
2. DrugBank tables
3. Related pharmaceutical data

Each QA pair will include:
- Question
- Answer
- Source text/passage
- Related table

In [1]:
# Import required libraries
import os
import pandas as pd
import json
from pathlib import Path
from openai import OpenAI
import csv
from tqdm import tqdm
from collections import defaultdict

In [2]:
# Configuration and paths
PUBMED_TARGETS_DIR = 'inputs/pubmed-targets'
DRUGBANK_TABLES_DIR = 'inputs/drugbank-tables'
MAPPING_FILE = 'inputs/pubmed-drugbank-tables.gt'
OUTPUT_FILE = 'test_output.gt'

# Initialize your LLM API key if needed
# os.environ["OPENAI_API_KEY"]
client = OpenAI()


In [7]:
def load_passage_table_mapping():
    """Load the mapping between passages and their relevant tables"""
    mapping = defaultdict(list)
    with open(MAPPING_FILE, 'r') as f:
        for line in f:
            passage_id, table_name = line.strip().split(',')
            mapping[passage_id].append(table_name)
    # print("mapping", mapping)
    return mapping

def load_target_passages():
    """Load all target passages from the pubmed-targets directory"""
    passages = {}
    target_files = Path(PUBMED_TARGETS_DIR).glob('Target-*')
    
    for file_path in target_files:
        target_id = file_path.name
        with open(file_path, 'r') as f:
            passages[target_id] = f.read()
    # print("passages", passages)
    return passages

def load_drugbank_tables():
    """Load all relevant DrugBank tables"""
    tables = {}
    csv_files = Path(DRUGBANK_TABLES_DIR).glob('*.csv')
    
    for file_path in csv_files:
        table_name = file_path.stem
        tables[table_name] = pd.read_csv(file_path)
    
    # print("tables", tables)
    return tables

In [44]:
def get_relevant_table_content(tables, table_names, max_rows=5):
    """Extract relevant content from tables for context"""
    print("Debug - Available tables:", tables.keys())
    print("Debug - Looking for table_names:", table_names)

    table_content = {}
    for table_name in table_names:
        # Remove .csv extension if present
        base_table_name = table_name.replace('.csv', '')
        
        if base_table_name in tables:
            df = tables[base_table_name]
            table_content[base_table_name] = {
                'columns': list(df.columns),
                'sample': df.head(max_rows).to_dict('records')
            }
        else:
            print(f"Debug - Table '{base_table_name}' not found in available tables")
    return table_content

def generate_questions_for_passage(passage_id, passage_text, tables, relevant_table_names, model="gpt-4"):
    """Generate questions for a given passage and its relevant tables using LLM"""

    # print("tables", tables)
    # print("relevant_table_names", relevant_table_names)
    
    # Limit passage length if too long (e.g., first 1000 characters)
    passage_text = passage_text[:1000] + "..." if len(passage_text) > 1000 else passage_text
    
    # Limit to maximum 3 relevant tables
    relevant_table_names = relevant_table_names[:3]
    
    # Get relevant table content
    table_content = get_relevant_table_content(tables, relevant_table_names)

    
    prompt = f"""
    Given the following passage and related tables, generate 3-5 meaningful question-answer pairs.
    Each question should be answerable using information from either the passage or tables or both.
    Focus on pharmaceutical and medical aspects, similar to these example formats:
    - "What is the mechanism of action of [drug]?"
    - "What are the different dosage levels of [drug]?"
    - "For which conditions is [drug] approved?"
    - "What are the key interactions of [drug]?"
    
    Passage (ID: {passage_id}):
    {passage_text}
    
    Related Tables:
    {json.dumps(table_content, indent=2)}
    
    Generate questions in the following format:
    1. question: [specific question about drug/treatment]
       answer: [detailed answer combining information from passage and/or tables]
       text: [passage ID if information from passage was used, "None" if not used]
       table: [table name if information from table was used, "None" if not used]
    """
    
    # Call your LLM here with the prompt
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a medical and pharmaceutical expert tasked with generating detailed question-answer pairs about drugs and treatments."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7
    )

    print(response.choices[0].message.content)
    
    # Parse the response into structured QA pairs
    qa_pairs = parse_llm_response(response.choices[0].message.content, passage_id)
    return qa_pairs

def parse_llm_response(response_text, passage_id):
    """Parse the LLM response into structured QA pairs"""
    qa_pairs = []
    
    # Split the response into individual QA entries
    entries = response_text.strip().split('\n\n')
    
    for entry in entries:
        if not entry.strip():
            continue
            
        lines = entry.strip().split('\n')
        current_qa = {
            'question': '',
            'answer': '',
            'text': passage_id,
            'table': 'None'  # Default value
        }
        
        for line in lines:
            line = line.strip()
            # Skip empty lines and numbering
            if not line or line.replace('.', '').strip().isdigit():
                continue
                
            # Parse each field using more robust splitting
            if 'question:' in line:
                current_qa['question'] = line.split('question:', 1)[1].strip()
            elif 'answer:' in line:
                current_qa['answer'] = line.split('answer:', 1)[1].strip()
            elif 'text:' in line:
                current_qa['text'] = line.split('text:', 1)[1].strip()
            elif 'table:' in line:
                table_value = line.split('table:', 1)[1].strip()
                # Handle NA, N/A, None cases
                current_qa['table'] = 'None' if table_value.upper() in ['NA', 'N/A', 'NONE'] else table_value
        
        # Only add complete QA pairs that have both question and answer
        if current_qa['question'] and current_qa['answer']:
            qa_pairs.append(current_qa.copy())  # Use copy to avoid reference issues
    
    return qa_pairs

In [54]:
def main():
    # Load mappings and data
    print("Loading passage-table mappings...")
    passage_table_mapping = load_passage_table_mapping()
    
    print("Loading target passages...")
    passages = load_target_passages()
    
    print("Loading DrugBank tables...")
    tables = load_drugbank_tables()
    
    # Initialize output list
    qa_pairs = []
    
    # Process each passage with its relevant tables
    for passage_id, passage_text in tqdm(list(passages.items())[:5]):
        if passage_id in passage_table_mapping:
            relevant_tables = passage_table_mapping[passage_id]
            # print(passage_id, passage_text, tables, relevant_tables)
            
            # Generate QA pairs using the passage and its relevant tables
            new_qa_pairs = generate_questions_for_passage(
                passage_id,
                passage_text,
                tables,
                relevant_tables
            )
            # new_qa_pairs = []
            print(new_qa_pairs)
            qa_pairs.extend(new_qa_pairs)
        else:
            print(f"No table mapping found for passage {passage_id}")
    
    # Save results
    with open('temp.gt', 'w', newline='') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_ALL)
        writer.writerow(['question', 'answer', 'text', 'table'])  # header
        for qa_pair in qa_pairs:
            writer.writerow([
                qa_pair['question'],
                qa_pair['answer'],
                qa_pair['text'],
                qa_pair['table']
            ])
        
    print(f"Generated {len(qa_pairs)} question-answer pairs")

In [55]:
if __name__ == "__main__":
    main()

Loading passage-table mappings...
mapping defaultdict(<class 'list'>, {'Target-20210689': ['drugbank-drugs_links.csv', 'drugbank-drug.csv', 'drugbank-targets.csv', 'drugbank-drug_mixtures.csv', 'drugbank-drug_pharmacology.csv', 'drugbank-drug_drug_interactions.csv', 'drugbank-drug_reactions.csv', 'drugbank-targets_polypeptides.csv'], 'Target-12970383': ['drugbank-drugs_links.csv', 'drugbank-drug.csv', 'drugbank-targets.csv', 'drugbank-drug_mixtures.csv', 'drugbank-drug_pharmacology.csv', 'drugbank-drug_drug_interactions.csv', 'drugbank-drug_reactions.csv', 'drugbank-targets_polypeptides.csv'], 'Target-15277270': ['drugbank-drugs_links.csv', 'drugbank-drug.csv', 'drugbank-targets.csv', 'drugbank-drug_mixtures.csv', 'drugbank-drug_pharmacology.csv', 'drugbank-drug_drug_interactions.csv', 'drugbank-drug_reactions.csv', 'drugbank-targets_polypeptides.csv'], 'Target-15955613': ['drugbank-drugs_links.csv', 'drugbank-drug.csv', 'drugbank-targets.csv', 'drugbank-drug_mixtures.csv', 'drugbank-d

  tables[table_name] = pd.read_csv(file_path)
  0%|          | 0/5 [00:00<?, ?it/s]

Debug - Available tables: dict_keys(['drugbank-drug_trans_links', 'drugbank-enzymes_polypeptides_ext_id', 'drugbank-drug_manufacturers', 'drugbank-drug_sequences', 'drugbank-drug_external_links', 'drugbank-drug_international_brands', 'drugbank-drug_mixtures', 'drugbank-drug_packagers', 'drugbank-drug_reactions_enzymes', 'drugbank-drug_enzymes_articles', 'drugbank-targets', 'drugbank-drug_carriers_textbooks', 'drugbank-drug_trans_textbooks', 'drugbank-carriers_polypeptides_ext_id', 'drugbank-drugs_attachments', 'drugbank-drug_carriers_articles', 'drugbank-enzymes_polypeptides', 'drugbank-drug_trans_articles', 'drugbank-drug_experimental_properties', 'drugbank-transporters_polypeptides_ext_id', 'drugbank-drug_dosages', 'drugbank-drug_syn', 'drugbank-drug_targ_textbooks', 'drugbank-drug_enzymes_attachments', 'drugbank-drug_carriers_links', 'drugbank-drug_salts', 'drugbank-drugs_articles', 'drugbank-targets_actions', 'drugbank-drug_patents', 'drugbank-drug_affected_organisms', 'drugbank-qu

 20%|██        | 1/5 [00:14<00:59, 14.84s/it]

1. question: What is the mechanism of action of Flumazenil?
   answer: Flumazenil is a specific competitive antagonist at benzodiazepine receptors, which are associated with receptors for gamma-aminobutyric acid, the most important inhibitory neurotransmitter in the central nervous system. Its usual clinical role is to reverse the effects of benzodiazepine sedation; however, administered before, or with, other benzodiazepines, it modifies their effects. Flumazenil also reverses adverse physiological effects of benzodiazepines.
   text: Target-8693922
   table: None

2. question: What is the preferred route of administration for Flumazenil?
   answer: The preferred route of administration for Flumazenil is intravenous.
   text: Target-8693922
   table: None

3. question: What are the clinical indications for the use of Flumazenil?
   answer: The clinical indications for Flumazenil include reversal of benzodiazepine-induced sedation, termination of benzodiazepine-induced anaesthesia, ret

100%|██████████| 5/5 [00:40<00:00,  8.06s/it]

1. question: What is the mechanism of action of dicyclomine, trihexyphenidyl, pirenzepine and atropine?
   answer: Dicyclomine, trihexyphenidyl, pirenzepine, and atropine are muscarinic receptor antagonists. They demonstrate high affinity for the M1 muscarinic receptor subtype. In competition experiments, these drugs show low affinity for cardiac receptors and intermediate affinity for glandular receptors. Thus, these drugs differentiate between the M1 (cortical) and the peripheral muscarinic subtypes (cardiac and glandular). However, atropine displays similar affinities for either subtype. 
   text: Target-2432979
   table: None

2. question: What are the differences in the selectivity profiles of dicyclomine, trihexyphenidyl, pirenzepine and atropine?
   answer: Dicyclomine, trihexyphenidyl, and pirenzepine have the highest affinity for the M1 muscarinic receptor subtype as revealed in competition experiments against [3H]-pirenzepine labelling of cortical membranes. Their affinity va


