<a href="https://colab.research.google.com/github/msquareddd/ai-engineering-notebooks/blob/main/document_processor_for_qa_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Processor for Q&A Generation

This notebook processes documents (DOCX, PPTX, PDF) and generates question-answer pairs using Hugging Face models.

## How to use:
1. Run the setup cell to install dependencies
2. Configure your model settings
3. Upload your document
4. Generate Q&A pairs
5. Download the results as JSON

## 1. Setup and Installation

In [None]:
# Check if running in Google Colab
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("Not running in Google Colab")

# Install required packages
print("Installing required packages...")
!pip install -q docling transformers torch accelerate bitsandbytes triton kernels
!pip install -q python-docx
!pip install -q tqdm

if IN_COLAB:
    print("Installing additional Colab-specific packages...")
    !pip install -q ipywidgets

print("All packages installed successfully!")

In [None]:
# Import necessary libraries
import os
import json
import re
import io
import torch
from datetime import date
from pathlib import Path
from tqdm.notebook import tqdm
import warnings
import bitsandbytes
warnings.filterwarnings("ignore")

# Document processing
from docling.document_converter import DocumentConverter

# Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

# Colab specific imports
if IN_COLAB:
    from google.colab import files
    import ipywidgets as widgets
    from IPython.display import display, JSON, clear_output

print("Libraries imported successfully!")

## 2. Configuration and Model Selection

In [None]:
# System prompt for Q&A generation
SYS_PROMPT = """
IMPORTANT: Your response must be raw JSON without any markdown formatting. Do NOT use ```json or ``` tags. Start your response directly with [ and end with ].
IMPORTANT: Generate at least 15-20 Q&A pairs per document section.
IMPORTANT: You **must** create specific Q&A pairs for any mention of key business entities if they are present in the document. This includes, but is not limited to: **project names, supplier names, client names, part numbers, product info, and other critical business-specific identifiers**
IMPORTANT: Always reference the project name or client name in every Q&A pair.

You are a specialized AI assistant expert in creating high-quality training datasets. Your purpose is to generate question-and-answer pairs from a given document to be used for fine-tuning a large language model.

Your task is to read the provided document in Markdown format and generate a series of varied, high-quality, and natural-sounding question-and-answer pairs. Your entire response must be a single, valid JSON object and nothing else.

### Your Guiding Principles

1.  **Language Check and Translation**: Your first step is to identify the language of the provided document. If the document is not in English, you must translate the entire text into high-quality, fluent English before proceeding. All subsequent steps will be performed on this translated English version.
2.  **Strictly Grounded**: All questions must be answerable from the provided text, and all answers must be derived exclusively from it. Do not use any external knowledge or make assumptions beyond what is written.
3.  **High-Quality and Natural Language**: The Q&A pairs should be well-written, grammatically correct, and phrased in a way that a human would naturally ask and answer them.
4.  **Comprehensive Coverage**: Aim to create questions that cover the main topics, key entities, processes, and conclusions mentioned in the document.
5.  **Reference Components**: When discussing test results, alwyas include the names of the parts tested in Q&A pair.
6.  **Reference Project Name/Client**: Always reference the project name or client name in every Q&A pair.

### Instructions for Generating Questions

-   **Prioritize Key Entities**: You **must** create specific Q&A pairs for any mention of key business entities if they are present in the document. This includes, but is not limited to: **project names, supplier names, client names, part numbers, product info, and other critical business-specific identifiers**. This information is highly valuable for the fine-tuning process.
-   **Diverse Question Types**: Generate a mix of questions, including:
    -   **Factual Recall**: Questions that ask for specific details, definitions, names, or numbers (e.g., "What is the name of the framework?", "When was the company founded?").
    -   **Summarization**: Questions that require summarizing a paragraph or a concept (e.g., "What are the main advantages of this approach?", "Can you summarize the findings of the study?").
    -   **Inferential**: Questions that require connecting information from different parts of the document (e.g., "Based on the challenges mentioned, what was the likely reason for the project's delay?").
    -   **Procedural**: Questions that ask to explain a process or a sequence of steps (e.g., "How does the system authenticate a user?").
-   **Varied Phrasing**: Avoid repetitive question structures. Use different ways to ask about the same or similar information to ensure a diverse dataset. For example, instead of only asking "What is X?", use variations like "Can you explain what X is?", "Describe the purpose of X.", or "What role does X play?".

### Instructions for Generating Answers

-   **Concise and Accurate**: Answers should be direct, to the point, and accurately reflect the information in the source text.
-   **Natural Tone**: Write answers in a clear, conversational, and informative tone.
-   **Self-Contained**: Each answer should make sense on its own without needing to read the entire source document.
-   **No New Information**: Do not add any information, context, or explanations that are not explicitly present in the provided text.
-   **Avoid Talking About Images**: If the documents mention particular points or results in charts or images do not include that part.

### Constraints & Output Format

-   **No Ambiguity**: If a section of the text is ambiguous or lacks sufficient detail to form a clear question and answer, skip it.
-   **No Opinions**: Do not generate questions or answers that are subjective or require an opinion, unless the document itself presents an opinion and the question is about that stated opinion (e.g., "What was the author's main criticism of the policy?").
-   **Strict JSON Output**: Your entire response **MUST** be only a single, valid JSON object. Do not include any introductory text (like "Here is the JSON you requested:"), explanations, or concluding remarks. **CRITICAL: Do NOT wrap your response in markdown code blocks (```json ... ```).** Your response must start directly with the opening bracket `[` and end with the closing bracket `]`. The root of the JSON object must be a list, where each element is an object containing two string keys: `"question"` and `"answer"`.

### Enhanced Detail Instructions
- **Multi-level Questions**: Generate questions at different depth levels:
  - Surface-level: "What is X?"
  - Detailed: "Can you explain the key components and functions of X?"
  - Analytical: "How does X compare to Y in terms of performance and cost?"
- **Follow-up Questions**: Create question chains where answers to one question lead to more detailed follow-ups
- **Contextual Relationships**: Generate questions that explore relationships between different concepts in the document

### Quantity Targets
- **Minimum Output**: Generate at least 15-20 Q&A pairs per document section
- **Depth Coverage**: For each major concept, create 3-4 questions exploring different aspects
- **Cross-references**: Include questions that connect information from different sections

### Example

**If the input document is:**
"Project 'Phoenix' was initiated for our client, 'Innovate Corp', to overhaul their logistics. The main supplier for the new hardware is 'Tech Solutions Inc.'. The key component is the 'Sensor Model T-1000'. Two components were tested, component 1 and component 2. Stacks with smaller angular subdivision show reduced performance in polarization—and consequently in permeability—and in core loss values.

**Your output MUST be ONLY the following JSON content:**
[
    {
        "question": "What is the name of the project initiated for Innovate Corp?",
        "answer": "The project is named 'Phoenix'."
    },
    {
        "question": "Who is the client for the 'Phoenix' project?",
        "answer": "The client for the 'Phoenix' project is 'Innovate Corp'."
    },
    {
        "question": "In the Phoenix project, which company is the main supplier for the new hardware?",
        "answer": "In the Phoenix project, the main supplier for the new hardware is 'Tech Solutions Inc.'."
    },
    {
        "question": "In the Phoenix project, what is the name of the key component mentioned in the document?",
        "answer": "In the Phoenix project, the key component is the 'Sensor Model T-1000'."
    },
    {
        "question": "In the Phoenix project, what were the results of the magnetic tests performed on component 1 and component 2?",
        "answer": "In the Phoenix project, the results of magnetic tests performed on component 1 and component 2 showed a 15% difference in permeability in favor of component 1"
    },
]
Now, carefully read the document provided below and generate the Q&A pairs according to these instructions.
"""

print("System prompt loaded successfully!")

In [None]:
# Model configuration
class ModelConfig:
    def __init__(self):
        self.available_models = {
            "Mistral-7B Instruct": "mistralai/Mistral-7B-Instruct-v0.2",
            "Llama-3.1 8B Instruct": "meta-llama/Llama-3.1-8B-Instruct",
            "Qwen2.5-14B-Instruct-1M" : "Qwen/Qwen2.5-14B-Instruct-1M",
            "Qwen2.5-14B-Instruct" : "Qwen/Qwen2.5-14B-Instruct",
            "Qwen2.5-14B" : "Qwen/Qwen2.5-14B",
            "GPT-OSS-20B": "openai/gpt-oss-20b",
        }

        # Default settings
        self.model_name = "Qwen2.5-14B-Instruct"
        self.model_id = self.available_models[self.model_name]
        self.temperature = 0.7
        self.max_new_tokens = 1024*3 # Reduced default value
        self.top_p = 0.9
        self.repetition_penalty = 1.1

    def update_model(self, model_name):
        if model_name in self.available_models:
            self.model_name = model_name
            self.model_id = self.available_models[model_name]
            return True
        return False

# Create model configuration instance
model_config = ModelConfig()

if IN_COLAB:
    # Create model selection dropdown
    model_dropdown = widgets.Dropdown(
        options=list(model_config.available_models.keys()),
        value=model_config.model_name,
        description='Select Model:',
        style={'description_width': 'initial'}
    )

    # Create parameter sliders
    temp_slider = widgets.FloatSlider(
        value=model_config.temperature,
        min=0.1,
        max=2.0,
        step=0.1,
        description='Temperature:',
        style={'description_width': 'initial'}
    )

    max_tokens_slider = widgets.IntSlider(
        value=model_config.max_new_tokens,
        min=512,
        max=8192,
        step=512,
        description='Max Tokens:',
        style={'description_width': 'initial'}
    )

    # Display configuration widgets
    print("Configure your model settings:")
    display(model_dropdown, temp_slider, max_tokens_slider)
else:
    print(f"Default model: {model_config.model_name}")
    print(f"Temperature: {model_config.temperature}")
    print(f"Max tokens: {model_config.max_new_tokens}")

## 3. Model Loading

In [None]:
# Update model configuration based on user selection
if IN_COLAB:
    model_config.update_model(model_dropdown.value)
    model_config.temperature = temp_slider.value
    model_config.max_new_tokens = max_tokens_slider.value

print(f"Loading model: {model_config.model_name} ({model_config.model_id})")
print(f"This may take a few minutes...")

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer and model
try:
    tokenizer = AutoTokenizer.from_pretrained(model_config.model_id)

    # Prepare keyword arguments for model loading
    model_kwargs = {
        "dtype": torch.float16 if device == "cuda" else torch.float32,
        "device_map": "auto" if device == "cuda" else None,
    }

    # Add quantization config only if not using GPT-OSS-20B
    if model_config.model_name != "GPT-OSS-20B":
        model_kwargs["quantization_config"] = quant_config

    model = AutoModelForCausalLM.from_pretrained(
        model_config.model_id,
        **model_kwargs
    )

    # Create text generation pipeline
    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=model_config.max_new_tokens,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    print(f"Model loaded successfully on {device}!")

except Exception as e:
    print(f"Error loading model: {str(e)}")
    print("Please try a different model or check your internet connection.")

## 4. Document Upload

In [None]:
# Initialize document converter
converter = DocumentConverter()

# Function to upload and process document
def upload_document():
    if IN_COLAB:
        print("Please upload your document (DOCX, PPTX, PDF):")
        uploaded = files.upload()

        if not uploaded:
            raise ValueError("No file uploaded. Please select a file to continue.")

        # Get the uploaded file path
        file_path = list(uploaded.keys())[0]
        print(f"File uploaded: {file_path}")

        return file_path
    else:
        # For local execution, use a default path or ask for input
        file_path = input("Enter the path to your document: ")
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")
        return file_path

# Function to convert document to markdown
def convert_to_markdown(file_path):
    print(f"Converting {file_path} to markdown...")

    try:
        result = converter.convert(file_path)
        markdown_content = result.document.export_to_markdown()
        print("Document converted successfully!")
        return markdown_content
    except Exception as e:
        print(f"Error converting document: {str(e)}")
        raise

# Upload and convert document
try:
    uploaded_file = upload_document()
    doc_markdown = convert_to_markdown(uploaded_file)

    # Display a preview of the converted content
    print("\nDocument preview (first 500 characters):")
    print("-" * 50)
    print(doc_markdown[:500] + "..." if len(doc_markdown) > 500 else doc_markdown)
    print("-" * 50)
    print(f"Total document length: {len(doc_markdown)} characters")

except Exception as e:
    print(f"Error: {str(e)}")
    print("Please try uploading a different document.")

## 5. Q&A Generation

In [None]:
# Utility functions for JSON parsing
def clean_json_response(response_text):
    """
    Clean JSON response that might be wrapped in markdown tags.
    """
    # Remove markdown code blocks if present
    json_pattern = r'```(?:json)?\s*(.*?)\s*```'
    match = re.search(json_pattern, response_text, re.DOTALL | re.IGNORECASE)

    if match:
        # Extract JSON from within markdown tags
        cleaned_json = match.group(1).strip()
        return cleaned_json
    else:
        # If no markdown tags found, return the original text
        return response_text.strip()

def parse_llm_response(response_text):
    """
    Parse LLM response, handling both raw JSON and markdown-wrapped JSON.
    """
    # First try to parse the response as-is
    try:
        return json.loads(response_text)
    except json.JSONDecodeError:
        # If that fails, try cleaning markdown tags and parsing again
        cleaned_response = clean_json_response(response_text)
        return json.loads(cleaned_response)

# Function to generate Q&A pairs
def generate_qa_pairs(markdown_content):
    print("Generating Q&A pairs...")
    print("This may take a few minutes depending on the document size.")

    # Prepare the prompt
    messages = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": markdown_content}
    ]

    # Format messages for the model
    if "Instruct" in model_config.model_name:
        # For models that use chat templates
        prompt = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        print("Chat template applied successfully!")
    else:
        # For other models, combine messages manually
        prompt = f"{SYS_PROMPT}\n\nDocument:\n{markdown_content}\n\nGenerate Q&A pairs:"

    try:
        # Generate response
        with tqdm(total=100, desc="Generating Q&A pairs") as pbar:
            response = generator(
                prompt,
                max_new_tokens=model_config.max_new_tokens,
                temperature=model_config.temperature,
                top_p=model_config.top_p,
                repetition_penalty=model_config.repetition_penalty,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
                return_full_text=False
            )
            pbar.update(100)

        # Extract the generated text
        generated_text = response[0]['generated_text']

        print("\nQ&A generation completed!")
        return generated_text

    except Exception as e:
        print(f"Error generating Q&A pairs: {str(e)}")
        raise

# Generate Q&A pairs
try:
    if 'doc_markdown' in locals():
        llm_response = generate_qa_pairs(doc_markdown)

        # Parse the response
        print("\nParsing response...")
        qa_data = parse_llm_response(llm_response)

        print(f"Successfully generated {len(qa_data)} Q&A pairs!")

        # Display first few Q&A pairs as preview
        print("\nPreview of generated Q&A pairs:")
        print("-" * 50)
        for i, qa in enumerate(qa_data[:3]):
            print(f"Q{i+1}: {qa['question']}")
            print(f"A{i+1}: {qa['answer']}")
            print()
        print("-" * 50)
        print(f"Showing 3 of {len(qa_data)} Q&A pairs")
    else:
        print("No document processed. Please upload a document first.")

except Exception as e:
    print(f"Error: {str(e)}")
    print("Please try again with a different document or model.")

## 6. Results and Download

In [None]:
# Function to save and download results
def save_and_download_results(qa_data):
    # Generate filename with current date
    today = date.today()
    filename = f"{today.strftime('%Y%m%d')}_qa_dataset.json"

    # Save to file
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(qa_data, f, indent=2, ensure_ascii=False)

    print(f"Results saved to {filename}")

    # Download file in Colab
    if IN_COLAB:
        files.download(filename)
        print("Download initiated!")
    else:
        print(f"File saved locally at: {os.path.abspath(filename)}")

    return filename

# Save and download results if available
try:
    if 'qa_data' in locals():
        output_file = save_and_download_results(qa_data)

        # Display statistics
        print("\nGeneration Statistics:")
        print(f"- Total Q&A pairs: {len(qa_data)}")
        print(f"- Average question length: {sum(len(qa['question']) for qa in qa_data) / len(qa_data):.1f} characters")
        print(f"- Average answer length: {sum(len(qa['answer']) for qa in qa_data) / len(qa_data):.1f} characters")
        print(f"- Model used: {model_config.model_name}")
        print(f"- Document processed: {uploaded_file}")

        # Display full results in Colab
        if IN_COLAB:
            print("\nFull Results:")
            display(JSON(qa_data))
    else:
        print("No Q&A data to save. Please generate Q&A pairs first.")

except Exception as e:
    print(f"Error saving results: {str(e)}")

## 7. Process Multiple Documents (Optional)

In [None]:
# Optional: Batch processing for multiple documents
def process_multiple_documents():
    if not IN_COLAB:
        print("Batch processing is optimized for Google Colab.")
        return

    print("Upload multiple documents for batch processing:")
    uploaded = files.upload()

    if not uploaded:
        print("No files uploaded.")
        return

    all_qa_data = []

    for filename in tqdm(uploaded.keys(), desc="Processing documents"):
        try:
            print(f"\nProcessing {filename}...")

            # Convert to markdown
            doc_markdown = convert_to_markdown(filename)

            # Generate Q&A pairs
            llm_response = generate_qa_pairs(doc_markdown)
            qa_data = parse_llm_response(llm_response)

            # Add source document info
            for qa in qa_data:
                qa['source_document'] = filename

            all_qa_data.extend(qa_data)
            print(f"Generated {len(qa_data)} Q&A pairs from {filename}")

        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")
            continue

    if all_qa_data:
        # Save combined results
        today = date.today()
        filename = f"{today.strftime('%Y%m%d')}_batch_qa_dataset.json"

        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(all_qa_data, f, indent=2, ensure_ascii=False)

        print(f"\nBatch processing completed!")
        print(f"Total Q&A pairs generated: {len(all_qa_data)}")
        print(f"Results saved to: {filename}")

        files.download(filename)
    else:
        print("No Q&A pairs were generated.")

# Uncomment the line below to run batch processing
process_multiple_documents()

## Usage Instructions

### Single Document Processing:
1. Run all cells in order
2. Configure your model settings in Section 2
3. Upload your document when prompted in Section 4
4. Wait for Q&A generation to complete
5. Download the JSON file with the results

### Batch Processing (Optional):
1. Uncomment the last line in Section 7
2. Run the cell to upload multiple documents
3. Wait for all documents to be processed
4. Download the combined results

### Tips:
- For large documents, consider using a smaller model or reducing max tokens
- If you encounter memory errors, try restarting the runtime and using a smaller model
- The quality of Q&A pairs depends on the model used and the document content
- Some models may require authentication tokens from Hugging Face

### Troubleshooting:
- **Memory Issues**: Try using a smaller model or enable 8-bit quantization
- **Slow Processing**: Consider using GPU runtime in Colab
- **JSON Parsing Errors**: The model may not have followed the output format exactly - try regenerating
- **Document Conversion Errors**: Ensure your document is in a supported format (DOCX, PPTX, PDF)