### Step 2: Extract Q&A Markers and Save to Excel

This step processes human-verified `.docx` files from Step 1 to extract structured Q&A content and save it in `.xlsx` format.

#### **Workflow**:
1. **Manual Processing**:
   - Start with the file `Extracted.docx` generated in Step 1, located in the main directory.
   - Open `Extracted.docx` and verify the content:
     - Correct any misnumbered questions or truncated text.
     - Ensure all questions start with "Q" and all answers start with "A."
   - Save the verified `.docx` file in the `Reviewed` folder with an appropriate name (e.g., `BGD.docx` for Bangladesh).
2. **Automated Processing with Step 2 script**:
   - Read `.docx` files from the `Reviewed` folder.
   - Extract questions (starting with "Q") and their corresponding answers (starting with "A") using regex.
   - Save the extracted content to an Excel file for each `.docx` file.

#### **Inputs**:
- **Initial Input**: The `Extracted.docx` file from Step 1, located in the main directory.
- **Final Input**: Human-verified `.docx` files, renamed and placed in the `Reviewed` folder.

#### **Outputs**:
- `.xlsx` files containing structured questions and answers, saved in the `Reviewed` folder.

#### **Notes**:
- Manual verification and cleaning of the `Extracted.docx` file is a crucial step before automated processing.
- Ensure the verified `.docx` files are appropriately named (e.g., using country codes) before running this step.


### Extract Questions and Answers
This function reads a `.docx` file, extracts questions and their corresponding answers using regex, and returns the data as a list of tuples.


In [None]:
# Go over extracted content, split text on Q markers and save to csv - individual file
import openpyxl
from docx import Document
import re

def extract_questions_and_answers(docx_path):
    doc = Document(docx_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    
    content = "\n".join(full_text)
    
    # Find all question blocks
    question_blocks = re.findall(r'(Q\w+_\d+\..*?)(?=(Q\w+_\d+\.)|$)', content, re.DOTALL)
    
    # Extract questions and corresponding answers
    paired_qa = []
    for block in question_blocks:
        question_match = re.search(r'(Q\w+_\d+\..*?)(A\d+\..*)', block[0], re.DOTALL)
        if question_match:
            question = question_match.group(1).strip()
            answers = question_match.group(2).strip()
            paired_qa.append((question, answers))
    
    return paired_qa

## Save Extracted Q&A to Excel (Single File)

This section demonstrates how to process a single `.docx` file. The workflow includes:
1. Manually verifying and saving the `.docx` file in the `Reviewed` folder (e.g., `BGD.docx` for Bangladesh).
2. Extracting questions and answers using the `extract_questions_and_answers` function.
3. Saving the extracted content to a corresponding `.xlsx` file in the same folder (e.g., `BGD_questions.xlsx`).

#### Notes:
- Update the `country_code` variable in the `save_to_excel` function to match the corresponding file.
- Use this method for processing one file at a time, ideal for testing or small-scale workflows.

In [None]:
def save_to_excel(questions, output_xlsx_path):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    sheet.title = "Questions and Answers"

    # Write the header
    sheet.cell(row=1, column=1).value = "country"
    sheet.cell(row=1, column=2).value = "question"
    sheet.cell(row=1, column=3).value = "answers"
    
    country_code = "BGD"  # Country code corresponding to the file name, BGD used as example, amend as necessary
    for i, (question, answers) in enumerate(questions, start=2):  # Start at row 2
        sheet.cell(row=i, column=1).value = country_code
        sheet.cell(row=i, column=2).value = question
        sheet.cell(row=i, column=3).value = answers
    
    workbook.save(output_xlsx_path)

# Path to the DOCX file
docx_path = "Reviewed/BGD.docx"
# Output Excel file path
output_xlsx_path = "Reviewed/BGD_questions.xlsx"

extracted_questions = extract_questions_and_answers(docx_path)
save_to_excel(extracted_questions, output_xlsx_path)
print(f"Questions and answers have been saved to {output_xlsx_path}")

### Process All `.docx` Files (Batch Processing)

This script offers an alternative path for batch processing multiple `.docx` files. It processes all files in the `Reviewed` folder:
1. Place multiple human-verified `.docx` files in the `Reviewed` folder (e.g., `BGD.docx`, `KEN.docx`).
2. The script extracts Q&A content from each file using the `extract_questions_and_answers` function.
3. Saves the content to corresponding `.xlsx` files in the same folder, with filenames matching the `.docx` files (e.g., `BGD_questions.xlsx`, `KEN_questions.xlsx`).

#### Notes:
- Use this approach when working with a large number of files to save time and effort.
- Ensure all `.docx` files are verified and follow a logical naming convention (e.g., country codes) before running the batch script.
- The batch processing script is an efficient alternative to processing individual files manually.

In [None]:
# Loop over all files and extract to corresponding excels

import openpyxl
from docx import Document
import re
import os

def extract_questions_and_answers(docx_path):
    doc = Document(docx_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    
    content = "\n".join(full_text)
    
    # Split content into question blocks
    question_blocks = re.split(r'(?=Q\w+)', content)

    paired_qa = []
    for block in question_blocks:
        lines = block.strip().split('\n')
        if lines and lines[0].startswith('Q'):
            question = lines[0].strip()
            answers = '\n'.join(lines[1:]).strip()
            paired_qa.append((question, answers))
    
    return paired_qa

def save_to_excel(questions, output_xlsx_path):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    sheet.title = "Questions and Answers"

    # Write the header
    sheet.cell(row=1, column=1).value = "country"
    sheet.cell(row=1, column=2).value = "question"
    sheet.cell(row=1, column=3).value = "answers"
    
    country_code = os.path.basename(output_xlsx_path).split('_')[0]  # Extract country code from the file name
    for i, (question, answers) in enumerate(questions, start=2):  # Start at row 2
        sheet.cell(row=i, column=1).value = country_code
        sheet.cell(row=i, column=2).value = question
        sheet.cell(row=i, column=3).value = answers
    
    workbook.save(output_xlsx_path)

def process_all_docx_files(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith(".docx") and not filename.startswith('~$'):
            docx_path = os.path.join(folder_path, filename)
            xlsx_filename = f"{os.path.splitext(filename)[0]}.xlsx"
            output_xlsx_path = os.path.join(folder_path, xlsx_filename)
            
            extracted_questions = extract_questions_and_answers(docx_path)
            save_to_excel(extracted_questions, output_xlsx_path)
            print(f"Processed {filename} and saved to {output_xlsx_path}")

# Path to the folder containing DOCX files
folder_path = "Reviewed"

process_all_docx_files(folder_path)

### Conclusion

In this step, we successfully:
1. Processed human-verified `.docx` files from Step 1.
2. Extracted questions and answers into a structured format.
3. Saved the structured data into `.xlsx` files for further analysis.

### Next Steps
The extracted Q&A data in Excel format is now ready for analysis or integration into downstream systems.
