# Introduction: Simplifying Dataset Preparation from Hard Copy Questions

This workflow is designed to help you transform questions from physical materials (like books or PDFs) into a well-organized digital dataset. Whether you're working with scanned documents or PDF files, this process ensures that the data is cleaned, structured, and ready for use without requiring advanced technical knowledge.

By using tools like **Tesseract-OCR** and **EasyOCR**, you can extract text accurately from images or PDFs. Additionally, **PyMuPDF** helps process PDF files efficiently. The steps are easy to follow, even for beginners, making it ideal for creating datasets on platforms like Kaggle.

---

## Workflow Overview

### **Step 1–4: Question Cropping Workflow**

1. **PDF to Image Conversion**  
   - Convert each page of the PDF into an image.  
   - This step simplifies processing and helps in removing parts you don’t need, such as headers or footers.

2. **Removing Unnecessary Sections**  
   - Identify and eliminate redundant parts like repeated titles or sections (e.g., "Kertas 2").  
   - This ensures the dataset is focused on relevant questions.

3. **Tag Cleanup**  
   - Detect and remove irrelevant tags or extra text that might clutter your dataset.

4. **Column & Question Cropping**  
   - Split pages formatted in multiple columns (e.g., two or three columns).  
   - Process each column separately to ensure clean and organized extraction of questions and content.

---

### **Step 5: Answer Extraction Workflow**

5. **Answer Extraction**  
   - Use OCR tools (like EasyOCR or Tesseract) to automatically detect and extract answer keys or solutions from the images.  
   - Save extracted answers in a structured format (e.g., JSON), ensuring both question and answer data are preserved for further analysis.

---

### **Step 6–7: Dataset Consolidation Workflow**

6. **Combine Questions and Answers**  
   - Merge the datasets containing questions and answers into a unified JSON file.  
   - Each entry in the JSON file will include fields for the question, its associated answer, and any relevant metadata (e.g., page or section number).

7. **Export to Final Dataset**  
   - Convert the consolidated JSON file into a user-friendly format (e.g., `.csv` or `.xlsx`).  
   - This final dataset can be used for reviewing, sharing, or additional machine learning tasks.

---

## Summary of Outputs

- **Step 1–4**: A set of cropped images containing only relevant questions, saved as PNG files.  
- **Step 5**: A JSON file containing extracted answers, organized for each question.  
- **Step 6–7**: A consolidated dataset with questions and answers in JSON, then exported to `.csv` format for end users.


In [None]:
# Install Tesseract OCR PyMuPDF
!sudo apt-get install -y tesseract-ocr --quiet
!pip install PyMuPDF --quiet
!pip install --upgrade easyocr  --quiet

# Worflow for PDF to Image Conversion & Removing Unnecessary Sections

### Workflow Summary Table

| **Book Name**       | **Header Keyword**    | **Header Padding** | **Footer Keyword**               | **Footer Padding** | **Footer Adjustment**                                              |
|----------------------|-----------------------|---------------------|-----------------------------------|---------------------|----------------------------------------------------------------------|
| **FC065244 Book**    | `Praktis`            | 320                 | `"Sasbadi Sdn"`                  | 30                  | Use OCR to detect footer keyword; adjust crop if detected.           |
| **FC064244 Book**    | `Praktis`            | 320                 | `"Sasbadi Sdn"`                  | 30                  | Use OCR to detect footer keyword; adjust crop if detected.           |
| **QC174032 Book**    | `Ujian`              | 530                 | `"Penerbitan Pelangi Sdn"`       | 30                  | Use OCR to detect footer keyword; adjust crop if detected.           |
| **KM24SF1 & KM24SMA Book** | `Kertas`      | 30                  | `KM`                             | 50                  | Look for footer keyword `KM` and adjust footer crop dynamically.     |
| **GG24SFI4 Book**    | `Gerak`              | 320                 | Digit in last few footer texts   | Dynamic             | Use EasyOCR and Tesseract to find numbers in footer; apply dynamic padding if undetected. |
| **IB4MA Book**       | `Bidang`             | 340                 | Digit in last few footer texts   | Dynamic             | Same workflow as GG24SFI4: Use OCR to detect numbers in footer and apply dynamic padding if undetected. |

---

### Header and Footer Extraction Process

1. **Header Extraction**  
   - Analyze the **top quarter** of the page for header text using OCR.  
   - Detect and compare text against a customizable list of header keywords:  
     `header_keywords = ["Bidang", "Jawapan", "Praktis", "Fizik Tingkatan 5 Praktis", "Ujian", "Kertas Model", "Gerak"]`  
   - Determine the crop position based on the detected keywords:  
     - `"Bidang"` → Set header crop to `340 + padding`.  
     - `"Kertas"` → Set header crop to `30 + padding`.  
     - No matching keyword → Do not crop the header.

2. **Footer Extraction**  
   - Analyze the **bottom 1/9 region** of the page using OCR.  
   - Detect text against a customizable list of footer keywords:  
     `footer_keywords = ["KM"]`  (can include `"Penerbitan Pelangi Sdn"` and `"Sasbadi Sdn"`).  
   - If detected, adjust the footer crop dynamically.  
   - For books like **IB4MA**, apply `footer_y = h - (-10) - padding` if no footer keyword or number is found.

---

### Parameters You Can Customize

1. **Header Keywords**  
   - Adjust the list of keywords to fit the type of document.  
   - Example:  
     `header_keywords = ["Bidang", "Praktis", "Fizik Tingkatan 5 Praktis", "Kertas Model", "Gerak"]`

2. **Footer Keywords**  
   - Modify the footer keyword list to detect specific text or publisher details.  
   - Example:  
     `footer_keywords = ["KM", "Penerbitan Pelangi Sdn", "Sasbadi Sdn"]`

3. **Padding**  
   - Configure the padding for cropping areas.  
   - Example:  
     - `header_padding = 340` (for larger header spaces)  
     - `footer_padding = 30` (for smaller footer areas)

4. **Dynamic Adjustment**  
   - Adjust footer cropping dynamically using bounding box positions and additional offsets.

---

### Output

1. **Processed Images**  
   - Cropped images with detected keywords in the header or footer are saved in the output folder.

2. **Fallback for Missing Keywords**  
   - If no keywords are detected, the original image is saved without cropping.

3. **Organized File Structure**  
   - Images are saved with meaningful filenames, including chapter and page numbers for easy navigation.

In [None]:
import os
import fitz  # PyMuPDF
import cv2
import numpy as np
from scipy.ndimage import rotate
import matplotlib.pyplot as plt
import pytesseract
import easyocr
import re

def pdf_to_images(pdf_path, output_folder):
    """
    Converts each page of a PDF file to an image, processes each image,
    and saves only the cropped output images to the output folder.
    
    Args:
        pdf_path (str): The path to the PDF file.
        output_folder (str): The folder where the cropped images will be saved.
    """
    pdf_document = fitz.open(pdf_path)

    # Extract base name and join it into a string
    base_name = '_'.join(os.path.basename(pdf_path).split('_')[:-1])

    # Define the dynamic output folder based on base_name
    output_folder = f"./output_final_images_{base_name}"

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Extract chapter number from the PDF filename
    chapter_match = re.search(r'_C(\d+)', pdf_path)
    if chapter_match:
        chapter = int(chapter_match.group(1))

    for page_number in range(pdf_document.page_count):
        # Convert PDF page to image
        page = pdf_document.load_page(page_number)
        pix = page.get_pixmap(matrix=fitz.Matrix(4, 4))
        
        # Convert the page to a numpy array without saving initially
        img_data = np.frombuffer(pix.samples, dtype=np.uint8)
        img_data = img_data.reshape((pix.height, pix.width, pix.n))
        image = cv2.cvtColor(img_data, cv2.COLOR_RGB2BGR)  # Convert RGB to BGR for OpenCV

        # Plot and save the original image after converting from PDF
        plt.figure(figsize=(10, 5))
        plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        plt.title(f'Original Image - Page {page_number + 1}')
        plt.axis('off')
        plt.show()
        
        # Print detected text before cropping header and footer
        full_text = easyocr_extract_text(image)
        
         # Crop content after skew correction        
        cropped_image = crop_questions_and_answers(image) 

         # Crop header and footer if matching keywords are found
        cropped_image = crop_header_footer_keywords(cropped_image) 

        # Apply line-based skew correction using the longest detected line to ensure everything is straight
        _, line_corrected_image = correct_skew(cropped_image)

        # Plot the final cropped image without header and footer, and with line correction
        plt.figure(figsize=(10, 5))
        plt.imshow(cv2.cvtColor(line_corrected_image, cv2.COLOR_BGR2RGB))
        plt.title(f'Final Cropped Image (Line Corrected) - Page {page_number + 1}')
        plt.axis('off')
        plt.show()

        # Save only the cropped image with updated chapter number
        new_image_path = os.path.join(output_folder, f"{base_name}_C{chapter}_P{page_number + 1}.png")
        cv2.imwrite(new_image_path, line_corrected_image)
        print(f"Corrected skew and cropped for page {page_number + 1} and saved: {new_image_path}")

    pdf_document.close()
    
def correct_skew(image, delta=1, limit=5):
    """
    Corrects the skew of an image.
    Args:
        image (numpy.ndarray): The input image to correct.
        delta (int): The increment for the angle to test.
        limit (int): The range of angles to test for skew correction.
    Returns:
        tuple: The best angle and the corrected image.
    """
    def determine_score(arr, angle):
        data = rotate(arr, angle, reshape=False, order=0)
        histogram = np.sum(data, axis=1, dtype=float)
        score = np.sum((histogram[1:] - histogram[:-1]) ** 2, dtype=float)
        return histogram, score

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    scores = []
    angles = np.arange(-limit, limit + delta, delta)
    for angle in angles:
        _, score = determine_score(thresh, angle)
        scores.append(score)

    best_angle = angles[scores.index(max(scores))]

    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, best_angle, 1.0)
    corrected = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

    return best_angle, corrected


def crop_questions_and_answers(image, margin=40, exclusion_threshold=0.1, min_text_area=10):
    """
    Dynamically crop the image to retain only the question and answer sections.
    Crops out areas without text and respects exclusion zones for the right and bottom sides.

    Args:
        image (numpy.ndarray): The input image.
        margin (int): Additional padding to add around the cropped region.
        exclusion_threshold (float): Proportion of the width considered as the exclusion zone on the right side.
        min_text_area (int): Minimum area of a text region to be considered significant.

    Returns:
        numpy.ndarray: The cropped image.
    """
    # Convert to grayscale and binary image
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY_INV)

    # Step 1: Detect text regions using EasyOCR
    reader = easyocr.Reader(['en', 'ms'])
    results = reader.readtext(image)

    # Step 2: Filter text regions
    significant_text_regions = []
    image_width = image.shape[1]
    image_height = image.shape[0]
    right_exclusion_zone = image_width * (1 - exclusion_threshold)
    bottom_exclusion_zone = image_height * (1 - exclusion_threshold)

    for res in results:
        ((x_min, y_min), _, (x_max, y_max), _) = res[0]
        text_width = x_max - x_min
        text_height = y_max - y_min
        text_area = text_width * text_height

        # Exclude text regions that fall within the exclusion zone
        if (
            text_area >= min_text_area
            and x_max < right_exclusion_zone
            and y_max < bottom_exclusion_zone
        ):
            significant_text_regions.append((x_min, y_min, x_max, y_max))

    # Step 3: Determine cropping boundaries
    if significant_text_regions:
        x_min = min([region[0] for region in significant_text_regions])
        y_min = min([region[1] for region in significant_text_regions])
        x_max = max([region[2] for region in significant_text_regions])
        y_max = max([region[3] for region in significant_text_regions])

        # Apply margin
        x_min = max(0, int(x_min) - margin)
        y_min = max(0, int(y_min) - margin)
        x_max = min(image.shape[1], int(x_max) + margin)
        y_max = min(image.shape[0], int(y_max) + margin)

        # Crop the image
        cropped_image = image[y_min:y_max, x_min:x_max]

        # Step 4: Additional crop for empty areas without text
        binary_cropped = cv2.threshold(cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY), 240, 255, cv2.THRESH_BINARY_INV)[1]
        row_sums = np.sum(binary_cropped, axis=1)
        col_sums = np.sum(binary_cropped, axis=0)

        # Detect empty areas (rows and columns with no text)
        empty_top = np.where(row_sums > 0)[0][0] if np.any(row_sums > 0) else 0
        empty_bottom = np.where(row_sums > 0)[0][-1] if np.any(row_sums > 0) else cropped_image.shape[0]
        empty_left = np.where(col_sums > 0)[0][0] if np.any(col_sums > 0) else 0
        empty_right = np.where(col_sums > 0)[0][-1] if np.any(col_sums > 0) else cropped_image.shape[1]

        # Apply the additional cropping
        cropped_image = cropped_image[empty_top:empty_bottom, empty_left:empty_right]

        return cropped_image
    else:
        # If no significant text is detected, return the original image
        return image


def crop_header_footer_keywords(image, padding=50):
    """
    Crops the header and footer from an image if keywords are present.
    Args:
        image (numpy.ndarray): The input image to crop.
        padding (int): The number of pixels to add around detected text regions.
    Returns:
        numpy.ndarray: The cropped image with header and footer removed if keywords match.
    """
    header_keywords = ["Bidang","Jawapan", "Praktis", "Fizik Tingkatan 5 Praktis", "Ujian", "Kertas Model", "Gerak" ]
    footer_keywords = [ "KM"] #"Penerbitan Pelangi Sdn", "Sasbadi Sdn"

    h, w = image.shape[:2]

    # Convert image to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Header cropping
    # Limit header search area to the upper quarter of the page
    header_area = gray[0:h//4, :]  
    header_text = ' '.join(easyocr_extract_text(header_area).strip().lower().split()[:2])
    # Praktis = 300 FC065244 , fizik 30 
    # UJian = 530 QC174032, matematik 30
    # Print the extracted text for debugging
    print(f"Detected header text: '{header_text}'")
    
    # Logic for determining header crop position
    if "bidang" in header_text:
        header_y = 340 + padding
        print("Detected 'Bidang' in header text. Setting header_y to:", header_y)
    elif "kertas" in header_text:
        header_y = 30 + padding
        print("Detected 'Kertas' in header text. Setting header_y to:", header_y)
    else:
        header_y = 0
        print("No matching keyword found in header text. Setting header_y to:", header_y)


    # Initialize OCR reader
    reader = easyocr.Reader(['en', 'ms'])
    
     # ---------------- FOOTER CROPPING ----------------
    footer_area = gray[h - (h // 9):h, :]  # Search area: bottom 1/8
    
    # EasyOCR extraction
    footer_text_easyocr = ' '.join(reader.readtext(footer_area, detail=0)).lower()
    # Tesseract extraction
    footer_text_tesseract = pytesseract.image_to_string(footer_area, config='--psm 6').lower()
    
    # Combine results
    footer_text = footer_text_easyocr + " " + footer_text_tesseract
    footer_last_words = footer_text.split()[-5:]  # Take only the last 5 words for detection
    print(f"[EasyOCR + Tesseract] Last detected words in 1/8 region: '{' '.join(footer_last_words)}'")  # Debugging print
    
    footer_y = h
    # Step 1: Check for footer keywords or page numbers in bottom 1/8
    if any(keyword.lower() in footer_text for keyword in footer_keywords):
        footer_y = h - 30 - padding
        print("Detected footer keyword in 1/8 region")
    elif any(re.search(r'\b\d+\b', word) for word in footer_last_words):  
        # Compare numeric page numbers from last few words
        footer_y = h - (-10) - padding
        print(f"Detected page number in 1/8 region: '{' '.join(footer_last_words)}'")
    else:
        footer_y = h - (-40) - padding
        print(f"Detected page number after dynamic padding: '{' '.join(footer_last_words)}'")
    
    # Crop the image to remove the detected footer
    cropped_image = image[header_y:footer_y, :]
    return cropped_image

def easyocr_extract_text(image):
    """
    Extract text from an image using EasyOCR.
    Args:
        image (numpy.ndarray): The input image to extract text from.
    Returns:
        str: The extracted text.
    """
    reader = easyocr.Reader(['en', 'ms']) 
    results = reader.readtext(image, detail=0)
    return ' '.join(results)

# Process all PDF files in a folder
input_folder = "/kaggle/input/ib4ma-pdf"  
pdf_paths =  sorted([os.path.join(input_folder, file) for file in os.listdir(input_folder) if file.endswith(".pdf")])

# Run the function to convert the PDF to images
#pdf_paths = ["/kaggle/input/qc174032/QC174032_C7.pdf"]
output_folder = "./output_final_images"
for pdf_path in pdf_paths:
    pdf_to_images(pdf_path, output_folder)

# Workflow: Detecting and Cropping "Kertas 2" or Similar Keywords from Images

## Step-by-Step Workflow

1. **Setup Input and Output Folders**
   - Input images are stored in a folder (e.g., `/kaggle/working/output_final_images_IB4MA`).
   - Processed images are saved to another folder (e.g., `/kaggle/working/processed_images_IB4MA_rK2`).

2. **Keyword Detection with EasyOCR**
   - EasyOCR is used to extract text from the image.
   - The script searches for a main keyword (e.g., `"Kertas"`) and its associated numbers or text (e.g., `"2"`, `"Bahagian A"`).
   - It checks up to **10 nearby words** (configurable via `proximity`) to match the associated keywords.

3. **Crop Based on Detection**
   - When a keyword (e.g., `"Kertas 2"`) is detected:
     - The script retrieves the bounding box of the text and crops the image below the detected keyword, adding optional padding (e.g., `-70` for tighter crops).
   - The cropped image is saved to the output folder.

4. **Fallback for Undetected Keywords**
   - If no keyword or associated text is found, the script saves the original image to the output folder without cropping.


---

## Parameters Can Customize
- **Keyword**: The main word to detect in the image (default is `"Kertas"`).
- **Keyword Numbers**: Additional associated text to search for (e.g., `["2", "Bahagian A"]`).
- **Padding**: Adjusts the crop area relative to the detected keyword (e.g., `-70` for tighter cropping).
- **Proximity**: Number of nearby words to consider when searching for associated text (default is `10`).

---
## Output
- Cropped images with detected keywords are saved to the output folder.
- If no keywords are detected, the original image is saved without modification.



In [None]:
import cv2
import easyocr
import os
import matplotlib.pyplot as plt


def detect_and_crop_kertas2_easyocr(image_path, output_path, keyword="Kertas", keyword_numbers=None, padding=20, proximity=10):
    """
    Detects a keyword with specific associated keyword numbers (e.g., "Kertas 2" or "Bahagian A")
    in an image and crops the bottom part of the image if found using EasyOCR.

    Args:
        image_path (str): Path to the input image.
        output_path (str): Path to save the processed image.
        keyword (str): The main keyword to detect in the image (e.g., "Kertas").
        keyword_numbers (list): List of associated keywords or numbers (e.g., ["2", "Bahagian A"]).
        padding (int): Additional padding to include when cropping.
        proximity (int): Number of words to check near the detected keyword.

    Returns:
        None
    """
    if keyword_numbers is None:
        keyword_numbers = []

    # Initialize EasyOCR reader
    reader = easyocr.Reader(['en'], gpu=False)

    # Load the main image
    image = cv2.imread(image_path)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Extract text using EasyOCR
    results = reader.readtext(gray_image, detail=1)

    # Debugging: Print all detected text with positions
    print(f"\nProcessing Image: {os.path.basename(image_path)}")

    # Check for the keyword in the extracted text
    for i, result in enumerate(results):
        detected_text = result[1]
        if keyword.lower() in detected_text.lower():
            print(f"\nEasyOCR: Keyword '{keyword}' found.")
            # Print the next `proximity` words for debugging
            next_words = [results[j][1] for j in range(i + 1, min(i + 1 + proximity, len(results)))]
            print(f"EasyOCR: Next {proximity} words: {next_words}")

            # Check for 'Bahagian A' or other keyword numbers
            for key_num in keyword_numbers:
                if key_num in next_words or "bahagian a" in " ".join(next_words).lower():
                    print(f"EasyOCR: Detected '{keyword} {key_num}' or 'Bahagian A'.")
                    y = int(result[0][0][1]) 
                    cropped_image = image[:y + padding, :]  

                    # Save the cropped image
                    cv2.imwrite(output_path, cropped_image)
                    print(f"EasyOCR: Cropped and saved: {output_path}")

                    # Display the cropped image for verification
                    plt.figure(figsize=(10, 5))
                    plt.imshow(cv2.cvtColor(cropped_image, cv2.COLOR_BGR2RGB))
                    plt.title(f"EasyOCR Cropped Image: {os.path.basename(output_path)}")
                    plt.axis("off")
                    plt.show()
                    return

    # If the keyword is not found, save the original image
    cv2.imwrite(output_path, image)
    print(f"No keyword '{keyword}' or any of the keyword numbers {keyword_numbers} found by EasyOCR. Original image saved: {output_path}")


# Paths for processing
input_folder = "/kaggle/working/output_final_images_IB4MA"  
output_folder = "/kaggle/working/processed_images_IB4MA_rK2" 

# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)

# List of keyword numbers (e.g., ["2", "Bahagian A"])
keyword_numbers_list = ["2", "Bahagian A"]

# Custom image path for testing
custom_image_path = ""  

if custom_image_path:
    # Process only the custom image
    custom_output_path = os.path.join(output_folder, os.path.basename(custom_image_path))
    detect_and_crop_kertas2_easyocr(
        image_path=custom_image_path,
        output_path=custom_output_path,
        keyword="Kertas",  
        keyword_numbers=keyword_numbers_list,  
        padding=-60,  
        proximity=10  
    )
else:
    # Process all images in the folder
    for image_filename in sorted(os.listdir(input_folder)):
        if image_filename.lower().endswith((".png", ".jpg", ".jpeg")):
            input_image_path = os.path.join(input_folder, image_filename)
            output_image_path = os.path.join(output_folder, image_filename)

            # Run the function to detect and crop "Kertas 2" or other keyword numbers
            detect_and_crop_kertas2_easyocr(
                image_path=input_image_path,
                output_path=output_image_path,
                keyword="Kertas",  
                keyword_numbers=keyword_numbers_list,  
                padding=-70,
                proximity=10  
            )


# Workflow: Removing Tags from Images with Template Matching

This workflow automates the process of detecting and removing unwanted tags (logos, watermarks, etc.) from images using **template matching** and inpainting.

---

## Step-by-Step Workflow

1. **Set Input and Output Paths**
   - **Input Folder**: Place the images you want to process in a folder (e.g., `/kaggle/input/fc065244-pdf-latest`).  
   - **Templates Folder**: Place template images of the tags you want to remove in another folder (e.g., `/kaggle/input/remove-tags`).  
   - **Output Folder**: Processed images will be saved in the specified output folder (e.g., `/kaggle/working/processed_images`).

2. **Detect Tags**
   - Each template in the **Templates Folder** is matched against the input images using **template matching**.  
   - The algorithm identifies regions in the input image that closely resemble the template.

3. **Remove Tags**
   - Detected tag regions are inpainted by replacing them with a white rectangle.  
   - The bounding box can be **shrunk or adjusted** to control the area of removal.

4. **Save and Display Processed Images**
   - The processed images are saved to the **Output Folder**.  
   - The cropped or modified image is displayed for verification.

---

## Parameters You Can Customize

1. **Similarity Threshold**  
   - Controls how similar a detected region must be to the template to count as a match.  
   - **Default**: `0.5` (values range from 0 to 1).  
   - **Example**:  
     - A higher value (e.g., `0.7`) makes the match stricter, reducing false positives.  
     - A lower value (e.g., `0.3`) detects more regions but may include irrelevant areas.

2. **Shrink Factor**  
   - Shrinks the bounding box around the detected tag to avoid overlapping unnecessary regions.  
   - **Default**: `5` (pixels).  
   - **Example**:  
     - A higher value reduces the tag area more aggressively.  
     - A lower value keeps the bounding box closer to the original detected region.

3. **Templates Directory**  
   - The folder containing template images of the tags you want to remove.  
   - **Example**: `/kaggle/input/remove-tags`.

---

## Output

1. **Processed Images**  
   - Images with detected tags removed are saved to the **Output Folder** (e.g., `/kaggle/working/processed_images`).

2. **Fallback for Missing Tags**  
   - If no tags are detected in an image, the original image is saved without modifications.

3. **Display Processed Images**  
   - The processed image is displayed for manual verification during the workflow.

---

## Example Usage
- **Input Folder**: `/kaggle/input/fc065244-pdf-latest`  
- **Templates Folder**: `/kaggle/input/remove-tags`  
- **Output Folder**: `/kaggle/working/processed_images`

---

### For **Task 1: FC065244 Book**

```python
output_dir = "/kaggle/working/output_final_images_FC065244/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_FC065244/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_FC065244"
```

---

### For **Task 2: FC064244 Book**

```python
output_dir = "/kaggle/working/output_final_images_FC064244/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_FC064244/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_FC064244"
```

---


In [None]:
import cv2
import os

def remove_tags(image_path, templates_dir, output_path, similarity_threshold=0.5, shrink_factor=0):
    """
    Detects and removes multiple tags from an image using template matching and inpainting.

    Args:
        image_path (str): Path to the input image.
        templates_dir (str): Directory containing template images of tags.
        output_path (str): Path to save the processed image.
        similarity_threshold (float): Threshold for template matching (0 to 1).
        shrink_factor (int): Pixels to shrink bounding box on each side.

    Returns:
        None
    """
    # Load the main image
    image = cv2.imread(image_path)
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Iterate through all template images in the directory
    for template_filename in os.listdir(templates_dir):
        template_path = os.path.join(templates_dir, template_filename)
        if template_filename.lower().endswith((".png", ".jpg", ".jpeg")):
            # Load the template image
            template = cv2.imread(template_path, 0)

            # Template matching
            result = cv2.matchTemplate(gray_image, template, cv2.TM_CCOEFF_NORMED)
            min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)

            # Process matches that exceed the similarity threshold
            if max_val >= similarity_threshold:
                top_left = max_loc
                h, w = template.shape
                bottom_right = (top_left[0] + w, top_left[1] + h)

                # Shrink the bounding box slightly to avoid overlapping too much
                top_left = (top_left[0] + shrink_factor, top_left[1] + shrink_factor)
                bottom_right = (bottom_right[0] - shrink_factor, bottom_right[1] - shrink_factor)

                cv2.rectangle(image, top_left, bottom_right, (255, 255, 255), -1)

    # Save the result
    cv2.imwrite(output_path, image)

    # Display the processed image
    plt.figure(figsize=(10, 5))
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.title(f"Processed Image: {os.path.basename(output_path)}")
    plt.axis("off")
    plt.show()

# Paths for processing
input_folder = "/kaggle/input/fc065244-pdf-latest"
templates_dir = "/kaggle/input/remove-tags"
output_folder = "/kaggle/working/processed_images"

# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)

# Process all images in the input folder
for image_filename in os.listdir(input_folder):
    if image_filename.lower().endswith((".png", ".jpg", ".jpeg")):
        input_image_path = os.path.join(input_folder, image_filename)
        output_image_path = os.path.join(output_folder, image_filename)

        # Run the function to remove tags
        remove_tags(
            image_path=input_image_path,
            templates_dir=templates_dir,
            output_path=output_image_path,
            similarity_threshold=0.5,
            shrink_factor=5
        )

print(f"Processed images are saved in: {output_folder}")


# Workflow: Extracting and Organizing Questions from Images (3 Columns)

This workflow automates the process of extracting questions and their options from a set of images. It handles column-based layouts, detects questions, extracts relevant regions, and organizes them into structured outputs.

---

## Step-by-Step Workflow

### 1. Organizing Input and Output Folders
- **Input Folder**: Store images to process in the folder (e.g., `/kaggle/input/fc065244-image-completed`).
- **Output Folders**:  
  - `stitched_output_dir`: Saves vertically stitched column images.  
  - `output_dir`: Saves cropped question images, organized by chapter and question number.

---

### 2. Stitched Column Layout
- The image is divided into **three columns**.
- These columns are stitched vertically into a single image for easier question extraction.

---

### 3. Detect and Extract Questions
- Text is extracted using **pytesseract**, and numbered questions (e.g., `1.`, `2.`) are detected.
- Options (`A, B, C, D`) and any text following the last option are also captured.
- The algorithm ensures the **ordering of question numbers** is maintained:
  - Skips duplicate or incorrectly ordered numbers.
  - Accepts out-of-order numbers if the difference is within a reasonable range.

---

### 4. Cropping Questions and Options
- For each detected question:
  - The bounding box of the question and its associated options is determined.
  - The region of interest (ROI) is cropped, including a padding of 15 pixels to capture context.
- The cropped question images are saved in the **output folder**.

---

### 5. Stitching Across Pages
- If a question is cut across two pages:
  - The narrower image is padded with the background color to match widths.
  - The question is stitched vertically to continue the context.

---

## Parameters You Can Customize

1. **Paths**  
   - `dataset_folder`: Folder containing input images.  
   - `stitched_output_dir`: Folder to save stitched column images.  
   - `output_dir`: Folder to save cropped questions.

2. **Text Recognition Settings**  
   - Uses pytesseract with `--psm 6` for OCR, optimized for multi-column text.

3. **Padding**  
   - Adjusts the padding around the cropped questions. Default: `15` pixels.

4. **Symbols to Ignore**  
   - Customize the list of ignored symbols or units during question detection.  
   - Example: `ignore_symbols = ["°", "×", "™", "%", "kg", "m"]`.

5. **Column Handling**  
   - The image is divided into 3 columns by default.

---

## Output

1. **Stitched Images**
   - All columns in the input image are stitched vertically and saved to `stitched_output_dir`.

2. **Cropped Questions**
   - Each detected question and its options are saved as individual image files in `output_dir`.
   - Images are named based on chapter and question number (e.g., `FC065244_C1_Q1.png`).

3. **Unprocessed Images**
   - If no questions are detected in an image, the original stitched image is saved for review.

---

## Example Usage

### Input Files
- Images: `/kaggle/input/fc065244-image-completed/FC065244_C1_P1.png`.

### Output Files
- **Stitched Images**:  
  `/kaggle/working/FC065244_stitched_images/FC065244_C1_P1.png`.
- **Cropped Questions**:  
  `/kaggle/working/FC065244_cropped_questions_test4/FC065244_C1_Q1.png`

In [None]:
import cv2
import numpy as np
import pytesseract
import re
import os
import matplotlib.pyplot as plt

# Define paths
output_dir = "/kaggle/working/FC065244_cropped_questions_test4"
stitched_output_dir = "/kaggle/working/FC065244_stitched_images"
dataset_folder = '/kaggle/input/fc065244-image-completed'
custom_image_path = ""  # Optional: specify a single image for processing
os.makedirs(output_dir, exist_ok=True)
os.makedirs(stitched_output_dir, exist_ok=True)


def extract_numbered_list(text):
    """
    Extracts numbered lists (e.g., "1.", "2,", etc.) from a given text while ignoring unwanted symbols or units.
    """
    ignore_symbols = ["°", "•", "×", "+", "™", "©", "®", "%", "$", "€", "¥", "N", "kg", "m"]
    ignore_patterns = r"|".join(map(re.escape, ignore_symbols))
    pattern = rf'(^|\s)(\d+)[\.,](?![\d{ignore_patterns}])'
    matches = re.findall(pattern, text)

    return [match[1] for match in matches if not re.search(ignore_patterns, text) and match[1] != "0"]


def crop_and_stitch_columns(image):
    """
    Splits the image into three equal columns, then vertically stitches them into one image.
    """
    height, width = image.shape[:2]
    column_width = width // 3
    columns = [image[:, start_x:start_x + column_width] for start_x in range(0, width, column_width)]

    # Equalize column widths and stitch vertically
    max_width = max(col.shape[1] for col in columns)
    for i, col in enumerate(columns):
        if col.shape[1] < max_width:
            padding = np.zeros((col.shape[0], max_width - col.shape[1], 3), dtype=np.uint8)
            columns[i] = np.hstack((col, padding))

    return np.vstack(columns)


def get_text_and_boxes(image):
    """
    Extracts text and bounding boxes from an image using pytesseract.
    """
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    return [(data['text'][i].strip(), (data['left'][i], data['top'][i], data['width'][i], data['height'][i]))
            for i in range(len(data['text'])) if data['text'][i].strip()]


def detect_and_extract_questions_with_options(image):
    """
    Detects questions and their options from an image, ensuring strict ordering of question numbers.
    """
    text_boxes = get_text_and_boxes(image)
    questions_with_options = []
    current_question = None
    current_options = []
    after_options_text = []
    option_labels = ["A", "B", "C", "D"]
    detected_last_option = False
    last_detected_question_number = 0

    for text, (x, y, w, h) in text_boxes:
        numbered_list = extract_numbered_list(text)
        if numbered_list:
            current_number = int(numbered_list[0])

            # Validate question order
            if current_number <= last_detected_question_number and (last_detected_question_number - current_number) > 1:
                continue
            last_detected_question_number = current_number

            # Finalize the previous question
            if current_question:
                questions_with_options.append((current_question, current_options, after_options_text))
            current_question = (current_number, (x, y, w, h))
            current_options = []
            after_options_text = []
            detected_last_option = False

        elif current_question:
            # Check if the text is part of the options
            qx, qy, qw, qh = current_question[1]
            if y > qy and abs(y - (qy + qh)) < 3000:
                if text in option_labels and not detected_last_option:
                    current_options.append((text, (x, y, w, h)))
                    if text == "D":
                        detected_last_option = True
                elif detected_last_option:
                    after_options_text.append((text, (x, y, w, h)))

    # Append the last question if needed
    if current_question:
        questions_with_options.append((current_question, current_options, after_options_text))

    cropped_questions = []
    for question, options, after_texts in questions_with_options:
        question_number, (qx, qy, qw, qh) = question
        min_x, min_y = qx, qy
        max_x, max_y = qx + qw, qy + qh

        for _, (x, y, w, h) in options + after_texts:
            min_x, min_y, max_x, max_y = min(min_x, x), min(min_y, y), max(max_x, x + w), max(max_y, y + h)

        padding = 15
        roi = image[max(min_y - padding, 0):min(max_y + padding, image.shape[0]),
                    max(min_x - padding, 0):min(max_x + padding, image.shape[1])]

        cropped_questions.append((roi, question_number, [opt[0] for opt in options]))

    return cropped_questions


def pad_to_match_width(image1, image2):
    """
    Pads the narrower image to match the width of the wider image using its dominant background color.
    """
    height1, width1 = image1.shape[:2]
    height2, width2 = image2.shape[:2]

    if width1 < width2:
        padding = np.full((height1, width2 - width1, 3), np.mean(image1[:, :5], axis=(0, 1)), dtype=np.uint8)
        return np.hstack((image1, padding)), image2
    elif width2 < width1:
        padding = np.full((height2, width1 - width2, 3), np.mean(image2[:, :5], axis=(0, 1)), dtype=np.uint8)
        return image1, np.hstack((image2, padding))
    return image1, image2


def main():
    stitched_images_by_chapter = {}

    # Organize images by chapter
    image_files = sorted([f for f in os.listdir(dataset_folder) if f.endswith(".png")],
                         key=lambda x: (x.split('_')[1], int(re.search(r'P(\d+)', x).group(1))))

    for image_file in image_files:
        chapter_key = image_file.split('_')[1]
        image_path = os.path.join(dataset_folder, image_file)
        image = cv2.imread(image_path)
        stitched_image = crop_and_stitch_columns(image)

        if chapter_key not in stitched_images_by_chapter:
            stitched_images_by_chapter[chapter_key] = []
        stitched_images_by_chapter[chapter_key].append((stitched_image, image_file))

    # Process each chapter
    for chapter_key, chapter_images in stitched_images_by_chapter.items():
        last_cropped_question = None

        for i, (stitched_image, image_file) in enumerate(chapter_images):
            if last_cropped_question:
                stitched_image, last_cropped_question = pad_to_match_width(stitched_image, last_cropped_question)
                stitched_image = np.vstack((last_cropped_question, stitched_image))

            stitched_image_path = os.path.join(stitched_output_dir, f"{image_file}")
            cv2.imwrite(stitched_image_path, stitched_image)

            question_regions = detect_and_extract_questions_with_options(stitched_image)

            for roi, question_number, _ in question_regions:
                cropped_question_path = os.path.join(output_dir, f"FC065244_{chapter_key}_Q{question_number}.png")
                cv2.imwrite(cropped_question_path, roi)

            if question_regions:
                last_cropped_question, _, _ = question_regions[-1]
            else:
                last_cropped_question = None


if __name__ == "__main__":
    main()


# Workflow: Extracting and Organizing Questions from Images (2 Columns)

This workflow automates the process of extracting questions and their options from a set of images. It handles **two-column layouts** in the same way as the three-column workflow, detects questions, extracts relevant regions, and organizes them into structured outputs.

---

### Example Usage
To update the output paths and dataset folder for the books in question, you can use the following template. Replace the placeholders with the relevant book names.

---

### Example Usage for Each Book

For **Task 3: QC174032 Book**

```python
output_dir = "/kaggle/working/output_final_images_QC174032/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_QC174032/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_QC174032"
```

---

For **Task 4: KM24SF1 Book**

```python
output_dir = "/kaggle/working/output_final_images_KM24SF1/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_KM24SF1/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_KM24SF1"
```

---

For **Task 5: KM24SMA Book**

```python
output_dir = "/kaggle/working/output_final_images_KM24SMA/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_KM24SMA/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_KM24SMA"
```

---

For **Task 6: GG24SFI4 Book**

```python
output_dir = "/kaggle/working/output_final_images_GG24SFI4/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_GG24SFI4/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_GG24SFI4"
```

---

For **Task 7: IB4MA Book**

```python
output_dir = "/kaggle/working/output_final_images_IB4MA/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_IB4MA/stitched_images"
dataset_folder = "/kaggle/working/output_final_images_IB4MA"
```

---

In [None]:
import cv2
import numpy as np
import pytesseract
import re
import os
import matplotlib.pyplot as plt
import fnmatch

# Define paths
output_dir = "/kaggle/working/output_final_images_QC174032/cropped_question"
stitched_output_dir = "/kaggle/working/output_final_images_QC174032/stitched_images"
dataset_folder = '/kaggle/working/output_final_images_QC174032'
os.makedirs(output_dir, exist_ok=True)
os.makedirs(stitched_output_dir, exist_ok=True)

def extract_numbered_list(text):
    """
    Extracts numbered questions (e.g., "1.", "2.") from a given text while ignoring unwanted symbols or units.
    """
    ignore_symbols = ["°", "•", "×", "+", "™", "©", "®", "%", "$", "€", "¥", "N", "kg", "m"]
    ignore_patterns = r"|".join(map(re.escape, ignore_symbols))
    pattern = rf'(^|\s)(\d+)\.(?![\d{ignore_patterns}])'
    matches = re.findall(pattern, text)
    return [match[1] for match in matches if not re.search(ignore_patterns, text) and match[1] != "0"]

def tesseract_extract_text(image):
    config = '--psm 6'
    return pytesseract.image_to_string(image, config=config)

def crop_and_stitch_columns(image):
    """
    Stitches two columns of an image vertically.
    """
    column_width = image.shape[1] // 2
    columns = [
        image[:, :column_width],
        image[:, column_width:]
    ]
    max_width = max(col.shape[1] for col in columns)
    for i in range(len(columns)):
        current_width = columns[i].shape[1]
        if current_width < max_width:
            padding = np.zeros((columns[i].shape[0], max_width - current_width, 3), dtype=np.uint8)
            columns[i] = np.hstack((columns[i], padding))
    return np.vstack(columns)

def get_text_and_boxes(image):
    """
    Extracts text and bounding boxes from the image using pytesseract.
    """
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    return [(data['text'][i].strip(), (data['left'][i], data['top'][i], data['width'][i], data['height'][i]))
            for i in range(len(data['text'])) if data['text'][i].strip()]

def detect_and_extract_questions_with_options(image):
    """
    Detects and extracts questions and their options from the image.
    """
    text_boxes = get_text_and_boxes(image)
    questions_with_options = []
    current_question = None
    current_options = []
    after_options_text = []
    option_labels = ["A", "B", "C", "D"]
    detected_last_option = False
    skip_current_question = False
    expected_question_number = None

    for text, (x, y, w, h) in text_boxes:
        if skip_current_question:
            skip_current_question = False
            continue
        numbered_list = extract_numbered_list(text)
        if numbered_list:
            current_number = int(numbered_list[0])
            if expected_question_number is not None and current_number != expected_question_number:
                continue
            expected_question_number = current_number + 1
            if current_question:
                questions_with_options.append((current_question, current_options, after_options_text))
            current_question = (numbered_list[0], (x, y, w, h))
            current_options = []
            after_options_text = []
            detected_last_option = False
            skip_current_question = False
        elif current_question:
            qx, qy, qw, qh = current_question[1]
            if y > qy and abs(y - (qy + qh)) < 2000:
                if text in option_labels and not detected_last_option:
                    current_options.append((text, (x, y, w, h)))
                    if text == "D":
                        detected_last_option = True
                elif detected_last_option:
                    if re.match(r'^[\w\s\.,!?\'"-]*$', text) and len(text) > 1:
                        if after_options_text and abs(y - after_options_text[-1][1][1]) > 100:
                            skip_current_question = True
                            continue
                        after_options_text.append((text, (x, y, w, h)))
                    else:
                        current_options.append((text, (x, y, w, h)))

    if current_question:
        questions_with_options.append((current_question, current_options, after_options_text))

    cropped_questions = []
    for question, options, after_texts in questions_with_options:
        question_number, (qx, qy, qw, qh) = question
        min_x, min_y = qx, qy
        max_x, max_y = qx + qw, qy + qh
        for _, (x, y, w, h) in options + after_texts:
            min_x, min_y, max_x, max_y = min(min_x, x), min(min_y, y), max(max_x, x + w), max(max_y, y + h)
        padding = 15
        roi = image[max(min_y - padding, 0):min(max_y + padding, image.shape[0]),
                    max(min_x - padding, 0):min(max_x + padding, image.shape[1])]
        cropped_questions.append((roi, question_number, [opt[0] for opt in options]))
    return cropped_questions

def main():
    image_files = sorted(
        [f for f in os.listdir(dataset_folder) if fnmatch.fnmatch(f, "QC174032_7_*.png")]
    )
    if not image_files:
        print(f"No files matching the pattern 'QC174032_7_*.png' found in {dataset_folder}.")
        return
    for image_file in image_files:
        image_path = os.path.join(dataset_folder, image_file)
        image = cv2.imread(image_path)
        stitched_image = crop_and_stitch_columns(image)
        stitched_image_path = os.path.join(stitched_output_dir, f"stitched_{image_file}")
        cv2.imwrite(stitched_image_path, stitched_image)
        question_regions = detect_and_extract_questions_with_options(stitched_image)
        for i, (roi, question_number, options) in enumerate(question_regions):
            cropped_question_path = os.path.join(output_dir, f"{image_file}_question_{question_number}.png")
            cv2.imwrite(cropped_question_path, roi)

if __name__ == "__main__":
    main()


# Another better version to crop Question for 2 column. Solved the numbered list problem 

In [None]:
import cv2
import numpy as np
import pytesseract
import re
import os
import matplotlib.pyplot as plt

# Define paths
output_dir = "/kaggle/working/IB4MA_cropped_questions_final"
stitched_output_dir = "/kaggle/working/IB4MA_stitched_images"
dataset_folder = '/kaggle/working/processed_images_IB4MA_rK2'
os.makedirs(output_dir, exist_ok=True) 
os.makedirs(stitched_output_dir, exist_ok=True)

def extract_numbered_list(text):
    """
    Extracts numbers from text where the text contains only digits (e.g., '1', '23', '456').
    """
    numbered_list = []
    if re.match(r'^\d+$', text):  
        numbered_list.append(text)  # Add the detected number
    return numbered_list

def crop_and_stitch_columns(image):
    """
    Splits the image into two columns and stitches them vertically.
    """
    height, width = image.shape[:2]
    column_width = width // 2
    columns = [(0, column_width), (column_width, width)]
    column_images = []

    for start_x, end_x in columns:
        column_image = image[:, start_x:end_x]
        column_images.append(column_image)

    max_width = max(col.shape[1] for col in column_images)
    for i in range(len(column_images)):
        current_width = column_images[i].shape[1]
        if current_width < max_width:
            padding = np.zeros((column_images[i].shape[0], max_width - current_width, 3), dtype=np.uint8)
            column_images[i] = np.hstack((column_images[i], padding))

    stitched_image = np.vstack(column_images)
    return stitched_image

def get_text_and_boxes(image):
    """
    Extracts text and bounding boxes from the entire image using pytesseract.
    """
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    text_boxes = []
    for i in range(len(data['text'])):
        text = data['text'][i].strip()
        if text:  # Ignore empty text
            x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
            text_boxes.append((text, (x, y, w, h)))
    return text_boxes

def detect_and_extract_questions_with_options(image):
    """
    Detects and extracts questions and options from the image.
    """
    text_boxes = get_text_and_boxes(image)
    questions_with_options = []
    current_question = None
    current_options = []
    after_options_text = []
    option_labels = ["A", "B", "C", "D"]
    detected_last_option = False
    left_threshold = 0.1
    last_detected_question_number = 0
    image_width = image.shape[1]

    for text, (x, y, w, h) in text_boxes:
        numbered_list = extract_numbered_list(text)
        if numbered_list:
            current_number = int(numbered_list[0])
            if x > image_width * left_threshold:
                continue          
            if current_number <= last_detected_question_number and (last_detected_question_number - current_number) > 1:
                continue
            last_detected_question_number = current_number

            if current_question is not None:
                questions_with_options.append((current_question, current_options, after_options_text))

            current_question = (numbered_list[0], (x, y, w, h))
            current_options = []
            after_options_text = []
            detected_last_option = False

        elif current_question is not None:
            qx, qy, qw, qh = current_question[1]
            if y > qy and abs(y - (qy + qh)) < 3000:
                if text in option_labels and not detected_last_option:
                    current_options.append((text, (x, y, w, h)))
                    if text == "D":
                        detected_last_option = True
                elif detected_last_option:
                    if re.match(r'^[\w\s\.,]*$', text) and len(text) > 1:
                        after_options_text.append((text, (x, y, w, h)))

    if current_question is not None:
        questions_with_options.append((current_question, current_options, after_options_text))

    cropped_questions = []
    for question, options, after_texts in questions_with_options:
        question_number, (qx, qy, qw, qh) = question
        min_x, min_y = qx, qy
        max_x, max_y = qx + qw, qy + qh

        for _, (x, y, w, h) in options + after_texts:
            min_x, min_y = min(min_x, x), min(min_y, y)
            max_x, max_y = max(max_x, x + w), max(max_y, y + h)

        padding = 50
        y_padded = max(min_y - padding, 0)
        x_padded = max(min_x - padding, 0)
        cropped_questions.append((image[y_padded:max(max_y + padding, image.shape[0]), 
                                        x_padded:max(max_x + padding, image.shape[1])], question_number, [opt[0] for opt in options]))
    return cropped_questions

def pad_to_match_width(image1, image2):
    """
    Pads the narrower image to match the width of the wider image.
    """
    height1, width1 = image1.shape[:2]
    height2, width2 = image2.shape[:2]
    white_bg_color = (255, 255, 255)

    if width1 < width2:
        padding = np.full((height1, width2 - width1, 3), white_bg_color, dtype=np.uint8)
        return np.hstack((image1, padding)), image2
    elif width2 < width1:
        padding = np.full((height2, width1 - width2, 3), white_bg_color, dtype=np.uint8)
        return image1, np.hstack((image2, padding))
    return image1, image2

def main():
    stitched_images_by_chapter = {}
    image_files = sorted(
        [f for f in os.listdir(dataset_folder) if f.endswith(".png")], 
        key=lambda x: (
            x.split('_')[1],
            int(re.search(r'P(\d+)', x).group(1)) if re.search(r'P(\d+)', x) else 0
        )
    )

    for image_file in image_files:
        chapter_key = image_file.split('_')[1]
        image_path = os.path.join(dataset_folder, image_file)
        image = cv2.imread(image_path)
        stitched_image = crop_and_stitch_columns(image)

        if chapter_key not in stitched_images_by_chapter:
            stitched_images_by_chapter[chapter_key] = []
        stitched_images_by_chapter[chapter_key].append((stitched_image, image_file))

    for chapter_key, chapter_images in stitched_images_by_chapter.items():
        last_cropped_question = None

        for i, (stitched_image, image_file) in enumerate(chapter_images):
            if last_cropped_question is not None:
                stitched_image, last_cropped_question = pad_to_match_width(stitched_image, last_cropped_question)
                stitched_image = np.vstack((last_cropped_question, stitched_image))

            stitched_image_path = os.path.join(stitched_output_dir, f"{image_file}")
            cv2.imwrite(stitched_image_path, stitched_image)

            question_regions = detect_and_extract_questions_with_options(stitched_image)

            for roi, question_number, _ in question_regions:
                cropped_question_path = os.path.join(output_dir, f"IB4MA_{chapter_key}_Q{question_number}.png")
                cv2.imwrite(cropped_question_path, roi)

            if question_regions:
                last_cropped_question, _, _ = question_regions[-1]
            else:
                last_cropped_question = None

if __name__ == "__main__":
    main()


# Answer Extraction with OCR
This process uses **Tesseract** and **EasyOCR** to extract answers from images containing multiple-choice answers. It is designed specifically for images with a **single-column layout**, as multi-column formats may cause errors during text extraction.-

## Process Overview
1. **Input Requirements**:
   - The script processes images in a structured, single-column format.
   - Multi-column layouts are not supported without additional preprocessing.
2. **OCR Methods**:
   - Both **Tesseract** and **EasyOCR** are used for extracting text.
   - Results are compared, and conflicts are resolved by prioritizing EasyOCR for better accuracy.
3. **Output**:
   - Extracted answers are saved in a structured format for urthr use.

---

## Supported Format (Single Column)
Example:  
`/kaggle/input/answer-extraction-instruction/FC4_C2_ANS.PNG`

---

## Unsupported Format (Two Columns)
Example:  
`/kaggle/input/answer-extraction-instructin/FC065244_C2_ANS_1.png`


In [None]:
import cv2
import pytesseract
import re
import easyocr
import numpy as np
import json
from collections import defaultdict
import os

def preprocess_image(image):
    # Preprocess the image to enhance OCR accuracy: resizing, grayscale conversion, blurring, sharpening, and thresholding.
    image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
    kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
    sharpened = cv2.filter2D(blurred, -1, kernel)
    _, thresh = cv2.threshold(sharpened, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return thresh

def tesseract_extract_text(image):
    # Extract text using Tesseract OCR.
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    extracted_text = " ".join([data['text'][i] for i in range(len(data['text'])) if int(data['conf'][i]) > 0])
    return extracted_text

def easyocr_extract_text(image):
    # Extract text using EasyOCR.
    reader = easyocr.Reader(['en', 'ms'])
    results = reader.readtext(image, detail=0, contrast_ths=0.7, text_threshold=0.6, low_text=0.5)
    return ' '.join(results)

def extract_answers(text):
    # Extract answers in the format 'number. answer' (e.g., '1. A').
    answer_pattern = re.compile(r'(\d+)\.\s*([A-D])')
    return answer_pattern.findall(text)

def compare_ocr_methods(image_path):
    # Compare the results of Tesseract and EasyOCR, and resolve conflicts by using EasyOCR answers when discrepancies are found.
    image = cv2.imread(image_path)
    preprocessed_image = preprocess_image(image)

    tesseract_text = tesseract_extract_text(preprocessed_image)
    tesseract_answers = extract_answers(tesseract_text)

    easyocr_text = easyocr_extract_text(preprocessed_image)
    easyocr_answers = extract_answers(easyocr_text)

    answer_dict = defaultdict(lambda: {"Tesseract": None, "EasyOCR": None})
    for number, answer in tesseract_answers:
        answer_dict[number]["Tesseract"] = answer

    for number, answer in easyocr_answers:
        if answer_dict[number]["Tesseract"] is None:
            answer_dict[number]["EasyOCR"] = answer
        elif answer_dict[number]["Tesseract"] != answer:
            answer_dict[number]["EasyOCR"] = answer

    print(f"Results for {image_path}:")
    print(f"Tesseract OCR Results: {tesseract_text}")
    print(f"EasyOCR Results: {easyocr_text}")
    print("Combined Results (Sorted):")

    missing_questions = []
    conflicting_questions = []

    max_question = max(map(int, answer_dict.keys())) if answer_dict else 0
    for i in range(1, max_question + 1):
        str_i = str(i)
        if str_i not in answer_dict:
            missing_questions.append(i)
        else:
            tesseract_answer = answer_dict[str_i]["Tesseract"]
            easyocr_answer = answer_dict[str_i]["EasyOCR"]

            if tesseract_answer and easyocr_answer and tesseract_answer != easyocr_answer:
                conflicting_questions.append(i)
                print(f"Question {i}: Conflicting Answers Detected: Tesseract: {tesseract_answer}, EasyOCR: {easyocr_answer} (Using EasyOCR)")
                answer_dict[str_i]["FinalAnswer"] = easyocr_answer
            elif tesseract_answer:
                print(f"Question {i}: Answer {tesseract_answer} (Detected by Tesseract)")
                answer_dict[str_i]["FinalAnswer"] = tesseract_answer
            elif easyocr_answer:
                print(f"Question {i}: Answer {easyocr_answer} (Detected by EasyOCR)")
                answer_dict[str_i]["FinalAnswer"] = easyocr_answer

    if missing_questions:
        print(f"\nMissing Questions: {', '.join(map(str, missing_questions))}")
    if conflicting_questions:
        print(f"\nConflicting Questions: {', '.join(map(str, conflicting_questions))} (Needs Recheck)")
    print("\n" + "="*50 + "\n")

    for i in missing_questions:
        print(f"Rechecking missing question {i}...")
        str_i = str(i)

        alternative_pattern = re.compile(fr'{i}\s*([A-D])')
        alternative_match = alternative_pattern.search(tesseract_text)
        if alternative_match:
            tesseract_answer = alternative_match.group(1)
            answer_dict[str_i]["Tesseract"] = tesseract_answer
            answer_dict[str_i]["FinalAnswer"] = tesseract_answer
            print(f"Question {i}: Found answer {tesseract_answer} in Tesseract recheck")
            continue

        alternative_match = alternative_pattern.search(easyocr_text)
        if alternative_match:
            easyocr_answer = alternative_match.group(1)
            answer_dict[str_i]["EasyOCR"] = easyocr_answer
            answer_dict[str_i]["FinalAnswer"] = easyocr_answer
            print(f"Question {i}: Found answer {easyocr_answer} in EasyOCR recheck")

    combined_results = {int(number): answer_dict[number]["FinalAnswer"] for number in sorted(answer_dict, key=int) if answer_dict[number]["FinalAnswer"]}
    conflict_results = {int(number): {"Tesseract": answer_dict[number]["Tesseract"], "EasyOCR": answer_dict[number]["EasyOCR"]} for number in conflicting_questions}

    output_data = {
        "CombinedResults": combined_results,
        "ConflictingQuestions": conflict_results
    }

    output_filename = f'ocr_results_{os.path.basename(image_path).split(".")[0]}.json'
    with open(output_filename, 'w') as json_file:
        json.dump(output_data, json_file, indent=4)
    print(f"Results saved to '{output_filename}'\n")

    print("Final Combined Results:")
    for number, answer in sorted(combined_results.items()):
        print(f"Question {number}: Answer {answer}")

    print("\nJSON Content:")
    print(json.dumps(output_data, indent=4))

    print("\nConflicting Answers:")
    print(json.dumps(conflict_results, indent=4))

# Set the directory containing images
image_directory = '/kaggle/input/qc174032-ans'

# Collect all image file paths in the directory
image_paths = sorted([os.path.join(image_directory, file) for file in os.listdir(image_directory) if file.lower().endswith(('png', 'jpg', 'jpeg'))])

# Process each image in the directory
for image_path in image_paths:
    compare_ocr_methods(image_path)


# Testing on KM24SF1 Answer

In [None]:
import cv2
import pytesseract
import re
import easyocr
import numpy as np
import json
from collections import defaultdict
import os

def preprocess_image(image):
    """ Preprocess the image to enhance OCR accuracy. """
    image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
    kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
    sharpened = cv2.filter2D(blurred, -1, kernel)
    _, thresh = cv2.threshold(sharpened, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return thresh

def tesseract_extract_text(image):
    """ Extract text using Tesseract OCR. """
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    extracted_text = " ".join([data['text'][i] for i in range(len(data['text'])) if int(data['conf'][i]) > 0])
    return extracted_text

def easyocr_extract_text(image):
    """ Extract text using EasyOCR. """
    reader = easyocr.Reader(['en', 'ms'])
    results = reader.readtext(image, detail=0, contrast_ths=0.7, text_threshold=0.6, low_text=0.5)
    return ' '.join(results)

def extract_answers(text):
    """
    Extract answers where a number is followed by a valid answer (A, B, C, or D).
    Ignores invalid entries.
    """
    answer_pattern = re.compile(r'(\d+)\s+([A-D])\b')  # Matches 'number' followed by a valid answer A-D
    return answer_pattern.findall(text)

def detect_missing_questions(final_answers):
    """ Detect and add missing questions with null values. """
    all_numbers = set(range(1, max(map(int, final_answers.keys())) + 1))
    existing_numbers = set(map(int, final_answers.keys()))
    missing_numbers = sorted(all_numbers - existing_numbers)

    for num in missing_numbers:
        final_answers[str(num)] = None
    return dict(sorted(final_answers.items(), key=lambda x: int(x[0])))

def process_chapter_images(image_paths, chapter):
    """ Process images for a single chapter. """
    chapter_results = {}
    final_answers = {}
    conflicting_questions = {}

    for image_path in image_paths:
        image = cv2.imread(image_path)
        preprocessed_image = preprocess_image(image)

        # Extract text with both OCR methods
        tesseract_text = tesseract_extract_text(preprocessed_image)
        easyocr_text = easyocr_extract_text(preprocessed_image)

        # Extract answers
        tesseract_answers = extract_answers(tesseract_text)
        easyocr_answers = extract_answers(easyocr_text)

        # Combine and resolve answers
        combined_answers = defaultdict(lambda: {"Tesseract": None, "EasyOCR": None, "Final": None})

        # Store answers from Tesseract
        for num, ans in tesseract_answers:
            combined_answers[num]["Tesseract"] = ans

        # Store answers from EasyOCR and resolve conflicts
        for num, ans in easyocr_answers:
            combined_answers[num]["EasyOCR"] = ans
            if combined_answers[num]["Tesseract"]:
                if combined_answers[num]["Tesseract"] == ans:
                    combined_answers[num]["Final"] = ans  # Both agree
                else:
                    combined_answers[num]["Final"] = ans  # Prioritize EasyOCR
                    conflicting_questions[num] = {
                        "Tesseract": combined_answers[num]["Tesseract"],
                        "EasyOCR": ans
                    }
            else:
                combined_answers[num]["Final"] = ans  # Tesseract is empty, use EasyOCR

        # Update logic to use Tesseract if EasyOCR is None
        for num, answer_data in combined_answers.items():
            if answer_data["EasyOCR"] is None and answer_data["Tesseract"] is not None:
                combined_answers[num]["Final"] = answer_data["Tesseract"]  # Use Tesseract if EasyOCR is None

        # Consolidate answers into final format
        for num, answer_data in combined_answers.items():
            final_answers[num] = answer_data["Final"]

        # Store intermediate results
        chapter_results[os.path.basename(image_path)] = {
            "TesseractText": tesseract_text,
            "EasyOCRText": easyocr_text,
            "Answers": combined_answers
        }

    # Detect missing questions and fill with "Missing"
    final_answers = detect_missing_questions(final_answers)
    return chapter_results, final_answers, conflicting_questions



def save_combined_results(final_answers, conflicting_questions, output_directory, chapter):
    """ Save combined results in proper numerical sequence for each chapter. """
    output_data = {
        "CombinedResults": final_answers,
        "ConflictingQuestions": conflicting_questions
    }
    output_file = os.path.join(output_directory, f"chapter_C{chapter}_final.json")
    with open(output_file, 'w') as json_file:
        json.dump(output_data, json_file, indent=4)
    print(f"Final combined results saved to {output_file}")

def main():
    image_directory = '/kaggle/input/km24sf1-ans'  # Set to your folder path
    output_directory = '/kaggle/working/km24sf1-ocr-ans-final'
    os.makedirs(output_directory, exist_ok=True)
    
    # Group files by chapter (C1, C2, C3)
    chapter_groups = defaultdict(list)
    for file in os.listdir(image_directory):
        if file.endswith(('.png', '.jpg', '.jpeg')):
            chapter = re.search(r'_C(\d+)_', file).group(1)  # Extract chapter (C1, C2, C3)
            chapter_groups[chapter].append(os.path.join(image_directory, file))
    
    # Process each chapter
    all_results = {}

    for chapter, files in chapter_groups.items():
        print(f"\n{'=' * 20} Processing Chapter C{chapter} ({len(files)} files) {'=' * 20}\n")
        chapter_results, final_answers, conflicting_questions = process_chapter_images(sorted(files), chapter)

        # Print results for each file
        for file_name, result in chapter_results.items():
            print(f"\nFile: {file_name}")
            print(f"Tesseract Text: {result['TesseractText']}")
            print(f"EasyOCR Text: {result['EasyOCRText']}")
            print("Extracted Answers:")
            for qnum, answers in result['Answers'].items():
                print(f"  Q{qnum}: Tesseract={answers['Tesseract']}, EasyOCR={answers['EasyOCR']}, Final={answers['Final']}")

        # Save all chapter results
        all_results[f"Chapter_C{chapter}"] = chapter_results

        # Save detailed chapter results as JSON
        output_file = os.path.join(output_directory, f"chapter_C{chapter}_results.json")
        with open(output_file, 'w') as json_file:
            json.dump(chapter_results, json_file, indent=4)
        print(f"\nSaved detailed results for Chapter C{chapter} to {output_file}")

        # Save final consolidated results for the chapter
        save_combined_results(final_answers, conflicting_questions, output_directory, chapter)

        # Save final consolidated answers for this chapter into an individual JSON file
        final_answers_file = os.path.join(output_directory, f"chapter_C{chapter}_final_answers.json")
        with open(final_answers_file, 'w') as json_file:
            json.dump(final_answers, json_file, indent=4)
        print(f"Saved final consolidated answers for Chapter C{chapter} to {final_answers_file}")

        # Print sorted consolidated answers
        print(f"\nFinal Consolidated Answers for Chapter C{chapter}:")
        for qnum, answer in sorted(final_answers.items(), key=lambda x: int(x[0])):
            print(f"  Question {qnum}: {answer if answer else 'Missing'}")
        print("=" * 60)

    # Save all results combined into a single file
    all_results_file = os.path.join(output_directory, 'all_chapters_results.json')
    with open(all_results_file, 'w') as json_file:
        json.dump(all_results, json_file, indent=4)
    print(f"\nAll detailed results saved to {all_results_file}")


if __name__ == "__main__":
    main()


# Testing on Answer GG24SF14_ans column cropped

In [None]:
# This script performs OCR using Tesseract and EasyOCR on an input image containing multiple-choice answers.
# Currently, the input images must be structured in a single column format for proper extraction.
# Only images with a similar format to 'Picture 2' are supported, as 'Picture 1' includes two separate columns that make the OCR challenging.
# The script extracts answers from the images, compares the results from both OCR methods, and resolves conflicting answers by using the EasyOCR result.


import cv2
import pytesseract
import re
import easyocr
import numpy as np
import json
from collections import defaultdict

def preprocess_image(image):
    # Preprocess the image to enhance OCR accuracy: resizing, grayscale conversion, blurring, sharpening, and thresholding.
    image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
    kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
    sharpened = cv2.filter2D(blurred, -1, kernel)
    _, thresh = cv2.threshold(sharpened, 150, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return thresh

def tesseract_extract_text(image):
    # Extract text using Tesseract OCR.
    data = pytesseract.image_to_data(image, config='--psm 6', output_type=pytesseract.Output.DICT)
    extracted_text = " ".join([data['text'][i] for i in range(len(data['text'])) if int(data['conf'][i]) > 0])
    return extracted_text

def easyocr_extract_text(image):
    # Extract text using EasyOCR.
    reader = easyocr.Reader(['en', 'ms'])
    results = reader.readtext(image, detail=0, contrast_ths=0.7, text_threshold=0.6, low_text=0.5)
    return ' '.join(results)

def extract_answers(text):
    # Match answers with or without a period
    answer_pattern = re.compile(r'(\d+)\s*\.?\s*([A-D])')
    return answer_pattern.findall(text)


def compare_ocr_methods(image_path):
    # Compare the results of Tesseract and EasyOCR, and resolve conflicts by using EasyOCR answers when discrepancies are found.
    image = cv2.imread(image_path)
    preprocessed_image = preprocess_image(image)

    tesseract_text = tesseract_extract_text(preprocessed_image)
    tesseract_answers = extract_answers(tesseract_text)

    easyocr_text = easyocr_extract_text(preprocessed_image)
    easyocr_answers = extract_answers(easyocr_text)

    answer_dict = defaultdict(lambda: {"Tesseract": None, "EasyOCR": None})
    for number, answer in tesseract_answers:
        answer_dict[number]["Tesseract"] = answer

    for number, answer in easyocr_answers:
        if answer_dict[number]["Tesseract"] is None:
            answer_dict[number]["EasyOCR"] = answer
        elif answer_dict[number]["Tesseract"] != answer:
            answer_dict[number]["EasyOCR"] = answer

    print(f"Results for {image_path}:")
    print(f"Tesseract OCR Results: {tesseract_text}")
    print(f"EasyOCR Results: {easyocr_text}")
    print("Combined Results (Sorted):")

    missing_questions = []
    conflicting_questions = []

    max_question = max(map(int, answer_dict.keys())) if answer_dict else 0
    for i in range(1, max_question + 1):
        str_i = str(i)
        if str_i not in answer_dict:
            missing_questions.append(i)
        else:
            tesseract_answer = answer_dict[str_i]["Tesseract"]
            easyocr_answer = answer_dict[str_i]["EasyOCR"]

            if tesseract_answer and easyocr_answer and tesseract_answer != easyocr_answer:
                conflicting_questions.append(i)
                print(f"Question {i}: Conflicting Answers Detected: Tesseract: {tesseract_answer}, EasyOCR: {easyocr_answer} (Using EasyOCR)")
                answer_dict[str_i]["FinalAnswer"] = easyocr_answer
            elif tesseract_answer:
                print(f"Question {i}: Answer {tesseract_answer} (Detected by Tesseract)")
                answer_dict[str_i]["FinalAnswer"] = tesseract_answer
            elif easyocr_answer:
                print(f"Question {i}: Answer {easyocr_answer} (Detected by EasyOCR)")
                answer_dict[str_i]["FinalAnswer"] = easyocr_answer

    if missing_questions:
        print(f"\nMissing Questions: {', '.join(map(str, missing_questions))}")
    if conflicting_questions:
        print(f"\nConflicting Questions: {', '.join(map(str, conflicting_questions))} (Needs Recheck)")
    print("\n" + "="*50 + "\n")

    for i in missing_questions:
        print(f"Rechecking missing question {i}...")
        str_i = str(i)

        alternative_pattern = re.compile(fr'{i}\s*([A-D])')
        alternative_match = alternative_pattern.search(tesseract_text)
        if alternative_match:
            tesseract_answer = alternative_match.group(1)
            answer_dict[str_i]["Tesseract"] = tesseract_answer
            answer_dict[str_i]["FinalAnswer"] = tesseract_answer
            print(f"Question {i}: Found answer {tesseract_answer} in Tesseract recheck")
            continue

        alternative_match = alternative_pattern.search(easyocr_text)
        if alternative_match:
            easyocr_answer = alternative_match.group(1)
            answer_dict[str_i]["EasyOCR"] = easyocr_answer
            answer_dict[str_i]["FinalAnswer"] = easyocr_answer
            print(f"Question {i}: Found answer {easyocr_answer} in EasyOCR recheck")

    combined_results = {
        int(number): answer_dict[number].get("FinalAnswer", None)
        for number in sorted(answer_dict, key=int)
        if answer_dict[number].get("FinalAnswer") and 1 <= int(number) <= 50  # Valid range filter
    }
    
    conflict_results = {
        int(number): {"Tesseract": answer_dict[number]["Tesseract"], "EasyOCR": answer_dict[number]["EasyOCR"]}
        for number in conflicting_questions
    }
    
    output_data = {
        "CombinedResults": combined_results,
        "ConflictingQuestions": conflict_results
    }
    
    output_filename = f'ocr_results_{image_path.split("/")[-1].split(".")[0]}.json'
    with open(output_filename, 'w') as json_file:
        json.dump(output_data, json_file, indent=4)
    print(f"Results saved to '{output_filename}'\n")


    print("Final Combined Results:")
    for number, answer in sorted(combined_results.items()):
        print(f"Question {number}: Answer {answer}")

    print("\nJSON Content:")
    print(json.dumps(output_data, indent=4))

    print("\nConflicting Answers:")
    print(json.dumps(conflict_results, indent=4))

image_paths = [
    "/kaggle/input/qc174032-ans/Ujian_1.png",
    "/kaggle/input/qc174032-ans/Ujian_2.png",
    "/kaggle/input/qc174032-ans/Ujian_3.png",
    "/kaggle/input/qc174032-ans/Ujian_4.png",
    "/kaggle/input/qc174032-ans/Ujian_5.png",
    "/kaggle/input/qc174032-ans/Ujian_6.png",
    "/kaggle/input/qc174032-ans/Ujian_7.png",
    "/kaggle/input/qc174032-ans/Ujian_8.png",
    "/kaggle/input/qc174032-ans/Ujian_9.png",
    "/kaggle/input/qc174032-ans/Ujian_10.png",
    "/kaggle/input/qc174032-ans/Ujian_11.png",
    "/kaggle/input/qc174032-ans/Ujian_12.png",
    "/kaggle/input/qc174032-ans/Ujian_13.png",
    "/kaggle/input/qc174032-ans/Ujian_14.png"    
]

for image_path in image_paths:
    compare_ocr_methods(image_path)


# Comparing the answer and the question file name

In [None]:
import os
import shutil
import re

# Define the source and destination directories
source_directory = "/kaggle/input/qc174032-ans-latest-231224"
destination_directory = "/kaggle/working/qc174032-ans-latest-renamed_1"

# Create the destination directory if it doesn't exist
os.makedirs(destination_directory, exist_ok=True)

# Regular expression to match and extract the chapter number
pattern = r"ocr_results_QC174032_(\d+)\.json"

# Copy and rename the files
for file_name in os.listdir(source_directory):
    match = re.match(pattern, file_name)
    if match:
        # Extract the chapter number using the regex
        chapter = match.group(1)  # Extract the numeric part (e.g., '12')

        # Create the new file name
        new_file_name = f"ocr_results_QC174032_C{chapter}.json"  # Add 'C' before the chapter number

        # Define old and new paths
        old_path = os.path.join(source_directory, file_name)
        new_path = os.path.join(destination_directory, new_file_name)

        # Copy the file to the destination directory with the new name
        shutil.copy2(old_path, new_path)
        print(f"Copied and renamed: {file_name} -> {new_file_name}")

print("Renaming and copying complete!")


# Rename Ans file name

In [None]:
import os
import re

# Directories for the response and result files
response_dir = "/kaggle/input/qc174032-question-latest231224"
result_dir = "/kaggle/working/QC174032_QA_en_ms_latest/english"

def extract_chapter_question(file_name, pattern):
    """Extract chapter and question from a file name based on a pattern."""
    match = re.search(pattern, file_name)
    if match:
        return match.group(1), match.group(2)  # Return chapter and question as strings
    return None, None

def get_chapters_questions(directory, pattern):
    """Get chapters and questions from files in a directory."""
    chapters_questions = set()
    unmatched_files = []  # To track unmatched files
    for file_name in os.listdir(directory):
        chapter, question = extract_chapter_question(file_name, pattern)
        if chapter and question:
            chapters_questions.add((chapter, question))
        else:
            unmatched_files.append(file_name)  # Track unmatched files
    # Log unmatched files
    if unmatched_files:
        print(f"Unmatched files in {directory}:")
        print(unmatched_files)
    return chapters_questions

def compare_files(response_dir, result_dir):
    """Compare response and result files and identify missing entries."""
    response_pattern = r"C(\d+)_Q(\d+)_response\.json"
    result_pattern = r"C(\d+)_Q(\d+)_en\.json"

    response_chapters_questions = get_chapters_questions(response_dir, response_pattern)
    result_chapters_questions = get_chapters_questions(result_dir, result_pattern)

    # Find missing in results
    missing_in_results = response_chapters_questions - result_chapters_questions

    # Find extra in results
    extra_in_results = result_chapters_questions - response_chapters_questions

    return missing_in_results, extra_in_results, len(response_chapters_questions), len(result_chapters_questions)

# Compare the files
missing_in_results, extra_in_results, total_response, total_results = compare_files(response_dir, result_dir)

# Print results
print(f"Total files in response: {total_response}")
print(f"Total files in results: {total_results}")
print("Missing in results:")
for chapter, question in sorted(missing_in_results):
    print(f"Chapter: {chapter}, Question: {question}")

print("\nExtra in results:")
for chapter, question in sorted(extra_in_results):
    print(f"Chapter: {chapter}, Question: {question}")


### Workflow for Combining Responses and Answers to dataset

This workflow explains how to process and combine questions (responses) and their answers from the provided directories into structured outputs.

---

### **Step 1: Set Directories**
- Define directories for:
  - **Responses**: Contains question files (e.g., `C1_Q1_response.json`).
  - **Answers**: Contains answer files (e.g., `C1_answers.json`).

---

### **Step 2: Load Metadata**
- Load **chapter metadata** for each book to add information like:
  - Book name, publisher, ISBN.
  - Chapter topics.

---

### **Step 3: Match Files**
- For each response file:
  1. Find the corresponding answer file using naming patterns.
  2. Extract the question details and match them with the answer.

---

### **Step 4: Enrich Data**
- Combine the question and answer details.
- Add metadata such as:
  - Book details.
  - Chapter topics.
  - Question text, options, and figures.

---

### **Step 5: Save Outputs**
- Save the combined data into two formats:
  - **English**: `C1_Q1_en.json`
  - **Malay**: `C1_Q1_ms.json`
- Organize outputs into:
  - `/english`
  - `/malay`

---

### **Step 6: Handle Missing Data**
- Log questions that couldn't be processed due to:
  - Missing answer files.
  - Empty or unmatched answers.

---

### Example:
1. Input:
   - Responses: `/qc174032-response`
   - Answers: `/qc174032-answers`
2. Output:
   - `/english/qc174032_C1_Q1_en.json`
   - `/malay/qc174032_C1_Q1_ms.json`



In [None]:
import os
import json
import re
from pathlib import Path
import glob

# Define the pairs of response and answer directories
data_pairs = [
    {
        "responses_dir": "/kaggle/input/fc064244-cleaned",
        "answers_dir": "/kaggle/input/fc064244-ans-latest-1"
    },
    {
        "responses_dir": "/kaggle/input/fc064244-response-2",
        "answers_dir": "/kaggle/input/fc064244-ans-latest-1"
    },
    {
        "responses_dir": "/kaggle/input/fc065244-response",
        "answers_dir": "/kaggle/input/fc065244-ans-latest-1"
    },
    {
        "responses_dir": "/kaggle/input/qc174032-question-latest231224",
        "answers_dir": "/kaggle/input/qc174032-ans-latest1-23-12-14"
    },
    {
        "responses_dir": "/kaggle/input/km24sf1-response",
        "answers_dir": "/kaggle/input/km24sf1-ans-json"
    },
    {
        "responses_dir": "/kaggle/input/km24sma-response",
        "answers_dir": "/kaggle/input/km24sma-ans-latest"
    },
    {
        "responses_dir": "/kaggle/input/gg24sf14-response",
        "answers_dir": "/kaggle/input/gg24sf14-ans-latest-1"
    }
]

# Define chapters directory
chapters_dir = "/kaggle/input/book-info"

def load_chapters(book_name):
    """Load chapters JSON for a specific book."""
    chapter_file = os.path.join(chapters_dir, f"{book_name}.json")
    if os.path.exists(chapter_file):
        with open(chapter_file, "r", encoding="utf-8") as f:
            return json.load(f)
    return {}

def save_json(data, directory, filename):
    """Helper function to save JSON files."""
    path = os.path.join(directory, filename)
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)
    print(f"Saved: {path}")
    print(json.dumps(data, indent=4, ensure_ascii=False))  # Print the JSON content

def get_topic_from_chapters(chapters, chapter_label):
    """Fetch the topic from chapters based on the chapter label."""
    for chapter in chapters.get("chapters", []):
        if chapter.get("label", "") == chapter_label:
            return chapter.get("topic", {})
    return {}

def prepare_english_output(book_name, chapters, chapter, question, response, answer):
    topic = get_topic_from_chapters(chapters, f"C{chapter}")
    return {
        "book_no": book_name,
        "book_name": chapters.get("book_name", ""),
        "publisher": chapters.get("publisher", ""),
        "IBSN": chapters.get("IBSN", ""),
        "subject": chapters.get("subject", {}).get("english", ""),
        "topic": topic.get("english", ""),
        "text": response.get("text", {}).get("english", ""),
        "figures": [
            {
                "label": figure.get("label", {}).get("english", ""),
                "path": figure.get("path", "")
            }
            for figure in response.get("figures", [])
        ],
        "options": {k: v.get("english", "") for k, v in response.get("options", {}).items()},
        "answers": answer
    }

def prepare_malay_output(book_name, chapters, chapter, question, response, answer):
    topic = get_topic_from_chapters(chapters, f"C{chapter}")
    return {
        "book_no": book_name,
        "book_name": chapters.get("book_name", ""),
        "publisher": chapters.get("publisher", ""),
        "IBSN": chapters.get("IBSN", ""),
        "subject": chapters.get("subject", {}).get("malay", ""),
        "topic": topic.get("malay", ""),
        "text": response.get("text", {}).get("malay", ""),
        "figures": [
            {
                "label": figure.get("label", {}).get("malay", ""),
                "path": figure.get("path", "")
            }
            for figure in response.get("figures", [])
        ],
        "options": {k: v.get("malay", "") for k, v in response.get("options", {}).items()},
        "answers": answer
    }

def find_answer_file(book_name, chapter, answers_dir):
    """Find the correct answer file dynamically based on various naming patterns."""
    patterns = [
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}_ANS.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}_ans.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_*C{chapter}*.json"),
        os.path.join(answers_dir, f"{book_name}_C{chapter}_ans.json"),
        os.path.join(answers_dir, f"{book_name}_C{chapter}*.json"),
        os.path.join(answers_dir, f"*{chapter}*.json"),  # Broad wildcard for unmatched cases
    ]
    
    for pattern in patterns:
        files = glob.glob(pattern)
        if files:
            return files[0]
    
    print(f"No matching answer file found for book: {book_name}, chapter: {chapter}")
    return None

def main():
    print("Processing responses and answers...")

    for pair in data_pairs:
        responses_dir = pair["responses_dir"]
        answers_dir = pair["answers_dir"]

        response_files = [f for f in os.listdir(responses_dir) if f.endswith("_response.json")]
        book_name = re.match(r"([A-Z0-9]+)_", response_files[0]).group(1)
        chapters = load_chapters(book_name)

        # Special condition for KM24SF1 to rename the output folder and file names to KM24SFI
        if book_name == "KM24SF1":
            book_name_to_save = "KM24SFI"
        else:
            book_name_to_save = book_name

        output_dir_english = os.path.join(f"./{book_name_to_save}_QA_en_ms_latest", "english")
        output_dir_malay = os.path.join(f"./{book_name_to_save}_QA_en_ms_latest", "malay")
        os.makedirs(output_dir_english, exist_ok=True)
        os.makedirs(output_dir_malay, exist_ok=True)

        for file in response_files:
            match = re.match(rf"{book_name}_C(\d+)_Q(\d+)_response.json", file)
            if match:
                chapter = match.group(1)
                question = match.group(2)
                response_path = os.path.join(responses_dir, file)

                # Dynamically find the answer file
                answer_path = find_answer_file(book_name, chapter, answers_dir)
                if not answer_path:
                    print(f"No answer file for {file}")
                    continue

                with open(response_path, "r", encoding="utf-8") as f:
                    response = json.load(f)
                with open(answer_path, "r", encoding="utf-8") as f:
                    answers = json.load(f)

                # Check for CombinedResults or directly search for the question key
                if "CombinedResults" in answers:
                    answer = answers.get("CombinedResults", {}).get(question, "")
                else:
                    # Fallback: Directly access the question as a key in the top-level dictionary
                    answer = answers.get(question, "")

                # Skip if no answer is found
                if not answer:
                    print(f"No answer found for question {question} in {answer_path}.")
                    continue

                english_output = prepare_english_output(book_name, chapters, chapter, question, response, answer)
                malay_output = prepare_malay_output(book_name, chapters, chapter, question, response, answer)

                save_json(english_output, output_dir_english, f"{book_name_to_save}_C{chapter}_Q{question}_en.json")
                save_json(malay_output, output_dir_malay, f"{book_name_to_save}_C{chapter}_Q{question}_ms.json")

    print("Processing complete!")


if __name__ == "__main__":
    main()

# Testing for 1 response and 1 ans

In [None]:
import os
import json
import re
from pathlib import Path

# Define file directories
responses_dir = "/kaggle/input/km24sf1-response"
answers_dir = "/kaggle/input/km24sf1-ans-json"

chapters_dir = "/kaggle/input/book-info"  

output_dir_english = "./processed_outputs/english"
output_dir_malay = "./processed_outputs/malay"
os.makedirs(output_dir_english, exist_ok=True)
os.makedirs(output_dir_malay, exist_ok=True)


def load_chapters(book_name):
    """Load chapters JSON for a specific book."""
    chapter_file = os.path.join(chapters_dir, f"{book_name}.json")
    if os.path.exists(chapter_file):
        with open(chapter_file, "r") as f:
            return json.load(f)
    return {}

def save_json(data, directory, filename):
    """Helper function to save JSON files."""
    path = os.path.join(directory, filename)
    with open(path, "w") as f:
        json.dump(data, f, indent=4)
    print(f"Saved: {path}")
    print(json.dumps(data, indent=4))  # Print the JSON content

def get_topic_from_chapters(chapters, chapter_label):
    """Fetch the topic from chapters based on the chapter label."""
    for chapter in chapters.get("chapters", []):
        if chapter.get("label", "") == chapter_label:
            return chapter.get("topic", {})
    return {}

def prepare_english_output(book_name, chapters, chapter, question, response, answer):
    topic = get_topic_from_chapters(chapters, f"C{chapter}")
    return {
        "book_no": book_name,
        "book_name": chapters.get("book_name", ""),
        "publisher": chapters.get("publisher", ""),
        "IBSN": chapters.get("IBSN", ""),
        "subject": chapters.get("subject", {}).get("english", ""),
        "topic": topic.get("english", ""),
        "text": response.get("text", {}).get("english", ""),
        "figures": [
            {
                "label": figure.get("label", {}).get("english", ""),
                "path": figure.get("path", "")
            }
            for figure in response.get("figures", [])
        ],
        "options": {k: v.get("english", "") for k, v in response.get("options", {}).items()},
        "answers": answer
    }

def prepare_malay_output(book_name, chapters, chapter, question, response, answer):
    topic = get_topic_from_chapters(chapters, f"C{chapter}")
    return {
        "book_no": book_name,
        "book_name": chapters.get("book_name", ""),
        "publisher": chapters.get("publisher", ""),
        "IBSN": chapters.get("IBSN", ""),
        "subject": chapters.get("subject", {}).get("malay", ""),
        "topic": topic.get("malay", ""),
        "text": response.get("text", {}).get("malay", ""),
        "figures": [
            {
                "label": figure.get("label", {}).get("malay", ""),
                "path": figure.get("path", "")
            }
            for figure in response.get("figures", [])
        ],
        "options": {k: v.get("malay", "") for k, v in response.get("options", {}).items()},
        "answers": answer
    }

import glob

def find_answer_file(book_name, chapter):
    """Find the correct answer file dynamically based on various naming patterns."""
    patterns = [
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}_ANS.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}_ans.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_C{chapter}.json"),
        os.path.join(answers_dir, f"ocr_results_{book_name}_*C{chapter}*.json"),
        os.path.join(answers_dir, f"{book_name}_C{chapter}_ans.json"),
        os.path.join(answers_dir, f"{book_name}_C{chapter}*.json"),
        os.path.join(answers_dir, f"*{chapter}*.json"),  # Broad wildcard for unmatched cases
    ]
    
    for pattern in patterns:
        #print(f"Checking pattern: {pattern}")  # Debugging print
        files = glob.glob(pattern)
        if files:
            #print(f"Matched file: {files[0]}")  # Debugging print
            return files[0]
    
    # If no matching file is found
    print(f"No matching answer file found for book: {book_name}, chapter: {chapter}")
    return None



def main():
    print("Processing responses and answers...")
    response_files = [f for f in os.listdir(responses_dir) if f.endswith("_response.json")]
    book_name = re.match(r"([A-Z0-9]+)_", response_files[0]).group(1)
    chapters = load_chapters(book_name)

    for file in response_files:
        match = re.match(rf"{book_name}_C(\d+)_Q(\d+)_response.json", file)
        if match:
            chapter = match.group(1)
            question = match.group(2)
            response_path = os.path.join(responses_dir, file)

            # Dynamically find the answer file
            answer_path = find_answer_file(book_name, chapter)
            if not answer_path:
                print(f"No answer file for {file}")
                continue
            
            with open(response_path, "r") as f:
                response = json.load(f)
            with open(answer_path, "r") as f:
                answers = json.load(f)

            # Check for CombinedResults or directly search for the question key
            if "CombinedResults" in answers:
                answer = answers.get("CombinedResults", {}).get(question, "")
            else:
                # Fallback: Directly access the question as a key in the top-level dictionary
                answer = answers.get(question, "")

            # Skip if no answer is found
            if not answer:
                print(f"No answer found for question {question} in {answer_path}.")
                continue

            english_output = prepare_english_output(book_name, chapters, chapter, question, response, answer)
            malay_output = prepare_malay_output(book_name, chapters, chapter, question, response, answer)

            save_json(english_output, output_dir_english, f"{book_name}_C{chapter}_Q{question}_en.json")
            save_json(malay_output, output_dir_malay, f"{book_name}_C{chapter}_Q{question}_ms.json")

    print("Processing complete!")


    
if __name__ == "__main__":
    main()


In [None]:
import os
import shutil

# Define the root directory and new output directories
root_dir = "/kaggle/working"
output_dir = "/kaggle/working/combined_output"
english_output_dir = os.path.join(output_dir, "english")
malay_output_dir = os.path.join(output_dir, "malay")

# Create the output directories if they do not exist
os.makedirs(english_output_dir, exist_ok=True)
os.makedirs(malay_output_dir, exist_ok=True)

# Walk through the directory structure
for root, dirs, files in os.walk(root_dir):
    if "english" in root:
        # Copy files in the 'english' folder to the combined english_output_dir
        for file in files:
            src_file = os.path.join(root, file)
            dest_file = os.path.join(english_output_dir, file)
            # Skip if the file already exists at the destination
            if not os.path.exists(dest_file):
                shutil.copy(src_file, dest_file)
    elif "malay" in root:
        # Copy files in the 'malay' folder to the combined malay_output_dir
        for file in files:
            src_file = os.path.join(root, file)
            dest_file = os.path.join(malay_output_dir, file)
            # Skip if the file already exists at the destination
            if not os.path.exists(dest_file):
                shutil.copy(src_file, dest_file)

print(f"Files combined successfully into {output_dir}.")


In [None]:
import os

# Define the root directory
root_dir = "/kaggle/working"

# Dictionary to store the counts
folder_file_counts = {}

# Iterate through each folder in the root directory
for folder in os.listdir(root_dir):
    folder_path = os.path.join(root_dir, folder)
    if os.path.isdir(folder_path):  # Ensure it's a directory
        folder_file_counts[folder] = {}
        for subfolder in ["english", "malay"]:
            subfolder_path = os.path.join(folder_path, subfolder)
            if os.path.isdir(subfolder_path):  # Check if subfolder exists
                # Count the number of files in the subfolder
                file_count = len([file for file in os.listdir(subfolder_path) if os.path.isfile(os.path.join(subfolder_path, file))])
                folder_file_counts[folder][subfolder] = file_count
            else:
                folder_file_counts[folder][subfolder] = 0  # No subfolder present

# Print the results
for folder, subfolder_counts in folder_file_counts.items():
    print(f"Folder: {folder}")
    for subfolder, count in subfolder_counts.items():
        print(f"  {subfolder}: {count} file(s)")


# Workflow for Exporting JSON to Table

This script processes JSON files containing question-and-answer data and exports them into a structured CSV table. Here's the simplified workflow:

---

### **Workflow**
1. **Input Directory**:
   - JSON files are stored in the directory `final_dataset_dir` (e.g., `/kaggle/input/final-dataset-ms`).
   - Each JSON file represents a question, its options, answer, and related metadata.

2. **Process Each JSON File**:
   - For every file:
     - Load the JSON data.
     - Extract fields:
       - **IBSN**: Unique identifier for the book.
       - **Subject**: Subject of the question.
       - **Topics**: Related chapter or topic.
       - **Questions**: Main question text.
       - **Figures**: Image details, if any (e.g., figure label and file path).
       - **Options**: Answer choices (e.g., A, B, C, D).
       - **Answers**: Correct answer for the question.
     - Format the data into a row.

3. **Build a Table**:
   - Combine all rows into a structured **DataFrame** using Pandas.

4. **Export as CSV**:
   - Save the table to a CSV file (`final_dataset_table_ms.csv`) in the `output_table_dir` directory (e.g., `/kaggle/working/table`).

---

### **Customizable Parameters**
- **Input Directory** (`final_dataset_dir`): Change to point to the directory containing the JSON files.
- **Output Directory** (`output_table_dir`): Set the location to save the exported table.
- **Fields to Extract**:
  - You can add/remove fields to customize what is included in the table (e.g., add a new metadata field from the JSON).

---

### **Output**
- A CSV file (`final_dataset_table_ms.csv`) containing:
  - Columns: IBSN, Subject, Topics, Questions, Figures, Options, Answers.
  - Rows: One row per JSON file/question.

In [None]:
import os
import json
import pandas as pd

# Define the directory containing the final JSON dataset
final_dataset_dir = "/kaggle/working/combined_output/malay"

# Initialize an empty list to store rows for the table
rows = []

# Iterate over all JSON files in the dataset directory
for file_name in os.listdir(final_dataset_dir):
    if file_name.endswith(".json"):
        file_path = os.path.join(final_dataset_dir, file_name)
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        
        # Extract the required fields
        IBSN = data.get("IBSN", "")
        subject = data.get("subject", "")
        topic = data.get("topic", "")
        questions = data.get("text", "")
        figures = [f"{figure.get('label', '')}: {figure.get('path', '')}" for figure in data.get("figures", [])]
        # Ensure options are formatted as ["A:","B:","C:","D:"]
        options = [f"{key}: {value}" for key, value in sorted(data.get("options", {}).items())]
        answers = data.get("answers", "")

        # Extract and clean the file name
        clean_file_name = file_name.replace("_en", "").replace("_ms", "").replace(".json", "")
        
        # Append a row to the table
        rows.append({
            "FileName": clean_file_name,
            "IBSN": IBSN,
            "Subject": subject,
            "Topics": topic,
            "Questions": questions,
            "Figures": ", ".join(figures),
            "Options": [f"{option}" for option in options],
            "Answers": answers
        })

# Convert the rows into a Pandas DataFrame
df = pd.DataFrame(rows)

# Define the output directory and create it if it doesn't exist
output_table_dir = "/kaggle/working/table"
os.makedirs(output_table_dir, exist_ok=True)

# Define the output CSV file path
output_csv_path = os.path.join(output_table_dir, "final_dataset_table_ms.csv")

# Export the DataFrame to a CSV file
df.to_csv(output_csv_path, index=False)

# Display the DataFrame for verification
print(df.head())

# Display success message
print(f"Table successfully created and saved to {output_csv_path}")


# Use for download from Kaggle

In [None]:
import shutil

# Define the directory containing the output files
output_directory = "/kaggle/working/table"
output_zip_path = "Combined_dataset.zip"

# Create a ZIP file containing all files and folders within the output directory
shutil.make_archive(output_zip_path.replace(".zip", ""), 'zip', output_directory)

print(f"ZIP file created at: {output_zip_path}")