## **Project Overview**  

This project focuses on developing an OCR system using **CRNN (Convolutional Recurrent Neural Network)** and **Weighted CRNN** for improved text recognition. We also implement **CTC (Connectionist Temporal Classification) loss** and **Beam Search Decoding** to enhance sequence prediction accuracy.  

### **Key Components:**  
- **CRNN & Weighted CRNN** â€“ Combining CNN for feature extraction and RNN for sequential text prediction.  
- **CTC Loss** â€“ Handling variable-length text sequences without pre-segmented labels.  
- **Beam Search Decoding** â€“ Improving text output by considering multiple possible sequences.  

This setup ensures robust recognition of ancient Spanish handwritten text.  


### **Data Preprocessing Overview**

In this project, we are preparing data for our OCR system by processing PDF documents and extracting labeled word images. The preprocessing steps include:

1. **PDF to Image Conversion** â€“ Converting multi-page PDFs into single-page images.
2. **Word Detection with CRAFT** â€“ Using the CRAFT model to detect words in images.
3. **Cropping Word Images** â€“ Extracting words based on detected coordinates.
4. **Manual Labeling** â€“ Assigning correct labels to the cropped word images.
5. **Data Augmentation** â€“ Applying transformations like rotation and Gaussian noise.
6. **CSV Generation & Cleaning** â€“ Creating a CSV file for image-label mapping and removing unlabeled entries.

This dataset will be used to train our OCR model for recognizing ancient Spanish text.


# ***Code Explaination***

### ðŸ“¥ Imports and Setup    


In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import tensorflow.data as tfd
import shutil

### **Google Drive Mounting (For Colab)**

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


### **Defining Constants**

In [None]:
BATCH_SIZE=16
AUTOTUNE=tfd.AUTOTUNE
IMG_WIDTH=200
IMG_HEIGHT=50

### **PDF to Image Conversion**

In [None]:
import fitz  # PyMuPDF
import os

text_dir = '/content/drive/My Drive/GSOC_Naresh_Meena/data/Texts'  # Directory containing PDFs
output_dir = '/content/drive/My Drive/GSOC_Naresh_Meena/data/images'  # Directory to save images

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

for file in os.listdir(text_dir):
    if not file.endswith(".pdf"):  # Skip non-PDF files
        continue

    pdf_path = os.path.join(text_dir, file)
    pdf = fitz.open(pdf_path)

    # Create a subfolder for this PDF's images
    pdf_output_dir = os.path.join(output_dir, os.path.splitext(file)[0])
    os.makedirs(pdf_output_dir, exist_ok=True)

    for page_num in range(len(pdf)):
        page = pdf[page_num]
        pix = page.get_pixmap()  # Convert page to image

        # Save image as PNG
        image_path = os.path.join(pdf_output_dir, f"page_{page_num + 1}.png")
        pix.save(image_path)

    print(f" PDF '{file}' converted to images in '{pdf_output_dir}'")


 PDF 'Paredes - Reglas generales.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales'
 PDF 'Ezcaray - Vozes.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Ezcaray - Vozes'
 PDF 'Constituciones sinodales Calahorra 1602.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Constituciones sinodales Calahorra 1602'
 PDF 'Mendo - Principe perfecto.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Mendo - Principe perfecto'
 PDF 'Buendia - Instruccion.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Buendia - Instruccion'
 PDF 'PORCONES.228.35 â€“ 1636.pdf' converted to images in '/content/drive/My Drive/GSOC_Naresh_Meena/data/images/PORCONES.228.35 â€“ 1636'


### **Image Splitting (Double-Page to Single Page)**

In [None]:
import cv2
import os

# Path where your double-page images are saved
input_dir = "/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Buendia - Instruccion"  # Change this to your folder path
output_dir = "/content/drive/My Drive/GSOC_Naresh_Meena/data/images/Buendia_split_pages"

# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)

# Process each image in the input directory
for filename in os.listdir(input_dir):
    img_path = os.path.join(input_dir, filename)
    img = cv2.imread(img_path)

    if img is None:
        print(f"Skipping {filename}, could not read file.")
        continue

    # Get image dimensions
    height, width, _ = img.shape
    mid = width // 2  # Middle of the image

    # Split into left and right images
    left_half = img[:, :mid]
    right_half = img[:, mid:]

    # Save the two halves
    base_name, ext = os.path.splitext(filename)
    cv2.imwrite(os.path.join(output_dir, f"{base_name}_left{ext}"), left_half)
    cv2.imwrite(os.path.join(output_dir, f"{base_name}_right{ext}"), right_half)

    print(f"Split {filename} into two pages.")

print(" Splitting complete!")


Split page_1.png into two pages.
Split page_2.png into two pages.
Split page_3.png into two pages.
Split page_4.png into two pages.
Split page_5.png into two pages.
Split page_6.png into two pages.
 Splitting complete!


### **Renaming and Moving Images**

In [None]:
import os
import shutil

# Root directory containing images
images_root = "/content/drive/My Drive/GSOC_Naresh_Meena/data/images"

# Destination folder (to store renamed images in one place)
output_folder = "/content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images"
os.makedirs(output_folder, exist_ok=True)  # Create if it doesn't exist

# Supported image extensions
image_extensions = (".jpg", ".jpeg", ".png", ".bmp", ".tiff")

# Traverse through all subdirectories
for root, _, files in os.walk(images_root):
    folder_name = os.path.basename(root)  # Get current folder name

    for file in files:
        if file.lower().endswith(image_extensions):  # Check if it's an image
            old_path = os.path.join(root, file)
            new_filename = f"{folder_name}_{file}"  # Rename format
            new_path = os.path.join(output_folder, new_filename)

            # Move and rename the file
            shutil.move(old_path, new_path)
            print(f"Renamed & Moved: {old_path} â†’ {new_path}")

print("Renaming & moving completed successfully!")


Renamed & Moved: /content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales/page_1.png â†’ /content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images/Paredes - Reglas generales_page_1.png
Renamed & Moved: /content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales/page_2.png â†’ /content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images/Paredes - Reglas generales_page_2.png
Renamed & Moved: /content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales/page_3.png â†’ /content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images/Paredes - Reglas generales_page_3.png
Renamed & Moved: /content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales/page_4.png â†’ /content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images/Paredes - Reglas generales_page_4.png
Renamed & Moved: /content/drive/My Drive/GSOC_Naresh_Meena/data/images/Paredes - Reglas generales/page_5.png â†’ /content/drive/My Drive/GSOC_Naresh

### **Running CRAFT for Text Detection**

We tried using PaddleOCR, but it was detecting lines instead of individual words. Therefore, we used the CRAFT model, which was better suited for extracting words from images.

In [None]:
!python3 "/content/drive/My Drive/GSOC_Naresh_Meena/CRAFT-pytorch/test.py" \
  --trained_model="/content/drive/My Drive/GSOC_Naresh_Meena/CRAFT-pytorch/weights/craft_mlt_25k.pth" \
  --test_folder="/content/drive/My Drive/GSOC_Naresh_Meena/data/images/renamed_images" \
  --cuda=False


Loading weights from checkpoint (/content/drive/My Drive/GSOC_Naresh_Meena/CRAFT-pytorch/weights/craft_mlt_25k.pth)
elapsed time : 4.76837158203125e-06s


### **Word Cropping from Images**

This script extracts bounding boxes, groups them into lines, and crops words from images.

**Steps**:
1. **Load images**.
2. **Group bounding boxes**.
3. **Crop and save words**.


In [None]:
import cv2
import numpy as np
import os

# Root directories
images_root = "/content/drive/My Drive/GSOC_Naresh_Meena/data/renamed_images"
results_root = "/content/drive/My Drive/GSOC_Naresh_Meena/result"
output_root = "/content/drive/My Drive/GSOC_Naresh_Meena/data/Words"

# Ensure output root exists
os.makedirs(output_root, exist_ok=True)

# Supported image formats
image_extensions = (".jpg", ".jpeg", ".png", ".bmp", ".tiff")

# Function to get min y and min x of a bounding box
def get_y_min_x_min(box):
    x_coords = [box[i] for i in range(0, 8, 2)]
    y_coords = [box[i] for i in range(1, 8, 2)]
    return min(y_coords), min(x_coords)

# **Global counter for total saved images**
total_saved_images = 0

# Iterate through all subfolders in images_root
for root, _, files in os.walk(images_root):
    for file in files:
        if file.lower().endswith(image_extensions):  # Check if it's an image
            image_path = os.path.join(root, file)
            txt_filename = f"res_{os.path.splitext(file)[0]}.txt"  # Expected text file name
            txt_path = os.path.join(results_root, txt_filename)

            # Check if corresponding text file exists
            if not os.path.exists(txt_path):
                print(f"Skipping {image_path}, no bounding box file found.")
                continue

            # Load image safely
            image = cv2.imread(image_path, cv2.IMREAD_COLOR)
            if image is None:
                print(f"Error: Could not load image at {image_path}")
                continue

            img_height, img_width = image.shape[:2]  # Get image dimensions

            # Read bounding boxes from text file
            bounding_boxes = []
            with open(txt_path, "r") as txt_file:  # Renamed `file` to `txt_file`
                for line in txt_file:
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        coords = list(map(int, line.split(',')))
                        if len(coords) == 8:
                            bounding_boxes.append(coords)
                        else:
                            print(f"Skipping invalid line: {line}")
                    except ValueError:
                        print(f"Skipping non-numeric line: {line}")

            if not bounding_boxes:
                print(f"No bounding boxes found in {txt_path}, skipping.")
                continue

            # Step 1: Sort bounding boxes by `y_min` first
            bounding_boxes.sort(key=lambda box: get_y_min_x_min(box)[0])

            # Step 2: Group bounding boxes into lines based on y_min proximity
            line_threshold = 10  # Pixels within which words are considered in the same row
            lines = []  # List to store grouped bounding boxes
            current_line = []

            for box in bounding_boxes:
                y_min, _ = get_y_min_x_min(box)

                # If the line is empty or within threshold, add to current line
                if not current_line or abs(get_y_min_x_min(current_line[-1])[0] - y_min) <= line_threshold:
                    current_line.append(box)
                else:
                    # Sort current line by `x_min` before adding to `lines`
                    current_line.sort(key=lambda box: get_y_min_x_min(box)[1])
                    lines.append(current_line)
                    current_line = [box]  # Start a new line

            # Add last line if exists
            if current_line:
                current_line.sort(key=lambda box: get_y_min_x_min(box)[1])
                lines.append(current_line)

            # Prepare output directory structure
            image_name = os.path.splitext(file)[0]  # Image filename without extension
            output_folder = os.path.join(output_root, image_name, "images")
            os.makedirs(output_folder, exist_ok=True)

            # Step 3: Process and save words row-wise
            word_counter = 1
            for line in lines:
                for box in line:
                    x_coords = [box[i] for i in range(0, 8, 2)]
                    y_coords = [box[i] for i in range(1, 8, 2)]

                    x_min, x_max = min(x_coords), max(x_coords)
                    y_min, y_max = min(y_coords), max(y_coords)

                    # Ensure coordinates are within image bounds
                    x_min = max(0, min(x_min, img_width - 1))
                    x_max = max(0, min(x_max, img_width))
                    y_min = max(0, min(y_min, img_height - 1))
                    y_max = max(0, min(y_max, img_height))

                    # Crop and save word if valid
                    if x_max > x_min and y_max > y_min:
                        cropped_word = image[y_min:y_max, x_min:x_max]
                        word_filename = os.path.join(output_folder, f"word_{word_counter}.jpg")
                        cv2.imwrite(word_filename, cropped_word)
                        print(f"Saved {word_filename}")
                        word_counter += 1
                        total_saved_images += 1  # **Increment total count**
                    else:
                        print(f"Skipping invalid crop for {image_name} word {word_counter} with coordinates: {x_min}, {x_max}, {y_min}, {y_max}")

# **Print the total number of images saved**
print("\n-----------------------------")
print(f"Total cropped words saved: {total_saved_images}")
print("-----------------------------")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_326.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_327.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_328.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_329.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_330.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_331.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_332.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena/data/Words/PORCONES.228.35 â€“ 1636_page_8/images/word_333.jpg
Saved /content/drive/My Drive/GSOC_Naresh_Meena

data manually labeled for 5000 images

### **Image Preprocessing**  

Resizing and padding images while maintaining aspect ratio for OCR training.  


In [None]:
import tensorflow as tf
import pandas as pd
import cv2
import numpy as np

def resize_with_padding(image, target_size=(IMG_WIDTH, IMG_HEIGHT), pad_value=0):
    """
    Resizes an image while maintaining aspect ratio and pads it to fit the target size.

    Args:
        image: A NumPy array (H, W, C) or grayscale (H, W).
        target_size: Tuple of (width, height) for the final image.
        pad_value: Value to use for padding (default is 0 for black).

    Returns:
        Padded and resized image as a NumPy array.
    """
    target_w, target_h = target_size
    original_h, original_w = image.shape[:2]

    # Compute scale factor to maintain aspect ratio
    scale = min(target_w / original_w, target_h / original_h)

    # Compute new resized dimensions
    new_w = int(round(original_w * scale))
    new_h = int(round(original_h * scale))

    # Resize image
    image_resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_LINEAR)

    # Compute padding (to center the resized image)
    pad_top = (target_h - new_h) // 2
    pad_bottom = target_h - new_h - pad_top
    pad_left = (target_w - new_w) // 2
    pad_right = target_w - new_w - pad_left

    # Pad the image
    image_padded = cv2.copyMakeBorder(image_resized, pad_top, pad_bottom, pad_left, pad_right,
                                      borderType=cv2.BORDER_CONSTANT, value=pad_value)

    return image_padded

def process_images_from_df(df, img_column, target_size=(200, 50)):
    """
    Reads, resizes, pads, and saves images from a DataFrame.

    Args:
        df: Pandas DataFrame containing file paths.
        img_column: Column name in df containing file paths.
        target_size: Tuple (width, height) for resizing.

    Returns:
        None (images are saved at the same paths).
    """
    for _, row in df.iterrows():
        img_path = row[img_column]

        # Read image
        image = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)  # Read as grayscale

        if image is None:
            print(f"Warning: Could not read {img_path}")
            continue

        # Process image
        image_padded = resize_with_padding(image, target_size)

        # Save processed image
        cv2.imwrite(img_path, image_padded)

        print(f"Processed and saved: {img_path}")
create_csv_with_augmentation("/content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working","cleaned_working.csv")
# Example usage
df=pd.read_csv('cleaned_working.csv') # Replace with actual paths
process_images_from_df(df, 'path')


Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/es (2).jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/fer.jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/deben.jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/mucho.jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/lo titulo.jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/pequena.jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/como (2).jpg
Processed and saved: /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/split_pages_page_1_right/images/entre.jpg
Processed and saved: /co

### **Data Augmentation: Rotation**  

Applying slight rotations (-5Â° to +5Â°) to images for improving OCR model robustness.  


In [None]:
import os
import cv2
import numpy as np

def rotation_aug(root_dir):
    """
    Applies rotation augmentation to images stored in 'images/' folders under subdirectories.

    Args:
        root_dir (str): Path to the root directory containing subdirectories with images.
    """
    for subdir in os.listdir(root_dir):
        subdir_path = os.path.join(root_dir, subdir)
        images_path = os.path.join(subdir_path, 'images')

        if not os.path.isdir(images_path):
            continue  # Skip if the 'images' folder does not exist

        for filename in os.listdir(images_path):
            if filename.endswith(('.png', '.jpg', '.jpeg')):  # Process only image files
                img_path = os.path.join(images_path, filename)

                # Read the image
                img = cv2.imread(img_path)

                if img is None:
                    print(f"Error loading image: {filename}")
                    continue

                # Get image dimensions
                h, w = img.shape[:2]
                center = (w // 2, h // 2)

                # Loop from -5 to +5 degrees (excluding 0)
                for angle in range(-5, 6):
                    if angle == 0:
                        continue  # Skip 0-degree (original image)

                    # Compute rotation matrix
                    rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)

                    # Rotate the image
                    rotated_img = cv2.warpAffine(img, rotation_matrix, (w, h), borderMode=cv2.BORDER_REPLICATE)

                    # Construct the output filename
                    new_filename = f"{os.path.splitext(filename)[0]}_rot{angle}{os.path.splitext(filename)[1]}"

                    # Save the rotated image in the same 'images/' folder
                    cv2.imwrite(os.path.join(images_path, new_filename), rotated_img)

                    print(f"Saved: {new_filename} in {images_path}")

# Example Usage
root_directory = "/content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working" # Change this to your actual directory path
rotation_aug(root_directory)
print("Rotation augmentation completed!")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Saved: )_rot-4.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot-3.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot-2.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot-1.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot1.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot2.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot3.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: )_rot4.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe pe

### **Data Augmentation: Gaussian Noise**  

Adding Gaussian noise to images to enhance OCR model robustness against real-world distortions.  


In [None]:
import os
import cv2
import numpy as np

def add_gaussian_noise(image, mean=0, std=25):
    """
    Adds Gaussian noise to an image.

    Args:
        image (numpy.ndarray): Input image.
        mean (int): Mean of Gaussian noise.
        std (int): Standard deviation of Gaussian noise.

    Returns:
        numpy.ndarray: Noisy image.
    """
    noise = np.random.normal(mean, std, image.shape).astype(np.uint8)
    noisy_image = cv2.add(image, noise)  # Adding noise while ensuring pixel values remain valid
    return noisy_image

def gaussian_noise_aug(root_dir):
    """
    Applies Gaussian noise augmentation to images stored in 'images/' folders under subdirectories.

    Args:
        root_dir (str): Path to the root directory containing subdirectories with images.
    """
    for subdir in os.listdir(root_dir):
        subdir_path = os.path.join(root_dir, subdir)
        images_path = os.path.join(subdir_path, 'images')

        if not os.path.isdir(images_path):
            continue  # Skip if the 'images' folder does not exist

        for filename in os.listdir(images_path):
            if filename.endswith(('.png', '.jpg', '.jpeg')):  # Process only image files
                img_path = os.path.join(images_path, filename)

                # Read the image
                img = cv2.imread(img_path)

                if img is None:
                    print(f"Error loading image: {filename}")
                    continue

                # Apply Gaussian noise
                noisy_img = add_gaussian_noise(img)

                # Construct the output filename
                new_filename = f"{os.path.splitext(filename)[0]}_noise{os.path.splitext(filename)[1]}"

                # Save the noisy image in the same 'images/' folder
                cv2.imwrite(os.path.join(images_path, new_filename), noisy_img)

                print(f"Saved: {new_filename} in {images_path}")

# Example Usage
root_directory ="/content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working"  # Change this to your actual directory path
gaussian_noise_aug(root_directory)
print("Gaussian noise augmentation completed!")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Saved: GVZMAN_rot1_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: GVZMAN_rot2_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: GVZMAN_rot3_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: GVZMAN_rot4_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: GVZMAN_rot5_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: ensenanzas._rot-5_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: ensenanzas._rot-4_noise.jpg in /content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working/Mendo - Principe perfecto_page_1/images
Saved: ensena

### **Extra Image Augmentation**

This script was initially used to apply contrast and brightness adjustments to 50% of images listed in a CSV file, generating additional augmented images. However, since we have already achieved 100k images, this augmentation is no longer in use.

**Key steps (if used)**:
1. **Select 50% of images** from a CSV file.
2. **Apply contrast adjustment** to selected images.
3. **Apply brightness adjustment** to selected images.


In [None]:
import os
import cv2
import numpy as np
import pandas as pd
import random

def load_image_list(csv_path):
    """Loads image filenames from a CSV file."""
    df = pd.read_csv(csv_path)
    image_list = df['path'].tolist()
    return set(random.sample(image_list, len(image_list) // 2))  # Select 50% of images

def adjust_contrast(img, alpha=1.5, beta=0):
    """Adjusts the contrast of the image."""
    return cv2.convertScaleAbs(img, alpha=alpha, beta=beta)

def adjust_brightness(img, value=30):
    """Adjusts the brightness of the image."""
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    h, s, v = cv2.split(hsv)
    v = np.clip(v + value, 0, 255)
    final_hsv = cv2.merge((h, s, v))
    return cv2.cvtColor(final_hsv, cv2.COLOR_HSV2BGR)

def augment_images(root_dir, csv_path):
    """Applies contrast and brightness augmentation to 50% of images listed in a CSV file."""
    selected_images = load_image_list(csv_path)

    for subdir in os.listdir(root_dir):
        subdir_path = os.path.join(root_dir, subdir)
        images_path = os.path.join(subdir_path, 'images')

        if not os.path.isdir(images_path):
            continue  # Skip if the 'images' folder does not exist

        for filename in os.listdir(images_path):
            if filename not in selected_images or not filename.endswith(('.png', '.jpg', '.jpeg')):
                continue  # Skip if not in the selected 50%

            img_path = os.path.join(images_path, filename)
            img = cv2.imread(img_path)
            if img is None:
                print(f"Error loading image: {filename}")
                continue

            # Contrast adjustment
            contrast_img = adjust_contrast(img)
            contrast_filename = f"{os.path.splitext(filename)[0]}_contrast{os.path.splitext(filename)[1]}"
            cv2.imwrite(os.path.join(images_path, contrast_filename), contrast_img)
            print(f"Saved: {contrast_filename} in {images_path}")

            # Brightness adjustment
            bright_img = adjust_brightness(img)
            bright_filename = f"{os.path.splitext(filename)[0]}_bright{os.path.splitext(filename)[1]}"
            cv2.imwrite(os.path.join(images_path, bright_filename), bright_img)
            print(f"Saved: {bright_filename} in {images_path}")

# Example usage
root_directory = data  # Change this to your actual directory path
csv_file_path = "/content/drive/My Drive/OCR/cleaned_working.csv"  # Path to CSV file containing image names
augment_images(root_directory, csv_file_path)
print("Image augmentation completed!")


### **Dataset Labeling & Cleanup & CSV Generation**  

Generating a clean CSV file with image paths and labels by removing augmentation-related suffixes for accurate OCR training.  


In [None]:
import os
import csv
import re

def clean_label(filename):
    """Cleans the filename by removing augmentation suffixes and numbering patterns."""
    label = os.path.splitext(filename)[0]  # Remove file extension
    label = re.sub(r'_rot-?\d+', '', label)  # Remove rotation suffix (_rotX or _rot-X)
    label = re.sub(r'_noise', '', label)  # Remove Gaussian noise suffix
    label = re.sub(r'\s*\(\d+\)$', '', label)  # Remove numbering patterns (1), (2), etc.
    return label

def create_csv_with_augmentation(root_dir, output_file):
    """
    Creates a CSV file containing image paths and corresponding labels,
    ensuring clean labels without augmentation suffixes.

    Args:
        root_dir (str): Path to the root directory containing subdirectories with images.
        output_file (str): Path to the output CSV file.
    """
    with open(output_file, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['path', 'label'])

        for subdir in os.listdir(root_dir):
            subdir_path = os.path.join(root_dir, subdir)
            images_path = os.path.join(subdir_path, 'images')

            if os.path.isdir(images_path):
                for image in os.listdir(images_path):
                    image_path = os.path.join(images_path, image)
                    if os.path.isfile(image_path):
                        label = clean_label(image)  # Clean label from filename
                        writer.writerow([image_path, label])

# Example usage
root_directory ="/content/drive/My Drive/GSOC_Naresh_Meena/Cleaned_Working"  # Change this to your actual dataset directory
output_csv = "cleaned_working.csv"
create_csv_with_augmentation(root_directory, output_csv)
print(f"CSV file '{output_csv}' created successfully!")


CSV file 'cleaned_working.csv' created successfully!


###  **The dataset has been cleaned to exclude rows with numeric labels. After filtering, we have 100k images in the dataset.**

In [3]:
df=pd.read_csv("/content/drive/My Drive/GSOC_Naresh_Meena/CSV/cleaned_working.csv")
df = df[~df["label"].astype(str).str.contains(r"\d")]
df.to_csv("/content/drive/My Drive/GSOC_Naresh_Meena/CSV/cleaned_working.csv",index=False)
df.shape

(103092, 2)

In [None]:
df=pd.read_csv("/content/drive/My Drive/GSOC_Naresh_Meena/CSV/cleaned_working.csv")
df.head()
# df.shape

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,path,label
0,0,0,/content/drive/My Drive/GSOC_Naresh_Meena/Clea...,secondenado
1,1,1,/content/drive/My Drive/GSOC_Naresh_Meena/Clea...,muerte
2,2,2,/content/drive/My Drive/GSOC_Naresh_Meena/Clea...,la
3,3,3,/content/drive/My Drive/GSOC_Naresh_Meena/Clea...,"suplicado,"
4,4,4,/content/drive/My Drive/GSOC_Naresh_Meena/Clea...,y
