# Practice of Minor Language Text Recognition R&D Based on ERNIE 4.5 and PaddleOCR

## 1. Background Introduction

Since its release, PaddleOCR has received widespread attention due to its outstanding text recognition capabilities and end-to-end development abilities. During the full-process development, many users often need to obtain a large amount of annotated text line data. However, the high cost of data annotation often makes it difficult to meet this demand. Traditional data annotation processes rely heavily on manual labor, which is not only time-consuming and labor-intensive but also prone to subjective bias, resulting in inconsistent label accuracy. Especially in practical applications involving diverse scenarios and complex semantics, the difficulty of data acquisition and annotation increases further. This tutorial aims to address this issue by utilizing ERNIE 4.5 to achieve automatic annotation of text lines, thereby effectively improving the recognition performance of text recognition models in real-world scenarios.

The automatic text recognition data annotation process based on ERNIE 4.5 is as follows: First, images containing text are collected. The PP-OCRv5 detection model of PaddleOCR is used to detect and locate the text lines in these images, and each line of text is cropped into an individual text line image. Then, ERNIE 4.5 is used to independently predict these images twice. Images with consistent results in both predictions are selected, and the corresponding recognition result is taken as the final ground truth label. This filtering mechanism can effectively avoid hallucination issues that may occur with large models, ensuring the accuracy and high quality of the automatically annotated data, and providing reliable data support for subsequent text recognition model training.

<div align="center">
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-OCRv5/cookbook/ocr_rec_data_labeled.png" width="800"/>
</div>

This tutorial uses a **Russian character recognition** dataset as an example to demonstrate how to achieve automatic data annotation based on ERNIE 4.5. The original Russian data used in this tutorial was collected from the internet. Users can use this [dataset](https://paddle-model-ecology.bj.bcebos.com/paddlex/data/russian_dataset_demo.tar) for batch automatic annotation operations, quickly completing the high-quality annotation and training process.

## 2. Environment Setup

This project depends on PaddlePaddle, PaddleOCR, the OpenAI SDK, and common Python utility packages. Please ensure all required dependencies are installed before use. For detailed installation instructions, refer to the [Environment Setup Documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/installation.md).

In [None]:
%pip install openai matplotlib

## 3. Text Line Detection and Cropping

Text detection is the first step in the OCR process. In this tutorial, the detection model PP-OCRv5_server_det is used to automatically locate each line of text in an image. Once located, the corresponding regions are cropped into individual text line images, which facilitates subsequent label prediction using ERNIE 4.5 and the training of text recognition models. This approach helps improve overall recognition accuracy and efficiency.

In [None]:
# Obtain the Russian sample dataset
!wget https://paddle-model-ecology.bj.bcebos.com/paddlex/data/russian_dataset_demo.tar
!tar -xf russian_dataset_demo.tar

In [None]:
import base64
import copy
import glob
import os
import time

import cv2
import numpy as np
from openai import OpenAI
from tqdm import tqdm


def get_rotate_crop_image(img: np.ndarray, points: list) -> np.ndarray:
    """
    Crop and rotate the image region to obtain a small text line image after perspective transformation.
    """
    assert len(points) == 4, "shape of points must be 4*2"
    img_crop_width = int(
        max(
            np.linalg.norm(points[0] - points[1]),
            np.linalg.norm(points[2] - points[3]),
        )
    )
    img_crop_height = int(
        max(
            np.linalg.norm(points[0] - points[3]),
            np.linalg.norm(points[1] - points[2]),
        )
    )
    pts_std = np.float32(
        [
            [0, 0],
            [img_crop_width, 0],
            [img_crop_width, img_crop_height],
            [0, img_crop_height],
        ]
    )
    M = cv2.getPerspectiveTransform(points, pts_std)
    dst_img = cv2.warpPerspective(
        img,
        M,
        (img_crop_width, img_crop_height),
        borderMode=cv2.BORDER_REPLICATE,
        flags=cv2.INTER_CUBIC,
    )
    dst_img_height, dst_img_width = dst_img.shape[0:2]
    if dst_img_height * 1.0 / dst_img_width >= 1.5:
        dst_img = np.rot90(dst_img)
    return dst_img


def get_minarea_rect_crop(img: np.ndarray, points: np.ndarray) -> np.ndarray:
    """
    Crop the minimum-area rectangular region from the detected set of points.
    """
    bounding_box = cv2.minAreaRect(np.array(points).astype(np.int32))
    points = sorted(cv2.boxPoints(bounding_box), key=lambda x: x[0])
    index_a, index_b, index_c, index_d = 0, 1, 2, 3
    if points[1][1] > points[0][1]:
        index_a = 0
        index_d = 1
    else:
        index_a = 1
        index_d = 0
    if points[3][1] > points[2][1]:
        index_b = 2
        index_c = 3
    else:
        index_b = 3
        index_c = 2

    box = [points[index_a], points[index_b], points[index_c], points[index_d]]
    crop_img = get_rotate_crop_image(img, np.array(box))
    return crop_img


def crop_and_save(image_path, output_dir, ocr):
    """
    Detect and crop all text lines in the image, and save them to output_dir.
    """
    img = cv2.imread(image_path)
    img_name = os.path.splitext(os.path.basename(image_path))[0]
    result = ocr.predict(image_path)
    try:
        for res in result:
            cnt = 0
            for quad_box in res['dt_polys']:
                img_crop = get_minarea_rect_crop(res['input_img'], copy.deepcopy(quad_box))
                cv2.imwrite(os.path.join(output_dir, f"{img_name}_crop{cnt:04d}.jpg"), img_crop)
                cnt += 1

    except Exception as e:
        print(f"Process Failed with error: {e}")


# Usage example (assuming all your images are in the russian_dataset_demo/ directory)
input_dir = 'russian_dataset_demo'
output_dir = 'crops'  # The cropped images will be saved to this directory.
os.makedirs(output_dir, exist_ok=True)

image_paths = glob.glob(os.path.join(input_dir, '*.jpg')) + glob.glob(os.path.join(input_dir, '*.png'))

# Batch processing
from paddleocr import TextDetection

ocr = TextDetection(
    model_name="PP-OCRv5_server_det",
    device='gpu',
)
for path in tqdm(image_paths):
    crop_and_save(path, output_dir, ocr)
print(f"Cropping completed, saved to the {output_dir} directory")

### 3.2 Visualization of Cropping Results

After cropping, it is recommended to randomly sample some of the small images to verify the detection and cropping quality.

In [None]:
import random

import matplotlib.pyplot as plt

crop_imgs = glob.glob(os.path.join(output_dir, '*.jpg'))

if len(crop_imgs) >= 5:
    show_imgs = random.sample(crop_imgs, 5)
else:
    show_imgs = crop_imgs  # Display all if there are fewer than 5 images

for crop_path in show_imgs:
    img = cv2.imread(crop_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(os.path.basename(crop_path))
    plt.axis('off')
    plt.show()

## 4. Using ERNIE 4.5 to Predict on Cropped Images

By directly utilizing ERNIE 4.5, the efficiency and accuracy of automatic annotation for unstructured text images can be greatly improved:

- No need for manual verification of each image; the large model directly outputs the text content.
- Multi-round consistency checks effectively reduce the risks of hallucinations or misreading by the model.
- Supports complex scenarios, such as handwritten, printed, cursive, and blurry text samples.

### 4.1 Deploying ERNIE 4.5 and Setting Key Parameters

In this example, the ERNIE large model is invoked through service requests, so it needs to be deployed as a local service. The deployment can be accomplished using the FastDeploy tool, which is an inference deployment tool for large models open-sourced by PaddlePaddle. For deployment methods, please refer to the [FastDeploy official documentation](https://github.com/PaddlePaddle/FastDeploy).

After deploying FastDeploy as a backend service, you need to fill in the service URL in the configuration below, and use a script to test the service. If the output includes “Test successful!”, it indicates the service deployment is available; otherwise, it means the service is unavailable. Please troubleshoot based on the error message.

In [None]:
base_url = ""  # # Please fill in the URL of the local service, e.g., http://0.0.0.0:8000/v1
model_name = "xxx"  # Select the model to invoke
prompt = "Identify the text content in the image and output it as plain text. If there is no text in the image, output ###. Do not explain, do not add line breaks, do not translate, and do not output extra content."  # Can be modified according to the actual situation
api_key = "api_key"  # No modification is needed for local deployment

try:
    import openai

    client = openai.OpenAI(base_url=base_url, api_key=api_key)
    question = "Who are you?"
    response1 = client.chat.completions.create(model=model_name, messages=[{"role": "user", "content": question}])
    reply = response1.choices[0].message.content
except Exception as e:  # Corrected from "Exception()" to "Exception"
    print(f"Test failed! The error message is:\n{e}")

print(f"Test successful!\nQuestion: {question}\nAnswer: {reply}")

### 4.2 Main Process of Automatic Label Generation Based on ERNIE 4.5

This is an automated process for batch labeling of text line images cropped by the detection model: it automatically scans all images in a specified folder, calls ERNIE 4.5 to recognize the text content in each image, and saves the labeling results to an output file. The code also features breakpoint recovery, automatically skipping images that have already been processed, so that even if the process is interrupted, it can resume from where it left off and continue processing unfinished images. To ensure label accuracy, each image is inferred twice with two different prompts, and only if the results are consistent will the label be accepted as final. The process also supports automatic retry in case of exceptions.

In [None]:
import glob
import os

from tqdm import tqdm


def encode_image(image_path):
    """Convert image to base64 string, compatible with multimodal API."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def initialize_client(api_key, base_url):
    """Initialize OpenAI compatible API client."""
    return OpenAI(api_key=api_key, base_url=base_url)


def read_processed_images(output_file):
    """Read processed image paths to avoid duplicate labeling."""
    processed = set()
    if os.path.exists(output_file):
        with open(output_file, "r", encoding="utf-8") as f:
            for line in f:
                if line.strip():
                    image_path = line.split('\t')[0]
                    processed.add(image_path)
    return processed


def append_result(output_file, line):
    """Append label result to output file."""
    with open(output_file, "a", encoding="utf-8") as f:
        f.write(f"{line}\n")


# Suppose all your images are in the './crops' folder
image_folder = './crops'  # Change to your image folder path
image_list_file = 'image_list.txt'  # Store all image paths

image_paths = glob.glob(os.path.join(image_folder, "*.jpg"))

with open(image_list_file, "w", encoding="utf-8") as f:
    f.writelines(f"{img_path}\n" for img_path in image_paths)

print(f"Collected {len(image_paths)} image paths and wrote to {image_list_file}")

output_file = "label_output.txt"

max_retries = 3  # Maximum number of retries after failure

LIMIT_PROMPT_SUFFIX = "Please strictly output only the text content in the image, do not output any explanation or other content. Do not write in formula encoding format either."  # Restriction, used for the second prompt

with open(image_list_file, "r", encoding="utf-8") as f:
    all_images = [line.strip() for line in f if line.strip()]
client = initialize_client(api_key, base_url)
processed_images = read_processed_images(output_file)
remaining_images = [img for img in all_images if img not in processed_images]

if not remaining_images:
    print("All images have been processed.")
else:
    with tqdm(total=len(remaining_images), desc="Batch Image Labeling", unit="image") as pbar:
        for idx, image_path in enumerate(remaining_images, 1):
            retries = 0
            while retries < max_retries:
                try:
                    base64_image = encode_image(image_path)
                    # First inference
                    response1 = client.chat.completions.create(
                        model=model_name,
                        messages=[
                            {
                                "role": "user",
                                "content": [
                                    {"type": "text", "text": prompt},
                                    {
                                        "type": "image_url",
                                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                                    },
                                ],
                            }
                        ],
                        stream=False,
                    )
                    rec_text1 = response1.choices[0].message.content.strip()

                    # Second inference
                    response2 = client.chat.completions.create(
                        model=model_name,
                        messages=[
                            {
                                "role": "user",
                                "content": [
                                    {"type": "text", "text": prompt + LIMIT_PROMPT_SUFFIX},
                                    {
                                        "type": "image_url",
                                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                                    },
                                ],
                            }
                        ],
                        stream=False,
                    )
                    rec_text2 = response2.choices[0].message.content.strip()

                    # Compare two results
                    if rec_text1 == rec_text2 and rec_text1 != "###":
                        result_line = f"{image_path}\t{rec_text1}"
                        append_result(output_file, result_line)
                        print(
                            f"Successfully processed image: {image_path} ({idx}/{len(remaining_images)}), both results are consistent."
                        )
                        break  # Success, break retry loop
                    else:
                        print(
                            f"Image {image_path} two results are inconsistent or there is no text in the image, discarded. Result 1: {rec_text1}, Result 2: {rec_text2}"
                        )
                        break  # No more retries, just skip
                except Exception as e:
                    retries += 1
                    print(f"Error processing image {image_path} (attempt {retries}/{max_retries}): {e}")
                    if retries < max_retries:
                        sleep_time = 2**retries
                        print(f"Retrying after waiting {sleep_time} seconds.")
                        time.sleep(sleep_time)
                    else:
                        print(f"Image {image_path} failed after reaching maximum retries.")
            pbar.set_postfix({"Current image": os.path.basename(image_path)})
            pbar.update(1)

print("All processing completed. Results saved to", output_file)

### 4.3. View the final tag results

In [None]:
with open(output_file, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i > 9:  # Only display the first 10 lines, the rest are omitted
            print('......')
            break

The output format is: image path\trecognized text, which can be directly used for OCR training or manual verification. The results after running are shown as follows:

<div align="center">
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-OCRv5/cookbook/labeled_show.png" width="600"/>
</div>

## 5. Training a Russian Text Recognition Model Based on Labeled Data

We obtained a large amount of high-quality labeled data by acquiring text line images through the text detection model and automating data labeling with ERNIE 4.5, which can effectively support the training of the Russian text recognition model.

### 5.1 Clone the PaddleOCR Repository and Initiate Training

In [None]:
# Clone the PaddleOCR repository
!git clone https://github.com/PaddlePaddle/PaddleOCR.git

# Install dependencies required for training
%pip install -r PaddleOCR/requirements.txt

In [None]:
# Start Training
# When passing in the dataset path, please ensure that the evaluation dataset has been prepared in advance. For demonstration purposes, the evaluation set and the training set are set to be the same dataset here.

!python PaddleOCR/tools/train.py -c PaddleOCR/configs/rec/PP-OCRv5/multi_language/eslav_PP-OCRv5_mobile_rec.yml -o Train.dataset.data_dir=./  Train.dataset.label_file_list=./label_output.txt Eval.dataset.data_dir=./  Eval.dataset.label_file_list=./label_output.txt Global.epoch_num=20 Global.character_dict_path=PaddleOCR/ppocr/utils/dict/ppocrv5_eslav_dict.txt Global.pretrained_model=https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_mobile_rec_pretrained.pdparams Global.eval_batch_step=100 Train.loader.batch_size_per_card=32  Eval.loader.batch_size_per_card=32 Train.sampler.first_bs=32

The model weights are saved in the directory `./output/eslav_rec_ppocr_v5`. For more information on how to initiate training, please refer to the [documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version2.x/ppocr/model_train/recognition.md).

### 5.2 Model Export

During training, the saved models are checkpoints, which contain only the model parameters and are mainly used for tasks such as resuming training. The inference model (saved using `paddle.jit.save`) is primarily used for prediction and deployment scenarios. Compared with the checkpoints generated during training, the inference model additionally saves the model's structural information, making it superior in prediction deployment and accelerated inference. It is also flexible and convenient, suitable for integration into real-world systems.

The method for converting a recognition model to an inference model is as follows:

In [None]:
# The Global.pretrained_model parameter sets the address of the training model to be converted.
# The Global.save_inference_dir parameter sets the address where the converted model will be saved.

!python3 PaddleOCR/tools/export_model.py -c PaddleOCR/configs/rec/PP-OCRv5/multi_language/eslav_PP-OCRv5_mobile_rec.yml -o Global.pretrained_model=./output/eslav_rec_ppocr_v5/best_accuracy Global.save_inference_dir=./inference/eslav_PP-OCRv5_mobile_rec_infer/ Global.character_dict_path=PaddleOCR/ppocr/utils/dict/ppocrv5_eslav_dict.txt

After successful conversion, there are three files under the directory:

```
inference/eslav_PP-OCRv5_mobile_rec_infer/
    ├── inference.pdiparams         # Parameter file for the recognition inference model
    └── inference.json              # Program file for the recognition inference model
    └── inference.yaml              # Configuration file for the recognition inference model
```

### 5.3 Model Prediction

Use the exported static graph model to predict images of Russian text lines. You can download the [test image](https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/demo_images/labeled_test.jpg) and use the following code for prediction:

In [None]:
!paddleocr text_recognition -i https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/demo_images/labeled_test.jpg --model_name eslav_PP-OCRv5_mobile_rec --model_dir ./inference/eslav_PP-OCRv5_mobile_rec_infer/

The prediction results are saved in the `./output` directory, and the visualization results are shown in the figure: 

<div align="center">
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-OCRv5/cookbook/labeled_test_res.jpg" width="400"/>
</div>

The exported static graph model can also be integrated into PP-OCRv5. It should be noted that the text detection model does not need to be trained separately for minor languages, as it already possesses strong text feature detection capabilities. You can directly use the PP-OCRv5_server_det model for text line detection and refer to the following code for prediction.



In [None]:
!paddleocr ocr -i https://paddle-model-ecology.bj.bcebos.com/paddlex/PaddleX3.0/demo_images/ru_pipeline_test.jpg --text_detection_model_name PP-OCRv5_server_det --text_recognition_model_name eslav_PP-OCRv5_mobile_rec --text_recognition_model_dir inference/eslav_PP-OCRv5_mobile_rec_infer/


The prediction results are saved in the ./output directory. The visualization of the results is shown in the figure:

<div align="center">
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/paddleocr/PP-OCRv5/cookbook/ru_pipeline_result.jpg" width="600"/>
</div>

## 6. Summary

This tutorial, based on Russian text recognition, systematically demonstrates the complete R&D workflow for low-resource language text recognition tasks using ERNIE 4.5 and PaddleOCR. The process includes text line detection and cropping, automatically generating high-quality labels for the cropped images with ERNIE 4.5, and finally training and inference of the text recognition model. By following the steps in this tutorial, you can not only quickly complete automatic data labeling for specific scenarios, but also efficiently train text recognition models tailored to your own needs—significantly reducing manual labeling costs and improving development efficiency.

It should be noted that the number of sample images used in this tutorial is relatively small, so the accuracy of the resulting model may be limited. The main purpose of the tutorial is to help you become familiar with the complete workflow of image labeling and model training based on ERNIE 4.5. In practical applications, if you wish to achieve higher model accuracy and stronger generalization ability, it is recommended to use more and richer image data for labeling and training. This will allow you to fully leverage the advantages of large models and achieve better recognition results in real-world scenarios.

## 7. Frequently Asked Questions and Optimization Suggestions

- What should I do if the image detection or cropping results are unsatisfactory?
  - Check if the image resolution is too low. It is recommended to enlarge the image appropriately and try again.
  - Try fine-tuning the parameters of the PP-OCRv5 model, such as `box_thresh`, `unclip_ratio`, etc. For specific parameter adjustment methods, please refer to the [documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/module_usage/text_detection.md).
- What if an error occurs or the recognition results are abnormal when calling ERNIE?
  - Check whether the service URL is filled in correctly and whether the port is open.
  - Try modifying or optimizing the prompt content. Clearly specify "output text only" to prevent the model from outputting explanatory content.
  - Check whether the image’s base64 encoding method and format are consistent with the API requirements.
- Other suggestions and precautions
  - Resume from breakpoint: The data labeling process supports resuming from breakpoints. After interruption, simply restart the script to automatically skip already processed images and avoid repeated labeling.
  - Multi-process acceleration: When there are a large number of images, you can use multi-processing to improve labeling efficiency.
  - Consistent label format with training configuration: Make sure the output label format (such as separating image path and text content with a `\t`) is consistent with the PaddleOCR training script requirements.