# Fine-tuning PaddleOCR's text detection models

We'll cover the following steps to fine-tune text detection models of PaddleOCR and to make inference with our fine-tuned model.

1. Clone the PaddleOCR GitHub repository
2. Install the framework PaddlePaddle, the gpu version
3. Install the package PaddleOCR
4. Get the data
5. Get the pretrained model
6. Train and evaluate the model
7. Make inference on new images


First, we need to clone the GitHub repository of PaddleOCR.

Next, we have to install the gpu version of PaddlePaddle, the core deep learning framework needed to run PaddleOCR. You may want consult the [PaddleOCR Quick Start](https://github.com/PaddlePaddle/PaddleOCR/blob/58e876d38d92b722f527946954c12231cd7ef7c6/doc/doc_en/quickstart_en.md). 

In [1]:
!python -m pip install -q paddlepaddle-gpu==2.5.0.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
!pip install -q "paddleocr>=2.0.1" # Recommend to use version 2.0.1+
!pip install -q imutils
!pip install -q gdown

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.2/383.2 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.10.0 requires tensorflow==2.17.0, but you have tensorflow 2.17.1 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m94.7 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m33.3 MB/s[0m eta 

We're going to verify that our gpu paddle installation succeed with:

In [2]:
import paddle
gpu_available  = paddle.device.is_compiled_with_cuda()
print("GPU available:", gpu_available)

GPU available: True


In [3]:
paddle.utils.run_check()

Running verify PaddlePaddle program ... 
PaddlePaddle works well on 1 GPU.
PaddlePaddle works well on 2 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.


## The pretrained model
PaddleOCR’s detection model supports 3 backbones (please refer to the [detection models documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/detection_en.md)):
1. MobileNetV3
2. Resnet18_vd
3. Resnet50_vd

It means that the model uses one of these architectures as a feature extractor, with layers trained to identify hierarchical patterns in images. 

There are many algorithms of text detection included in PaddleOCR, for example, [DBNet](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/algorithm_det_db_en.md), [SAST](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/algorithm_det_sast_en.md), [EAST](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/algorithm_det_east_en.md), etc.

We'll employ the model  PP-OCRv3, regarded as the best model of PaddleOCR because of its precision and generalization capabilities, as mentioned on the [fine-tune](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/finetune_en.md) guide in the docs. 

* Download the .tar file from this [download link](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_distill_train.tar).
* Extract the files in the .tar file, where you'll find `student.pdparams`, which will served us as our pretrained model. 

The next step is to get [the matching configuration file](https://github.com/PaddlePaddle/PaddleOCR/blob/main/configs/det/ch_PP-OCRv3/ch_PP-OCRv3_det_student.yml) for the student model.

## The configuration file

This is going to be the configuration we will utilize with our pretrained model.

In [4]:
!git clone https://github.com/PaddlePaddle/PaddleOCR.git

Cloning into 'PaddleOCR'...
remote: Enumerating objects: 138306, done.[K
remote: Counting objects: 100% (9655/9655), done.[K
remote: Compressing objects: 100% (1104/1104), done.[K
remote: Total 138306 (delta 9202), reused 8713 (delta 8551), pack-reused 128651 (from 4)[K
Receiving objects: 100% (138306/138306), 819.61 MiB | 37.00 MiB/s, done.
Resolving deltas: 100% (107872/107872), done.


In [5]:
TXT = '''
Global:
  use_gpu: true
  use_xpu: false
  use_mlu: false
  epoch_num: 50
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/db_mv3/
  save_epoch_step: 50
  # evaluation is run every 50 iterations
  eval_batch_step: [0, 50]
  cal_metric_during_train: False
  pretrained_model: ./pretrain_models/MobileNetV3_large_x0_5_pretrained
  checkpoints:
  save_inference_dir:
  use_visualdl: False
  infer_img: doc/imgs_en/img_10.jpg
  save_res_path: ./output/det_db/predicts_db.txt

Architecture:
  model_type: det
  algorithm: DB
  Transform:
  Backbone:
    name: MobileNetV3
    scale: 0.5
    model_name: large
  Neck:
    name: DBFPN
    out_channels: 256
  Head:
    name: DBHead
    k: 50

Loss:
  name: DBLoss
  balance_loss: true
  main_loss_type: DiceLoss
  alpha: 5
  beta: 10
  ohem_ratio: 3

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    learning_rate: 0.001
  regularizer:
    name: 'L2'
    factor: 0

PostProcess:
  name: DBPostProcess
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5

Metric:
  name: DetMetric
  main_indicator: hmean

Train:
  dataset:
    name: SimpleDataSet
    data_dir: /kaggle/input/scalex5-plate
    label_file_list:
      - /kaggle/input/scalex5-plate/chi/Label.txt
    ratio_list: [1.0]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - IaaAugment:
          augmenter_args:
            - { 'type': Fliplr, 'args': { 'p': 0.5 } }
            - { 'type': Affine, 'args': { 'rotate': [-10, 10] } }
            - { 'type': Resize, 'args': { 'size': [0.5, 3] } }
      - EastRandomCropData:
          size: [640, 640]
          max_tries: 50
          keep_ratio: true
      - MakeBorderMap:
          shrink_ratio: 0.4
          thresh_min: 0.3
          thresh_max: 0.7
      - MakeShrinkMap:
          shrink_ratio: 0.4
          min_text_size: 8
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] # the order of the dataloader list
  loader:
    shuffle: True
    drop_last: False
    batch_size_per_card: 16
    num_workers: 8
    use_shared_memory: True

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: /kaggle/input/scalex5-plate
    label_file_list:
      - /kaggle/input/scalex5-plate/chi/Label.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - DetResizeForTest:
          image_shape: [736, 1280]
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 8
    use_shared_memory: True

'''
PATH_TXT = "/kaggle/working/config.yml"

with open(PATH_TXT, 'w', encoding = "utf-8") as f:
    f.write(TXT)

In [6]:
cd /kaggle/working/

/kaggle/working


In [None]:
!cat /kaggle/working/config.yml

Some of the information that `config.yml` contains is: 

* Hyperparameters. The most importants are: `pretrained_model`, `batch_size`, and `learning_rate`.
    * For a single gpu:
        * batch_size = 8
        * learning_rate = 1e-4
    * For a single gpu with memory limitations:
        * batch_size = 4
        * learning_rate = 5e-5
* The paths of our images and annotations.
* The algorithm and backbone. 

In [None]:
!pip install pyyaml

In [None]:
import yaml

# Load the YAML file
with open('/kaggle/working/config.yml', 'r') as file:
    data = yaml.safe_load(file)

In [None]:
print(f'pretrained model: {data["Global"]["pretrained_model"]}')
print(f'number of epochs: {data["Global"]["epoch_num"]}')
print(f'evaluate every: {data["Global"]["eval_batch_step"][1]} iterations')
print(f'learning rate: {data["Optimizer"]["lr"]["learning_rate"]}')
print(f'training batch size: {data["Train"]["loader"]["batch_size_per_card"]}')
print(f'training images path: {data["Train"]["dataset"]["data_dir"]}')
print(f'training annotations file path: {data["Train"]["dataset"]["label_file_list"]}')

## Training and evaluation

We'll make use of the Kaggle's `GPU T4 x 2`,  so we'll perform a multi-gpu training with the following command. For more details, refer to the [documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/detection_en.md).

The model's evaluation will be carried out every `400` iterations as was specified in `config["Global"]["eval_batch_step"]`.

### Metrics
For the evaluation, the detection model returns hmean as the main metric, and also computes precision and recall, as you can confirm [here](https://github.com/PaddlePaddle/PaddleOCR/blob/58e876d38d92b722f527946954c12231cd7ef7c6/ppocr/metrics/det_metric.py).


In [None]:
!cd /kaggle/working/PaddleOCR
# Download the pre-trained model of MobileNetV3
!wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/pretrained/MobileNetV3_large_x0_5_pretrained.pdparams
# or, download the pre-trained model of ResNet18_vd
!wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/pretrained/ResNet18_vd_pretrained.pdparams
# or, download the pre-trained model of ResNet50_vd
!wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/pretrained/ResNet50_vd_ssld_pretrained.pdparams


In [7]:
cd ./PaddleOCR

/kaggle/working/PaddleOCR


In [None]:
!python tools/infer_det.py -c /kaggle/working/config.yml -o Global.infer_img="/kaggle/input/scalex5-plate/chi/crop_plate_957.jpg" Global.pretrained_model="/kaggle/working/pretrain_models/MobileNetV3_large_x0_5_pretrained.pdparams"

In [None]:
!cat /kaggle/working/PaddleOCR/output/det_db/predicts_db.txt

In [None]:
!python tools/train.py -c /kaggle/working/config.yml -o Global.pretrained_model="./pretrain_models/MobileNetV3_large_x0_5_pretrained"

In [12]:
!python tools/export_model.py -c /kaggle/working/config.yml -o Global.pretrained_model="/kaggle/input/det-best-model/best_model/best_model.pdparams" Global.save_inference_dir="./inference/det_db_inference/"

W0403 08:04:31.192308   197 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.6, Runtime API Version: 11.8
W0403 08:04:31.193373   197 gpu_resources.cc:149] device: 0, cuDNN Version: 8.9.
[2025/04/03 08:04:31] ppocr INFO: load pretrain successful from /kaggle/input/det-best-model/best_model/best_model
[2025/04/03 08:04:31] ppocr INFO: Export inference config file to ./inference/det_db_inference/inference.yml
Skipping import of the encryption module
I0403 08:04:33.830382   197 interpretercore.cc:237] New Executor is Running.
[2025/04/03 08:04:34] ppocr INFO: inference model is saved to ./inference/det_db_inference/inference


In [None]:
!python3 tools/infer_det.py -c /kaggle/working/config.yml \
  -o Global.pretrained_model="./output/db_mv3/best_accuracy" \
     Global.infer_img="/kaggle/input/scalex5-plate/chi/crop_plate_1036.jpg"


In [None]:
!python tools/infer/predict_det.py --det_algorithm="DB" --det_model_dir="./inference/det_db_inference/" --image_dir="/kaggle/input/scalex5-plate/chi/crop_plate_1036.jpg" --use_gpu=True

In [None]:
!cat /kaggle/working/PaddleOCR/inference_results/det_results.txt

In [None]:
import cv2
import json
import numpy as np
import matplotlib.pyplot as plt

# Đọc ảnh test
img_path = "/kaggle/working/PaddleOCR/output/det_db/det_results/crop_plate_1036.jpg"
image = cv2.imread(img_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Đọc file kết quả inference (nếu có)
result_json_path = "/kaggle/working/PaddleOCR/output/det_db/predicts_db.txt"

# Kiểm tra nếu file kết quả tồn tại
try:
    with open(result_json_path, "r") as f:
        results = f.readlines()

    # Tìm kết quả của ảnh đang test
    detected_boxes = []
    for line in results:
        if img_path in line:
            parts = line.strip().split("\t")
            if len(parts) > 1:
                detected_boxes = json.loads(parts[1])  # Chuyển đổi từ JSON
            break

    # Vẽ bounding box nếu có object được phát hiện
    if detected_boxes:
        for box in detected_boxes:
            points = np.array(box["points"], np.int32).reshape((-1, 1, 2))
            cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)  # Màu đỏ

    # Hiển thị ảnh
    plt.figure(figsize=(8, 8))
    plt.imshow(image)
    plt.axis("off")
    plt.show()

except Exception as e:
    print("Lỗi khi đọc file kết quả:", e)
