# Objectives

1. Applying Mindee's docTR Optical Character Recognition (OCR) to collect multiple PDFs' texts.

2. Translating those texts using Meta AI's M2M-100.

3. Building translated PDF/A documents (searchable PDFs).

# Code

## From .pdf to RGB arrays

Getting list of all files

In [10]:
import os

# paths where data is stored
PDF_PATH = '../DATA/'
PDFA_PATH = '../DATA/PDFA/'
WORK_PATH = '../WORK/'

# list of PDF files
files = ['.'.join(f.split('.')[:-1]) for f in os.listdir(PDF_PATH) if f.endswith('.pdf')] 

Adding toolkits to the path

In [2]:
import sys

sys.path.append('../packages/')

Converting to RGB numpy arrays:
* Scaling (zoom)
* Gray scaling
* Deskewing

In [18]:
import pickle

from tqdm.notebook import tqdm
from ocr_toolkit import pdf_to_array

# scaling parameter to be applied to original PDF files
zooming = 3 

# iteration over each PDF file
for file in tqdm(files):

    # getting array of RGB values from pdf file (rotated for straight pages)
    docs = pdf_to_array(PDF_PATH, file+'.pdf', zooming=zooming)
    pickle.dump(docs, open(f'{WORK_PATH+file}_array.pkl','wb'))
    # docs = pickle.load(open(f'{WORK_PATH+file}_array.pkl','rb'))

  0%|          | 0/3 [00:00<?, ?it/s]

func:'pdf_to_array' took: 21.8331 sec
func:'pdf_to_array' took: 7.2664 sec
func:'pdf_to_array' took: 17.1023 sec


## Applying OCR

* Optical Character Recognition with mindee's docTR:

https://mindee.github.io/doctr/

In [None]:
import pickle

from doctr.models import ocr_predictor
from ocr_toolkit import array_to_ocr_xml

# docTR pretrained models for rotated text 
# model = ocr_predictor(
#     det_arch='linknet_resnet18_rotation', reco_arch='crnn_vgg16_bn', pretrained=True, 
#     assume_straight_pages=False, export_as_straight_boxes=True)

# docTR pretrained models for straight text
model = ocr_predictor(
    det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True, 
    assume_straight_pages=True, export_as_straight_boxes=True)

# ocr of the pdf file
xml_outputs = array_to_ocr_xml(docs, model)
pickle.dump(xml_outputs, open(f'{WORK_PATH+file}_xml_outputs.pkl','wb'))
# xml_outputs = pickle.load(open(f'{WORK_PATH+file}_xml_outputs.pkl', 'rb'))

 The versions of TensorFlow you are currently using is 2.6.5 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons



Converting to PDFA file

In [194]:
from ocr_toolkit import xml_to_pdfa

# building text and adding it to the original PDF file (image)
pdfa_dict = xml_to_pdfa(PDFA_PATH, file+, docs, xml_outputs)
pickle.dump(pdfa_dict, open(f'{WORK_PATH+file}_dict.pkl','wb'))
# pdfa_dict = pickle.load(open(f'{WORK_PATH+file}_dict.pkl','rb'))

func:'xml_to_pdfa' took: 8.2860 sec
