# PDF text translator
This notebook contains the implementation of tool for universal PDF files reading and text translation to a specified language.

## Problem statement and solution strategy
Many times I received documents in PDF format written in foreign language and I needed a simple tool, running on my machine, to get an insight of the textual content. This problem can be broken down in 2 sub-problems:
1. First one has to parse the plain text from a PDF file, which could have been generated by a dedicated suite (like popular PDF editors), but could also derive from a scanned document. In the latter case, the PDF does not contain strings, but is practically an image. To handle both types of inputs, the text parsing should be done via optical character recognition (OCR). Some OCR libraries are easily available on the net. For the purpose of this project, I used Tesseract OCR engine by Google, which can be downloaded and installed as a standalone executable on the PC. So the first step to run this notebook is to get tesseract program suite and save it in a dedicated folder (e.g. in "C:\\" archive).
The Python code here below will use tesseract to parse the text (in the document original language) from the input PDf file.


2. Text translation is done via an open-source library, called "deep_translator", which offers many options to call the proper translators. In my case, I decided to use "GoogleTranslator", that is a wrapper aroud Google translate API. Given the list of strings (practically: the sentences) in the original language, they are passed to the translator and translated to the target language, decided by the user. There is no need of specifying the original language, since this job is done automatically by GoogleTranslator. That's very useful, especially if the document contains sentences written in different languages.

## Libraries
All special libraries used in this notebook have been listed in the requirements file, so the user can simply uncomment the cell below and run the installation command only once. 

In [None]:
#!pip install -r requirements.txt

In [1]:
import os
import re
import pytesseract
import fitz  # if not working properly, install using: pip install PyMuPDF
from PIL import Image
from deep_translator import GoogleTranslator
from googletrans import LANGUAGES
from pathlib import Path

import ipywidgets as widgets

## Utility functions
Here below there are some utility functions that do "backoffice work". The file finder function is useful to get the full path of a file/directory, given a base directory in which it's supposed to be located. I coded this function because, after a while, I did not remember exactly the full path of "Tesseract" executable, which shall be provided to the PDF translator class.
The "translate_text" function sends text strings to the Google Translator and returns the translated text.

In [2]:
def find_file(fname, base_dir):
    """
    Find the full path of a file given the root drive (e.g. 'C://' ) and the file name (e.g. my_file.txt)

    Parameters
    ----------
    fname : str
        File name with or without extension.
    base_dir : str
        Base drive to search in. Commonnly 'C' or 'D'.
        
    Returns
    -------
    results_paths : list of strings
        All files found in the base dir as document path string, if any.

    """
    file_name = "**/" + fname
    
    results = list(Path(base_dir).glob(file_name))

    results_paths = [str(x) for x in results]
    
    return results_paths

In [3]:
def translate_text(text, output_language, input_language = "auto", output_type = "list"):
    """
    Translate the given text (string) to another language, using Google translate API.
    
    Parameters
    ----------
    text : str
        Input string of plain text.
    output_language : str
        digram identifying the output language.
    input_language : str
        digram identifying the input language. Default is 'auto'. The program automatically detects the language of 
        the input document. Otherwise, user can input e.g. 'en' for English, 'de' for Deutsch and so forth.
    output_type : str
        Choose the type of the output. Default is 'list' to output the translated text as a list of strings. 
        The other choice is 'str', to return the result as a string of text.
        
    Returns
    -------
    output_text : list or str
        Translated text.
    """
    if output_language not in list(LANGUAGES.keys()) and output_language not in list(LANGUAGES.values()):
        raise Exception ("Unable to support translation to: {}. Check the argument spelling.\nAvailable languages are: {}".format(output_language, list(LANGUAGES.keys())))
    
    output_text = list()
    if isinstance(text, list):
        for item in text:
            print("Translating paragraph: {} of {}".format(text.index(item) + 1, len(text)))
            for line in item.splitlines():
                try:
                    translation = GoogleTranslator(source = input_language, target = output_language).translate(line)
                except:
                    translation = str(line)
                output_text.append(translation)
            
    elif isinstance(text, str):
        for line in text.splitlines():
            if len(re.findall("[A-Za-z]+", line)):
                try:
                    translation = GoogleTranslator(source = input_language, target = output_language).translate(line)
                except:
                    translation = str(line)
                output_text.append(translation)
            else:
                output_text.append(" ")          
    
    if output_type == "str":
        output_text = "\n".join(output_text)
        
    return output_text

## PDF translatior class
I coded this class to handle both methods (text parsing with OCR and translation methods) and data (original and translated text) in the same object.

In [4]:
class pdf_translator():
    
    """
    This class reads the text from a pdf file, using OCR technology, and performs translation to a custom language, using
    GoogleTranslator library. The result can be exported to a .txt file.
    
    """
    
    def __init__(self, tesseract_path = None):
        """
        Initialize the class by getting the full path of tesseract.exe. 
        Tesseract is the engine used for optical cahracter recognition (OCR).
        
        Parameters
        ----------
        tesseract_path : str
            Optional full path of 'tesseract.exe' program. User can specify this path, if known. 
            If not None, the executable will be searched in "C:\\" base directory. Default is None.
        
        """
        if tesseract_path:
            self.tesseract_path = tesseract_path
        else:
            tesseract_paths = find_file("tesseract.exe", r"C://") 
            if len(tesseract_paths) > 0:
                self.tesseract_path = tesseract_paths[0]
            else:
                raise Exception("Error: tesseract.exe not found in the default 'C:\\' drive.\nMake sure you have installed it as external program.")
        
    def read_with_OCR(self, pdf_file_path, return_array= False, remove_empty_lines = False, zoom = 1):
        """
        Read the given text file with OCR and returns the text content.
        
        Parameters
        ---------
        pdf_file_path : str
            full path of the input pdf file
        return_array : bool
            True if the output should be an array of strings, False to output a simple text string.
            Default is False.
        remove_empty lines : bool
            True to remove empty text lines. Default is False.
        zoom : float
            Text zooming parameter. Default is 1 (no zooming).
        
        Returns
        -------
        fulltext : list of strings or str
            Scanned text.
        """
        self.pdf_file_path = pdf_file_path
        
        # attach tesseract exe path
        pytesseract.pytesseract.tesseract_cmd = self.tesseract_path
        fulltext = ""   
        
        doc = fitz.open(pdf_file_path) # open pdf files using fitz bindings 
        mat = fitz.Matrix(zoom, zoom)
        n_pages = doc.page_count 
        
        if isinstance(n_pages, int):
            for page_number in range(n_pages):              
                                
                page = doc.load_page(page_number) # number of pages
                pix = page.get_pixmap(matrix = mat) # if you need to scale a scanned image
                img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

                text = pytesseract.image_to_string(img)
                
                fulltext = fulltext + text
            
            if return_array:
                fulltext = fulltext.splitlines() 
                
                if remove_empty_lines:
                    fulltext = [s for s in fulltext if s != ""]
                    
            return fulltext
        

    def translate(self, pdf_file_path, output_language):
        """
        Calls the function 'translate_text', using the OCR-scanned text as input.
        
        Parameters
        ----------
        pdf_file_path : str
            full path of the input pdf file
        output_language : str
            digram identifying the output language.
        
        """
        
        self.original_text = self.read_with_OCR(pdf_file_path, return_array = False)
                
        self.translated_text = translate_text(self.original_text, output_language = output_language, 
                                     input_language = "auto", output_type = "str")
        
    def export_to_txt(self, output_file_path = ""):
        
        """
        Exports the tarnslated text to a .txt document in the same directory of the input pdf file.
        
        Parameters
        ----------
        output_file_path : str or None
            Custom full path provided to export the .txt file. Default is None, so the program will use the same 
            input full path, but changing the file type from '.pdf' to '.txt'.
        
        """
        
        if output_file_path == "":
            output_file_path = self.pdf_file_path.replace(".pdf", ".txt")
            
        with open(output_file_path, "w") as text_file:
            text_file.write(self.translated_text)
        text_file.close()    

## Upload the PDF document
The widget here below is nice (but very simple) interface that allows the user to upload a custom PDF file. It opens a file explorer window, so one can easily surf the directories in the local machine and choose the files according to the need.

In [5]:
uploader = widgets.FileUpload(description = "Upload your file", 
                                    accept=".pdf",  # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
                                    multiple = False  # True to accept multiple files upload else False
                                        )
display(uploader)

FileUpload(value={}, accept='.pdf', description='Upload your file')

In [6]:
file_name = list(uploader.value.keys())[0]
file_name

'sample_text_de.pdf'

In [7]:
file_full_path = find_file(file_name, "C://")[0]
file_full_path

'C:\\Users\\GRI018\\Desktop\\Python_text_translation\\test_data\\sample_text_de.pdf'

## Read & translate

### Initialize the translator
The user can initialize the translator class by passing the full path of "Tesseract" exe file, if this is known. Otherwise, the code will search it through all sub-directories in "C:\\".

In [None]:
my_translator = pdf_translator()
my_translator.tesseract_path

In [8]:
tesseract_path = r"C://Users\GRI018\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"
my_translator = pdf_translator(tesseract_path)

### Return the parsed text
The code below reads the input file with OCR and returns the parsed text. Please note that some characters or even words could be misspeeled. This mainly depends on how messy is the data structure of the original document and on the special characters that might compare in the document.

In [9]:
pdf_text = my_translator.read_with_OCR(file_full_path.replace("\\", "//"), return_array = True, remove_empty_lines = True)
pdf_text

['es',
 'Kommission Juvenes Translatores | DE-2020',
 'Es geht nur gemeinsam',
 'Liebes Tagebuch,',
 'heute ist schon der zehnte Tag der Ausgangssperre. Nicht im Traum hatte ich daran gedacht, dass ich es',
 'so lange zu Hause aushalten wirde, ohne ein einziges Mal meine Freunde zu. treffen',
 'Uberraschenderweise geht es aber doch, und eigentlich ist es auch gar nicht so schlimm. Schlielch',
 'haben wir hier 2u Hause alles, was wir brauchen, und mit meinen Freunden tree ich mich vorerst eben',
 'nur online. lm Moment masten wir ale ein Stickchen Freiheit opfern, damit wir sie hoffentlch bald',
 'wieder in vollen Zigen genieen kénnen. Das ist das Mindeste, was wir ett alle leisten mussen.',
 '\\was mir in den ersten Tagen der Ausgangssperre schwer 2u schaffen gemacht hat, war gar nicht so sehr',
 'der plétlche Kontaktentaug, sondern die unaufhérliche Flut von negativen Nachrichten: Stindlich',
 'steigende Infektionszahlen. Arzte, die entscheiden missen, welche Patienten behandelt werde

### Translate the document
The user can choose to perform text parsing and translation directly, as follows. One has to specify the output language only, by using a digram that identifies the idiom.

In [10]:
my_translator.translate(file_full_path, "it")

### Print the translated text
The translation is saved in an attribut of the class as a string. The user can split it on newline characters, to output a "well-looking" result.

In [11]:
my_translator.translated_text.splitlines()

['es',
 'Commissione Juvenes Translatores | DE-2020',
 ' ',
 'Funziona solo insieme',
 ' ',
 'Caro diario,',
 ' ',
 "oggi è il decimo giorno del coprifuoco. Mai nei miei sogni ho pensato che l'avrei fatto",
 'così tanto tempo a casa senza vedere i miei amici. incontro',
 'Sorprendentemente, funziona, e in realtà non è così male. finalmente',
 'abbiamo tutto ciò di cui abbiamo bisogno qui a casa, e per il momento sto uscendo con i miei amici',
 'solo in linea. Al momento dobbiamo tutti sacrificare un bastoncino di libertà per poterlo sperare presto',
 'può godere di nuovo al massimo. Questo è il minimo che tutti dobbiamo fare.',
 ' ',
 'Ciò che mi ha reso le cose difficili nei primi giorni del coprifuoco non è stato tanto',
 'la plétlche mancanza di contatti, ma il diluvio incessante di notizie negative: ogni ora',
 'numero crescente di infezioni. Medici che mancano di decidere quali pazienti trattare',
 "può e quale no. L'incredibile numero di morti nel mondo. a questi",
 'notizie terr

### Export the result in a txt file

In [12]:
my_translator.export_to_txt()