# Translate long PDF-Reports in Python.
## Automatically extract, translate and write from PDF for free using Pdfplumber and GoogleTraslate, demonstrated with a complete German Central Bank report.

For work, I recently had to translate many old Central Bank reports from various OECD countries. While, luckily, gibberish online translation is a thing of the past, common manual solutions are often not viable when working with many long documents. A multitude of useful Python packages exist to help with this task and are introduced in a variety of excellent existing [articles](https://towardsdatascience.com/pdf-text-extraction-in-python-5b6ab9e92dd). However, when faced with this task I found that the commonly used examples are often too *stylised* and many of the established tools are *no longer being maintained* in favour of community-built follow-up projects.
That is why, with this article I want to **1)** provide a real-world example of PDF extraction & translation and **2)** give an update on the best packages to use.

## 2 + 1 Tasks
So, together we're going to translate a Central Bank Report, which  - just like the code - you can find on my Git [repository](https://github.com/pcschreiber1/PDF_Extraction-Translation). 

To get started, we need a clear idea of what it is that we want to do. In our case, we need to somehow extract the content of a pdf, translate it and, then, (potentially) bring it into a format easily readable by humans: Extract -> Translate -> Write. We deal with each task separately and tie them together in the end.

### Extract

As you might already have experienced yourself, retrieving the text from a PDF can be quite tricky. The reason for this is that PDFs only store the *location* of characters and do not record what constitutes words or lines. Our library of choice is the new `pdfplumber` project, which is built on the very good `pdfminer.six` library (itself the replacing `PDFMiner`) but sports a better documentation and exciting new features. One feature we'll make use of here is the filtering of tables. Note that the popular `PyPDF2` package serves better for PDF merging, rather than text extraction.

In [1]:
import pdfplumber

pdf = pdfplumber.open("src/examples/1978-geschaeftsbericht-data.pdf")

We import the library and open the desired document. The central object of `pdfplumber` is `pages`, which allows us to access each page and its content individually. Note that, while we could simply extract all text at once, reducing the pdf to one large string causes us to lose a lot of useful information.

We can use indices to access individual pages and simply apply the `extract_text()` method to access text.

In [2]:
page11 = pdf.pages[11]

While this already looks great (for comparison, check the 12th page of the PDF), we see that sentences are disrupted by end-of-line breaks, which we can easily see will create problems for translation. Since paragraphs naturally have a line-break after a full-stop, we can exploit this to keep desired line breaks.

In [3]:
def extract(page):
    """Extract PDF text.

    Delete in-paragraph line breaks.
    """
    # Get text
    extracted = page.extract_text()

    # Delete in-paragraph line breaks
    extracted = extracted.replace(".\n", "**/m" # keep paragraph breaks
                        ).replace(". \n", "**/m" # keep paragraph breaks
                        ).replace("\n", "" # delete in-paragraph breaks (i.e. all remaining \n)
                        ).replace("**/m", ".\n\n") # restore paragraph breaks

    return extracted

print(extract(page11)[:500])

2  schließlich diese Wende in ihrer Politik durch die Heraufsetzung des Diskont- und Lom bardsatzes. Die Bundesbank versucht hiermit, das Ihre für eine gedeihliche Fortentwick lung der Wirtschaft zu tun. Die Aufrechterhaltung der nach schwierigen Jahren und un ter manchen Opfern annähernd erreichten Preisstabilität bildet dafür eine wesentliche Voraussetzung.

1.  Konjunkturbelebung im Jahresverlauf Zunahme des realen  Die Wirtschaftsaktivität in der Bundesrepublik Deutschland hat sich im Verlau


Much better! But looking at the next page, we see that we have problems with the tables in the document.

In [4]:
page12 = pdf.pages[12]

print(extract(page12)[:500])

1  3 Zur Entwicklung des Wirtschaftswachstums Jährliche Veränderung in o;o Zum Vergleich: I  Bruttoin-Brutto- Iandsprodukt Brutto- inlands- je Einwohner, mlands- Produk- Arbeits- Erwerbs- produkt 1)  OECD-Länder Jahr  produkt 1)  tivität 2)  volumen 3)  tätige 4)  je Einwohner  insgesamt 5) JD 1961-1969  + 461  + 5,2  - 0,6  + 0,1  + 3,6  + 3,9 JD 1970-1973  + 4,4  + 4,6  - 0,1  + 0,4  + 3,6  + 3,5 JD1974-1978  + 1,8  + 3,7  - 1,8  - 1,2  + 2,1  + 1,7 1974  + 0,5  + 3,6  - 2,9  - 1.9  + 0.4  - 0


#### Filtering-out tables

A highlight of the `pdfplumber` package is the `filter` method. The library comes with built in functionality for finding tables but combining it with `filter` requires some ingenuity. Essentially, `pdfplumber` allocates each character to so-called "boxes", the coordinates of which `filter` takes as input. For the sake of brevity, I won't explain the `not_within_bboxes` function but point towards the original git [issue](https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404). We pass the identified characters belonging to tables, we combine them with the `not_within_bboxes` function. Importantly, since the filter method takes a function without arguments, we freeze the boxes argument using `partial`. We add this a prior step to our `extract` function.

In [5]:
from functools import partial

In [6]:
def not_within_bboxes(obj, bboxes):
    """Check if the object is in any of the table's bbox."""
    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

def extract(page):
    """Extract PDF text.

    Filter out tables and delete in-paragraph line-breaks.
    """
    # Filter-out tables
    if page.find_tables() != []:
        # Get the bounding boxes of the tables on the page.
        # Adapted From https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246
        bboxes = [table.bbox for table in page.find_tables()]
        
        bbox_not_within_bboxes = partial(not_within_bboxes, bboxes=bboxes) 

        # Filter-out tables from page
        page = page.filter(bbox_not_within_bboxes)

    # Extract Text
    extracted = page.extract_text()

    # Delete in-paragraph line breaks
    extracted = extracted.replace(".\n", "**/m" # keep paragraph breaks
                        ).replace(". \n", "**/m" # keep paragraph breaks
                        ).replace("\n", "" # delete in-paragraph breaks (i.e. all remaining \n)
                        ).replace("**/m", ".\n\n") # restore paragraph breaks
    
    return extracted

In [7]:
print(extract(page12)[:500])

3 des Produktionspotentials anzusehen. Die statistischen Möglichkeiten lassen es nur an näherungsweise zu,  die auf Grund langfristiger Schrumpfungstendenzen anfallenden Stillegungen physisch noch vorhandener, d. h. noch nicht "verschrotteter", wirtschaft lich aber nicht mehr zu nutzender Anlagen zu berücksichtigen.

Die Produktion je geleistete Erwerbstätigenstunde hat 1978 in der Bundesrepublik um  Niedrige knapp 4 O/o zugenommen. Der Produktivitätsfortschritt hielt sich damit zwar etwa in der


Fantastic! The table was successfully filtered out and can now see that the page starts with a sentence cut in half by the page break. We leave it with extraction, but I encourage you to play around with more features, such as extracting page numbers, improving paragraph separation and fixing frequent mistakes such as "0/o" for "%".

### Translation

AWS and DeepL offer two prominent APIs for high quality translation of text, but the character-based pricing schemes can turn-out extremely costly if we want to translate several long reports. To translate free of charge, we use the Google Api with a key workaround, enabling the translation of long texts.

In [8]:
from deep_translator import GoogleTranslator

Since the GoogleTranslate API is not maintained by Google, the community has repeatedly faced issues in translation. That is why we here use the `deep_translator` package, which acts as a useful wrapper for the API and enables us to seamlessly switch between translation engines, should we wish to. Importantly, GoogleTranslator can automatically identify the source language (in our example German), so we only need to specify our target language: English.

In [9]:
translate = GoogleTranslator(source='auto', target='en').translate

With this wrapper translation is extremely simple, as the following example demonstrates.

In [10]:
translate("Ich liebe Python programmieren.")

'I love Python programming.'

However, the key issue is that most translation engines have a 5000-byte upload limit. If a job should exceed it, the connection is simply terminated - which would, for example, prevent the translation of `page11`. Of course, we could translate every word/sentence individually, however, this impedes the translation quality. That is why we're collecting chunks of sentences just below the upload the limit and translate them together.

Originally, I found this the workaround [here](https://aws.amazon.com/de/blogs/media/how-to-translate-large-text-documents-with-amazon-translate/). It uses the popular natural language processing tool `nltk` for identifying sentences. The package's documentation is great, and I recommend anyone interest to try-out the package. Here, we're limiting our attention to the package's `tokenizer`. It cannot be stressed enough that only high-quality input will lead to high-quality translation output, so going the extra mile in these preparation steps will easily pay-off!

Because this can be daunting for first time-users, I present here the shell-script to install the relevant nltk fuctionality (on Windows OS). The "popular" includes the `nltk.tokenize` package, which will use now.
```
# Shell script
pip install nltk
python -m nltk.downloader popular
```

In [11]:
from nltk.tokenize import sent_tokenize

As you can see, the `sent_tokenize` function creates a list of sentences. The language argument defaults to English, which works just fine for most European languages. Please check-out the `nltk` documentation to see if the language you need is supported.

In [12]:
text = "I love Python. " * 2
sent_tokenize(text, language = "english")

['I love Python.', 'I love Python.']

Now, the second ingredient we need is an algorithm for collecting chunks of sentences below the upload limit. Once we find that adding another sentence would exceed 5k bytes, we translate the collection and start a new chunk with the current sentence. Importantly, if a sentence itself should be longer than a 5k bytes (which, remember, corresponds to roughly a page), we simply discard it and provide an in-text note, instead. Combining the i) set-up of the translation client, ii) sentence tokenization, and iii) chunk wise translation, we end up with the following translation function.

In [13]:
def translate_extracted(Extracted):
    """Wrapper for Google Translate with upload workaround.
    
    Collects chuncks of senteces below limit to translate.
    """
    # Set-up and wrap translation client
    translate = GoogleTranslator(source='auto', target='en').translate

    # Split input text into a list of sentences
    sentences = sent_tokenize(Extracted)

    # Initialize containers
    translated_text = ''
    source_text_chunk = ''

    # collect chuncks of sentences below limit and translate them individually
    for sentence in sentences:
        # if chunck together with current sentence is below limit, add the sentence
        if ((len(sentence.encode('utf-8')) + len(source_text_chunk.encode('utf-8')) < 5000)):
            source_text_chunk += ' ' + sentence
        
        # else translate chunck and start new one with current sentence
        else:
            translated_text += ' ' + translate(source_text_chunk)

            # if current sentence smaller than 5000 chars, start new chunck
            if (len(sentence.encode('utf-8')) < 5000):
                source_text_chunk = sentence

            # else, replace sentence with notification message
            else:    
                message = "<<Omitted Word longer than 5000bytes>>"
                translated_text += ' ' + translate(message)

                # Re-set text container to empty
                source_text_chunk = ''

    # Translate the final chunk of input text, if there is any valid text left to translate
    if translate(source_text_chunk) != None:
        translated_text += ' ' + translate(source_text_chunk)
    
    return translated_text


To see if it works, we apply our translation function to a page we already worked with earlier. As is now also evident for the non-German speakers, apparently the hourly productivity rate increased by roughly 4% in 1978.

In [14]:
extracted = extract(pdf.pages[12])
translated = translate_extracted(extracted)[:500]
print(translated)

 3 of the production potential. The statistical possibilities allow only an approximation of the closures that still occur physically due to long-term shrinkage tendencies, i. H. systems that have not yet been "scrapped" but are no longer economically viable. Production per hour worked increased in 1978 in the Federal Republic by just under 4%. The progress in productivity thus remained roughly at the same level as in the previous year. However, it appears that its pace has generally slowed in r


### Writing
We're almost have everything we need, but as I said, like me, you might need to bring your extracted text back into a format *easily* readable by humans. While it is easy to safe strings to ".txt" in Python, the lack of line-breaks makes them a poor choice for long reports. Instead, we will here write them back to PDF using the `fpdf2` [library](https://pyfpdf.github.io/fpdf2/index.html), which apparently succeeds the no longer maintained `pyfpdf` package.

In [26]:
from fpdf import FPDF

After initializing an FPDF object, we can add a page object for every page we translated and write them on there. This will help us maintain the structure of the original document. Two things to note: firstly, in `multi_cell` we set width to zero to have full width and choose a hight of $5$ to have slim line spacing. Secondly, since the pre-installed fonts are not Unicode compatible, we change the encoding to "latin-1". Instructions on download and use Unicode compatible fonts, see the instructions on the `fpdf2` [website](https://pyfpdf.github.io/fpdf2/Unicode.html).

In [27]:
fpdf = FPDF()
fpdf.set_font("Helvetica", size = 7)

fpdf.add_page()
fpdf.multi_cell(w=0,
               h=5,
               txt= translated.encode("latin-1",
                                      errors = "replace"
                             ).decode("latin-1")
               )
fpdf.output("output/page12.pdf")

Now, just like in extraction, there is obviously a lot more you could do with the `fpdf2`, such as the adding of page numbers, title layout, etc. However, for the purpose of this article, this minimal set-up will suffice.

## Tying everthing together

We'll now bring everything together in one pipeline. Remember that, to avoid losing too much information, we operate on each page individually. Importantly, we make two adaptations to the translation: since some pages are empty, but empty strings are not valid input for `GoogleTranslator`, we place an if condition before the translation. Secondly, because `nltk` allocates our paragraph breaks (i.e., "\n\n") to the beginning of the *following* sentence, `GoogleTranslate` ignores these. That is why we translate each paragraph individually using a list comprehension.

In [29]:
# Open PDF
with pdfplumber.open("src/examples/1978-geschaeftsbericht-data.pdf") as pdf:
    # Initialize FPDF file to write on
    fpdf = FPDF()
    fpdf.set_font("Helvetica", size = 7)
    
    # Treat each page individually
    for page in pdf.pages[:30]:
        # Extract Page
        extracted = extract(page)

        # Translate Page
        if extracted != "":
            # Translate paragraphs individually to keep breaks
            paragraphs = extracted.split("\n\n")
            translated = "\n\n".join(
                [translate_extracted(paragraph) for paragraph in paragraphs]
                )
        else:
            translated = extracted

        # Write Page
        fpdf.add_page()
        fpdf.multi_cell(w=0,
                        h=5,
                        txt= translated.encode("latin-1",
                                               errors = "replace"
                                      ).decode("latin-1")
                        )
    
    # Save all FPDF pages
    fpdf.output("output/trans_1978-geschaeftsbericht-data.pdf.pdf")


## Conclusion

Thank you for staying to the end. I hope this article gave you a hands-on example of how to translate PDFs and what the state-of-the-art packages are. Throughout the article I pointed towards potential extensions of this rudimentary example (i.e., adding page numbers, layout, etc.), so please share your approaches for these - I'd love to hear them. And, of course, I am also always eager to hear suggestions on how to improve on the code. Stay safe and stay in touch!