<a href="https://colab.research.google.com/github/gmoulantz2/Translate-PDF-files-with-Hugginface-and-PyPDF2/blob/main/Translation_with_Hugginface_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing and importing the required modules from Hugginface Transformers:

In [None]:
pip install transformers sentencepiece sacremoses 

In [3]:

from transformers.models.mgp_str.processing_mgp_str import AutoTokenizer
from transformers.models.auto.modeling_auto import AutoModelForSeq2SeqLM


Defining the Translator class. Supported languages are English, French and German.

In [4]:
class Translator:

  def __init__(self):

    self.model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
    self.tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
    self.languages = ['en', 'fr', 'de']

  def translate(self, text, src_lang, tgt_lang):

    if src_lang not in self.languages:
      raise RuntimeError('Source language not supported.')
    if tgt_lang not in self.languages:
      raise RuntimeError('Target language not supported')

    model_name = 'Helsinki-NLP/opus-mt-' + src_lang + '-' + tgt_lang
    self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)

    inputs = self.tokenizer.encode(text, return_tensors = 'pt')
    tokens = self.model.generate(inputs, max_length = 512)
    decoded = self.tokenizer.decode(tokens[0], skip_special_tokens = True)

    return decoded

    
    


In [None]:
translator_object = Translator()

Example: Translating a simple sentence from French to English.

In [6]:
translator_object.translate("j'aime manger des croissants et boire du chocolat chaud", 'fr', 'en')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

'I like to eat croissants and drink hot chocolate'

Testing the accuracy of our translations with nltk's BLEU score.

In [None]:
pip install nltk

In [11]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

In [9]:
reference = translator_object.translate('Lundi dernier, nous nous étions baladés avec Emilie, et sa fille Myrtille, pour aller acheter du pain au château de Boussan, en Occitanie.', 'fr', 'en')
candidate = 'Last Monday, we had taken a walk in the countryside with Emilie and her daughter Myrtille to buy bread at Boussan castle, in the Occitane region of France.'

In [10]:
print(reference)

Last Monday, we rode with Emilie, and her daughter Myrtille, to buy bread at the castle of Boussan in Occitanie.


In [15]:
score = sentence_bleu([reference.split()], candidate.split())
print(score)

0.17098323692758396


Function that converts a pdf file to text format. The output will be used for translation.

In [16]:
pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [17]:
import PyPDF2

def convert_pdf_to_txt(input_path, output_path):
    
    with open(input_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        with open(output_path, 'w') as txt_file:
            for page in range(len(pdf_reader.pages)):
                page_obj = pdf_reader.pages[page]
                text = page_obj.extract_text()
                lines = text.split('\n')
                for line in lines:
                    txt_file.write(line + '\n')

Example: Converting the pdf file 'baking bread in france.pdf' to text format.

In [18]:
convert_pdf_to_txt('baking bread in france.pdf', 'baking bread in france.txt')

Function that takes a text file and creates a new file containing the translation of the text from a specified source language to a specified target language.

In [20]:
def translate_file(input_file_path, src, tgt):
    
    with open(input_file_path, 'r') as f:
        text = f.readlines()
        
    translated_list = []

    for i in range(len(text)):
        translated_list.append(translator_object.translate(text[i], src, tgt))
    
    output_file_path = input_file_path.replace('.txt', '_translated.txt')    
    with open(output_file_path, 'w') as f:
        for i in range(len(text)):
            f.write(translated_list[i]+'\n') 

In [21]:
translate_file('baking bread in france.txt', 'fr', 'en')