## Convert a PDF to DOC

IBM Watson™ Language Translator is on a beta phase for PDF files and it handles better Word documents. For that purpose we will use an open-source library pdf2docx

In [32]:
#Uncomment the following command in order to install it directly.
#!pip install pdf2docx

In [48]:
from pdf2docx import parse

pdf_file = 'files/dutch/tax_statement_notes_2020.pdf'
docx_file = 'files/dutch/tax_statement_notes_2020.docx'

# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=None)

Parsing Page 44: 44/44...
Creating Page 44: 44/44...
--------------------------------------------------
Terminated in 5.666066796002269s.


## Getting started with Language Translator

IBM Watson™ Language Translator allows you to translate text programmatically from one language into another language.

Setup Instance in the IBM Cloud https://cloud.ibm.com/docs/language-translator?topic=language-translator-gettingstarted




## Introduction

This Jupyter Notebook helps to translate files from one language to another while preserving the original format. More than 12 different file formats can be translated, including MS Office, Open Office, and PDF.

Documentation https://cloud.ibm.com/docs/language-translator?topic=language-translator-document-translator-tutorial


## 1.Setup

To prepare your environment, you need to install some packages and enter credentials for the Watson services.

In [1]:
import json
from ibm_watson import LanguageTranslatorV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

authenticator = IAMAuthenticator('fg_jOKI4m-ZgGvxz29B6LckAbNZROT0wwdwI87yaw0v2')
language_translator = LanguageTranslatorV3(
    version='2018-05-01',
    authenticator=authenticator
)
language_translator.set_service_url('https://api.eu-de.language-translator.watson.cloud.ibm.com/instances/cd151abf-6d35-4b75-b262-412574d0c603')

## 2.List the documents customized
Lists documents that have been submitted for translation.

In [2]:
result = language_translator.list_documents().get_result()
print(json.dumps(result, indent=2))

{
  "documents": []
}


## 3.Delete translated documents (if required)
Deletes a document.

In [None]:
language_translator.delete_document(document_id='ADD ID') #add document ID


## 4.Add file for translation

Submit a document for translation. You can submit the document contents in the file parameter, or you can reference a previously submitted document by document ID.

a) Add local file location <br>
b) Add content type (See below)<br>
c) Add file name<br>
d) Add translation languages to / from  example en-nl<br>

***Application Microsoft***
<br>
application/powerpoint, application/mspowerpoint, application/x-rtf, application/json, application/xml, application/vnd.ms-excel application/vnd.ms-powerpoint
<br>

***Application Open Office***<br>
application/vnd.openxmlformats-officedocument.presentationml.presentation, application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.presentation, application/vnd.oasis.opendocument.text, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
<br>

***Application Other formats***<br>
application/pdf, application/rtf,
text/html, text/json, text/plain, text/richtext, text/rtf, or text/xml
<br>

***Application Other formats*** <br> 
application/pdf, application/rtf,text/html, text/json, text/plain, text/richtext, text/rtf, or text/xml.



In [50]:
with open('files/dutch/tax_statement_notes_2020.docx', 'rb') as file: #add file location
    result = language_translator.translate_document(
        file=file,
        file_content_type='application/msword', # add application type
        filename='tax_statement_notes_2020_English.docx', #add file name
        model_id='nl-en').get_result()  #add language parameters 
    print(json.dumps(result, indent=2))

{
  "document_id": "8d3dd46a-9729-4b4f-811f-3c60a3c43cc0",
  "filename": "tax_statement_notes_2020_English.docx",
  "model_id": "nl-en",
  "source": "nl",
  "target": "en",
  "status": "processing",
  "created": "2021-03-17T15:50:53Z"
}


### 4.1 Get document status
Gets the translation status of a document.

In [52]:
result = language_translator.get_document_status(
    document_id='8d3dd46a-9729-4b4f-811f-3c60a3c43cc0').get_result()  # Add document ID number
print(json.dumps(result, indent=2))


{
  "document_id": "8d3dd46a-9729-4b4f-811f-3c60a3c43cc0",
  "filename": "tax_statement_notes_2020_English.docx",
  "model_id": "nl-en",
  "source": "nl",
  "target": "en",
  "status": "available",
  "created": "2021-03-17T15:50:53Z",
  "completed": "2021-03-17T15:51:01Z",
  "word_count": 19305,
  "character_count": 120194
}


### 4.3 Get translated document
Gets the translated document associated with the given document ID.

In [56]:
with open('files/english/tax_statement_notes_2020_English.doc', 'wb') as f: #add name for translated document
    result = language_translator.get_translated_document(
        document_id='8d3dd46a-9729-4b4f-811f-3c60a3c43cc0', #add document ID
        accept='application/msword').get_result()  # add application type
    f.write(result.content)