# DeepL Document 
* This notebook is there to elaborate why DeepL's document translation was implemented like this:
```py
    def translate_document(self, text: list[str], src_lang: str, tgt_lang: str) -> list[str]:
        out_buffer = BytesIO()
        out_buffer.name = 'out_text.txt'
        in_text = '\n'.join(text)
        in_buffer = BytesIO(in_text.encode('utf-8'))
        in_buffer.name = 'in_text.txt'
        
        # Logger omitted for brevity

        self.client.translate_document(
            input_document=in_buffer,
            output_document=out_buffer,
            source_lang=src_lang.upper(),
            target_lang=get_deepl_code(tgt_lang),
        )
        out = out_buffer.getvalue()
        out_text = out.decode('utf-8')
        out_sents = out_sents = out_text.splitlines()
        return out_sents
```
* **TL;DR**: This implementation allows us to use the `/document` endpoint WITHOUT having to store our source text as a file on our machine first and then have DeepL's method read it. 

In [1]:
from scripts.translators import DeepLClient
myClient = DeepLClient()
deeplCli = myClient.client

* This is how you would use `document_translate` normally.

In [2]:
from scripts.data_management import EPManager
dm = EPManager()
de_sents, _ = dm.get_sentence_pairs('de', 'en', num_of_sents=5)
with open('de.txt', 'w') as f:
    for s in de_sents:
        print(s, file=f)

In [3]:
!cat de.txt | head -n 10

Wiederaufnahme der Sitzungsperiode
Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.
Wie Sie feststellen konnten, ist der gefürchtete "Millenium-Bug " nicht eingetreten. Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden.
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen.
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen -, allen Opfern der Stürme, insbesondere in den verschiedenen Ländern der Europäischen Union, in einer Schweigeminute zu gedenken.


In [4]:
!cat de.txt | wc -l

5


In [5]:
deeplCli.translate_document_from_filepath(
    input_path='de.txt',
    output_path='en.txt',
    source_lang='DE',
    target_lang='EN-GB'
)

<deepl.api_data.DocumentStatus at 0x2224ece09e0>

In [6]:
!cat en.txt | head -n 10

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December, wish you all the best for the New Year and hope you had a good holiday.
As you have seen, the dreaded "millennium bug" has not materialised. However, citizens of some of our Member States have been victims of terrible natural disasters.
There is a desire in Parliament for a debate during this part-session in the next few days.
Today I would like to ask you - and this is also the wish of some of my fellow Members - to observe a minute's silence in memory of all the victims of the storms, particularly in the various countries of the European Union.


In [7]:
!cat en.txt | wc -l

5


* This approach works, however we would always need to store the source text inside a file locally as well, which is a bit awkward since the whole setup uses datasets that provide text in form of list of strings.
* If we inspect the code `translate_document_from_filepath` [here](https://github.com/DeepLcom/deepl-python/blob/7469a47e833ff87d5cfc5564d62a8a21f417e67d/deepl/translator.py#L530), we observe the following:

```py
# Determine output_format from output path
        in_ext = pathlib.PurePath(input_path).suffix.lower()
        out_ext = pathlib.PurePath(output_path).suffix.lower()
        output_format = None if in_ext == out_ext else out_ext[1:]

        with open(input_path, "rb") as in_file:
            with open(output_path, "wb") as out_file:
                try:
                    return self.translate_document(
                        in_file,
                        out_file,
                        target_lang=target_lang,
                        source_lang=source_lang,
                        formality=formality,
                        glossary=glossary,
                        output_format=output_format,
                        timeout_s=timeout_s,
                    )
                except Exception as e:
                    out_file.close()
                    os.unlink(output_path)
                    raise e
```
* It reads the input file in bytes (`with open(input_path, "rb") as in_file:`) and then calls `translate_document`
* So we may do the same as well, `translate_document` has following definition:
```py
    def translate_document(
        self,
        input_document: Union[TextIO, BinaryIO, Any],
        output_document: Union[TextIO, BinaryIO, Any],
        *,
        source_lang: Optional[str] = None,
        target_lang: str,
        formality: Union[str, Formality] = Formality.DEFAULT,
        glossary: Union[str, GlossaryInfo, None] = None,
        filename: Optional[str] = None,
        output_format: Optional[str] = None,
        timeout_s: Optional[int] = None,
    ) -> DocumentStatus:
```
* It takes io objects or `Any` as input
* If you follow the calls, `translate_document` calls:
```py
        handle = self.translate_document_upload(
            input_document,
            target_lang=target_lang,
            source_lang=source_lang,
            formality=formality,
            glossary=glossary,
            filename=filename,
            output_format=output_format,
        )
```
* And `translate_document_uploud` is defined as:
```py
    def translate_document_upload(
        self,
        input_document: Union[TextIO, BinaryIO, str, bytes, Any],
        *,
        source_lang: Optional[str] = None,
        target_lang: str,
        formality: Union[str, Formality, None] = None,
        glossary: Union[str, GlossaryInfo, None] = None,
        filename: Optional[str] = None,
        output_format: Optional[str] = None,
    ) -> DocumentHandle:
```
* In other words, we can provide the `input_document` as `bytes` or `str`
* This is where the call is made:
```py
        files: Dict[str, Any] = {}
        if isinstance(input_document, (str, bytes)):
            if filename is None:
                raise ValueError(
                    "filename is required if uploading file content as string "
                    "or bytes"
                )
            files = {"file": (filename, input_document)}
        else:
            files = {"file": input_document}
        status, content, json = self._api_call(
            "v2/document", data=request_data, files=files
        )
```
* So if we pass the input as bytes or str, we have to provide a filename as well


In [8]:
from io import BytesIO
in_text  =  '\n'.join(de_sents)
in_bytes  = in_text.encode('utf-8')
# This is the same as storing de_sents in de.txt and getting in_file with "rb"
out_file = BytesIO()

try: 
    out = deeplCli.translate_document(
        input_document=in_bytes,
        output_document=out_file,
        source_lang='DE',
        target_lang='EN-GB'
    )
except Exception as e:
    print(str(e))

filename is required if uploading file content as string or bytes


* As expected, the call failed since we did not provide a filename
* DeepL requires the filename to infer the document type

In [9]:
from io import BytesIO
in_text = '\n'.join(de_sents)
in_bytes = in_text.encode('utf-8')
in_filename = 'infile.txt'
out_file = BytesIO()
try:
    out = deeplCli.translate_document(
        input_document=in_bytes,
        output_document=out_file,
        source_lang='DE',
        target_lang='EN-GB',
        filename=in_filename
    )
except Exception as e:
    print(str(e))

In [10]:
out

<deepl.api_data.DocumentStatus at 0x2224eaae960>

* The output of `translate_document` is not the translation, just meta information. 

In [11]:
out.billed_characters

768

* The translation can be found within our fake output file

In [12]:
out_file.getvalue()

b'Resumption of the session\nI declare resumed the session of the European Parliament adjourned on Friday 17 December, wish you all the best for the New Year and hope you had a good holiday.\nAs you have seen, the dreaded "millennium bug" has not materialised. However, citizens of some of our Member States have been victims of terrible natural disasters.\nThere is a desire in Parliament for a debate during this part-session in the next few days.\nToday I would like to ask you - and this is also the wish of some of my fellow Members - to observe a minute\'s silence in memory of all the victims of the storms, particularly in the various countries of the European Union.'

In [13]:
out_text = out_file.getvalue().decode('utf-8')
out_text

'Resumption of the session\nI declare resumed the session of the European Parliament adjourned on Friday 17 December, wish you all the best for the New Year and hope you had a good holiday.\nAs you have seen, the dreaded "millennium bug" has not materialised. However, citizens of some of our Member States have been victims of terrible natural disasters.\nThere is a desire in Parliament for a debate during this part-session in the next few days.\nToday I would like to ask you - and this is also the wish of some of my fellow Members - to observe a minute\'s silence in memory of all the victims of the storms, particularly in the various countries of the European Union.'

In [14]:
out_sents = out_text.splitlines()
out_sents

['Resumption of the session',
 'I declare resumed the session of the European Parliament adjourned on Friday 17 December, wish you all the best for the New Year and hope you had a good holiday.',
 'As you have seen, the dreaded "millennium bug" has not materialised. However, citizens of some of our Member States have been victims of terrible natural disasters.',
 'There is a desire in Parliament for a debate during this part-session in the next few days.',
 "Today I would like to ask you - and this is also the wish of some of my fellow Members - to observe a minute's silence in memory of all the victims of the storms, particularly in the various countries of the European Union."]

* So for an input that consists of a list of strings, we get an output that also consists of a list of strings.

In [15]:
!rm de.txt
!rm en.txt