In [87]:
import numpy as np # dealing with arrays
import re, string  # regular expressions and string objects
import codecs      # file I/O
import time        # timing code blocks
import os          # various OS interfaces
import sys
from bs4 import BeautifulSoup   # XML parsing

## Task description

The main tasks for this project can be summarised as:

+ Parsing and cleaning up TMX files
+ Splitting source/target sentences into aligned monolingual files 
+ Computing summary metrics on each monolingual file

## 1. Parsing TMX files and splitting by language

First things first – loading TMX files. Rather than writing our own parsing functions from scratch using regular expressions, a more robust (and faster) way is to use a pre-existing tool. A popular Python library for parsing HTML and XML documents (like TMX files) is BeautifulSoup (BS4).

BeautifulSoup often does a good job of figuring out which encoding to use from header tags, so we'll leave it for now on this first pass.

In [146]:
def extract_data_from_tmx(filepath):
    """
    Load TMX file, parse XML, build a nested soup object.
    """
    print('Extracting data from "{0}" ({1:.2f}MB)... this might take 1-2 mins...'.format(filepath, os.path.getsize(filepath)/1000000))
    
    start_time = time.time()
    # the b mode specifier treats files as binaries for now
    with open(filepath, 'rb') as file:
          soup = BeautifulSoup(file, "lxml")
    print('Data extracted in {0:.2f} seconds.'.format(time.time() - start_time))
    return soup

Let's see what this files looks like when read in. 

In [147]:
soup = extract_data_from_tmx('small_sample.tmx')
soup

Extracting data from "small_sample.tmx" (0.01MB)... this might take 1-2 mins...
Data extracted in 0.01 seconds.


<?xml version="1.0" encoding="UTF-8" ?><html><body><tmx version="1.4">
<header adminlang="en" creationdate="Sun Oct 23 00:52:07 2011" creationtool="Uplug" creationtoolversion="unknown" datatype="PlainText" o-tmf="unknown" segtype="sentence" srclang="en"></header>
<tu>
<tuv xml:lang="en"><seg>EMEA/ H/ C/ 471</seg></tuv>
<tuv xml:lang="fr"><seg>EMEA/ H/ C/ 471</seg></tuv>
</tu>
<tu>
<tuv xml:lang="en"><seg>ABILIFY</seg></tuv>
<tuv xml:lang="fr"><seg>ABILIFY</seg></tuv>
</tu>
<tu>
<tuv xml:lang="en"><seg>What is Abilify?</seg></tuv>
<tuv xml:lang="fr"><seg>Qu’ est -ce qu'Abilify?</seg></tuv>
</tu>
<tu>
<tuv xml:lang="en"><seg>Abilify is a medicine containing the active substance aripiprazole.</seg></tuv>
<tuv xml:lang="fr"><seg>Abilify est un médicament qui contient le principe actif aripiprazole.</seg></tuv>
</tu>
<tu>
<tuv xml:lang="en"><seg>It is available as 5 mg, 10 mg, 15 mg and 30 mg tablets, as 10 mg, 15 mg and 30 mg orodispersible tablets (tablets that dissolve in the mouth), as 

Looks like the things we want are on lines with `<tuv>` tags. We can find all of them like so:

In [148]:
tuv_lines = soup.find_all('tuv')
tuv_lines[0:6]

[<tuv xml:lang="en"><seg>EMEA/ H/ C/ 471</seg></tuv>,
 <tuv xml:lang="fr"><seg>EMEA/ H/ C/ 471</seg></tuv>,
 <tuv xml:lang="en"><seg>ABILIFY</seg></tuv>,
 <tuv xml:lang="fr"><seg>ABILIFY</seg></tuv>,
 <tuv xml:lang="en"><seg>What is Abilify?</seg></tuv>,
 <tuv xml:lang="fr"><seg>Qu’ est -ce qu'Abilify?</seg></tuv>]

Given a line of these, we can grab the language and sentence tags from this soup object as follows:

In [149]:
one_tuv_line = tuv_lines[42]
print(one_tuv_line)

<tuv xml:lang="en"><seg>Abilify has not been studied in children aged below 18 years or adults aged over 65 years.</seg></tuv>


In [150]:
print(one_tuv_line.attrs['xml:lang'])
print(one_tuv_line.find('seg').get_text())

en
Abilify has not been studied in children aged below 18 years or adults aged over 65 years.


We can build up arrays of paired language/sentence tags for a corpus, which looks like this:

In [151]:
tuv_line_sentences = np.array([[line.attrs['xml:lang'], line.find('seg').get_text()] for line in tuv_lines])
tuv_line_sentences[0:6]

array([['en', 'EMEA/ H/ C/ 471'],
       ['fr', 'EMEA/ H/ C/ 471'],
       ['en', 'ABILIFY'],
       ['fr', 'ABILIFY'],
       ['en', 'What is Abilify?'],
       ['fr', "Qu’ est -ce qu'Abilify?"]],
      dtype='<U593')

Since we want to split the TMX file into one file per language, we need to find the total number of languages present:

In [152]:
list(set(tuv_line_sentences[:,0]))

['en', 'fr']

We can then filter to get just the, say, English sentences using simple array slicing:

In [153]:
tuv_line_sentences[tuv_line_sentences[:,0] == 'en'][:,1][0:10]

array(['EMEA/ H/ C/ 471', 'ABILIFY', 'What is Abilify?',
       'Abilify is a medicine containing the active substance aripiprazole.',
       'It is available as 5 mg, 10 mg, 15 mg and 30 mg tablets, as 10 mg, 15 mg and 30 mg orodispersible tablets (tablets that dissolve in the mouth), as an oral solution (1 mg/ ml) and as a solution for injection (7.5 mg/ ml).',
       'What is Abilify used for?',
       'Abilify is used to treat adults with the following mental illnesses: • schizophrenia, a mental illness with a number of symptoms, including disorganised thinking and speech, hallucinations (hearing or seeing things that are not there), suspiciousness and delusions (mistaken beliefs); • bipolar I disorder, a mental illness in which patients have manic episodes (periods of abnormally high mood), alternating with periods of normal mood.',
       'They may also have episodes of depression.',
       'Abilify is used to treat moderate to severe manic episodes and to prevent manic episodes 

Putting all of these things together, we can write the below function. 

[Note: This code used to work by assuming sentences in each language were in a consistent ordering throughout the file (e.g. ['en', 'fr', 'es', 'en', 'fr', 'es'...], but I realised this isn't actually an enforced standard of any sort, and now it extracts monolingual sentences by explicitly searching for the tags]. 

In [154]:
def split_into_separate_files(doc_soup, encoding='utf8', monolingual_path='split_monolingual_files'):
    """
    Takes soup object, detects languages, extracts sentences, 
    outputs separate text file per language. 
    """
    
    # look for <tuv> tags to find sentence text in the doc soup
    print('Finding and organising language tags and sentences in the TMX file...')
    sentence_lines = doc_soup.find_all('tuv')
    
    # get array of language tag and sentence text arrays
    try:
        langs_sentences = np.array([[line.attrs['xml:lang'], line.find('seg').get_text()] for line in sentence_lines])
    except KeyError:
        langs_sentences = np.array([[line.attrs['lang'], line.find('seg').get_text()] for line in sentence_lines])

    # find all langs present
    languages = list(set(langs_sentences[:,0]))
    print('Detected {0} languages in file: {1}'.format(len(languages), languages))
    
    # make dir to store split files if it doesn't already exist
    if not os.path.exists(monolingual_path):
        os.makedirs(monolingual_path)
    
    # iterate over all languages in the file
    for lang in languages:

        # get all sentences in a given language and
        # put each sentence on a new line
        sentences_in_given_lang = langs_sentences[langs_sentences[:,0] == lang][:,1]
        text = '\n'.join('{}'.format(sentence) for sentence in sentences_in_given_lang)

        # write out files with sentences
        out_file = os.path.join(monolingual_path, lang + '_sentences.txt')
        print('Writing split {0} sentences to {1}...'.format(lang, out_file))
        with codecs.open(out_file,'w',encoding=encoding) as file:
            file.write(text)

In this toy case, the output of the function would look like:

In [155]:
split_into_separate_files(soup)

Finding and organising language tags and sentences in the TMX file...
Detected 2 languages in file: ['en', 'fr']
Writing split en sentences to split_monolingual_files/en_sentences.txt...
Writing split fr sentences to split_monolingual_files/fr_sentences.txt...


# 2. Processing each monolingual file

### Loading monolingual files

First, we need to load up each file that we produced. Here is a function to load a single file: 

In [156]:
def load_monolingual_corpus(file):
    """
    Loads a monolingual corpus containing 
    one sentence per line in UTF-8 format.
    """
    print('-'*50)
    print('Working on file "{0}" of size {1:.2f}MB.'.format(file, os.path.getsize(file)/1000000))
    try:
        with codecs.open(file,'rb',encoding='utf8') as f:
            text = f.read()
    except UnicodeDecodeError:
        with codecs.open(file,'rb',encoding='utf16') as f:
            text = f.read()
    return(text)

Here, the function is trying to load using UTF-8 and switching to UTF-16 if that fails (there are probably smarter ways to guess encoding). For example, the `fr_sentence.txt` file can be loaded like this:

In [157]:
text = load_monolingual_corpus(os.path.join('split_monolingual_files', 'fr_sentences.txt'))
text[0:200]

--------------------------------------------------
Working on file "split_monolingual_files/fr_sentences.txt" of size 0.00MB.


"EMEA/ H/ C/ 471\nABILIFY\nQu’ est -ce qu'Abilify?\nAbilify est un médicament qui contient le principe actif aripiprazole.\nIl est disponible sous la forme de comprimés de 5 mg, 10 mg, 15 mg et 30 mg, de c"

### Normalising white space and splitting corpus by sentence

We need a function to normalise the white space in a given corpus. We can compile a regex function to search for white space characters, while keeping new line `\n` and carriage return `\r` characters out of the regex so as not to mess up the sentence splitting.

In [158]:
def whitespace_normalise_corpus(text):
    """
    Reduce whitespace in a corpus down to a single space; preserve new lines. 
    """
    print('Normalising white space in corpus...')
    pattern = re.compile(r"[ \t\f\v]+")
    clean = pattern.sub(" ", text) 
    return(clean)

In [50]:
text = whitespace_normalise_corpus(text)
text[0:200]

Normalising white space in corpus...


"EMEA/ H/ C/ 471\nABILIFY\nQu’ est -ce qu'Abilify?\nAbilify est un médicament qui contient le principe actif aripiprazole.\nIl est disponible sous la forme de comprimés de 5 mg, 10 mg, 15 mg et 30 mg, de c"

You can't see the difference in this example, but for instance if we had the input sentence `'  this is    a test \t\t sentence \n   and another  sentence  \n ok   one more   '`, the function would reduce white spaces down to 1 space:

In [51]:
test_sentence = ' this is    a test \t\t sentence \n   and another  sentence  \n ok   one more .  '
whitespace_normalise_corpus(test_sentence)

Normalising white space in corpus...


' this is a test sentence \n and another sentence \n ok one more . '

You could extend this whitespace normalisation to delete any whitespaces at the start or end of new lines using the `string.strip()` method. First let's split the corpus into sentences. 

In [52]:
def split_corpus_into_sentences(text):
    """
    Split a corpus into sentences; returns array of
    one array per sentence.
    """
    print('Finding sentences in the corpus...')
    sentences = text.split('\n')
    return(sentences)

In [53]:
sentence_arrays = split_corpus_into_sentences(text)

Finding sentences in the corpus...


Taking a look at the output:

In [54]:
sentence_arrays[0:5]

['EMEA/ H/ C/ 471',
 'ABILIFY',
 "Qu’ est -ce qu'Abilify?",
 'Abilify est un médicament qui contient le principe actif aripiprazole.',
 'Il est disponible sous la forme de comprimés de 5 mg, 10 mg, 15 mg et 30 mg, de comprimés orodispersibles (comprimés qui se dissolvent dans la bouche) de 10 mg, 15 mg et 30 mg, sous la forme d’ une solution buvable (1 mg/ ml) et sous la forme d’ une solution injectable (7,5 mg/ ml).']

Looks good. And running this on our test sentence:

And now a quick function to get rid of any whitespaces lingering at the beggining and end of lines:

In [55]:
def remove_whitespace_padding(sentence_arrays):
    """
    Remove leading and trailing white spaces in sentences.
    """
    print('Removing any whitespace padding around sentences...')
    sentence_arrays = [sentence.strip() for sentence in sentence_arrays]
    return(sentence_arrays)

In [58]:
sentence_arrays = remove_whitespace_padding(sentence_arrays)

Removing any whitespace padding around sentences...


### Tokenising sentence arrays

Given an array containing a sentence, we can split by white space to create tokens. Note that this function only works on languages that use spaces to denote token boundaries (e.g. excludes Chinese, Japanese, Thai, Khmer etc.)

In [60]:
def tokenize_sentence_arrays(sentence_arrays):
    """
    Splitting sentence arrays on spaces. 
    """
    print('Tokenising sentence arrays...')
    sentence_tokens = [sentence.split() for sentence in sentence_arrays]
    return(sentence_tokens)

Can tokenise our corpus and see what the results look like. 

In [61]:
tokenised_sentences = tokenize_sentence_arrays(sentence_arrays)
tokenised_sentences[0:4]

Tokenising sentence arrays...


[['EMEA/', 'H/', 'C/', '471'],
 ['ABILIFY'],
 ['Qu’', 'est', '-ce', "qu'Abilify?"],
 ['Abilify',
  'est',
  'un',
  'médicament',
  'qui',
  'contient',
  'le',
  'principe',
  'actif',
  'aripiprazole.']]

### Some other cleaning functions of potential interest

Functions like these could be integrated later as optional preprocessing steps. 

In [62]:
def standardise_case(tokens):
    """
    Convert all tokens to lower case. 
    """
    print('Standardising case...')
    tokens = [token.lower() for token in tokens]
    return(tokens)

In [63]:
def remove_punctuation(tokens):
    """
    Removing punctuation.
    """
    print('Removing punctuation...')
    pattern = re.compile('[\W_]+', re.UNICODE)
    tokens = [pattern.sub('', token) for token in tokens]
    return(tokens)

### Putting together all the monolingual corpus processing functions

The previous preprocessing functions can be united into one pipeline:

In [66]:
def process_monolingual_corpus(text, whitespace_normalise=True, whitespace_padding_normalise=True, 
                               standardise_case=False, remove_punctuation=False):
    """
    Cleans a loaded corpus using a variety of options. 
    Returns cleaned arrays of individual tokens. 
    """
    
    if whitespace_normalise:
        text = whitespace_normalise_corpus(text)
    
    sentence_arrays = split_corpus_into_sentences(text)
    
    if whitespace_padding_normalise:
        sentence_arrays = remove_whitespace_padding(sentence_arrays)
    
    tokenised_sentences = tokenize_sentence_arrays(sentence_arrays)
        
    return(sentence_arrays, tokenised_sentences)  

That should do the job of cleaning up a monolingual to make it ready for summarisation. Trying this out on the tiny example: 

In [103]:
text = load_monolingual_corpus(os.path.join('split_monolingual_files', 'fr_sentences.txt'))
sentence_arrays, tokenised_sentences = process_monolingual_corpus(text)

--------------------------------------------------
Working on file "split_monolingual_files/fr_sentences.txt" of size 0.00MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any whitespace padding around sentences...
Tokenising sentence arrays...


# 3. Computing summary statistics on corpora

The task asked to calculate some simple summary statistics on each corpus. This is very easy given some sentence arrays and tokenised sentences, which were already created above. 

In [69]:
def write_summary_report(sentence_arrays, tokenised_sentences):
    """
    Compute simple summary statistics on corpus. 
    """
    print('Generating report...')
    flat_tokens = [token for sentence in tokenised_sentences for token in sentence]

    print('Number of sentences in this corpus: {}'.format(len(sentence_arrays)))
    print('Number of unique sentences in this corpus: {}'.format(len(set(sentence_arrays))))
    print('Number of total tokens: ' + str(len(flat_tokens)))
    print('Number of unique tokens: ' + str(len(set(flat_tokens))) + '\n')

In [70]:
write_summary_report(sentence_arrays, tokenised_sentences)

Generating report...
Number of sentences in this corpus: 24
Number of unique sentences in this corpus: 24
Number of total tokens: 482
Number of unique tokens: 236



**NB:** The number of unique sentences/tokens will heavily depend on preprocessing choices (case standardisation, stemming, lemitisation, removing non-alphanumeric characters, removing punctuation etc.)

### Processing every file in the directory

It would be good to be able to apply this processing to every file in the directory generated to hold the monolingual aligned files. This can be done just by iterating over files in the directory and processing each file in turn:

In [159]:
def process_monolingual_files(directory='split_monolingual_files'):
    """
    Apply the processing pipeline on all files in a directory.  
    """
    for file in os.listdir(directory):
        if not file.startswith('.') and os.path.getsize(directory+file)>0:
            text = load_monolingual_corpus(os.path.join(directory, file))
            sentence_arrays, tokenised_sentences = process_monolingual_corpus(text)
            write_summary_report(sentence_arrays, tokenised_sentences)

In [160]:
process_monolingual_files(os.path.join('split_monolingual_files', ''))

--------------------------------------------------
Working on file "split_monolingual_files/en_sentences.txt" of size 0.00MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any whitespace padding around sentences...
Tokenising sentence arrays...
Generating report...
Number of sentences in this corpus: 24
Number of unique sentences in this corpus: 24
Number of total tokens: 396
Number of unique tokens: 199

--------------------------------------------------
Working on file "split_monolingual_files/fr_sentences.txt" of size 0.00MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any whitespace padding around sentences...
Tokenising sentence arrays...
Generating report...
Number of sentences in this corpus: 24
Number of unique sentences in this corpus: 24
Number of total tokens: 482
Number of unique tokens: 236



# 4. Example output: Running the functions

Let's see how the above functions run on raw TMX files all together. 

I've noticed that TMX files can have quite a few different formats, so I've tried to be general and test on some different data sources, but I'm sure there's more work to be done here to handle edge cases and improve generalisation. 

The following TMX files were used:

1. Your example file, from the European Medicines Agency (EMEA) parallel [corpus](http://opus.nlpl.eu/EMEA.php) containing French/English sentences
2. The Bulgarian-Hungarian corpus (48.3MB) from the Europarl [dataset](http://opus.nlpl.eu/Europarl.php). It appears that something strange has happened with this file, maybe the curators have mislabelled the languages. I am just taking their data as granted and not attempting to fix it. 
3. An 11-language [dataset](https://data.europa.eu/euodp/data/dataset/dgt-translation-memory) from the Acquis Communautaire, the body of European legislation

### 1. Running the code on the EMEA corpus

In [75]:
tmx_file = 'en-fr.tmx'
filestring, _ = os.path.splitext(tmx_file)

doc_soup = extract_data_from_tmx(tmx_file)
split_into_separate_files(doc_soup, encoding='utf8', monolingual_path=filestring+'_split_monolingual_files')
process_monolingual_files(os.path.join(filestring+'_split_monolingual_files', ''))

Extracting data from "en-fr.tmx" (115.13MB)... this might take 1-2 mins...
Data extracted in 74.35 seconds.
Detected 2 languages in file: ['en', 'fr']
Writing split en sentences to en-fr_split_monolingual_files/en_sentences.txt...
Writing split fr sentences to en-fr_split_monolingual_files/fr_sentences.txt...
--------------------------------------------------
Working on file "en-fr_split_monolingual_files/en_sentences.txt" of size 34.54MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any whitespace padding around sentences...
Tokenising sentence arrays...
Generating report...
Number of sentences in this corpus: 373152
Number of unique sentences in this corpus: 290498
Number of total tokens: 5361232
Number of unique tokens: 133845

--------------------------------------------------
Working on file "en-fr_split_monolingual_files/fr_sentences.txt" of size 41.74MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any white

### 2. Running the code on the Europarl Bulgarian-Hungarian corpus

In [81]:
tmx_file = 'bg-hu.tmx'
filestring, _ = os.path.splitext(tmx_file)

doc_soup = extract_data_from_tmx(tmx_file)
split_into_separate_files(doc_soup, encoding='utf8', monolingual_path=filestring+'_split_monolingual_files')
process_monolingual_files(os.path.join(filestring+'_split_monolingual_files', ''))

Extracting data from "bg-hu.tmx" (207.61MB)... this might take 1-2 mins...
Data extracted in 112.63 seconds.
Finding and organising language tags and sentences in the TMX file...
Detected 2 languages in file: ['hu', 'bg']
Writing split hu sentences to bg-hu_split_monolingual_files/hu_sentences.txt...
Writing split bg sentences to bg-hu_split_monolingual_files/bg_sentences.txt...
--------------------------------------------------
Working on file "bg-hu_split_monolingual_files/bg_sentences.txt" of size 104.97MB.
Normalising white space in corpus...
Finding sentences in the corpus...
Removing any whitespace padding around sentences...
Tokenising sentence arrays...
Generating report...
Number of sentences in this corpus: 370236
Number of unique sentences in this corpus: 368042
Number of total tokens: 8675621
Number of unique tokens: 187145

--------------------------------------------------
Working on file "bg-hu_split_monolingual_files/hu_sentences.txt" of size 64.51MB.
Normalising white 

### 3. Running the code on the 11-language corpus from the Acquis Communautaire

In [144]:
tmx_file = 'DGT-TM.tmx'
filestring, _ = os.path.splitext(tmx_file)

doc_soup = extract_data_from_tmx(tmx_file)
split_into_separate_files(doc_soup, encoding='utf16', monolingual_path=filestring+'_split_monolingual_files')
process_monolingual_files(os.path.join(filestring+'_split_monolingual_files', ''))

Extracting data from "DGT-TM.tmx" (0.29MB)... this might take 1-2 mins...
Data extracted in 0.13 seconds.
Finding and organising language tags and sentences in the TMX file...
Detected 11 languages in file: ['EL-01', 'DE-DE', 'ES-ES', 'PT-PT', 'NL-NL', 'DA-01', 'FR-FR', 'FI-01', 'IT-IT', 'SV-SE', 'EN-GB']
Writing split EL-01 sentences to DGT-TM_split_monolingual_filesEL-01_sentences.txt...
Writing split DE-DE sentences to DGT-TM_split_monolingual_filesDE-DE_sentences.txt...
Writing split ES-ES sentences to DGT-TM_split_monolingual_filesES-ES_sentences.txt...
Writing split PT-PT sentences to DGT-TM_split_monolingual_filesPT-PT_sentences.txt...
Writing split NL-NL sentences to DGT-TM_split_monolingual_filesNL-NL_sentences.txt...
Writing split DA-01 sentences to DGT-TM_split_monolingual_filesDA-01_sentences.txt...
Writing split FR-FR sentences to DGT-TM_split_monolingual_filesFR-FR_sentences.txt...
Writing split FI-01 sentences to DGT-TM_split_monolingual_filesFI-01_sentences.txt...
Writi

# Packaging the above code into a stand-alone programme

The standalone Python script is basically just a copy and paste of the above functions, with a few additions:

+ simple argument parsing from command line input and checking for input validity (see below)
+ writing a main function to execute the processing script if the script is run alone (otherwise, the functions can just imported for other purposes)


There's all sorts of wrong things that a user could provide as an argument. We can check a few quick ones:

In [89]:
def check_tmx_file_CLI(file):
    """
    Checking if the file provided as a CLI argument
    is ok to proceed with.
    """
    
    # check user has provided a file as input
    assert len(sys.argv)==2, \
        print("Please provide the name of one TMX file to process.")

    # check file actually exists
    assert os.path.exists(file), \
        print("A file with that name cannot be found.")

    # check that file is nonempty
    assert os.path.getsize(file)>0, \
        print("This file doesn't contain anything.")

    # check if file extension is .tmx
    filestring, extension = os.path.splitext(file)
    assert extension=='.tmx', \
        print("Unsuitable file. File extension must be .tmx")

    return(filestring)

# 5. Technical details

The code was originally developed using Python 3.5.2 on macOS Sierra 10.12. 

### Running on a Mac

Create a virtual environment with:

`virtualenv tmx_env`

Activate the virtualenv:

`source tmx_env/bin/activate`

Install package requirements:

`python3 install -r requirements.txt`

Run code:

`python parse_tmx.py small_sample.tmx`

Note, sometimes lmxl installation wonky and you get a `bs4.FeatureNotFound: Couldn't find a tree builder` error. In this case `pip uninstall lmxl` then `python3 -m pip install lmxl` fixes it. 

When done, you can deactivate the env:

`deactivate`


### Running on a Windows machine

Sorry about that, I didn't have time for this.

# 6. Future work

This was intended to be a quick piece of work, but with a bit more time, I would work on:
+ Ensuring Windows compatibility (mostly checking file pathing with os.path and new line/carriage return characters)
+ Testing on more .tmx files to ensure generalisability
+ Looking into speed optimisations
+ Examining more edge cases
+ Potential conversion to OOP, esp. if this is the paradigm in the group, rather than this more procedural approach. Could conceptualiase of a corpus and sentence being two classes with associated preprocessing methods
+ Dockerising for easier transportability
+ Support for automatic encoding detection on monolingual files