# Segmentation and Alignment of Ebook
In this exercise, we are going to segment an ebook and align the segments to create a translation memory. We will do this in a unique way. We will predict the most probable alignment by leveraging a BLEU score of a machine translation. If a source segment's machine translation has a high BLEU score, it's likely that it matches with the corresponding segment from the translated ebook.

# Download Calibre
From Calibre's website: "calibre is a powerful and easy to use e-book manager. Users say it’s outstanding and a must-have. It’ll allow you to do nearly everything and it takes things a step beyond normal e-book software. It’s also completely free and open source and great for both casual users and computer experts."
Download and install Calibre [here](https://calibre-ebook.com/download).

# Import and Convert Ebooks to TXT
Press `a` to add a book to Calibre. Select the book you'd like to add. The book's title and details will appear in the list. Select the book and click **Convert books**. In the top right of the conversion screen, change the **Output format** to TXT. Click **OK**. Wait for the job to complete. In the right preview panel, the available formats will be displayed. You can also **Click to open** the **Path**. For each ebook, import it into Calibre and convert it to txt.

# Prepare for Segmentation
Clone the srx_segmenter GitHub repo with this line of code (if you have git)

`git clone https://github.com/nicklambson/srx_segmenter.git`

If you don't have git, just [download](https://github.com/nicklambson/srx_segmenter/archive/master.zip) and unzip the zip into your project folder (in the same directory as your python script).

Also be sure to install `regex`, an alternative to the core library `re` module.

`pip install regex`

If you need to install `regex` in your jupyter notebook, add this block of code to your notebook and run it:

    import sys
    !{sys.executable} -m pip install regex
    
The srx segmentation rules are stored locale-by-locale in the 0_srx_rules folder. They were downloaded from memoQ.

# Script Setup

In [2]:
from segmentation_tools import parse, SrxSegmenter

# Set the name of the text file you'd like to segment.
TEXT = r"3_extracted_text\Como ganar amigos e influir sob - Dale Carnegie.txt"

# Set the locale of the file
# This will be used in the segmented filename
# It is also important in order to locate the SRX file for that locale

LOCALE = "es"
LOCALE_MAP = {"en_US": "English (United States)",
              "zh_CN": "Chinese (PRC)",
              "es": "Spanish"}

# Locale code for your segmentation rules
SRX_PATH = r"0_srx_rules/" + LOCALE + ".srx"

# Parse the srx file
SRX = parse(srx_filepath=SRX_PATH)

# Get the language rules from the SRX file
LANGUAGE_RULES = SRX[LOCALE_MAP[LOCALE]]

# The filename of your resulting segmented file
SEGMENTED_FILEPATH = r"4_segmented_text/" + "segmented_" + LOCALE + ".txt"

# Begin Segmentation
We pass the language rules to the segmenter along with the content. It returns to us both segments and whitespace, but we will only use the segments.

In [3]:
# Open the file to read from and the file to write to
with open(TEXT, "r", encoding="utf-8") as r:
    with open(SEGMENTED_FILEPATH, "w", encoding="utf-8") as w:
        
        # Read the content of the text file
        content = r.read()
        segmenter = SrxSegmenter(LANGUAGE_RULES, content)
        
        # Segment the text into segments and whitespace
        segments, whitespace = segmenter.extract()
        
        # Write the segments to a new file
        for s in segments:
            w.write(s + "\n")

# Data Cleaning
Open your segmented files. The segmentation is probably not perfect. Are there any ways you would improve the segmentation rules? If you can figure it out, you can add some rules to the srx file and improve the segmentation.  

You may notice some extra text or recurring patterns that make the text a little unclean. You can remove those by searching in the text with regular expressions. Clean up the files as much as possible before moving to the next step.

# Machine Translate the Source Text

Next, we are going to generate a machine translation of the source text, line by line. After we machine translate the source text, we can get the BLEU score of the MT and use it to predict which segments should be aligned together.

First, go to the [Baidu Translation API](http://api.fanyi.baidu.com/api/trans/product/desktop) to sign up for an API key. Paste your APP_ID and SECRET_KEY in the appropriate variables below.

[Languages Available](http://api.fanyi.baidu.com/api/trans/product/apidoc#languageList)

[API Reference](http://api.fanyi.baidu.com/api/trans/product/apidoc)

# Import Modules

In [None]:
import http.client
import hashlib
import urllib
import random
import traceback
import json
import time

# Set App ID and Secret Key

In [None]:
# Replace with your own App ID and Secret Key
APP_ID = '1234567890'
SECRET_KEY = '1234567890'

BASE_URL = 'api.fanyi.baidu.com'
API_URL = r'/api/trans/vip/translate'

# Set up the Translate Function
You can change the timeout time if you'd like. This will allow 20 seconds wait time before timing out on an http request.

In [None]:
def translate(from_lang, to_lang, query_text):
    httpClient = None
    url = get_url(from_lang, to_lang, query_text)
    try:
        httpClient = http.client.HTTPConnection(BASE_URL, timeout=20)
        httpClient.request('GET', url)
        response = httpClient.getresponse()
        result = json.loads(response.read())
        print(result)
        result_list = []
        for ret in result["trans_result"]:
            result_list.append(ret["dst"])
        trans_result = "".join(result_list)
        return trans_result 
    except Exception as e:
        traceback.print_exc()
    finally:
        if httpClient:
            httpClient.close()

# Set up the get URL function.
This sets up the URL to be sent to the http client, including salt and hashing. Just leave the settings as they are.

In [None]:
def get_url(from_lang, to_lang, query_text):
    salt = random.randint(32768, 65536)
    sign = str.encode(APP_ID + query_text + str(salt) + SECRET_KEY)
    m1 = hashlib.new('md5')
    m1.update(sign)
    sign = m1.hexdigest()
    url = API_URL +'?appid=' + APP_ID + '&q=' + urllib.parse.quote(query_text) + '&from=' + from_lang + '&to=' + to_lang + '&salt=' + str(salt) + '&sign=' + sign
    return url

# Do the Machine Translation

In [None]:
# Note that the language code for Spanish for Baidu translate is spa, not es
FROM = "en"
TO = "spa"
SOURCE_FILE = r"4_segmented_text/segmented_en_US.txt"
MT_FILEPATH = r"5_machine_translation/machineTranslated_" + TO + ".txt"

# Change the limit to stop the process after an amount of lines
count = 0
limit = 2000

with open(SOURCE_FILE, "r", encoding="utf-8") as f, /
     open(MT_FILEPATH, "w", encoding="utf-8") as w:
    lines = f.readlines()
    for line in lines:
        if line == "\n":
            w.write("\n")
        else:
            trans_result = translate(FROM, TO, line)
            w.write(trans_result + "\n")

        # The translation will stop if the count exceeds the limit.
        if count > limit:
            break
        else:
            count += 1
        
        # sleep for 5 seconds between each translation request
        # This will help resolve "invalid access limit" error
        time.sleep(5)

# Bleu Align
Let's install bleualign. Don't use the bluealign from pypi, it's not updated. Just install it with pip from GitHub. 

`pip install git+git://github.com/rsennrich/Bleualign.git`

Paste the filenames of your source file, translated file, and machine translation file into the below variables. srctotarget is the variable for the machine translated file. output-src and output-target are IO wrappers that will write information to the respective filename.

The last line of code writes the results to the aligned source and target filenames.

Copy three files into a folder called `6_for_alignment`: source segmented file, original segmented translation, and your machine translation.

If you limited your machine translation to 2000 lines, do the following to your files in `6_for_alignment`:
 - In the source file, delete any lines above 2000
 - In the original segmented translation, locate the line represented by the last line of your machine translation. This is a little tricky. Find chapter numbers or proper names to give you a hint. Use the source text to understand what the translation means.
 - Delete everything after that line. It might be line 1922, or line 2157. It won't be exact. In any case, bleualign will do the work of aligning things.

Run the below code in your Python IDE. It doesn't work in Jupyter Notebook. Bleualign will calculate the BLEU score of the machine translated segments and will use that BLEU score to predict more accurate alignments. The source and target files will be separate, but they should have the same line count.

In [None]:
import os
from bleualign.align import Aligner

options = {'srcfile': r'6_for_alignment/segmented_en_US.txt',
           'targetfile': r'6_for_alignment/segmented_es.txt',
           'srctotarget': [r'6_for_alignment/machineTranslated_spa.txt'],
           'targettosrc': [],
           'output-src': 'output_src.txt',
           'output-target': "output_tgt.txt", }

a = Aligner(options)
a.mainloop()
output_src, output_target = a.results()

# Create TMX
At this point, it should be possible to create a TMX. There are several ways to do it. One way is to first paste all the text from the aligned files into Excel. Then, you could save as unicode text. Then, upload the tab-delimited file into Okapi Olifant. Save the resulting file as a translation memory.