#Practice Translating Similar Sentences

##Learn gramar rules by translating sentences which are very close to each other

###Set up your text to speech
In order to read the text of your target lanuage outloud you will need to have downloaded the necessary text to speech setting for your computer. To set this up go to your computer's settings, search for text to speech and try to set the language to your target language. Your computer will prompt you to download the correct files.

###Load the neccessary libraries

In [None]:
# to store a corpus once it is computed
import pickle
# to select a random set of sentences to work with
import numpy as np
# to display the javascript interface
from IPython.display import HTML
# to find the pickled file to load
from os import path
# to read in the data downloaded from http://tatoeba.org/eng/downloads
import csv
# to filter out white space etc.
import re

##Download the necessary data

Go to the following links
- http://downloads.tatoeba.org/exports/sentences.tar.bz2
- http://downloads.tatoeba.org/exports/links.tar.bz2

Unzip the files and save them as **`sentences.csv`** and **`links.csv`**

### Next select the languages you would like to translate between

The current set of tested languages are:

- "eng" for English with 575759 sentences
- "tur" for Turkish with 378117 sentences
- "cmn" for Chinese (Mandarin) with 48906 sentences

For the full set of codes and number of sentences in each go to http://tatoeba.org/eng/stats/sentences_by_language

If you would like sentences in the source or target language read out to you set **`voice_from_language`** or **`voice_to_language`** to **`True`**

In [None]:
language_to_translate_from = "cmn"
language_to_translate_to = "eng"
voice_from_language = True
voice_to_language = False

### Next load the sentences for the language you selected

If you have already built the corpus the program will load a pickled dump file.

If not it will build the program from scratch using data dumps from http://tatoeba.org/eng/downloads

In [None]:
class sentence:
    def __init__(self, id, lang, text):
        self.id = id
        self.lang = lang
        self.text = text
        self.tokens = tokenize(self.text, lang)
        self.translations = []

class corpus_class:
    def __init__(self, from_lang, to_lang):
        self.to_lang = to_lang
        self.from_lang = from_lang
        self.to_lookup = {}
        self.from_lookup = {}
    
    def add_sentence(self, sent_id, lang, row):
        if lang == self.from_lang:
            self.from_lookup[sent_id] = sentence(sent_id, lang, row)
        if lang == self.to_lang:
            self.to_lookup[sent_id] = sentence(sent_id, lang, row)
    
    def add_translation(self, id_1, id_2):
        if (id_1 in self.from_lookup and id_2 in self.to_lookup):
            self.from_lookup[id_1].translations.append(id_2)
            self.to_lookup[id_2].translations.append(id_1)
        if (id_2 in self.from_lookup and id_1 in self.to_lookup):
            self.from_lookup[id_2].translations.append(id_1)
            self.to_lookup[id_1].translations.append(id_2)
    
    def clean_up(self):
        for sent_id in self.to_lookup.keys():
            if len(self.to_lookup[sent_id].translations) == 0:
                del self.to_lookup[sent_id]
        for sent_id in self.from_lookup.keys():
            if len(self.from_lookup[sent_id].translations) == 0:
                del self.from_lookup[sent_id]

def tokenize(text, lang):
    if lang == 'cmn':
        tokens = [text[(i*3):(i*3)+3] for i in range(len(text)/3)]
        return tokens
    tokens = [t for t in re.split('[ \t.()"!?]+', text) if len(t) > 0]
    return tokens
 
filename = '_'.join([language_to_translate_from, language_to_translate_to, 'corpus.dump'])
if path.isfile(filename):
    corpus = pickle.load(open(filename, 'rb'))
else:
    corpus = corpus_class(language_to_translate_from, language_to_translate_to)
    with open('sentences.csv', 'rb') as csvfile:
        csv_reader = csv.reader(csvfile, delimiter="\t")
        for row in csv_reader:
            sent_id, lang, text = row
            if lang != language_to_translate_from and lang != language_to_translate_to:
                continue
            corpus.add_sentence(sent_id, lang, text)
    with open('links.csv', 'rb') as csvfile:
        csv_reader = csv.reader(csvfile, delimiter="\t")
        for row in csv_reader:
            id_1, id_2 = row
            if id_1 < id_2:
                corpus.add_translation(id_1, id_2)
    corpus.clean_up()
    pickle.dump(corpus, open(filename, 'wb'), pickle.HIGHEST_PROTOCOL)

### Next we have a script to randomly pick a set of sentences which are very similar to eachother

In [None]:
def rand_practice():
    all_ids = corpus.from_lookup.keys()
    index = int(np.random.uniform(0, len(all_ids)))
    index = all_ids[index]
    start_sentence = corpus.from_lookup[index]
    return practice(set(start_sentence.tokens))

def practice_sentence(start_string):
    start_token_set = set(tokenize(start_string, language_to_translate_from))
    return practice(start_token_set)

def practice_short(max_length):
    short_keys = [s_id for s_id in corpus.from_lookup.keys() if len(corpus.from_lookup[s_id].tokens) <= max_length]
    index = int(np.random.uniform(0, len(short_keys)))
    index = short_keys[index]
    start_sentence = corpus.from_lookup[index]
    return practice(set(start_sentence.tokens))

def practice(start_token_set):
    global all_pairs, current_pair, first_try
    key_fun = lambda s: len(set(s.tokens) & start_token_set)
    close_list = corpus.from_lookup.values()
    close_list.sort(key = key_fun, reverse = True)
    close_list = close_list[:10]
    
    all_pairs = []
    for s in close_list:
        translations = [corpus.to_lookup[trans_id] for trans_id in s.translations]
        all_pairs.append(([' '.join(t.tokens) for t in translations],s))
    
    if len(all_pairs) == 0:
        return rand_practice()
    current_pair = all_pairs.pop(0)
    first_try = True
    return HTML(practice_box('Translate ' + current_pair[1].text + ":"))

def make_guess(answer):
    global current_pair, first_try
    original_answer = answer
    answer = ' '.join([t for t in re.split('[ \t.()"!?]+', answer) if len(t) > 0])
    if answer == 'STOP':
        return "Finished"
    if answer == 'FIRST TIME':
        return 'Translate ' + current_pair[1].text + ":"
    if answer not in current_pair[0]:
        if first_try:
            all_pairs.append(current_pair)
            first_try = False
        return '<br/>'.join(current_pair[0] + ['Translate ' + current_pair[1].text + ':'])
    if len(all_pairs) > 0:
        first_try = True
        current_pair = all_pairs.pop(0)
        return 'Translate ' + current_pair[1].text + ":"
    return "Finished"

### Now we make a script to interact with the user using javascript

In [None]:
def practice_box(prompt):
    
    input_form = """
    <div style="background-color:gainsboro; border:solid black; padding:20px;">
    <p id='prompt'></p>
    <input type="text" value="FIRST TIME" id='guess'><br>
    <button id='guess-submit' onclick="make_guess()">Guess</button>
    </div>
    """

    javascript = """
    <script type="text/Javascript">
    function handle_output(out){
        var res = eval(out.content.data['text/plain']);
        res = decodeURIComponent(escape(res));
        document.getElementById('prompt').innerHTML = res;
        document.getElementById('guess').value = "";"""
    
    if voice_to_language:
        javascript += """
        if (guess != "FIRST TIME") {
            var break_index = res.indexOf("<br/>");
            if (break_index == -1) {
                say_to(guess);
            } else {
                say_to(res.substr(0,break_index));
            }
        }
        """
    
    if voice_from_language:
        javascript += """
        var break_index = res.lastIndexOf("<br/>");
        if (break_index == -1 && res != "Finished") {
            say_from(res.substr("Translate ".length,res.lastIndexOf(":")));
        } else {
            say_from(res.substr(break_index + "<br/>".length + "Translate ".length , res.lastIndexOf(":")));
        }"""
    
    javascript += """
        if (res == "Finished") {
            document.getElementById('guess').style.display = 'none';
            document.getElementById('guess-submit').style.display = 'none';
        }
    }
    
    var guess = "";
    function make_guess(){
        guess = document.getElementById('guess').value;
        var command = 'make_guess("'+guess+'")';
        console.log("Executing Command: " + command);

        var kernel = IPython.notebook.kernel;
        var msg_id = kernel.execute(command, {'iopub': {'output' : handle_output}}, {silent:false});
    }
    
    function say_to(text) {
        console.log("say "+text);
        var msg = new SpeechSynthesisUtterance(text);
        msg.lang = '"""
    
    javascript += language_to_translate_to
    
    javascript += """';
        window.speechSynthesis.speak(msg);
    }
    
    function say_from(text) {
        console.log("say: "+text);
        var msg = new SpeechSynthesisUtterance(text);
        msg.lang = '"""
    
    javascript += language_to_translate_from
    
    javascript += """';
        window.speechSynthesis.speak(msg);
    }
    make_guess();
    </script>
    """
    
    return input_form + javascript

## Finally we can actually run the program!

####Run Commands

To start practicing from a random sentence enter the command `rand_practice()`

If you want to limit the number of tokens in the sentences you practice with enter the command `practice_short(<max_length>)`

If you have a particular seed sentence you would like to start with enter the command `practice_sentence(<your_sentence>)`

####How to use

Once you run the command wait a bit and a box should appear bellow the cell. The box will ask you to translate a sentence from the language you picked for `language_to_translate_from`. Write that sentence in the language you picked for `language_to_translate_to`.

If you don't know the translation just click the "Guess" button and the program will tell you the answer (it will do the same thing if you get the answer wrong).

You will be given 10 similar sentences to translate from the source language to the target language. If you don't get a sentence on the first try it will be added to the end of the queue. Once the the queue is empty the box will say "Finished". If you want to keep practicing at this point you can run the script again.

In [None]:
practice_short(5)