### Some Previous Discussions

[Points noted down to be improved in previous AgMT APIs](https://teams.microsoft.com/l/file/A8BD6EC4-4946-482C-A7C0-5DFA0AC96ACA?tenantId=dc5352cb-2fb1-4f19-a355-61b4398ec2e1&fileType=docx&objectUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI%2FShared%20Documents%2FAgMT%20-%20VachanAPI%2FAPI%20Refactoring%2FAgMT%20API%20Revision.docx&baseUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI&serviceName=teams&threadId=19:eafa29b748664314b67c8a8105d7caec@thread.tacv2&groupId=0d8df138-370a-4ec7-917d-8ec0699577f6)

[The discussion document on suggestions module](https://teams.microsoft.com/l/file/44E2A83F-30FC-49CA-8E74-5BDDF1824DFC?tenantId=dc5352cb-2fb1-4f19-a355-61b4398ec2e1&fileType=docx&objectUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI%2FShared%20Documents%2FAgMT%20-%20VachanAPI%2FAPI%20Refactoring%2FSuggestions%20Module.docx&baseUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI&serviceName=teams&threadId=19:eafa29b748664314b67c8a8105d7caec@thread.tacv2&groupId=0d8df138-370a-4ec7-917d-8ec0699577f6)



## Tokenization

How about we use single word tokens for now(beta release in June)?

*Answer*: No, better use phrases now itself or the design choices we make now would make it not possible to upgrade later

Issues in using phrases
1. The best way to capture alignments is to use single word tokens. As we plan to use alignments to automaticaly identify token translation and enrich translation memory, using single word tokens would be easier to begin with. As we are planning to give context based suggestions, phrases of at least 3 words would always be considered in effect. So we may still get the quality improvement of using phrase tokens instead of single word.
2. Using phrase tokens increase the number of tokens to be translated considerably. Translators doesn't seem very happy about it. Now that context based translations are also going to be encouraged, they may feel their work is too much in token translation phase.

Other changes in tokenization
* The fucntion below takes **any list of sentences** as input(as list of (id, sentence) tuples) and tokenize them into single word tokens. This gives us flexibility to do tokenization in any desired manner: whole bible at once, some books, some chapters of a book etc. Later, in Autographa or outside, the same function can be used to translate other text contents like commenatries, notes or stories.
* The tokens are returned in chronological order, that is **in the order they appear in the input text**. I hope this will give the user a better connection to the source while translating and also a better idea about the progress he is making through the source text.
* The tokens are returned as a dict/json and the value of each token-key would be **the list of occurances** of the token. This would become handy for UI app for highlighlighting occurances and also returning the occurances where each sense is to be applied back to server. The server can also directly save the offset returned by UI and use it while draft generation.

In [5]:
import re
import utils

def tokenize(sent_list, punctuations=utils.punctuations()+utils.numbers()):
	'''Get single word tokens and their occurances from input sentence list
	input: [(sent_id, sent_text), (sent_id, sent_text), ...]
	output: {"token": [(sent_id, offset),(sent_id, offset)..],
	         "token": [(sent_id, offset),(sent_id, offset)..], ...}'''
	unique_tokens = {}
	for sent in sent_list:
		clean_sent = sent[1]
		for punct in punctuations:
			clean_sent = clean_sent.replace(punct, " ")
		#clean_sent = re.sub(r"[\\s\\n\\r]+", " ", clean_sent)
		words = clean_sent.split(" ")
		words = [w for w in words if w !=""]
		start = 0
		for word in words:
			offset = sent[1].find(word, start)
			start = offset+1
			if word not in unique_tokens:
				unique_tokens[word] = [(sent[0], offset)]
			else: 
				unique_tokens[word].append((sent[0], offset))
	return unique_tokens


In [6]:
sample_sentences = [(62001001,"उस जीवन के वचन के विषय में जो आदि से था*, जिसे हमने सुना, और जिसे अपनी आँखों से देखा, वरन् जिसे हमने ध्यान से देखा और हाथों से छुआ।"),
(62001002,"(यह जीवन प्रगट हुआ, और हमने उसे देखा, और उसकी गवाही देते हैं, और तुम्हें उस अनन्त जीवन का समाचार देते हैं जो पिता के साथ था और हम पर प्रगट हुआ)।"),
(62001003,"जो कुछ हमने देखा और सुना है उसका समाचार तुम्हें भी देते हैं, इसलिए कि तुम भी हमारे साथ सहभागी हो; और हमारी यह सहभागिता पिता के साथ, और उसके पुत्र यीशु मसीह के साथ है।"),
(62001004,"और ये बातें हम इसलिए लिखते हैं, कि तुम्हारा आनन्द पूरा हो जाए*।"),
(62001005,"जो समाचार हमने उससे सुना, और तुम्हें सुनाते हैं, वह यह है; कि परमेश्‍वर ज्योति हैं और उसमें कुछ भी अंधकार नहीं*।"),
(62001006,"यदि हम कहें, कि उसके साथ हमारी सहभागिता है, और फिर अंधकार में चलें, तो हम झूठ बोलते है और सत्य पर नहीं चलते।"),
(62001007,"पर यदि जैसा वह ज्योति में है, वैसे ही हम भी ज्योति में चलें, तो एक दूसरे से सहभागिता रखते हैं और उसके पुत्र यीशु मसीह का लहू हमें सब पापों से शुद्ध करता है। (यशा. 2:5)"),
(62001008,"यदि हम कहें, कि हम में कुछ भी पाप नहीं, तो अपने आप को धोखा देते हैं और हम में सत्य नहीं।"),
(62001009,"यदि हम अपने पापों को मान लें, तो वह हमारे पापों को क्षमा करने, और हमें सब अधर्म से शुद्ध करने में विश्वासयोग्य और धर्मी है। (भज. 32:5, नीति. 28:13)"),
(62001010,"यदि हम कहें कि हमने पाप नहीं किया, तो उसे झूठा ठहराते हैं, और उसका वचन हम में नहीं है।")]

tokenize(sample_sentences)

{'उस': [(62001001, 0), (62001002, 73)],
 'जीवन': [(62001001, 3), (62001002, 4), (62001002, 82)],
 'के': [(62001001, 8),
  (62001001, 15),
  (62001002, 114),
  (62001003, 124),
  (62001003, 156)],
 'वचन': [(62001001, 11), (62001010, 67)],
 'विषय': [(62001001, 18)],
 'में': [(62001001, 23),
  (62001006, 58),
  (62001007, 22),
  (62001007, 51),
  (62001008, 19),
  (62001008, 74),
  (62001009, 94),
  (62001010, 74)],
 'जो': [(62001001, 27), (62001002, 106), (62001003, 0), (62001005, 0)],
 'आदि': [(62001001, 30)],
 'से': [(62001001, 34),
  (62001001, 77),
  (62001001, 107),
  (62001001, 124),
  (62001007, 73),
  (62001007, 139),
  (62001009, 80)],
 'था': [(62001001, 37), (62001002, 121)],
 'जिसे': [(62001001, 42), (62001001, 61), (62001001, 91)],
 'हमने': [(62001001, 47),
  (62001001, 96),
  (62001002, 23),
  (62001003, 7),
  (62001005, 10),
  (62001010, 15)],
 'सुना': [(62001001, 52), (62001003, 20), (62001005, 20)],
 'और': [(62001001, 58),
  (62001001, 115),
  (62001002, 20),
  (62001002,

## Get Text functions

How about we define a get text function on every content table(like bible, commentary etc), which would return the cleaned text field contents along with an id for that specific table?

*Answer*: Yes. But design it as an abstract class which is inherited and implemented for each kind of sources

* This list of sentences could be used as input for tokenization and draft generation. 
* Also this could be used for apps like Autographa or BridgeEngine to display reference texts on screen, as it would contain just the clean contents and no foot notes, cross-refs, strongs markups, alignments or any other non-relevant contents in USFM files
* This would also come in handy for model building scripts to get the texts from varoius content tables

In [18]:
import schemas, db_models

def get_text_from_bible(db_, source_name, ref_start:schemas.Reference=None, 
    ref_end:schemas.Reference=None):
    '''fetched text contents from bible_cleaned tables to be used for translations apps 
    or for model building.
    Output format: [(id, sentance), (id, sentance), ....]'''
    if source_name not in db_models.dynamicTables:
        raise NotAvailableException('%s not found in database.'%source_name)
    if not source_name.endswith('_bible'):
        raise TypeException('The operation is supported only on bible')
    model_cls = db_models.dynamicTables[source_name+'_cleaned']
    ref_id_start = ref_id_end = None
    if ref_start:
    	book = db_models.BibleBook.filter(db_models.BibleBook.bookCode == ref_start.bookCode).first()
    	if not book:
    		raise NotAvailableException("Book %s, not found in database"%ref_start.bookCode)
    	ref_id_start = book.bookId*1000000 + ref_start.chapter*1000 + ref_start.verseNumber
    if ref_end:
    	book = db_models.BibleBook.filter(db_models.BibleBook.bookCode == ref_end.bookCode).first()
    	if not book:
    		raise NotAvailableException("Book %s, not found in database"%ref_end.bookCode)
    	ref_id_end = book.bookId*1000000 + ref_end.chapter*1000 + ref_end.verseNumber
    if not ref_id_start:
    	ref_id_start = 0
    if not ref_id_end:
    	ref_id_end = 999999999
    query = db_.query(model_cls).filter(model_cls.refId >= ref_id_start,
    	model_cls.refId <= ref_id_end, model_cls.active == True)
    res = query.all()
    formatted_res = []
    for item in res:
        formatted_res.append((item.refId, item.verseText))
    return formatted_res


In [22]:
from database import SessionLocal, engine
db_ = SessionLocal()

get_text_from_bible(db_, source_name="hin_IRV_1_bible")

[(41001001, 'इब्राहीम की सन्\u200dतान, दाऊद की ...')]

## Draft Generation

In V1, draft generation was done by find and replace of tokens(in the descending order of length of token) on the USFM file.

In V2, as we are doing context based translation, it needs to be changed. We will be doing a replacement of tokens with translations on specific occurance.

How about we do not use the input(source/reference USFM) for this replacement, instead create a fresh minimal USFM with the translated verses? For this we will be using the clean verse text we extracted from USFM and obtained using the the `get_text_from_bible()` function(the same text given for tokenizarion and displaying on UI), translate it using token replacement and then attach the minimum required markers \id, \c, \p and \v appropriately.

*Answer*: Yes

By doing this, all non-verse contents present in the source/reference USFM would be absent in the generated draft. 

An existing issue in the draft of V1 is that some words are not replaced with translations even though, they are present in tokens list and translated there. I think the issue happens because they are part of phrase tokens and these phrases are broken apart in USFM file with additional markup in between them. So the find and replace doesnt work. Similar issues will occur for us in V2 also even if we are using offsets. So I think, using the cleaned text for replacement would be better

In [None]:
def translate(db_, project_id, sent_ids=None):
    '''does token replacement translation with translation memory'''
    #fetch sentences from source document
    if sent_ids is None:
        sentences = db_.query(db_models.TranslationMemory.sourceDocument).filter(
            db_models.TranslationMemory.projectId == project_id).first()
    else:
        sentences = db_.query(db_models.TranslationMemory.sourceDocument).filter(
            db_models.TranslationMemory.projectId == project_id,
            db_models.TranslationMemory.sentences).all()

## Suggestions

For every language pair for which we have translation projects or previously available parallel aligned data, we will have a *translation memory learned model*. When we encounter a word in a particular context, we use this learned model to get all possible translations of the word and get them scored based on the current context window. These scored translation can be given to user as suggestions.

I think the key thing to decide in this is, how do we store the translation memory to be used efficiently.
Or what do we mean by a learned translation model for a language pair?

The options that occur to me are as follows:
1. Query the translation memory(alignment) **SQL table** based on the key word and check and sort them based on the context window(I am afraid, this will have poor performace in terms of time and space)
2. Periodically build a **trie** structure from translation memory(alignment) table and query this trie for suggestions. I am not yet familiar with trie. I hope it allows us to search based on a context window efficiently. One draw back I can see in this is "learning" will not happen in real time and data user adds will take time(depending on how often we run the learning script) to improve the suggestions quality.
3. While we keep a translation memory table in SQL DB, parallely bulid a **graph** structure with it in DGraph. Use this graph for suggestions and use the table in SQL DB for draft generation. 
4. Build a **nueral network** (or ML) model that can be trained with word and context window and can predict the translation. I am not sure if we can get such a model to give multiple translations with varing scores. Building, storing, and using such models can also become expensive in terms of time and space. 

*Answer*: Option 2, trie built from SQL table, for now and 3, 4 for later

**Proposed tire structure for AgMT**

* Have one trie per source-target language pair, this would reduce the size and thus increase search performance at level 1
* Each node will have
	* a key: the context. The window size increases by one at each level
	* translations: list of all seen translations and their count for the given context. The count and current level can be used for scoring suggestions(score = level*count/total_occurances, total_occurance=sumOfCountsAtLevel1)
	* children: context increases by one word to right or left from the current context

input: 
```
[
{"token": "house","context":"They use barrels to house their pets","translation":"പാര്‍പ്പിക്കുക"},
{"token": "house","context":"His house is to the left","translation":"വീട്"},
{"token": "house","context":"Their house contruction methods are different","translation":"ഭവന"},
{"token": "house","context":"Last time I went to his house,","translation":"വീട്ടിലേക്ക്"},
{"token": "house","context":"Museums house large collection of Roman sculpture","translation":"ഉള്‍ക്കൊള്ളുന്നു"}
]
```
A trie of window size 3
![trie diagram](example_trie.png)


In [2]:
def form_trie_keys(prefix, to_left, to_right, prev_keys):
    '''build the trie tree recursively'''    
    keys = prev_keys
    a = b = None
    if len(to_left) > 0:
        a = '/L:'+to_left.pop(0)
    if len(to_right) > 0:
        b = '/R:'+to_right.pop(0)
    if a:
        key_left = prefix + a
        keys.append(key_left)
        if not b:
            keys = form_trie_keys(key_left, to_left.copy(), to_right.copy(), keys)
    if b:
        key_right = prefix + b
        keys.append(key_right)
        if not a:
            keys = form_trie_keys(key_right, to_left.copy(), to_right.copy(), keys)
    if a and b:
        key_both_1 = prefix + a + b
        key_both_2 = prefix + b + a
        keys.append(key_both_1)
        keys.append(key_both_2)
        keys = form_trie_keys(key_both_1, to_left.copy(), to_right.copy(), keys)
        keys = form_trie_keys(key_both_2, to_left, to_right, keys)
    return keys

token = "house"
#context = ["house"]
context = ["his", "house", "is"]
#context = ["his", "house"]
#context = ["house", "is"]
#context = ["says","his", "house", "is", "in", "town"]
#context = ["house", "is", "in", "town"]
#context = ["he","says","his", "house"]
#context = ["he","says","his", "house", "is", "in", "town"]

token_index = context.index(token)
to_left = [context[i] for i in range(token_index-1, -1, -1)]
to_right = context[token_index+1:]
form_trie_keys(token, to_left, to_right, [token])

['house', 'house/L:his', 'house/R:is', 'house/L:his/R:is', 'house/R:is/L:his']

In [4]:
import pygtrie

def build_trie(token_context__trans_list):
    '''Build a trie tree from scratch
    input: [(token,context_list, translation), ...]'''
    t = pygtrie.StringTrie()
    for item in token_context__trans_list:
        token = item[0]
        context = item[1]
        translation = item[2]
        token_index = context.index(token)
        to_left = [context[i] for i in range(token_index-1, -1, -1)]
        to_right = context[token_index+1:]
        keys = form_trie_keys(token, to_left, to_right, [token])
        for key in keys:
            if t.has_key(key):
                value = t[key]
                if translation in value.keys():
                    value[translation] += 1
                else:
                    value[translation] = 1
                t[key] = value
            else:
                t[key] = {translation: 1}
    return t

training_data = [
    ("bank", ["bank", "is", "closed"], "ബാങ്ക്"),
    ("bank", ["they", "bank", "on", "us"], "ആശ്രയിക്കുക"),
    ("bank", ["pay", "bank", "back"], "ബാങ്ക്"),
    ("bank", ["river", "bank", "is", "muddy"], "തീരം"),
    ("bank", ["bank", "manager", "spoke"], "ബാങ്ക്")]
t = build_trie(training_data)

In [25]:
def get_translation_suggestion(word, context, t):
    '''find the context based translation suggestions for a word.
    Makes use of the learned model, t, for the lang pair, based on translation memory
    output format: [(translation1, score1), (translation2, score2), ...]'''
    token_index = context.index(word)
    to_left = [context[i] for i in range(token_index-1, -1, -1)]
    to_right = context[token_index+1:]
    keys = form_trie_keys(word, to_left, to_right, [word])
    suggestions = {}
    single_word_match = t[word]
    total_count = sum(single_word_match.values())
    for k in keys:
        match = t.longest_prefix(k)
        levels = len(match.key.split("/"))
        for trans in match.value:
            score = match.value[trans]*levels*levels / total_count
            if trans in suggestions:
                if suggestions[trans] < score:
                    suggestions[trans] = score
            else:
                suggestions[trans] = score
    return [(key, suggestions[key]) for key in suggestions]

#get_translation_suggestion("bank", ["pay", "bank", "the", "money"], t)
#get_translation_suggestion("bank", ["bank", "the", "money"], t)
#get_translation_suggestion("bank", ["river", "bank", "is", "near"], t)
#get_translation_suggestion("bank", ["people", "bank", "on", "others"], t)
get_translation_suggestion("bank", ["bank"], t)

[('ബാങ്ക്', 0.6), ('ആശ്രയിക്കുക', 0.2), ('തീരം', 0.2)]

## Tables

![Projects and Translation memory tables](agmt_tables.png)