### Some Previous Discussions

[Points noted down to be improved in previous AgMT APIs](https://teams.microsoft.com/l/file/A8BD6EC4-4946-482C-A7C0-5DFA0AC96ACA?tenantId=dc5352cb-2fb1-4f19-a355-61b4398ec2e1&fileType=docx&objectUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI%2FShared%20Documents%2FAgMT%20-%20VachanAPI%2FAPI%20Refactoring%2FAgMT%20API%20Revision.docx&baseUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI&serviceName=teams&threadId=19:eafa29b748664314b67c8a8105d7caec@thread.tacv2&groupId=0d8df138-370a-4ec7-917d-8ec0699577f6)

[The discussion document on suggestions module](https://teams.microsoft.com/l/file/44E2A83F-30FC-49CA-8E74-5BDDF1824DFC?tenantId=dc5352cb-2fb1-4f19-a355-61b4398ec2e1&fileType=docx&objectUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI%2FShared%20Documents%2FAgMT%20-%20VachanAPI%2FAPI%20Refactoring%2FSuggestions%20Module.docx&baseUrl=https%3A%2F%2Fbridgeconn.sharepoint.com%2Fsites%2FDevTeam-AgMT-VachanAPI&serviceName=teams&threadId=19:eafa29b748664314b67c8a8105d7caec@thread.tacv2&groupId=0d8df138-370a-4ec7-917d-8ec0699577f6)



## Tokenization

How about we use single word tokens for now(beta release in June)?

Issues in using phrases
1. The best way to capture alignments is to use single word tokens. As we plan to use alignments to automaticaly identify token translation and enrich translation memory, using single word tokens would be easier to begin with. As we are planning to give context based suggestions, phrases of at least 3 words would always be considered in effect. So we may still get the quality improvement of using phrase tokens instead of single word.
2. Using phrase tokens increase the number of tokens to be translated considerably. Translators doesn't seem very happy about it. Now that context based translations are also going to be encouraged, they may feel their work is too much in token translation phase.

**Answer**: No, better use phrases now itself or the design choices we make now would make it not possible to upgrade later

Main changes from older method(V1)
* The fucntion below takes **any list of sentences** as input(as list of (id, sentence) tuples) and tokenize them into single word tokens. This gives us flexibility to do tokenization in any desired manner: whole bible at once, some books, some chapters of a book etc. Later, in Autographa or outside, the same function can be used to translate other text contents like commenatries, notes or stories.
* We **dont use statistical models** for phrase identification. That makes it possible to work with data of very less size too (like one story)
* we get more control over tokenization. For example we can load a set of predefined phrases to the translation memory table and make sure they get treated as tokens(eg: translation words, named entities etc)
* we can **use language specific knowledge** like stopwords, puctuations etc if available
* tokenization **changes(improves) with user data**. As translation memory table is added with newer phrases for a language, from different projects in the App or alignment data obtained from the app or otherwise, the tokens may change accordingly.
* The tokens are returned in chronological order, that is **in the order they appear in the input text**. I hope this will give the user a better connection to the source while translating and also a better idea about the progress he is making through the source text.
* The tokens are returned as a dict/json and the value of each token-key would be **the list of occurances** of the token. This would become handy for UI app for highlighlighting occurances and also returning the occurances where each sense is to be applied back to server. 
* In the **absence of any prior knowledge**, ie. no phrases in translation memory and no known stopwords for that source language, the base line system would give single word tokens only

In the UI, the user should be allowed to **fiddle with the machine generated tokens**. He should be able to split a phrase token into component words, if he finds that suits translation better. Also he should be allowed to combine adjacent tokens to form bigger phrases. This would allow us to learn better phrases and would be easier for user than going to a separate alignment mode for it.(I am not suggesting replacing alignment mode)

In [10]:
## Chunking based on punctuations 
import re
import utils
import pdb
punctuations = utils.punctuations()+utils.numbers()
sample_text = "hello! My dear friend. It's a tasty, healthy food. Did you make it? Thanks!!! "
chunks = [chunk.strip() for chunk in re.split(r'['+"".join(punctuations)+']+', sample_text)]
print('Chunking input:', sample_text)
print('Chunking output:',chunks)

# matching phrase alternate
def find_phrases(text, stop_words):
    '''try forming phrases as <preposition stop word>* <content word> <postposition stop word>*'''
    #pdb.set_trace()
    words = text.split()
    phrases = []
    current_phrase = ''
    state = 'pre'
    i = 0
    while i < len(words):
        word = words[i]
        if state == 'pre':
            if word in stop_words['prepositions']:
                current_phrase += ' '+ word # adds prepostion, staying in 'pre' state
            else:
                current_phrase += ' '+ word # adds one content word and goes to 'post' state
                state = 'post'
        elif state == 'post':
            if word in stop_words['postpositions']:
                current_phrase += ' '+ word # adds postposition, staying in 'post' state
            else:
                phrases.append(current_phrase.strip()) # stops the phrase building
                current_phrase = word
                if word in stop_words['prepositions']:
                    state = 'pre'
                else:
                    state = 'post'
        #pdb.set_trace()
        i += 1
        
    phrases.append(current_phrase.strip())
    return phrases

text = 'उस जीवन के वचन के विषय में जो आदि से था'
stop_words = utils.stopwords('hin')

print('\n\nphrases input(one chunk):', text)
print('phrases output:')
find_phrases(text, stop_words)

Chunking input: hello! My dear friend. It's a tasty, healthy food. Did you make it? Thanks!!! 
Chunking output: ['hello', 'My dear friend', 'It', 's a tasty', 'healthy food', 'Did you make it', 'Thanks', '']


phrases input(one chunk): उस जीवन के वचन के विषय में जो आदि से था
phrases output:


['उस जीवन के', 'वचन के', 'विषय में जो', 'आदि से था']

In [11]:
def display_tree(tree):
    for path in tree.items():
        nodes = path[0].split('/')
        for nod in nodes:
            print('\t-',nod,end='')
        print(' => ',path[1])

In [36]:
import pygtrie, re

def build_memory_trie(translation_memory):
    memory_trie = pygtrie.StringTrie()
    space_pattern = re.compile('\s+')
    for token in translation_memory:
        key = re.sub(space_pattern,'/', token)
        memory_trie[key] = 0
    return memory_trie

#fetch all distinct tokens for the source language from translation memory in DB
mock_translation_memory = ["जीवन के वचन", "जीवन का", "अपनी आँखों से देखा", "पिता के साथ", "यीशु मसीह", "परमेश्‍वर ज्योति", "झूठा ठहराते",
                          "Here is it", "hare", "no"]

#build a trie using the fetched data. words in tokens will form the path 
memory_trie = build_memory_trie(mock_translation_memory)
display_tree(memory_trie)

	- जीवन	- के	- वचन =>  0
	- जीवन	- का =>  0
	- अपनी	- आँखों	- से	- देखा =>  0
	- पिता	- के	- साथ =>  0
	- यीशु	- मसीह =>  0
	- परमेश्‍वर	- ज्योति =>  0
	- झूठा	- ठहराते =>  0
	- Here	- is	- it =>  0
	- hare =>  0
	- no =>  0


In [13]:
import re
import utils
import pdb

def tokenize(src_lang, sent_list, punctuations=None,
            stop_words = None):
    '''Get phrase and single word tokens and their occurances from input sentence list.
    Performs tokenization using two knowledge sources: translation memory and stopwords list
    input: [(sent_id, sent_text), (sent_id, sent_text), ...]
    output: {"token": [(sent_id, start_offset, end_offset),(sent_id, start_offset, end_offset)..],
             "token": [(sent_id, start_offset, end_offset),(sent_id, start_offset, end_offset)..], ...}'''
    #pdb.set_trace()
    unique_tokens = {}
    if stop_words is None:
        stop_words = utils.stopwords(src_lang)
    if punctuations is None:
        punctuations = utils.punctuations()+utils.numbers()
    # fetch all known tokens for the language and build a trie with it
    memory_trie = build_memory_trie(mock_translation_memory)
    for sent in sent_list:
        phrases = []
        text = re.sub(r'[\n\r]+', ' ', sent[1])
        #first split the text into chunks based on punctuations
        chunks = [chunk.strip() for chunk in re.split(r'['+"".join(punctuations)+']+', text)]
        updated_chunks = []
        for i,chunk in enumerate(chunks):
            #search the trie to get the longest matching phrases known to us
            temp = chunk
            new_chunks = ['']
            while temp != "":
                key = '/'.join(temp.split())
                lngst = memory_trie.longest_prefix(key)
                if lngst.key is not None:
                    new_chunks.append("###"+lngst.key.replace('/',' '))
                    temp = temp[len(lngst.key):]
                    new_chunks.append('')
                else:
                    if " " in temp:
                        indx = temp.index(' ')
                        new_chunks[-1] += temp[:indx+1]
                        temp = temp[indx+1:]
                    else:
                        new_chunks[-1] += temp
                        temp = ""
                #pdb.set_trace()
            updated_chunks += new_chunks
            #pdb.set_trace()
        chunks = [ chk.strip() for chk in updated_chunks if chk.strip() != '']       
        for i,chunk in enumerate(chunks):
            # from the left out words in above step, try forming phrases 
            # as <preposition stop word>* <content word> <postposition stop word>* 
            if chunk.startswith('###'):
                phrases.append(chunk.replace("###",""))
            else:
                phrases+=find_phrases(chunk,stop_words)
        start = 0
        for phrase in phrases:
            offset = sent[1].find(phrase, start)
            if offset == -1:
                #raise "token not found in sentence"
                pdb.set_trace()
            start = offset+1
            if phrase not in unique_tokens:
                unique_tokens[phrase] = [(sent[0], offset, offset+len(phrase))]
            else: 
                unique_tokens[phrase].append((sent[0], offset, offset+len(phrase)))
    return unique_tokens


In [5]:
sample_sentences = [(62001001,"उस जीवन के वचन के विषय में जो आदि से था*, जिसे हमने सुना, और जिसे अपनी आँखों से देखा, वरन् जिसे हमने ध्यान से देखा और हाथों से छुआ।"),
(62001002,"(यह जीवन प्रगट हुआ, और हमने उसे देखा, और उसकी गवाही देते हैं, और तुम्हें उस अनन्त जीवन का समाचार देते हैं जो पिता के साथ था और हम पर प्रगट हुआ)।"),
(62001003,"जो कुछ हमने देखा और सुना है उसका समाचार तुम्हें भी देते हैं, इसलिए कि तुम भी हमारे साथ सहभागी हो; और हमारी यह सहभागिता पिता के साथ, और उसके पुत्र यीशु मसीह के साथ है।"),
(62001004,"और ये बातें हम इसलिए लिखते हैं, कि तुम्हारा आनन्द पूरा हो जाए*।"),
(62001005,"जो समाचार हमने उससे सुना, और तुम्हें सुनाते हैं, वह यह है; कि परमेश्‍वर ज्योति हैं और उसमें कुछ भी अंधकार नहीं*।"),
(62001006,"यदि हम कहें, कि उसके साथ हमारी सहभागिता है, और फिर अंधकार में चलें, तो हम झूठ बोलते है और सत्य पर नहीं चलते।"),
(62001007,"पर यदि जैसा वह ज्योति में है, वैसे ही हम भी ज्योति में चलें, तो एक दूसरे से सहभागिता रखते हैं और उसके पुत्र यीशु मसीह का लहू हमें सब पापों से शुद्ध करता है। (यशा. 2:5)"),
(62001008,"यदि हम कहें, कि हम में कुछ भी पाप नहीं, तो अपने आप को धोखा देते हैं और हम में सत्य नहीं।"),
(62001009,"यदि हम अपने पापों को मान लें, तो वह हमारे पापों को क्षमा करने, और हमें सब अधर्म से शुद्ध करने में विश्वासयोग्य और धर्मी है। (भज. 32:5, नीति. 28:13)"),
(62001010,"यदि हम कहें कि हमने पाप नहीं किया, तो उसे झूठा ठहराते हैं, और उसका वचन हम में नहीं है।")]

tokenize("hin", sample_sentences)

{'उस': [(62001001, 0, 2)],
 'जीवन के वचन': [(62001001, 3, 14)],
 'के': [(62001001, 8, 10)],
 'विषय में जो': [(62001001, 18, 29)],
 'आदि से था': [(62001001, 30, 39)],
 'जिसे': [(62001001, 42, 46), (62001001, 61, 65), (62001001, 91, 95)],
 'हमने': [(62001001, 47, 51),
  (62001001, 96, 100),
  (62001002, 23, 27),
  (62001005, 10, 14),
  (62001010, 15, 19)],
 'सुना': [(62001001, 52, 56), (62001005, 20, 24)],
 'और': [(62001001, 58, 60),
  (62001002, 20, 22),
  (62001002, 38, 40),
  (62001002, 62, 64),
  (62001003, 98, 100),
  (62001003, 132, 134),
  (62001005, 26, 28),
  (62001006, 44, 46),
  (62001009, 63, 65),
  (62001010, 59, 61)],
 'अपनी आँखों से देखा': [(62001001, 66, 84)],
 'वरन्': [(62001001, 86, 90)],
 'ध्यान से': [(62001001, 101, 109)],
 'देखा और': [(62001001, 110, 117), (62001003, 12, 19)],
 'हाथों से': [(62001001, 118, 126)],
 'छुआ': [(62001001, 127, 130)],
 'यह जीवन': [(62001002, 1, 8)],
 'प्रगट हुआ': [(62001002, 9, 18), (62001002, 133, 142)],
 'उसे': [(62001002, 28, 31), (62001

## Get Text functions

How about we define a get text function on every content table(like bible, commentary etc), which would return the cleaned text field contents along with an id for that specific table?

*Answer*: Yes. But design it as an abstract class which is inherited and implemented for each kind of sources

* This list of sentences could be used as input for tokenization and draft generation. 
* Also this could be used for apps like Autographa or BridgeEngine to display reference texts on screen, as it would contain just the clean contents and no foot notes, cross-refs, strongs markups, alignments or any other non-relevant contents in USFM files
* This would also come in handy for model building scripts to get the texts from varoius content tables

In [None]:
import schemas, db_models
import main

def get_text_from_bible(db_, source_name, ref_start:schemas.Reference=None, 
    ref_end:schemas.Reference=None):
    '''fetched text contents from bible_cleaned tables to be used for translations apps 
    or for model building.
    Output format: [(id, sentance), (id, sentance), ....]'''
    if source_name not in db_models.dynamicTables:
        print(db_models.dynamicTables)
        raise NotAvailableException('%s not found in database.'%source_name)
    if not source_name.endswith('_bible'):
        raise TypeException('The operation is supported only on bible')
    model_cls = db_models.dynamicTables[source_name+'_cleaned']
    ref_id_start = ref_id_end = None
    if ref_start:
    	book = db_models.BibleBook.filter(db_models.BibleBook.bookCode == ref_start.bookCode).first()
    	if not book:
    		raise NotAvailableException("Book %s, not found in database"%ref_start.bookCode)
    	ref_id_start = book.bookId*1000000 + ref_start.chapter*1000 + ref_start.verseNumber
    if ref_end:
    	book = db_models.BibleBook.filter(db_models.BibleBook.bookCode == ref_end.bookCode).first()
    	if not book:
    		raise NotAvailableException("Book %s, not found in database"%ref_end.bookCode)
    	ref_id_end = book.bookId*1000000 + ref_end.chapter*1000 + ref_end.verseNumber
    if not ref_id_start:
    	ref_id_start = 0
    if not ref_id_end:
    	ref_id_end = 999999999
    query = db_.query(model_cls).filter(model_cls.refId >= ref_id_start,
    	model_cls.refId <= ref_id_end, model_cls.active == True)
    res = query.all()
    formatted_res = []
    for item in res:
        formatted_res.append((item.refId, item.verseText))
    return formatted_res


In [7]:
from database import SessionLocal, engine
from custom_exceptions import NotAvailableException, TypeException, AlreadyExistsException

db_ = SessionLocal()

get_text_from_bible(db_, source_name="hin_KJV_1_bible")

[(41001001, 'इब्राहीम की सन्\u200dतान, दाऊद की ...'),
 (42001001, 'इब्राहीम की सन्\u200dतान, दाऊद की ...'),
 (43001001, 'इब्राहीम की सन्\u200dतान, दाऊद की ...')]

## Draft Generation

In V1, draft generation was done by find and replace of tokens(in the descending order of length of token) on the USFM file.

In V2, as we are doing context based translation, it needs to be changed. We will be doing a replacement of tokens with translations on specific occurance.

How about we do not use the input(source/reference USFM) for this replacement, instead create a fresh minimal USFM with the translated verses? For this we will be using the clean verse text we extracted from USFM and obtained using the the `get_text_from_bible()` function(the same text given for tokenizarion and displaying on UI), translate it using token replacement and then attach the minimum required markers \id, \c, \p and \v appropriately.

By doing this, all non-verse contents present in the source/reference USFM would be absent in the generated draft. 

An existing issue in the draft of V1 is that some words are not replaced with translations even though, they are present in tokens list and translated there. I think the issue happens because they are part of phrase tokens and these phrases are broken apart in USFM file with additional markup in between them. So the find and replace doesnt work. Similar issues will occur for us in V2 also even if we are using offsets. So I think, using the cleaned text for replacement would be better

*Answer*: Yes. It is better to generate.

In [56]:
def display_alignment(source, draft, draft_meta):
    for item in draft_meta:
        src = source[item[0][0]:item[0][1]]
        trg = draft[item[1][0]:item[1][1]]
        status = item[2]
        print(src," --> ",trg,"(",status,")")
    return


In [57]:
import pdb

def replace_token(source, token_offset, translation, draft="", draft_meta=[], tag="confirmed"):
    '''make a token replacement and return updated sentence and draft_meta'''
#     pdb.set_trace()
    token_length = token_offset[1] - token_offset[0]
    trans_length = len(translation)
    updated_meta = []
    updated_draft = ""
    translation_offset = [None, None]
    if len(draft_meta) == 0:
        draft = source
        draft_meta.append(((0,len(source)), (0,len(source)), "untranslated"))
    for meta in draft_meta:
        tkn_offset = meta[0]
        trans_offset = meta[1]
        status = meta[2]
        intersection = set(range(token_offset[0],token_offset[1])).intersection(range(tkn_offset[0],tkn_offset[1]))
        if len(intersection) > 0: # our area of interest overlaps with this segment 
            if token_offset[0] == tkn_offset[0]: #begining is same
                translation_offset[0] = trans_offset[0]
                updated_draft += translation
            elif token_offset[0] > tkn_offset[0]: # begins within this segment
                updated_draft += source[tkn_offset[0]: token_offset[0]]
                new_seg_len = token_offset[0] - tkn_offset[0]
                updated_meta.append(((tkn_offset[0], token_offset[0]), (trans_offset[0], trans_offset[0]+new_seg_len),"untranslated"))
                translation_offset[0] = trans_offset[0]+new_seg_len
                updated_draft += translation
            else: # begins before this segment
                pass
            if token_offset[1] == tkn_offset[1]: # ending is the same
                translation_offset[1] = translation_offset[0]+trans_length
                updated_meta.append((token_offset, translation_offset, tag))
                offset_diff = translation_offset[1] - trans_offset[1]
            elif token_offset[1] < tkn_offset[1]: # ends within this segment
                trailing_seg = source[token_offset[1]: tkn_offset[1]]
                translation_offset[1] = translation_offset[0]+trans_length
                updated_meta.append((token_offset, translation_offset, tag))
                updated_draft += trailing_seg
                updated_meta.append(((token_offset[1], tkn_offset[1]),(translation_offset[1],translation_offset[1]+len(trailing_seg)),"untranslated"))
                offset_diff = translation_offset[1]+len(trailing_seg) - trans_offset[1]
            else: # ends after this segment
                pass
        elif tkn_offset[1] < token_offset[1]: # our area of interest come after this segment
            updated_draft += draft[trans_offset[0]: trans_offset[1]]
            updated_meta.append(meta)
        else: # our area of interest was before this segment
            updated_draft += draft[trans_offset[0]: trans_offset[1]]
            updated_meta.append((tkn_offset, (trans_offset[0]+offset_diff, trans_offset[1]+offset_diff),status))
    return updated_draft, updated_meta
    
source = "hello, my dear friend!"
    
draft, meta = replace_token(source, (0,5), "ഹലോ")
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)
print("---------------\n")
draft, meta = replace_token(source, (15,21), "ചങ്ങാതീ", draft, meta, "suggestion")
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)
print("---------------\n")
draft, meta = replace_token(source, (7,9), "എന്റെ", draft, meta)
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)
print("---------------\n")
draft, meta = replace_token(source, (7,14), "എന്റെ പ്രിയ", draft, meta)
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)
print("---------------\n")
draft, meta = replace_token(source, (15,22), "സുഹൃത്തെ***", draft, meta)
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)
print("---------------\n")
draft, meta = replace_token(source, (15,21), "സുഹൃത്തെ", draft, meta)
print(draft, meta)
print('\n')
display_alignment(source, draft, meta)


ഹലോ, my dear friend! [((0, 5), [0, 3], 'confirmed'), ((5, 22), (3, 20), 'untranslated')]


hello  -->  ഹലോ ( confirmed )
, my dear friend!  -->  , my dear friend! ( untranslated )
---------------

ഹലോ, my dear ചങ്ങാതീ! [((0, 5), [0, 3], 'confirmed'), ((5, 15), (3, 13), 'untranslated'), ((15, 21), [13, 20], 'suggestion'), ((21, 22), (20, 21), 'untranslated')]


hello  -->  ഹലോ ( confirmed )
, my dear   -->  , my dear  ( untranslated )
friend  -->  ചങ്ങാതീ ( suggestion )
!  -->  ! ( untranslated )
---------------

ഹലോ, എന്റെ dear ചങ്ങാതീ! [((0, 5), [0, 3], 'confirmed'), ((5, 7), (3, 5), 'untranslated'), ((7, 9), [5, 10], 'confirmed'), ((9, 15), (10, 16), 'untranslated'), ((15, 21), (16, 23), 'suggestion'), ((21, 22), (23, 24), 'untranslated')]


hello  -->  ഹലോ ( confirmed )
,   -->  ,  ( untranslated )
my  -->  എന്റെ ( confirmed )
 dear   -->   dear  ( untranslated )
friend  -->  ചങ്ങാതീ ( suggestion )
!  -->  ! ( untranslated )
---------------

ഹലോ, എന്റെ പ്രിയ ചങ്ങാതീ! [((0, 5), [0, 3

In [48]:
import re
from math import floor, ceil
def extract_context(token, offset, sentence, window_size=5, punctuations=utils.punctuations()+utils.numbers()):
    '''return token index and context array'''
    punct_pattern = re.compile('['+''.join(punctuations)+']')
    front = sentence[:offset[0]]
    rear = sentence[offset[1]:]
    front = re.sub(punct_pattern, "", front)
    rear = re.sub(punct_pattern, "", rear)
    front = front.split()
    rear = rear.split()
    if len(front) >= window_size/2:
        front  = front[-floor(window_size/2):]
    if len(rear) >= window_size/2:
        rear = rear[:ceil(window_size/2)]
    index = len(front)
    context = front + [token] + rear
    return index, context

print(extract_context("work", (14,18), "better do the work and then come back to check"))
print(extract_context("the work", (5,13), "I do the work and then come back to check"))
print(extract_context("the work", (5,13), "I do the work and, then come back to check"))
print(extract_context("work", (0,4), "work and then come back to check"))

(2, ['do', 'the', 'work', 'and', 'then', 'come'])
(2, ['I', 'do', 'the work', 'and', 'then', 'come'])
(2, ['I', 'do', 'the work', 'and', 'then', 'come'])
(0, ['work', 'and', 'then', 'come'])


In [64]:
import re
import pdb
import utils
   
def auto_translate(sentence_list, source_lang, target_lang,
    punctuations=None, stop_words=None):
    '''Attempts to tokenize the input sentence and replace each token with top suggestion.
    If draft_meta is provided indicating some portion of sentence is user translated, it is left untouched.
    Output is of the format [(sent_id, translated text, metadata)]
    metadata: List of (start_offset, end_offset, confirmed/suggestion/untranslated)'''
    if not punctuations:
        punctuations = utils.punctuations()+utils.numbers()
        punct_pattern = re.compile('['+''.join(punctuations)+']')
    if not stop_words:
        stop_words = utils.stopwords(source_lang)
    sentence_dict = {}
    for item in sentence_list:
        sent_obj = {
            "source": item[1],
#             "source_chunks": [chunk.strip() for chunk in re.split(punct_pattern, item[1])],
            "draft":item[2],
            "draft_meta":item[3]
        }
        if sent_obj['draft'] is None:
            sent_obj['draft'] = ''
        if sent_obj['draft_meta'] is None:
            sent_obj['draft_meta'] = []
        sentence_dict[item[0]]=sent_obj 
    suggestions_model = t # load corresponding trie for source and target if not already in memory
    tokens = tokenize(source_lang, sentence_list)
    for token in tokens:
        for occurence in tokens[token]:
            offset = (occurence[1], occurence[2])
            index, context = extract_context(token, offset, sentence_dict[occurence[0]]['source'])
            suggestions = get_translation_suggestion(index, context, t)
            if len(suggestions) > 0:
                draft, meta = replace_token(sentence_dict[occurence[0]]['source'], offset, suggestions[0][0], 
                              sentence_dict[occurence[0]]['draft'], sentence_dict[occurence[0]]['draft_meta'], 
                              "suggestion")
                sentence_dict[occurence[0]]['draft'] = draft
                sentence_dict[occurence[0]]['draft_meta'] = meta
            elif sentence_dict[occurence[0]]['draft'] == '':
                sentence_dict[occurence[0]]['draft'] = sentence_dict[occurence[0]]['source']
                indices = (0,len(sentence_dict[occurence[0]]['source']))
                sentence_dict[occurence[0]]['draft_meta'] = [(indices, indices, "untranslated")]
                
    return sentence_dict

sentences = [(41001001, "Here is it, a sample: it tells us what? yes a lot!!! isn't it", None, None),
             (41001002, "Once there was a hare and a tortoise.", None, None),
             (41001003, "One day, they got into an argument.", None, None),
             (41001004, "The hare said, 'I am the fastest'.", None, None),
             (41001005, "Then tortoise replied, 'no no no'", None, None)]
auto_translate(sentences, "eng", "mal")

{41001001: {'source': "Here is it, a sample: it tells us what? yes a lot!!! isn't it",
  'draft': "Here is it, a sample: it tells us what? yes a lot!!! isn't it",
  'draft_meta': [((0, 61), (0, 61), 'untranslated')]},
 41001002: {'source': 'Once there was a hare and a tortoise.',
  'draft': 'Once there was a മുയല്\u200d and a tortoise.',
  'draft_meta': [((0, 17), (0, 17), 'untranslated'),
   ((17, 21), [17, 23], 'suggestion'),
   ((21, 37), (23, 39), 'untranslated')]},
 41001003: {'source': 'One day, they got into an argument.',
  'draft': 'One day, they got into an argument.',
  'draft_meta': [((0, 35), (0, 35), 'untranslated')]},
 41001004: {'source': "The hare said, 'I am the fastest'.",
  'draft': "The മുയല്\u200d said, 'I am the fastest'.",
  'draft_meta': [((0, 4), (0, 4), 'untranslated'),
   ((4, 8), [4, 10], 'suggestion'),
   ((8, 34), (10, 36), 'untranslated')]},
 41001005: {'source': "Then tortoise replied, 'no no no'",
  'draft': "Then tortoise replied, 'no no no'",
  'draf

In [None]:
def get_draft(db_, project_id, sent_ids=None):
    '''fetches drafts for a set of sentences in a project, to be displayed on UI or sent for USFM creation'''
    project = db_.query(AgmtProject).filter(db_models.AgmtProject.projectId == project_id).first()
    if project is None:
        raise NotAvailableException("The given project not available in database")
    
    query = db_.query(db_models.TranslationDrafts).filter(
        db_models.TranslationDrafts.project_id == project_id)
    if sent_ids:
        query = query.filter(db_models.TranslationDrafts..sentenceId in sent_ids)
    return drafts    
        
    

In [9]:
import pdb
import utils
from custom_exceptions import NotAvailableException

def create_usfm(sent_drafts):
    '''Creates minimal USFM file with basic markers from the input verses list
    input: List of (bbbcccvvv, "generated translation")
    output: List of usfm files, one file per bible book'''
    book_start = '\\id {}\n'
    chapter_start = '\\c {}\n\\p\n'
    verse = '\\v {} {}'
    usfm_files = []
    file = ''
    prev_book = 0
    prev_chapter = 0
    prev_verse = 0
    book_code = ''
    sentences = sorted(sent_drafts, key=lambda x:x[0], reverse=False)
    #pdb.set_trace()
    for sent in sentences:
        #pdb.set_trace()
        verse_num = sent[0] % 1000
        chapter_num = int((sent[0] /1000) % 1000)
        book_num = int(sent[0] / 1000000)
        if book_num != prev_book:
            if file != '':
                usfm_files.append(file)
            book_code = utils.book_code(book_num)
            if book_code is None:
                #pdb.set_trace()
                raise NotAvailableException("Book number %s not a valid one" %book_num)
            file = book_start.format(book_code)
            prev_book = book_num
        if chapter_num != prev_chapter:
            file += chapter_start.format(chapter_num)
            prev_chapter = chapter_num
        file += verse.format(verse_num, sent[1])
    if file != '':
        usfm_files.append(file)
    return usfm_files

draft = [
    (1001001,"the first verse",[(0,2,"confirmed"), (4,13,"suggestion")]), #with metadata
    (1001002,"the second verse"),
    (1002001,"the first verse of new chapter"),
    (1001003,"the verse out of order"), 
    (1002002,"the next verse"),
    (1002003,"the last verse"),
    (2001001,"the first verse of new book"),
    (2001001,"the next verse"),
    #(40001001, "invalid book"), 
]
create_usfm(draft)

['\\id GEN\n\\c 1\n\\p\n\\v 1 the first verse\\v 2 the second verse\\v 3 the verse out of order\\c 2\n\\p\n\\v 1 the first verse of new chapter\\v 2 the next verse\\v 3 the last verse',
 '\\id EXO\n\\c 1\n\\p\n\\v 1 the first verse of new book\\v 1 the next verse']

## Suggestions

For every language pair for which we have translation projects or previously available parallel aligned data, we will have a *translation memory learned model*. When we encounter a word in a particular context, we use this learned model to get all possible translations of the word and get them scored based on the current context window. These scored translation can be given to user as suggestions.

I think the key thing to decide in this is, how do we store the translation memory to be used efficiently.
Or what do we mean by a learned translation model for a language pair?

The options that occur to me are as follows:
1. Query the translation memory(alignment) **SQL table** based on the key word and check and sort them based on the context window(I am afraid, this will have poor performace in terms of time and space)
2. Periodically build a **trie** structure from translation memory(alignment) table and query this trie for suggestions. I am not yet familiar with trie. I hope it allows us to search based on a context window efficiently. One draw back I can see in this is "learning" will not happen in real time and data user adds will take time(depending on how often we run the learning script) to improve the suggestions quality.
3. While we keep a translation memory table in SQL DB, parallely bulid a **graph** structure with it in DGraph. Use this graph for suggestions and use the table in SQL DB for draft generation. 
4. Build a **nueral network** (or ML) model that can be trained with word and context window and can predict the translation. I am not sure if we can get such a model to give multiple translations with varing scores. Building, storing, and using such models can also become expensive in terms of time and space. 

*Answer*: Option 2, trie built from SQL table, for now and 3, 4 for later

#### Proposed tire structure for AgMT

* Have one trie per source-target language pair, this would reduce the size and thus increase search performance at level 1
* Each node will have
	* a key: the context. The window size increases by one at each level
	* translations: list of all seen translations and their count for the given context. The count and current level can be used for scoring suggestions(score = level*count/total_occurances, total_occurance=sumOfCountsAtLevel1)
	* children: context increases by one word to right or left from the current context

input: 
```
[
{"token": "house","context":"They use barrels to house their pets","translation":"പാര്‍പ്പിക്കുക"},
{"token": "house","context":"His house is to the left","translation":"വീട്"},
{"token": "house","context":"Their house contruction methods are different","translation":"ഭവന"},
{"token": "house","context":"Last time I went to his house,","translation":"വീട്ടിലേക്ക്"},
{"token": "house","context":"Museums house large collection of Roman sculpture","translation":"ഉള്‍ക്കൊള്ളുന്നു"}
]
```
A trie of window size 3
![trie diagram](example_trie.png)


In [21]:
def form_trie_keys(prefix, to_left, to_right, prev_keys):
    '''build the trie tree recursively'''    
    keys = prev_keys
    a = b = None
    if len(to_left) > 0:
        a = '/L:'+to_left.pop(0)
    if len(to_right) > 0:
        b = '/R:'+to_right.pop(0)
    if a:
        key_left = prefix + a
        keys.append(key_left)
        keys = form_trie_keys(key_left, to_left.copy(), to_right.copy(), keys)
    if b:
        key_right = prefix + b
        keys.append(key_right)
        keys = form_trie_keys(key_right, to_left.copy(), to_right.copy(), keys)
    if a and b:
        key_both_1 = prefix + a + b
        key_both_2 = prefix + b + a
        keys.append(key_both_1)
        keys.append(key_both_2)
        keys = form_trie_keys(key_both_1, to_left.copy(), to_right.copy(), keys)
        keys = form_trie_keys(key_both_2, to_left.copy(), to_right.copy(), keys)
    return keys

token = "house"
#context = ["house"]
context = ["his", "house", "is"]
#context = ["his", "house"]
#context = ["house", "is"]
#context = ["says","his", "house", "is", "in", "town"]
#context = ["house", "is", "in", "town"]
#context = ["he","says","his", "house"]
#context = ["he","says","his", "house", "is", "in", "town"]

token_index = context.index(token)
to_left = [context[i] for i in range(token_index-1, -1, -1)]
to_right = context[token_index+1:]
form_trie_keys(token, to_left, to_right, [token])

['house', 'house/L:his', 'house/R:is', 'house/L:his/R:is', 'house/R:is/L:his']

In [53]:
import pygtrie
from custom_exceptions import TypeException

def build_trie(token_context__trans_list):
    '''Build a trie tree from scratch
    input: [(token,context_list, translation), ...]'''
    t = pygtrie.StringTrie()
    for item in token_context__trans_list:
        context = item[1]
        translation = item[2]
        if isinstance(item[0], str):
            token = item[0]
            token_index = context.index(token)
        elif isinstance(item[0], int):
            token_index = item[0]
            token = context[token_index]
        else:
            raise TypeException("Expects the token, as string, or index of token, as int, in first field of input tuple")
        to_left = [context[i] for i in range(token_index-1, -1, -1)]
        to_right = context[token_index+1:]
        keys = form_trie_keys(token, to_left, to_right, [token])
        for key in keys:
            if t.has_key(key):
                value = t[key]
                if translation in value.keys():
                    value[translation] += 1
                else:
                    value[translation] = 1
                t[key] = value
            else:
                t[key] = {translation: 1}
    return t

training_data = [
    ("bank", ["bank", "is", "closed"], "ബാങ്ക്"),
    ("bank", ["they", "bank", "on", "us"], "ആശ്രയിക്കുക"),
    ("bank", ["pay", "bank", "back"], "ബാങ്ക്"),
    ("bank", ["river", "bank", "is", "muddy"], "തീരം"),
    ("bank", ["Ganga","has", "wide" , "bank", "on", "sides"], "തീരം"),
    ("bank", ["bank", "manager", "spoke"], "ബാങ്ക്"),
    ("hare", ["a", "hare", "is", "same", "as", "rabbit"], "മുയല്‍")]
t = build_trie(training_data)


display_tree(t)

	- bank =>  {'ബാങ്ക്': 3, 'ആശ്രയിക്കുക': 1, 'തീരം': 2}
	- bank	- R:is =>  {'ബാങ്ക്': 1, 'തീരം': 1}
	- bank	- R:is	- R:closed =>  {'ബാങ്ക്': 1}
	- bank	- R:is	- R:muddy =>  {'തീരം': 1}
	- bank	- R:is	- L:river =>  {'തീരം': 1}
	- bank	- R:is	- L:river	- R:muddy =>  {'തീരം': 1}
	- bank	- L:they =>  {'ആശ്രയിക്കുക': 1}
	- bank	- L:they	- R:us =>  {'ആശ്രയിക്കുക': 1}
	- bank	- L:they	- R:on =>  {'ആശ്രയിക്കുക': 1}
	- bank	- L:they	- R:on	- R:us =>  {'ആശ്രയിക്കുക': 1}
	- bank	- R:on =>  {'ആശ്രയിക്കുക': 1, 'തീരം': 1}
	- bank	- R:on	- R:us =>  {'ആശ്രയിക്കുക': 1}
	- bank	- R:on	- L:they =>  {'ആശ്രയിക്കുക': 1}
	- bank	- R:on	- L:they	- R:us =>  {'ആശ്രയിക്കുക': 1}
	- bank	- R:on	- L:has =>  {'തീരം': 1}
	- bank	- R:on	- L:has	- L:Ganga =>  {'തീരം': 1}
	- bank	- R:on	- L:has	- R:sides =>  {'തീരം': 1}
	- bank	- R:on	- L:has	- R:sides	- L:Ganga =>  {'തീരം': 1}
	- bank	- R:on	- R:sides =>  {'തീരം': 1}
	- bank	- R:on	- R:sides	- L:Ganga =>  {'തീരം': 1}
	- bank	- R:on	- R:sides	- L:has =>  {'തീരം': 1}
	- b

In [23]:
import pdb;

def get_translation_suggestion(word, context, t):
    '''find the context based translation suggestions for a word.
    Makes use of the learned model, t, for the lang pair, based on translation memory
    output format: [(translation1, score1), (translation2, score2), ...]'''
    if isinstance(word, str):
        token_index = context.index(word)
    elif isinstance(word, int):
        token_index = word
        word = context[token_index]
    single_word_match = list(t.prefixes(word))
    #pdb.set_trace()
    if len(single_word_match) == 0:
        return []
    total_count = sum(single_word_match[0].value.values())
    to_left = [context[i] for i in range(token_index-1, -1, -1)]
    to_right = context[token_index+1:]
    keys = form_trie_keys(word, to_left, to_right, [word])
    keys = sorted(keys, key = lambda x : len(x), reverse=True)
    suggestions = {}
    prev_path_length = 0
    for k in keys:
        if len(k) < prev_path_length:
            # avoid searching with all the lower level keys
            break
        prev_path_length = len(k)
        all_matches = t.prefixes(k)
        for match in all_matches:
            levels = len(match.key.split("/"))
            #pdb.set_trace()
            for trans in match.value:
                score = match.value[trans]*levels*levels / total_count
                if trans in suggestions:
                    if suggestions[trans] < score:
                        suggestions[trans] = score
                else:
                    suggestions[trans] = score
    sorted_suggestions = {k: suggestions[k] for k in sorted(suggestions, key=suggestions.get, reverse=True)}
    return [(key, suggestions[key]) for key in sorted_suggestions]

#get_translation_suggestion("bank", ["pay", "bank", "the", "money"], t)
#get_translation_suggestion("bank", ["bank", "the", "money"], t)
get_translation_suggestion(1, ["river", "bank", "is", "near"], t)
#get_translation_suggestion("bank", ["people", "bank", "on", "others"], t)
#get_translation_suggestion("bank", ["bank"], t)
#get_translation_suggestion("hill", ["that","hill", "is"], t)


[('തീരം', 1.5),
 ('ബാങ്ക്', 0.6666666666666666),
 ('ആശ്രയിക്കുക', 0.16666666666666666)]

## Tables

![Projects and Translation memory tables](agmt_tables.png)

### Changes to above design

* Do not store suggestions in DB. Obtain them from trie dynamically (removes it from translation memory). Or may be we can store it in the drafts/draft_metadata field, by keeping top 3 suggestions there, and not just the top one.
* Move the source sentences and drafts from projects table to a separate table(drafts)
* Store sentence/verse wise drafts in drafts table along with which part of that draft has
    * user confirmed translation
    * top scored suggestions
    * not translated
    
    in metadata
    eg: `[((0, 5), (0, 3), 'confirmed'), ((5, 15), (3, 13), 'untranslated'), ((15, 21), (13, 20), 'suggestion')]`
    
    This will allow UI to display them with different styles. Also allow us to have phrase tokens and convert the data into the form of an alignment JSON for data export
* Also store who corrected each verse/sentence last. The SB alignment flavor has to specify which user did which reference range. Also adds user info to translation memory table
* split earlier composite fields into separate columns(occurences and sentences)
* The `translation memory` table is made independant of projects. 
* user table is removed, as that data will be stored in Kratos    

![Revised tables](./agmt_tables3.png)

### How these table structures work

* **Projects table**: For every AgMT project user starts, there will be one entry here. The document type will indicate the output format, which will be USFM for the current scope of AgMT mode(ie., only bible translation)
* **Drafts table**: For all the AgMT projects their **entire source and coressponding draft**(according to current project status) will be saved here. Each row will have one sentence/verse. For a project that uses all/some books from an exiting bible in DB, these source sentences will be copied from the corresponding `_bible_cleaned` table. If the user uploads hisown usfm, the verse texts from that will be copied here. This will allow user to use their own sources/refernce bibles. Also user can add more books to a existing project.
    * `sentence_id`: For a bible translation project this will be `ref_id`(bbbcccvvv). For a Notes or commentary project, we want to include later, we can use this `ref_id` itself. For a story project, may be paragraghCount-sentenceCount can be used. The idea is, we should be able to generate the draft in a required format as USFM, CSV, doc etc from this numbering system for that specific documentType/projectType. `sentence_id` and `project_id` combo will be unique and indexed. The table will be searched based on these fields for retrieving all occurences of a token, for displaying those sentences/verses and drafts on UI.
    * `sentence`: will have the actual source sentence, the verse, for a bible translation project. when user is working on a token, we can display all sentences/verses where it occurs on screen, so he can choose different senses for each occurance, if applicable.
    * `draft`: will have the current generated draft, for the corresponding sentence/verse. This may include some user-translated tokens, some untranslated tokens and some tokens replaced by top suggestions we provide. This field, along with the next `draft_meta` field will be where the project data will be updated as the translation progresses. Saving the draft like this will allow to display them on screen in real time, for all the sentence/verses where the current token occurs, allowing user to make proper decisions on which sense to be used in the context and also get a better sense of translation progress. This field will a plain text field. For the USFM generation we will be taking this field and generating the USFM based on the `sentence_id` values.
    * `draft_meta`: This will a JSON field of the following format. `[((0, 5), (0, 3), 'confirmed'), ((5, 15), (3, 13), 'untranslated'), ((15, 21), (13, 20), 'suggestion')]`. It is here the source sentence is split up into segments by start and end offsets and indicate the status of how the draft is obtained for each segment. When user makes a translation those tokens will be replaced in the draft and corresponding change will be made here setting that source token's offset values to "confirmed". Similary when we run the suggestions module on the entire project or specific sentences, we would replace tokens in drafts with top suggestions and update those offsets and values to "suggestion" here. This info can be used to display the generated draft with different styles on the UI, for example a dark colour for confirmed translations, a moderated colour for suggestions and lighter shade for un-translated. Thus user will have a clear idea on progress of work upon seeing the drafts displayed.
* **translation_memory table**: This stores all the known tokens and their known translations.
    * updation: Normally it is updated, along with the draft table, when ever user makes a translation on the AgMT UI and saves it. If we are able to add an alignmnet mode to the tool which allows to user to change alginement smanually, that data will also be updated here. It can also be updated externally if we have aligned training data from else where.
    * accesing: This table can be refered to check if a token is known to us, while tokenization. Also it can be used to get all known translations of a token irrespective of context. The source_lang, target_lang, token combo is unique and indexed.
* One info missing, with this kind of table structure, would be, who made a specific translation, as we only store last upated user for the whole sentence/verse
* The tokens generated are not stored anywhere in DB, unless they have a user confirmed translation. This allows us to keep tokenization dynamic and based on tokens marked by user via alignments.
