# GCM Library Mode Demo

- The Library Mode can be thought of as a light-weight wrapper around the GCM toolkit. 
- It is designed keeping in mind the need to have a way through which one can perform quick-prototyping using Python's friendly programming style.
- It is not a standalone library in itself, and hence depends upon the GCM repository. 

Let's see how we can use this nifty interface for experimenting with Code-mixing!

## Importing libraries and Loading the corpus

In [73]:
import matplotlib
matplotlib.use('Agg')

from nltk.tree import Tree
import matplotlib.pyplot as plt

In [74]:
# visualizing the tree
def draw_tree(treestring):
    tree = Tree.fromstring(treestring)
    tree.pretty_print()

We've defined the following corpus for Code-mixing. It contains 20 sentences in Hindi (a language spoken in India) and their corresponding translations in English. Note that `lang1` is the source language and `lang2` is the target language, and this is the format in which the GCM expects the input data.

In [75]:
corpus = {
    'lang1': """如果 您 感到 任何 不适 ， 请 立即 联系 医生 。
但 人们 一直 要求 降低 投票 年龄 。
而且 ， 以 你 的 主 的 名义 ， 我们 确实 将 聚集 他们 和 魔鬼 ， 然后 我们 将 把 他们 带到 地狱 周围 ， 蹲伏 着 。
创建 一个 接触 这条线 的 圆
人 难道 没有 经历 过 一段 时期 ， 他 甚至 没有 被 提及 吗 ？
该州 有 20 种 竹子 ， 其中 以 melocanna baccifera 为主 ， 占 该州 竹林 面积 的 95% 。   竹子 在 该州 被 广泛 使用
现在 ， 还有 另 一件 事 影响 定价 ， 我们 应该 考虑一下 。
董事会 常设 委员会 如下 ：
当 联系人 离线 时 激活 通知
哮喘 是 一种 影响 呼吸道 的 严重 疾病 。
伤口 愈合 后 无法 通过 手术 切除 的 空洞 。
这样 ， 规定 的 费用 将 被 发送 ；
有点 可怕 ， 因为 他会 说出 这样的话 。
为什么 你 能 对 他 有 耐心 ， 摩西 说 保持 闲暇 如果 上帝 愿意 你 会 发现 我 是 一个 重新 列出 的 人
日 视图 的 第二 时 区
感知 是 有 限度 的 。
城市 地区 发展 了 一大批 服装 及 服装 加工 单位 。
提供 对 国际 信息源 的 访问
从 收入 来源 收到 的 存款 金额 。
是否 应该 推迟 审判 并 在 今天 晚些时候 重审 此案 ？""",
    'lang2': """contact the doctor immediately if you experience any discomfort.
But there was a constant demand that the voting age should be reduced .
and , by thy lord , verily we shall assemble them and the devils , then we shall bring them , crouching , around hell .
Create a circle that touches this line
has there not come upon man a period of time when he was not a thing even mentioned ?
it has 20 bamboo species , of which melocanna baccifera mautak is predominant and occupies 95 of the bamboo afforested land in the state . bamboo is widely used in the state
Now , another thing affects pricing , and we should think about it .
The Standing Committees of the Board are as follows:
Activate notification when the contact goes offline
Asthma is a serious disease that affects your respiratory tracts .
A cavity which cannot be surgically removed after healing the wound .
With this , the prescribed fee will be sent;
It was kind of scary , because he would say such things .
Why can you be patient with him , Moses said keep your leisure If God willing you will find me a relist of man
the second timezone for a day view
There is a limit to perception .
a large number of garment and garment processing units have developed in urban areas .
Providing access to international information sources
The amount of deposits received from an income resource .
Should the trial be postponed and the case be retried later today ?"""
}

The GCM toolkit works as a sequential process and it has the following three stages:

![](https://drive.google.com/uc?id=1ez8VH-Gs0MSEc6TW-7k_-5grigG5EQ69)

We will now see how to invoke each stage.

## I. Aligner stage

 - In this stage, we are interested to generate word-level alignments between the source and target language sentence pairs.
 - Currently, we support the `fast_align` aligner which works well for our use case, but you could also switch it out for your own aligner.

In [76]:
from gcm.aligners import fast_align

# code to generate alignments using fast_align
aligns = fast_align.gen_aligns(corpus)

loading fast_align..
ARG=i
ARG=o
ARG=v
INITIAL PASS 
expected target length = source length * 1.22243
ITERATION 1
  log_e likelihood: -5491.67
  log_2 likelihood: -7922.8
     cross entropy: 29.8974
        perplexity: 1e+09
      posterior p0: 0.0827692
 posterior al-feat: -0.311961
       size counts: 17
ITERATION 2
  log_e likelihood: -2276.74
  log_2 likelihood: -3284.64
     cross entropy: 12.3949
        perplexity: 5385.5
      posterior p0: 0.113942
 posterior al-feat: -0.287999
       size counts: 17
ITERATION 3
  log_e likelihood: -1411.41
  log_2 likelihood: -2036.24
     cross entropy: 7.68391
        perplexity: 205.63
      posterior p0: 0.127499
 posterior al-feat: -0.277739
       size counts: 17
ITERATION 4
  log_e likelihood: -1320.36
  log_2 likelihood: -1904.88
     cross entropy: 7.18821
        perplexity: 145.837
      posterior p0: 0.129792
 posterior al-feat: -0.272101
       size counts: 17
ITERATION 5 (FINAL)
  log_e likelihood: -1296.87
  log_2 likelihood: -

In [77]:
# generated alignments
aligns

['0-5 1-2 1-3 1-4 1-6 1-7 1-8 2-2 3-2 4-2 10-9',
 '0-0 1-0 2-0 3-0 4-0 5-0 6-0 7-13',
 '1-0 8-7 9-7 10-1 11-1 12-1 13-1 14-1 15-1 28-25',
 '0-0 0-2 1-1 2-0 3-0 4-1 5-0',
 '0-5 1-2 2-0 2-1 2-2 2-3 2-4 3-2 4-2 5-2 6-2 7-15 14-18',
 '0-8 0-9 0-10 0-12 0-23 1-11 2-3 3-3 4-3 5-0 7-5 11-6 16-7 18-24 20-17',
 '0-1 1-0 6-4 8-8 9-6 9-8 10-7 12-13',
 '0-1 0-2 1-1 2-1 3-1 4-1',
 '0-0 0-1 0-2 1-0 2-0 3-3',
 '0-0 1-1 2-0 3-6 5-2 8-10',
 '0-0 0-1 1-0 2-0 3-0 4-0 5-0 6-0 7-2 9-11',
 '0-0 3-1 5-2 6-3',
 '0-0 1-0 2-5 7-12',
 '1-0 4-23 5-20 7-7 8-2 9-2 10-2 11-2 12-2 13-2 14-2 15-2 16-2 17-2 18-2 19-11 23-17',
 '0-1 0-2 0-3 1-1 2-4 3-1 4-0 5-1',
 '0-0 1-1 2-1 3-0 4-4 5-6',
 '0-4 1-4 2-4 3-4 4-4 5-4 6-4 7-4 8-4 9-4 10-14',
 '0-0 0-1 4-2',
 '0-1 1-1 2-1 3-1 7-9',
 '0-2 0-3 0-4 1-5 5-1 10-12']

- Since `fast_align` is a statistical aligner, the quality of alignments greatly depend upon the size of input corpus. In the above case we've only taken 20 sentences each so the alignments won't be as good. 
- Hence, it's recommended that you have atleast 1k sentences (for both the languages) in your corpus for better alignments. The bigger the better.
- The following is the alignments for the same sentences as above but that were generated when the corpus was quite big. We'll be using these going forward, but feel free to compare them with the above alignments and check the difference in quality of CM generation.

In [78]:
alignments_generated_bigger_corpus = ['0-4 1-5 2-6 3-7 4-8 7-3 8-0 9-2 9-1',
'0-0 2-4 3-5 4-12 5-8 6-9',
'0-0 2-2 9-7 10-8 11-9 13-10 15-13 23-24',
'0-0 1-1 2-4 3-6 5-2',
'0-5 2-2 6-7 8-11 9-16 10-13 12-17',
'0-23 3-4 4-3 15-18 15-19 16-20 19-24 23-27 24-28',
'0-0 5-3 6-4 7-5 9-8 10-9 11-10 11-11 11-12',
'0-5 1-1 2-2 3-8',
'0-2 1-4 2-6 4-0 5-1',
'0-0 1-1 2-2 3-6 4-8 4-9 6-3 7-4',
'0-10 1-8 3-3 6-6 8-1',
'0-1 2-4 4-5 5-6 7-8',
'1-4 3-6 4-7 5-9 6-10 6-11',
'0-0 1-2 2-1 4-6 6-4 8-8 9-9 10-10 11-12 12-13 13-14 14-15 15-16 17-18 18-19 20-20 24-23',
'0-5 1-6 3-1 4-2 5-2',
'0-5 1-1 3-3',
'0-12 1-13 2-10 5-4 6-5 7-6 8-7 9-8',
'0-0 2-3 3-4 3-5',
'0-5 1-7 2-5 3-4 5-3 6-1',
'0-1 1-1 2-4 3-2 4-5 6-11 7-10 8-9 9-7',]

In [79]:
import spacy
nlp = spacy.load("en_core_web_md")
text = corpus['lang2'].split('\n')
Lable_No_Replacement=['QUANTITY','ORDINAL','CARDINAL','PERCENT','MONEY','DATE','TIME']
#text = ['Trump is the President of the United States.']
Name_Entity=[]
for l in text:
    doc = nlp(l)
    tmp_list=[]
    for ent in doc.ents:
        if ent.label_ not in Lable_No_Replacement:
            tmp_list.append(ent.text)
    Name_Entity.append(tmp_list)



In [80]:
Name_Entity

[[],
 [],
 [],
 [],
 [],
 ['melocanna baccifera'],
 [],
 ['The Standing Committees of the Board'],
 [],
 [],
 [],
 [],
 [],
 ['Moses'],
 [],
 [],
 [],
 [],
 [],
 []]

In [81]:
##Get the translation of name entity from the Internet and use the same tokenization tool to complete tokenization.
Name_Entity_Translation=[[],
 [],
 [],
 [],
 [],
 ['melocanna baccifera'],
 [],
 ['董事会 常设 委员会'],
 [],
 [],
 [],
 [],
 [],
 ['摩西'],
 [],
 [],
 [],
 [],
 [],
 []]

In [82]:
source_sentences_after_replacement=[]
for name, name_trans, source_sen, target_sen in zip(Name_Entity, Name_Entity_Translation, corpus['lang2'].split('\n'), corpus['lang1'].split('\n')):
    if name !=[]:
        for entity, entity_trans in zip(name, name_trans):
            if entity_trans in target_sen:
                source_sen=source_sen.replace(entity, entity_trans)
    source_sentences_after_replacement.append(source_sen)
corpus['lang2']= "\n".join(source_sentences_after_replacement)

In [83]:
corpus

{'lang1': '如果 您 感到 任何 不适 ， 请 立即 联系 医生 。\n但 人们 一直 要求 降低 投票 年龄 。\n而且 ， 以 你 的 主 的 名义 ， 我们 确实 将 聚集 他们 和 魔鬼 ， 然后 我们 将 把 他们 带到 地狱 周围 ， 蹲伏 着 。\n创建 一个 接触 这条线 的 圆\n人 难道 没有 经历 过 一段 时期 ， 他 甚至 没有 被 提及 吗 ？\n该州 有 20 种 竹子 ， 其中 以 melocanna baccifera 为主 ， 占 该州 竹林 面积 的 95% 。   竹子 在 该州 被 广泛 使用\n现在 ， 还有 另 一件 事 影响 定价 ， 我们 应该 考虑一下 。\n董事会 常设 委员会 如下 ：\n当 联系人 离线 时 激活 通知\n哮喘 是 一种 影响 呼吸道 的 严重 疾病 。\n伤口 愈合 后 无法 通过 手术 切除 的 空洞 。\n这样 ， 规定 的 费用 将 被 发送 ；\n有点 可怕 ， 因为 他会 说出 这样的话 。\n为什么 你 能 对 他 有 耐心 ， 摩西 说 保持 闲暇 如果 上帝 愿意 你 会 发现 我 是 一个 重新 列出 的 人\n日 视图 的 第二 时 区\n感知 是 有 限度 的 。\n城市 地区 发展 了 一大批 服装 及 服装 加工 单位 。\n提供 对 国际 信息源 的 访问\n从 收入 来源 收到 的 存款 金额 。\n是否 应该 推迟 审判 并 在 今天 晚些时候 重审 此案 ？',
 'lang2': 'contact the doctor immediately if you experience any discomfort.\nBut there was a constant demand that the voting age should be reduced .\nand , by thy lord , verily we shall assemble them and the devils , then we shall bring them , crouching , around hell .\nCreate a circle that touches this line\nhas there not come upon man a period o

## II. Pre-GCM stage

- This stage is responsible for collecting input sentences, alignments, parse trees etc. and packaging them into a format that GCM can work with.

### Generating Parse Trees

- In order to work with GCM, we need to provide parse tree of one of the input languages. 
- We currently support two parsers: `benepar` (Berkeley Neural Parser) and `stparser` (Stanford Parser).

In [84]:
from gcm.parsers import benepar, stparser

# code to use benepar to generate parse trees from the corpus
pt_benepar = benepar.parse(corpus['lang2'])

Loading benepar...
parsing sentences...


In [85]:
# code to use stanford parser to generate parse trees from the corpus
pt_stparser = stparser.parse(corpus['lang2'])

Launching stanford parser server...
parsing sentences...


In [86]:
corpus['lang2'].split("\n")[7]

'董事会 常设 委员会 are as follows:'

In [87]:
# draw the parse tree generated by benepar
draw_tree(pt_benepar[7])

            ROOT                     
             |                        
             S                       
      _______|_____________________   
     |            VP               | 
     |        ____|___             |  
     |       |       SBAR          | 
     |       |     ___|______      |  
     NP      |    |         SINV   | 
  ___|___    |    |          |     |  
NNP  NN  CD VBP   IN        VBZ    : 
 |   |   |   |    |          |     |  
董事会  常设 委员会 are   as      follows  : 



In [88]:
# draw the parse tree generated by stparser
draw_tree(pt_stparser[7])

            ROOT                     
             |                        
            SINV                     
      _______|_____________________   
     |       |       SBAR          | 
     |       |     ___|______      |  
     |       |    |          S     | 
     |       |    |          |     |  
     NP      VP   |          VP    | 
  ___|___    |    |          |     |  
NNP NNP NNP VBP   IN        VBZ    : 
 |   |   |   |    |          |     |  
 董会  常设  委会 are   as      follows  : 



Now that the parse trees are generated, everything is ready for the Pre-GCM stage to start its work.

### Running Pre-GCM

In [89]:
from gcm.stages import pregcm, gcm

# get pregcm output using parse trees generated by benepar
pgcm_benepar = pregcm.process(corpus, alignments_generated_bigger_corpus, pt_benepar)
print(pgcm_benepar[0])

setting up pre-gcm...
running pre-gcm...
pre_gcm: INFO: 2024-03-17 00:27:20,843: ROOT DIR: /home/weihua/CODE-MIX-GENERATE/COD2/CodeMixed-Text-Generator/CodeMixed-Text-Generator
data/ch-to-en-input_lang1
pre_gcm: INFO: 2024-03-17 00:27:20,847: Parsing 20 sentences
pre_gcm: INFO: 2024-03-17 00:27:20,849: Parsing sentences: 0, 19
len of source sents: 20 len of parse trees: 20
pre-gcm completed..
0	0.0
如果 您 感到 任何 不适 ， 请 立即 联系 医生 。
contact the doctor immediately if you experience any discomfort.
(ROOT (S (VP (VB contact) (NP (DT the) (NN doctor)) (ADVP (RB immediately)) (SBAR (IN if) (S (NP (PRP you)) (VP (VBP experience) (NP (DT any) (NN discomfort)))))) (. .)))
0-4 1-5 2-6 3-7 4-8 7-3 8-0 9-2 9-1


Notice how each row of the output of Pre-GCM stage has a set of information around the input sentences.

## III. GCM stage

- The GCM stage is the final stage in the GCM pipeline.
- It takes in the output of the Pre-GCM stage to generated Code-mixed sentences as output.
- Each output CM sentence is followed by it's parse tree that can be easily visualised using libraries like `nltk`.

In [90]:
# generate gcm
gcm_benepar = gcm.gen(pgcm_benepar)

setting up gcm...
running gcm process...


In [91]:
print(gcm_benepar)

['[CM]contact the doctor immediately if you experience any discomfort\n', '[TREE](ROOT (S_e (VP_e (VB_e contact) (NP (DT+NN_e the doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (S (NP (PRP_e you)) (VP (VBP_e experience) (NP (DT_e any) (NN_e discomfort))))))))\n', '[CM]contact the doctor immediately if you experience any 不适\n', '[TREE](ROOT (S_e (VP_e (VB_e contact) (NP (DT+NN_e the doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (S (NP (PRP_e you)) (VP (VBP_e experience) (NP (DT_e any) (NN_h 不适))))))))\n', '[CM]contact the doctor immediately if you experience 任何 discomfort\n', '[TREE](ROOT (S_e (VP_e (VB_e contact) (NP (DT+NN_e the doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (S (NP (PRP_e you)) (VP (VBP_e experience) (NP (DT_h 任何) (NN_e discomfort))))))))\n', '[CM]contact the doctor immediately if you experience 任何 不适\n', '[TREE](ROOT (S_e (VP_e (VB_e contact) (NP (DT+NN_e the doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (S (NP (PRP_e you)) (VP (VBP_e experience) (N

In [92]:
gcm_benepar[23:25]

['[TREE](ROOT (S_e (VP_e (VB_e contact) (NP (DT+NN_e the doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (S (NP (PRP_h 您)) (VP (VBP_e experience) (NP (DT_h 任何) (NN_h 不适))))))))\n',
 '[CM]contact the doctor immediately 如果 you experience any discomfort\n']

In [93]:
# visualising CM parse tree
draw_tree(gcm_benepar[23:25][0].split('[TREE]')[1])

                                       ROOT                                   
                                        |                                      
                                       S_e                                    
                                        |                                      
                                       VP_e                                   
    ____________________________________|_____________                         
   |           |                |                    SBAR                     
   |           |                |        _____________|_______                 
   |           |                |       |                     S               
   |           |                |       |      _______________|___             
   |           |                |       |     |                   VP          
   |           |                |       |     |        ___________|____        
   |           NP              ADVP     |     

## Advanced Generation Options

- We've kept the above Python interface simple for making it user-friendly.
- There are many advanced features and tuning options that are available with the GCM toolkit, things like generating language tags for each output word, limiting the number of generated sentences or sampling techniques.
- Each of these options can be selected in the 'CodeMixed-Text-Generator/config.ini' file.
- Once you have made changes in the config, you can simply re-run this code and the GCM will pickup specific options from there.
- Please read the documentation around **Batch Mode** to understand how config file works.