# GCM Library Mode Demo

- The Library Mode can be thought of as a light-weight wrapper around the GCM toolkit. 
- It is designed keeping in mind the need to have a way through which one can perform quick-prototyping using Python's friendly programming style.
- It is not a standalone library in itself, and hence depends upon the GCM repository. 

Let's see how we can use this nifty interface for experimenting with Code-mixing!

## Importing libraries and Loading the corpus

In [1]:
import matplotlib
matplotlib.use('Agg')

from nltk.tree import Tree
import matplotlib.pyplot as plt

In [2]:
# visualizing the tree
def draw_tree(treestring):
    tree = Tree.fromstring(treestring)
    tree.pretty_print()

We've defined the following corpus for Code-mixing. It contains 20 sentences in Hindi (a language spoken in India) and their corresponding translations in English. Note that `lang1` is the source language and `lang2` is the target language, and this is the format in which the GCM expects the input data.

In [3]:
corpus = {
    'lang1': """यदि आप तुरंत डॉक्टर से संपर्क करें
परंतु लगातार यह मांग की जा रही थी कि मतदान की आयु कम होनी चाहिए .
और , तेरे प्रभु द्वारा , वास्तव में हम उन्हें और असुरों को इकट्ठा करेंगे , फिर हम उन्हें नरक के चारों ओर , दुबकते हुए लाएंगे .
इस लकीर को स्पर्श करती वृत्त बनाएँ
क्या मनुष्य पर उस समय की अवधि नहीं आई है जब उसका उल्लेख भी नहीं किया गया था ?
बांस की 20 प्रजातियां हैं , जिनमें से मेलैकना बाकीफेरा माटेक प्रमुख है और राज्य में बांस की 95 भूमि पर है . बांस का व्यापक रूप से राज्य में उपयोग किया जाता है .
अब , एक और बात के मूल्य निर्धारण को प्रभावित करता है , और हम इसके बारे में सोचना चाहिए .
बोर्ड की स्थायी समितियाँ निम्नवत हैः-
अधिसूचना सक्रिय करें जब संपर्क ऑफ़लाइन चला जाता है
दमा एक गंभीर बीमारी है , जो आपकी श्वास नलिकाओं को प्रभावित करती है
घाव भरने के पश्चात् रहने वाली एक गुहा जिसे शल्यक्रिया द्वारा नहीं मिटाया जा सकता
इसके साथ निर्धारित शुल्क भेजा जाएगा;
यह एक तरह से डरावना था . क्युंकि वो ऐसी चीज़े कहेगा
उस पर आप सब्र क्योंकर कर सकते हैं मूसा ने कहा आप इत्मिनान रखिए अगर ख़ुदा ने चाहा तो आप मुझे साबिर आदमी पाएँगें
एक दिन के दृश्य के लिए दूसरी बार क्षेत्र
एक सीमा की धारणा है .
शहरी क्षेत्रों में बडी संख्या में परिधान और परिधान प्रसंस्करण इकाइयां विकसित हुई हैं .
अन्तरररष्ट्रीय सूचना स्रोतों पर सुलभता प्रदान करना
किसी आयगत संसाधन से प्राप्त जमा राशि .
क्या मुकदमे की सुनवाई मुल्तवी की जाये और केस की सुनवाई आज बाद में फिर की जाये ?""",
    'lang2': """contact the doctor immediately if you
But there was a constant demand that the voting age should be reduced .
and , by thy lord , verily we shall assemble them and the devils , then we shall bring them , crouching , around hell .
Create a circle that touches this line
has there not come upon man a period of time when he was not a thing even mentioned ?
it has 20 bamboo species , of which melocanna baccifera mautak is predominant and occupies 95 of the bamboo afforested land in the state . bamboo is widely used in the state
Now , another thing affects pricing , and we should think about it .
The Standing Committees of the Board are as follows:
Activate notification when the contact goes offline
Asthma is a serious disease that affects your respiratory tracts .
A cavity which cannot be surgically removed after healing the wound .
With this , the prescribed fee will be sent;
It was kind of scary , because he would say such things .
Why can you be patient with him , Moses said keep your leisure If God willing you will find me a relist of man
the second timezone for a day view
There is a limit to perception .
a large number of garment and garment processing units have developed in urban areas .
Providing access to international information sources
The amount of deposits received from an income resource .
Should the trial be postponed and the case be retried later today ?"""
}

The GCM toolkit works as a sequential process and it has the following three stages:

![](https://drive.google.com/uc?id=1ez8VH-Gs0MSEc6TW-7k_-5grigG5EQ69)

We will now see how to invoke each stage.

## I. Aligner stage

 - In this stage, we are interested to generate word-level alignments between the source and target language sentence pairs.
 - Currently, we support the `fast_align` aligner which works well for our use case, but you could also switch it out for your own aligner.

In [4]:
from gcm.aligners import fast_align

# code to generate alignments using fast_align
aligns = fast_align.gen_aligns(corpus)

loading fast_align..
ARG=i
ARG=o
ARG=v
INITIAL PASS 
expected target length = source length * 0.973115
ITERATION 1
  log_e likelihood: -5408.77
  log_2 likelihood: -7803.21
     cross entropy: 29.8974
        perplexity: 1e+09
      posterior p0: 0.0678847
 posterior al-feat: -0.315943
       size counts: 17
ITERATION 2
  log_e likelihood: -2336.91
  log_2 likelihood: -3371.45
     cross entropy: 12.9174
        perplexity: 7736.35
      posterior p0: 0.0975656
 posterior al-feat: -0.295793
       size counts: 17
ITERATION 3
  log_e likelihood: -1359.59
  log_2 likelihood: -1961.47
     cross entropy: 7.51521
        perplexity: 182.938
      posterior p0: 0.109483
 posterior al-feat: -0.287589
       size counts: 17
ITERATION 4
  log_e likelihood: -1260.24
  log_2 likelihood: -1818.15
     cross entropy: 6.96608
        perplexity: 125.026
      posterior p0: 0.109761
 posterior al-feat: -0.286315
       size counts: 17
ITERATION 5 (FINAL)
  log_e likelihood: -1240.09
  log_2 likeliho

In [5]:
# generated alignments
aligns

['0-2 0-3 0-4 1-5 2-2 3-2 4-1 5-0 6-0',
 '2-2 4-0 5-11 10-7 14-10 15-13',
 '1-6 1-7 1-8 1-9 1-10 6-1 7-0 8-1 9-1 10-0 11-1 12-1 13-1 14-1 16-12 27-25',
 '0-0 1-0 4-3',
 '0-18 3-5 4-2 5-1 6-2 7-2 7-3 7-4 8-2 9-0 10-10 15-8 17-11 17-12',
 '0-23 1-17 7-6 12-11 13-0 14-3 15-21 22-24 23-23',
 '1-8 3-0 9-4 14-1 15-1 16-1 17-7 19-9 20-13',
 '0-1 0-2 1-4',
 '0-0 0-1 1-0 2-4 3-2 7-3',
 '1-10 2-0 3-0 4-1 10-5 11-6 12-5',
 '0-0 0-1 1-0 6-11 13-4',
 '0-2 1-0 2-0 3-0 4-0 5-0',
 '0-1 2-0 2-2 2-4 3-3 5-7 6-12',
 '0-23 2-0 2-1 2-2 3-2 4-2 5-2 6-2 7-22',
 '2-1',
 '1-0 2-2 3-0 4-1 5-6',
 '2-11 5-11 6-4 6-6 6-7 6-8 6-9 6-10 6-12 6-13 7-5 8-4 9-4 10-4 11-4 12-4 13-3 14-14',
 '0-0 0-1 0-2 0-3 0-4 0-5 1-0 2-0',
 '0-1 1-1 2-1 3-2 4-1 5-1 6-1 7-9',
 '0-12 1-3 2-0 2-2 2-4 3-3 4-3 13-5 14-1']

- Since `fast_align` is a statistical aligner, the quality of alignments greatly depend upon the size of input corpus. In the above case we've only taken 20 sentences each so the alignments won't be as good. 
- Hence, it's recommended that you have atleast 1k sentences (for both the languages) in your corpus for better alignments. The bigger the better.
- The following is the alignments for the same sentences as above but that were generated when the corpus was quite big. We'll be using these going forward, but feel free to compare them with the above alignments and check the difference in quality of CM generation.

In [6]:
alignments_generated_bigger_corpus = """0-4 1-5 2-3 3-2 5-0 6-0 6-1
0-0 1-4 2-6 3-5 4-5 7-2 8-6 9-8 11-9 12-12 13-10 13-11 14-10 15-13
0-0 1-1 2-3 3-4 4-2 5-1 6-6 7-6 8-7 9-10 11-13 13-9 14-9 16-15 19-24 20-12 21-23 22-23 24-21 26-17 26-18 27-25
0-5 1-6 3-4 4-3 5-2 6-0 6-1
1-5 2-4 3-4 4-9 5-8 6-7 7-1 7-2 8-3 9-15 10-10 11-11 12-17 13-16 16-12 17-12 18-18
0-3 1-17 2-2 3-4 4-4 5-5 6-7 11-12 12-11 13-13 14-23 15-21 18-15 19-20 20-14 22-24 24-6 25-27 26-27 27-27 30-28 31-28 32-28
0-0 1-1 2-2 3-7 4-3 6-5 7-5 8-4 9-4 10-4 13-7 14-8 15-12 16-11 17-11 18-10 19-9 20-13
0-5 1-3 1-4 2-1 3-2 4-7 4-8 5-6
0-1 1-0 2-0 2-3 3-2 4-4 5-6 6-5 7-5 8-5
0-0 1-2 2-3 3-4 4-1 6-5 7-7 8-8 8-9 9-9 10-6 11-6 12-6
0-10 1-8 2-9 3-7 5-0 6-0 7-1 8-2 9-5 9-6 10-5 11-3 12-3 13-3 13-4 14-3
0-1 0-2 1-0 2-4 3-5 4-8 5-6 5-7 5-8
0-0 2-2 3-5 4-4 5-1 6-12 7-6 8-7 9-10 10-11 11-8 11-9
2-2 3-4 4-0 5-1 6-1 7-1 8-8 9-9 10-9 12-12 13-10 13-11 14-13 15-14 18-7 20-19 21-21 22-23 23-3
0-4 1-5 2-0 3-6 4-3 5-3 6-1 7-2 8-2
0-2 1-3 3-5 4-1 5-6
0-12 1-13 2-11 3-0 3-1 4-2 6-4 6-6 7-5 8-4 9-7 10-8 11-10 12-10 13-9 14-14
0-0 0-3 1-4 2-5 4-1 5-0 6-0
0-6 1-7 2-8 3-5 4-4 5-3 6-1 7-9
0-0 1-2 2-1 3-9 4-9 6-3 7-5 8-7 11-11 12-10 13-10 17-12""".split("\n")

## II. Pre-GCM stage

- This stage is responsible for collecting input sentences, alignments, parse trees etc. and packaging them into a format that GCM can work with.

### Generating Parse Trees

- In order to work with GCM, we need to provide parse tree of one of the input languages. 
- We currently support two parsers: `benepar` (Berkeley Neural Parser) and `stparser` (Stanford Parser).

In [7]:
from gcm.parsers import benepar, stparser

# code to use benepar to generate parse trees from the corpus
pt_benepar = benepar.parse(corpus['lang2'])

Loading benepar...


parsing sentences...


In [8]:
# code to use stanford parser to generate parse trees from the corpus
pt_stparser = stparser.parse(corpus['lang2'])

Launching stanford parser server...
parsing sentences...


In [9]:
corpus['lang2'].split("\n")[7]

'The Standing Committees of the Board are as follows:'

In [10]:
# draw the parse tree generated by benepar
draw_tree(pt_benepar[7])

                            ROOT                                   
                             |                                      
                             S                                     
                  ___________|_____________________________         
                 NP                         |             SBAR     
        _________|___________               |        ______|_____   
       |                     PP             |      SBAR          | 
       |                  ___|____          |    ___|______      |  
       NP                |        NP        |   |          VP    | 
  _____|_________        |    ____|____     |   |          |     |  
 DT   VBG       NNS      IN  DT        NN  VBP  IN        VBZ    : 
 |     |         |       |   |         |    |   |          |     |  
The Standing Committees  of the      Board are  as      follows  : 



In [11]:
# draw the parse tree generated by stparser
draw_tree(pt_stparser[7])

                            ROOT                                   
                             |                                      
                             S                                     
                  ___________|___________________________________   
                 NP                             |                | 
        _________|___________                   |                |  
       |                     PP                 VP               | 
       |                  ___|____           ___|___             |  
       NP                |        NP        |      ADJP          | 
  _____|_________        |    ____|____     |    ___|______      |  
 DT   VBG       NNS      IN  DT       NNP  VBP  RB         JJ    : 
 |     |         |       |   |         |    |   |          |     |  
The Standing Committees  of the      Board are  as      follows  : 



Now that the parse trees are generated, everything is ready for the Pre-GCM stage to start its work.

### Running Pre-GCM

In [12]:
from gcm.stages import pregcm, gcm

# get pregcm output using parse trees generated by benepar
pgcm_benepar = pregcm.process(corpus, alignments_generated_bigger_corpus, pt_benepar)
print(pgcm_benepar[0])

setting up pre-gcm...
running pre-gcm...
pre_gcm: INFO: 2021-04-11 16:31:05,842: ROOT DIR: /home/sanad/CM_Text_Generator
data/hi-to-en-input_lang1
pre_gcm: INFO: 2021-04-11 16:31:05,854: Parsing sentences: 0, 19
len of source sents: 20 len of parse trees: 20
pre-gcm completed..
0	0.0
यदि आप तुरंत डॉक्टर से संपर्क करें
contact the doctor immediately if you
(ROOT (S (VP (VB contact) (NP (DT the) (NN doctor)) (ADVP (RB immediately)) (SBAR (IN if) (NP (PRP you))))))
0-4 1-5 2-3 3-2 5-0 6-0 6-1


Notice how each row of the output of Pre-GCM stage has a set of information around the input sentences.

## III. GCM stage

- The GCM stage is the final stage in the GCM pipeline.
- It takes in the output of the Pre-GCM stage to generated Code-mixed sentences as output.
- Each output CM sentence is followed by it's parse tree that can be easily visualised using libraries like `nltk`.

In [13]:
# generate gcm
gcm_benepar = gcm.gen(pgcm_benepar)

setting up gcm...
running gcm process...


In [14]:
print(gcm_benepar)

['[CM]contact the doctor immediately if you\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_e doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (NP (PRP_e you)))))\n', '[CM]contact the doctor immediately if आप\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_e doctor)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (NP (PRP_h आप)))))\n', '[CM]contact the doctor immediately यदि you\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_e doctor)) (ADVP (RB_e immediately)) (SBAR (IN_h यदि) (NP (PRP_e you)))))\n', '[CM]contact the डॉक्टर immediately if you\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_h डॉक्टर)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (NP (PRP_e you)))))\n', '[CM]contact the डॉक्टर immediately if आप\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_h डॉक्टर)) (ADVP (RB_e immediately)) (SBAR (IN_e if) (NP (PRP_h आप)))))\n', '[CM]contact the डॉक्टर immediately यदि you\n', '[TREE](ROOT (VP_e (VB+DT_e contact the) (NP (NN_h डॉक्टर)) (ADVP (RB_e immediately)) (S

In [16]:
gcm_benepar[24:26]

['[CM]यदि आप तुरंत doctor संपर्क करें\n',
 '[TREE](ROOT (VP_h (SBAR (IN_h यदि) (NP (PRP_h आप))) (ADVP (RB_h तुरंत)) (NP (NN_e doctor)) (VB+DT_h संपर्क करें)))\n']

In [17]:
# visualising CM parse tree
draw_tree(gcm_benepar[24:26][1].split('[TREE]')[1])

                 ROOT                           
                  |                              
                 VP_h                           
       ___________|____________________          
     SBAR         |     |              |        
  ____|_____      |     |              |         
 |          NP   ADVP   NP             |        
 |          |     |     |              |         
IN_h      PRP_h  RB_h  NN_e         VB+DT_h     
 |          |     |     |       _______|_____    
यदि         आप  तुरंत doctor संपर्क         करें



## Advanced Generation Options

- We've kept the above Python interface simple for making it user-friendly.
- There are many advanced features and tuning options that are available with the GCM toolkit, things like generating language tags for each output word, limiting the number of generated sentences or sampling techniques.
- Each of these options can be selected in the 'CodeMixed-Text-Generator/config.ini' file.
- Once you have made changes in the config, you can simply re-run this code and the GCM will pickup specific options from there.
- Please read the documentation around **Batch Mode** to understand how config file works.