## CODE-MIX GENERATOR - MODIFIED DOCKER IMAGE

- This modified docker image contains API calls to utilise the aligner and codemix-generator functionalities in a simple manner.

### Installation instructions (after pulling docker image)

```
docker run -p 5000:5000 -p 6000:6000 prakod/codemix-gcm-generator (this can alternatively be done using Docker desktop)
```
- This will create a container based on the Docker image. Get the ID of the container (using the Desktop app or `docker ps`)
- Then run:
```
docker exec -it <container_id> bash
```
- This will create a bash terminal for you to perform operations on the container.
```
conda activate gcm-venv
git clone https://github.com/prashantkodali/CodeMixToolkit.git
```

### Running jupyter notebook

```
jupyter notebook --ip 0.0.0.0 --port 5000 --no-browser --allow-root
```

### Instructions to run the flask API: 

- Ensure you are in the "library" folder

- Run these commands:
 ```
 >>> export FLASK_APP=gcmgenerator
 >>> flask run -h 0.0.0.0 -p 6000
 ```
- (change port and host details as required)

In [26]:
from flask import Blueprint, render_template, request, jsonify, flash
import requests
import json
from tqdm import tqdm
import time
import pandas as pd
import sys, os

## ALIGNER

- Sentences are passed here, and the alignment is generated.

In [27]:
# alignment generation
l1 = "यदि आप तुरंत डॉक्टर से संपर्क करें"
l2 = "contact the doctor immediately if you"

td = {'l1':l1, 'l2':l2}
alignment_api_endpoint = "http://127.0.0.1:6000/statistical_aligner_enhi"

response = requests.post(alignment_api_endpoint, json = td)

print(response)

print(response.json())

aligner_output = response.json()
alignments = aligner_output['alignment']

<Response [200]>
{'alignment': '0-4 1-5 2-3 3-2 4-0 5-0', 'l1': 'यदि आप तुरंत डॉक्टर से संपर्क करें', 'l2': 'contact the doctor immediately if you'}


## CODE-MIXED SENTENCE GENERATOR

- Using the given sentences and alignment, codemixed sentences are generated

### Expected Outputs

- In case of any error during code-mix sentence generation, the program errors out with the message: 
```
fail
```

- Sometimes it is possible that no alignments can be generated, in which case the program returns an empty array.
- If any alignment error occurs then it is possible for the code-mixed sentence to skip a few words as well

## Success case

In [28]:
# cm-sentences generation
choice = 2  #choice for language to generate parse trees
data = {
    "lang1": l1,
    "lang2": l2,
    "alignments": alignments,
    "choice": choice
}

gcm_api_endpoint = "http://127.0.0.1:6000/gcm_enhi"
#CODE FOUND IN gcmgenerator.py

response = requests.post(gcm_api_endpoint, json=data)
print(response)
#print(response.json())

retdata = response.json()
print("Sentence 1: ", retdata['lang1'])
print("Sentence 2: ", retdata['lang2'])
print("Alignments: ", retdata['alignments'])
for i in retdata['cm_sentences']:
    print(i)

<Response [200]>
Sentence 1:  यदि आप तुरंत डॉक्टर से संपर्क करें
Sentence 2:  contact the doctor immediately if you
Alignments:  ['0-4 1-5 2-3 3-2 4-0 5-0']
[IDX]	0

[L1]	यदि आप तुरंत डॉक्टर से संपर्क करें

[L2]	contact the doctor immediately if you

[L2_Tree]	(ROOT (S (VP (VB contact) (NP (DT the) (NN doctor)) (ADVP (RB immediately)) (SBAR (IN if) (NP (PRP you))))))

Alignments	0-4 1-5 2-3 3-2 4-0 5-0

Theory	ec

[CM]contact the तुरंत if you

[TREE](ROOT (VP_e (VB_e contact) (NP_e (DT_e the)) (ADVP (RB_h तुरंत)) (SBAR (IN_e if) (NP (PRP_e you)))))



[IDX]	0

[L1]	यदि आप तुरंत डॉक्टर से संपर्क करें

[L2]	contact the doctor immediately if you

[L2_Tree]	(ROOT (S (VP (VB contact) (NP (DT the) (NN doctor)) (ADVP (RB immediately)) (SBAR (IN if) (NP (PRP you))))))

Alignments	0-4 1-5 2-3 3-2 4-0 5-0

Theory	ec

[CM]contact the तुरंत if आप

[TREE](ROOT (VP_e (VB_e contact) (NP_e (DT_e the)) (ADVP (RB_h तुरंत)) (SBAR (IN_e if) (NP (PRP_h आप)))))



[IDX]	0

[L1]	यदि आप तुरंत डॉक्टर से संपर्क

### Error example (Case - ```fail```)

In [19]:
l1 = "बांस की 20 प्रजातियां हैं , जिनमें से मेलैकना बाकीफेरा माटेक प्रमुख है और राज्य में बांस की 95 भूमि पर है . बांस का व्यापक रूप से राज्य में उपयोग किया जाता है ."
l2 = "it has 20 bamboo species , of which melocanna baccifera mautak is predominant and occupies 95 of the bamboo afforested land in the state . bamboo is widely used in the state"
td = {'l1':l1, 'l2':l2}

response = requests.post(alignment_api_endpoint, json = td)

print(response)

print(response.json())

aligner_output = response.json()
alignments = aligner_output['alignment']

choice = 2  #choice for language to generate parse trees
data = {
    "lang1": l1,
    "lang2": l2,
    "alignments": alignments,
    "choice": choice
}
response = requests.post(gcm_api_endpoint, json=data)
print(response)
#print(response.json())

retdata = response.json()
print("Sentence 1: ", retdata['lang1'])
print("Sentence 2: ", retdata['lang2'])
print("Alignments: ", retdata['alignments'])
for i in retdata['cm_sentences']:
    print(i)

<Response [200]>
{'alignment': '0-3 1-6 2-2 3-4 5-5 6-7 11-12 12-11 13-13 15-21 17-16 18-15 19-20 22-24 23-18 23-25 25-27 26-27 28-23 29-29 30-28 31-28 32-28 33-26', 'l1': 'बांस की 20 प्रजातियां हैं , जिनमें से मेलैकना बाकीफेरा माटेक प्रमुख है और राज्य में बांस की 95 भूमि पर है . बांस का व्यापक रूप से राज्य में उपयोग किया जाता है .', 'l2': 'it has 20 bamboo species , of which melocanna baccifera mautak is predominant and occupies 95 of the bamboo afforested land in the state . bamboo is widely used in the state'}
<Response [200]>
Sentence 1:  बांस की 20 प्रजातियां हैं , जिनमें से मेलैकना बाकीफेरा माटेक प्रमुख है और राज्य में बांस की 95 भूमि पर है . बांस का व्यापक रूप से राज्य में उपयोग किया जाता है .
Sentence 2:  it has 20 bamboo species , of which melocanna baccifera mautak is predominant and occupies 95 of the bamboo afforested land in the state . bamboo is widely used in the state
Alignments:  ['0-3 1-6 2-2 3-4 5-5 6-7 11-12 12-11 13-13 15-21 17-16 18-15 19-20 22-24 23-18 23-25 25-2

### Error example (Case - ```Words skipped in code-mixed sentence ```)

Here, one of the generated output sentences "The Standing of the Board are as follows" skips the word "Committee" due to alignment issues

In [10]:
l1 = "बोर्ड की स्थायी समितियाँ निम्नवत हैः-"
l2 = "The Standing Committees of the Board are as follows:"
td = {'l1':l1, 'l2':l2}

response = requests.post(alignment_api_endpoint, json = td)

print(response)

print(response.json())

aligner_output = response.json()
alignments = aligner_output['alignment']

choice = 2  #choice for language to generate parse trees
data = {
    "lang1": l1,
    "lang2": l2,
    "alignments": alignments,
    "choice": choice
}
response = requests.post(gcm_api_endpoint, json=data)
print(response)
#print(response.json())

retdata = response.json()
print("Sentence 1: ", retdata['lang1'])
print("Sentence 2: ", retdata['lang2'])
print("Alignments: ", retdata['alignments'])
for i in retdata['cm_sentences']:
    print(i)

<Response [200]>
{'alignment': '0-5 1-3 1-4 2-1 3-2 4-6 4-7 4-8 5-8', 'l1': 'बोर्ड की स्थायी समितियाँ निम्नवत हैः-', 'l2': 'The Standing Committees of the Board are as follows:'}
<Response [200]>
Sentence 1:  बोर्ड की स्थायी समितियाँ निम्नवत हैः-
Sentence 2:  The Standing Committees of the Board are as follows:
Alignments:  ['0-5 1-3 1-4 2-1 3-2 4-6 4-7 4-8 5-8']
[IDX]	0

[L1]	बोर्ड की स्थायी समितियाँ निम्नवत हैः-

[L2]	The Standing Committees of the Board are as follows:

[L2_Tree]	(ROOT (S (NP (NP (DT The) (VBG Standing) (NNS Committees)) (PP (IN of) (NP (DT the) (NN Board)))) (VBP are) (SBAR (SBAR (IN as) (VP (VBZ follows))) (: :))))

Alignments	0-5 1-3 1-4 2-1 3-2 4-6 4-7 4-8 5-8

Theory	ec

[CM]The Standing of the Board are as follows

[TREE](ROOT (S_e (NP_e (NP_e (DT_e The) (VBG_e Standing)) (PP_e (IN+DT_e of the) (NP (NN_e Board)))) (VBP+IN+VBZ_e are as follows)))



[IDX]	0

[L1]	बोर्ड की स्थायी समितियाँ निम्नवत हैः-

[L2]	The Standing Committees of the Board are as follows:

[