---

# Basic sentence generation using a list of IDs as input
## Concept ID Database

### POS for qutr

* sent: complete sentences and greetings
* vp: verb phrases to refer to sentence templates with empty noun phrases
* np: noun and noun phrases
* adj: adjective
* adv: adverb
* conj: conjunctions
* prt: particles or other function words
* num: cardinal numbers
* X: other

"A Universal Part-of-Speech Tagset" by Slav Petrov, Dipanjan Das and Ryan McDonald

In [3]:
import pandas as pd

In [4]:
tri = pd.read_csv("basic_tri_db.csv")

In [5]:
## inputting POS tag col for all lang databases
pos = ["sent","sent","sent","sent","sent","sent","sent","sent","vp","sent","adj","adv","adv","vp","np","np","np","vp","sent","prt","prt","vp","prt","adv","sent","sent","sent","sent","sent","sent","sent","adj","adj","sent","sent","vp","sent","conj","sent","vp","sent"]

In [6]:
en = pd.DataFrame(tri.en)
ar = pd.DataFrame(tri.ar)
cn = pd.DataFrame(tri.cn)

en = en.rename(columns={"en":"phrase"})
ar = ar.rename(columns={"ar":"phrase"})
cn = cn.rename(columns={"cn":"phrase"})

for langdf in [en, ar, cn]:
    langdf['pos'] = pos
    for i in range(0, len(langdf)):
        ## assigning a new index col with p0, p1... to differentiate string IDs with cardinal numbers
        langdf.loc[i, 'id'] = "p" + str(i)
    langdf.set_index("id", inplace=True)

In [7]:
en

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,Hello,sent
p1,Hi,sent
p2,Welcome,sent
p3,Good Morning,sent
p4,Good Afternoon,sent
p5,Good Evening,sent
p6,What would you like?,sent
p7,What is your name?,sent
p8,My name is *,vp
p9,Nice to meet you,sent


In [8]:
cn

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,你好,sent
p1,您好,sent
p2,欢迎,sent
p3,早上好,sent
p4,下午好,sent
p5,晚上好,sent
p6,你需要什么？,sent
p7,你叫什么名字？,sent
p8,我叫*,vp
p9,很高兴见到你,sent


In [9]:
ar

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,أهلا,sent
p1,مرحبا,sent
p2,السلام عليكم,sent
p3,صباح الخير,sent
p4,طاب مسائك,sent
p5,مساء الخير,sent
p6,ماذا تحب؟,sent
p7,ما اسمك؟,sent
p8,اسمي *,vp
p9,تشرفنا,sent


## Notes
* need queries to be separated as single sentences; hard to parse otherwise

In [10]:
## general translate method for all languages, requires target lang specification as parameter
def translate(query, target):
    #phrase id: pid
    final = ""
    np = ""
    temp = ""
    raw = ""
    
    for pid in query:
        if type(pid) is int:
            np += str(pid)
            raw += str(pid) + " | "
        else:
            prow = target.loc[pid]
            raw += prow.phrase + " | "
            #print(prow.complete)
            if prow.pos == "sent":
                final += prow.phrase + "\t"
            else:
                ## there can only be one templating sentence per query
                if "*" in prow.phrase and prow.pos == "vp":
                        temp = prow.phrase
                else:
                    np += prow.phrase.replace("*", "").lower()
                    
    if temp != "":
        temp = temp.replace("*", np)
    final += temp
    
    print("Query IDs were: " + raw + "\n")
    print(final)
    print("---\n")

### Example sentence compilation

In [11]:
translate(["p21", 5, "p22"], en)
translate(["p36"], en)
translate(["p3", "p35", "p15"], en)
translate(["p16", "p17"], en)

Query IDs were: I would like * | 5 | * kilograms | 

I would like 5 kilograms
---

Query IDs were: I do not understand | 

I do not understand	
---

Query IDs were: Good Morning | Do you have *? | Apple | 

Good Morning	Do you have apple?
---

Query IDs were: Mango | How much are *? | 

How much are mango?
---



In [12]:
translate(["p21", 5, "p22"], cn)
translate(["p36"], cn)
translate(["p3", "p35", "p15"], cn)
translate(["p16", "p17"], cn)

Query IDs were: 我要* | 5 | *公斤 | 

我要5公斤
---

Query IDs were: 我听不懂 | 

我听不懂	
---

Query IDs were: 早上好 | 有没有*？ | 苹果 | 

早上好	有没有苹果？
---

Query IDs were: 芒果 | *多少钱？ | 

芒果多少钱？
---



In [13]:
translate(["p21", 5, "p22"], ar)
translate(["p36"], ar)
translate(["p3", "p35", "p15"], ar)
translate(["p16", "p17"], ar)

Query IDs were: أود * | 5 | *كلغ | 

أود 5كلغ
---

Query IDs were: لا افهم | 

لا افهم	
---

Query IDs were: صباح الخير | هل تمتلك *؟ | تفاح | 

صباح الخير	هل تمتلك تفاح؟
---

Query IDs were: مانجو | كم هي *؟ | 

كم هي مانجو؟
---



---

## Jan 28 Notes

### implementing a multilingual phrase compilation model

How do we build a single model that builds sentences accurately for all languages?

* importance of a clean, well organized database
* scalability of the database?
* how to specify gender and age? through addition of more concept phrases or somehow inflecting already existing phrases?

---