---

# Basic sentence generation using a list of IDs as input
## Concept ID Database

### POS for qutr

* sent: complete sentences and greetings
* vp: verb phrases to refer to sentence templates with empty noun phrases
* np: noun and noun phrases
* adj: adjective
* adv: adverb
* conj: conjunctions
* prt: particles or other function words
* num: cardinal numbers
* X: other

"A Universal Part-of-Speech Tagset" by Slav Petrov, Dipanjan Das and Ryan McDonald

In [1]:
import pandas as pd

In [2]:
tri = pd.read_csv("basic_tri_db.csv")

In [17]:
pos = ["phrs","phrs","phrs","phrs","phrs","phrs","phrs","phrs","vp","phrs","adj","adv","adv","vp","np","np","np","vp","phrs","prt","prt","vp","prt","adv","phrs","phrs","phrs","phrs","phrs","phrs","phrs","adj","adj","phrs","phrs","vp","phrs","conj","phrs","vp","phrs"]

In [18]:
en = pd.DataFrame(tri.en)
ar = pd.DataFrame(tri.ar)
cn = pd.DataFrame(tri.cn)

en = en.rename(columns={"en":"phrase"})
ar = ar.rename(columns={"ar":"phrase"})
cn = cn.rename(columns={"cn":"phrase"})

for langdf in [en, ar, cn]:
    langdf['pos'] = pos
    for i in range(0, len(langdf)):
        ## assigning a new index col with p0, p1... to differentiate string IDs with cardinal numbers
        langdf.loc[i, 'id'] = "p" + str(i)
    langdf.set_index("id", inplace=True)

In [19]:
en

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,Hello,phrs
p1,Hi,phrs
p2,Welcome,phrs
p3,Good Morning,phrs
p4,Good Afternoon,phrs
p5,Good Evening,phrs
p6,What would you like?,phrs
p7,What is your name?,phrs
p8,My name is *,vp
p9,Nice to meet you,phrs


In [20]:
cn

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,你好,phrs
p1,您好,phrs
p2,欢迎,phrs
p3,早上好,phrs
p4,下午好,phrs
p5,晚上好,phrs
p6,你需要什么？,phrs
p7,你叫什么名字？,phrs
p8,我叫*,vp
p9,很高兴见到你,phrs


In [21]:
ar

Unnamed: 0_level_0,phrase,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1
p0,أهلا,phrs
p1,مرحبا,phrs
p2,السلام عليكم,phrs
p3,صباح الخير,phrs
p4,طاب مسائك,phrs
p5,مساء الخير,phrs
p6,ماذا تحب؟,phrs
p7,ما اسمك؟,phrs
p8,اسمي *,vp
p9,تشرفنا,phrs


In [22]:
ar.to_csv("ar40.csv")
en.to_csv("en40.csv")
cn.to_csv("cn40.csv")


In [36]:
tri = pd.merge(en, ar, left_index=True, right_index=True)
tri = pd.merge(tri, cn, left_index=True, right_index=True)

In [46]:
tri = tri.rename(columns={"phrase_x": "en", "phrase_y": "ar", "phrase": "cn"})
tri = tri.drop(["pos_x","pos_y"], axis=1)

In [47]:
tri

Unnamed: 0_level_0,en,ar,cn,pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
p0,Hello,أهلا,你好,phrs
p1,Hi,مرحبا,您好,phrs
p2,Welcome,السلام عليكم,欢迎,phrs
p3,Good Morning,صباح الخير,早上好,phrs
p4,Good Afternoon,طاب مسائك,下午好,phrs
p5,Good Evening,مساء الخير,晚上好,phrs
p6,What would you like?,ماذا تحب؟,你需要什么？,phrs
p7,What is your name?,ما اسمك؟,你叫什么名字？,phrs
p8,My name is *,اسمي *,我叫*,vp
p9,Nice to meet you,تشرفنا,很高兴见到你,phrs


## Notes
* need queries to be separated as single sentences; hard to parse otherwise

---

## Jan 28 Notes

### implementing a multilingual phrase compilation model

How do we build a single model that builds sentences accurately for all languages?

* importance of a clean, well organized database
* scalability of the database?
* how to specify gender and age? through addition of more concept phrases or somehow inflecting already existing phrases?

---