## Sentence Generation from Keywords

1. During generation-rule acquisition, generation
rules for each keyword are automatically
acquired.
2. Candidate-text sentences are constructed
during candidate-text construction by applying
the rules acquired in the first
step. Each candidate-text sentence is represented
by a graph or dependency tree.
3. Candidate-text sentences are ranked according
to their scores assigned during evaluation.
The scores are calculated as a
probability estimated by using a keywordproduction
model and a language model
that are trained with a corpus.
4. The candidate-text sentence that maximizes
the score or the candidate-text sentences
whose scores are over a threshold
are selected as output. The system can
also output candidate-text sentences that
are ranked within the top N sentences.


headword: rightmost content word

content word: word whose part-of-speech is a verb, adjective,noun, demonstrative, adverb, conjunction, attribute,

interjection, or undefined word.

function word: everything else, formal nouns and auxiliary verbs “SURU (do)” and “NARU (become)”

In [1]:
import re

In [2]:
t = "打给 美国/澳洲/英国/加拿大 是多少钱? (<i>dǎgěi měiguó/àozhōu/yīngguó/jiānádà shì duōshǎo qián?)</i>"
hi = t.split(" (<i>")[1]
print(hi[:-5])

dǎgěi měiguó/àozhōu/yīngguó/jiānádà shì duōshǎo qián?


In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import requests      

wiki = "http://wikitravel.org/en/Arabic_phrasebook"
req = requests.get(wiki)
soup = BeautifulSoup(req.text, "lxml")
    
#soup = BeautifulSoup(body, 'lxml') # Parse the HTML as a string
df = pd.DataFrame(columns=["en","ar"])
cat = ""


In [4]:
labels = []
for dl in soup.find_all('dl'):
    if dl.find_all('dt') and not dl.find_all('dl'):
        labels.append(dl)     

In [5]:
ens = [] 
ars = []

df = pd.DataFrame(columns=["en","ar"])
for label in labels:
    dt = label.find_all("dt")
    dd = label.find_all("dd")
    
    for d in dd:
        if "Good night" in d.text:
            dd.remove(d)
            print("removed 1")
        
    for t in dt:
        if "antunna" in t.text:
            dt.remove(t)
            print("removed 1")
        if "(two people)" in t.text:
            dt.remove(t)
            print("removed 1")
        if "momken tfaye" in t.text:
            dt.remove(t)
            print("removed 1")
        
    if len(dt) == len(dd):
        for i in range(len(dt)):
            df.loc[len(df)] = [dt[i].text.strip(), dd[i].text.strip()]
    else:
        print(len(dt),len(dd))
        print("failed")
        print(label)
        


removed 1
removed 1
removed 1
removed 1


In [6]:
df.to_csv("en_ar_phrases_raw.csv")

In [7]:
df

Unnamed: 0,en,ar
0,OPEN,مفتوح (maftūh)
1,CLOSED,مغلق (mughlaq)
2,ENTRANCE,دخول (dukhūl)
3,EXIT,خروج (khurūj)
4,PUSH,ادفع (idfa`)
5,PULL,اسحب (ishab)
6,TOILET,حمام (hammām)
7,MEN,رجال (rijāl)
8,WOMEN,سيدات (sayyidāt)
9,FORBIDDEN,ممنوع (mamnū`)


In [8]:
for i, row in df.iterrows():
    if row.en in row.ar:
        row.ar = row.ar.strip(row.en).strip(" .")
    if row.en[0:3] == "...":
        row.en = row.en[3:len(row.en)-1]
    if row.ar[0:3] == "...":
        temp = row.ar.split("(")
        row.ar = temp[0][3:len(row.ar)-1]
    
    brack = re.compile(r'\(.*\)')
    unders = re.compile(r'_+')
    eng = re.compile(r'[a-zA-Z0-9āīūēō\-.,~\'`?!\[\]]+')
    row.ar = brack.sub("", row.ar)
    row.ar = unders.sub("*", row.ar)
    row.ar = eng.sub("", row.ar)
    
    

In [9]:
df

Unnamed: 0,en,ar
0,OPEN,مفتوح
1,CLOSED,مغلق
2,ENTRANCE,دخول
3,EXIT,خروج
4,PUSH,ادفع
5,PULL,اسحب
6,TOILET,حمام
7,MEN,رجال
8,WOMEN,سيدات
9,FORBIDDEN,ممنوع


In [10]:
df.to_csv("cleaned_arabic.csv")

In [11]:
cndf = pd.read_csv("../clean_en_cn.tsv", sep="\t")

In [12]:
cndf= cndf.drop(['Unnamed: 0','category'], axis=1)

In [13]:
ardf = df.copy()
cndf = cndf.rename(columns={"english": "en", "chinese": "cn"})

In [14]:
for i, row in cndf.iterrows():
    row.en = row.en.strip()
    row.cn = row.cn.strip()
    if row.en[0:3] == "...":
        row.en = row.en[3:len(row.en)-2]
        defart = ["the ","a ","an "]
        for artc in defart:
            if artc in row.en:
                row.en = row.en.replace(artc, "")
    
    if row.cn[0:3] == "...":
        row.cn = row.cn[3:len(row.cn)-2]
    
for i,row in ardf.iterrows():
    row.en = row.en.strip()
    row.ar = row.ar.strip()

In [15]:
cndf

Unnamed: 0,en,cn
0,Hello.,你好。
1,How are you?,你好吗？
2,"Fine, thank you.","很好, 谢谢。"
3,"May I please ask, what is your name?",请问你叫什么名?
4,What is your name?,你叫什么名字？
5,My name is * .,我叫 * 。
6,Nice to meet you.,很高兴认识你。
7,Please.,请。
8,Thank you.,谢谢。
9,You're welcome.,不客气。


In [18]:
pd.merge?

In [19]:
result = pd.merge(ardf, cndf, how='outer', on=['en'])`

In [21]:
len(result)

563

In [22]:
result.sort_values("en")

Unnamed: 0,en,ar,cn
318,(bubbly) water,مياه غازية,
307,(fresh) fruit,),
306,(fresh) vegetables,),
527,"* (hard liquor) and * (mixer), please.",,请给我*和*。
478,* day(s),,* 天
477,* hour(s),,* 小时
476,* minute(s),,* 分钟
480,* month(s),,* 月
479,* week(s),,* 星期
481,* year(s),,* 年


In [24]:
nodup = result.drop_duplicates().sort_values("en")

In [26]:
nodup.to_csv("cleaned_merged.csv")