# Constructing Chinese N-Ball Embeddigs

contributed by **Shichen Zhan**

This instruction is written for the lab AI Language Technology (WiSe 2018) offered by the Fraunhofer IAIS. It is used for generating n-ball embedding files of Chinese language - contributed by **Shichen Zhan**.

In this file, it constructs child and catcode file using the index in English-version wordnet(Princeton WordNet 3.1), translate corresponding English words into Chinese words,using the same relations as english words(hypernym and hyponym).
Because the vocabularies of pre-trained word embeddings and Chinese wordnet are different, 
it is important to delete part  of words in each vocabulary, making them identical.


## Implementation
we use two package, bs4 and nltk. Bs4 is for handling the file of Chinese wordnet(wn-cmn-lmf.xml) and nltk is for translating Chinese word into English word,in order to obtain their word relations(hypernym and hyponym).

In [None]:
from bs4 import BeautifulSoup as bs
from nltk.corpus import wordnet as wn
soup = bs(open("/home/zhanshichen/Desktop/code/wn-cmn-lmf.xml"),"xml")
pa_child = {}
parent = {}
tag = soup.LexicalEntry
#all Chinese words which appearing in w2v.vector(file of pre-trained word embeddings)
vector = []
with open("/home/zhanshichen/nball4tree/w2v.vector","r") as f:
    lines = f.readlines()     
    for line in lines:
        vec = line.split(" ")
        vector.append(vec[0])

Here we create a dictionary **pa_child** to store all Chinese word name obtained from the Chinese word net. Furtheremore, it will be used for recording their relations.

In [None]:
word_name = []  
#store all word-name which appearing in wordnet, in order to shearing the vector file.

while not tag.name == "Synset":
    tag_sense = tag.Sense
    tag_id = tag_sense['synset']
    tag_name_o = tag.Lemma['writtenForm']  
    part = tag.Lemma['partOfSpeech']  
    i = 1
    number = str(i)

    #remove "+" from tag_name
    tag_name = tag_name_o.replace("+","")     


    if (tag_name in vector) and (len(pa_child) < 10000):
    # if tag_name in vector :
        word_name.append(tag_name)
        pa_child[tag_id] = {}


        tag_total_name = tag_name + '.' + part + '.' + number
        pa_child[tag_id][tag_total_name] = []


        #if one word has more than one meanings
        while tag_sense.next_sibling.next_sibling:
            i = i + 1
            number = str(i)
            tag_sense = tag_sense.next_sibling.next_sibling
            tag_id = tag_sense['synset']
            pa_child[tag_id] = {}
            tag_total_name = tag_name + '.' + part + '.' + number
            pa_child[tag_id][tag_total_name] = []

    tag = tag.next_sibling.next_sibling


Then, we create another dictionary, **parent**, to save the hypernym relations of words, through searching index in english-version wordnet and translating into corresponding Chinese words. Particularly, if a word has more than one hypernym words, we just choose one of them and ignore others.

In [None]:

for id in pa_child: 
    b = id.split('-')
    english_id =  b[2] + b[3]
    #15028818n format

    parent[id] = []
    for name in pa_child[id]: 
        parent[id].append(name)

    try:
        english_name = wn.of2ss(english_id)  #english_name format ： Synset('isoagglutinin.n.01')
    except:
        continue  
    else:

        
        children_names = english_name.hyponyms()
        if children_names:  
            for child_name in children_names:
                child_id = str(child_name.offset()).zfill(8) + '-' + child_name.pos()
                chinese_child_id = 'cmn-10-' + child_id
                if chinese_child_id in pa_child.keys():
                    for name in pa_child[id]:
                        pa_child[id][name].append(chinese_child_id)

        parent_names = english_name.hypernyms()  
        if parent_names:  
            label = 0
            for parent_name in parent_names:


                only_parent_name = parent_name  
                only_parent_id = str(only_parent_name.offset()).zfill(8) + '-' + only_parent_name.pos()
                chinese_parent_id = 'cmn-10-' + only_parent_id
                if (chinese_parent_id in pa_child.keys()) and (label == 0):  
                    parent[id].append(chinese_parent_id)
                    label = 1
                elif (chinese_parent_id in pa_child.keys()) and (label == 1) : 
                    
                #if there are multiple parent word for one word, keep only one parent word and remove this relations in other's nodes.
                    for delete_parent_name in pa_child[chinese_parent_id]:
                        if id in pa_child[chinese_parent_id][delete_parent_name]:
                            pa_child[chinese_parent_id][delete_parent_name].remove(id)
#                             print("after delete from ",delete_parent_name," ",id)
#                             print("then",pa_child[chinese_parent_id][delete_parent_name])
                            # print("delete relation 1 \n")   


We need to check if each child-tree is a tree(not graph),in other words, its parent-node has been included in its children nodes. We realize it using depth-first traversal.

In [None]:

for id in parent:
    if len(parent[id]) == 1:  
        parent_path = []
        # parent_path.append(id)  
        for x in pa_child[id]:  
            name = x
        node_id = id

        queue=[]
        queue.append(node_id)    
        while queue:
            v = queue.pop()    
            parent_path.append(v) 
            for x in pa_child[v]:
                name = x
            for children_id in reversed(pa_child[v][name]):   
                if children_id in pa_child.keys():
                    if children_id in parent_path:
                        pa_child[v][name].remove(children_id)
                        print("deleted 1 \n")
                    else:
                        queue.append(children_id)

Then, we generate **child_trans.txt**, recording parent-children relations among word-senses. If a word in **parent** dic does not have parent nodes, we take **root** node as its parent node.

In [None]:
with open("/home/zhanshichen/child_trans.txt","w") as f:
                            
    f.write("*root* ")
    for id in parent:
        if len(parent[id]) == 1 :
                f.write(parent[id][0]+' ')   #写入
                
                parent[id].append("*root*")  
                # count = count + 1
    f.write('\n')    
    
    for pa_id in pa_child:
        for pa_name in pa_child[pa_id]:
            # if pa_name in half_wordname:    
            f.write(pa_name + ' ')
            for children_id in pa_child[pa_id][pa_name]:  
                if children_id in pa_child.keys():
                    for children_name in pa_child[children_id]:
                        f.write(children_name)
                        f.write(" ")
        f.write("\n")

We need to re-create file of pre-trained word embeddings(w2v.vector), in order to match the vocabulary in wordnet 

In [None]:

with open("/home/zhanshichen/nball4tree/w2v.vector","r") as f:
    with open("/home/zhanshichen/w2v_new_trans.txt","w") as f_w:    
        lines = f.readlines()
        for line in lines:
            vec = line.strip().split(" ")
            if vec[0] in word_name:
                for each_vector in vec:
                    f_w.write(each_vector + " ")
            f_w.write("\n")
            

Here, we use catcode_trans.txt to store the parent location code of a word-sense in the tree structure.

In [None]:

longest_dimension = 0
with open("/home/zhanshichen/catcode_trans.txt","w") as f:
    for id in parent:  
        # if parent[id][0] in half_wordname:  
        f.write(parent[id][0]+' ')   
        node_id = id
        position = []  
        
        while not parent[node_id][1] == "*root*":  
            parent_id =  parent[node_id][1]
            number = 0
            for x in pa_child[parent_id]:
                for child_id in pa_child[parent_id][x]:
                    number = number + 1
                    if child_id == node_id:
                        position.append(str(number) + " ")
                        break
            node_id = parent_id  
            if len(position) == 5:  #deepest tree
                print("root node5:" + parent[node_id][0])

            
        position.append("1")
        dimension = len(position)
        if dimension > longest_dimension:
            longest_dimension = dimension

        level = 0 
        for po_number in position[::-1]:
            f.write(po_number + ' ')
            level = level + 1
        while level < 9:
            f.write("0" + " ")
            level = level + 1
        f.write("\n")

Finally, remove space between lines in the new generated word pre-trained word embedding file.

In [None]:
with open("/home/zhanshichen/nball4tree/w2v_new_trans.txt","r") as f_r:
    with open("/home/zhanshichen/w2v_2.txt","w") as f_w:
        for line in f_r.readlines():                                  
            data=line.strip()
            if len(data)!=0:
                f_w.write(data)
                f_w.write('\n')

## Training 
The training method references [nball4tree](https://github.com/gnodisnait/nball4tree), where it trains and evaluates nball embedding files.


In [None]:
$ git clone https://github.com/gnodisnait/nball4tree.git
$ cd nball4tree
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt


In [None]:
$python nball.py --train_nball nball.txt --w2v w2v_2.txt  --ws_child child_trans.txt  --ws_catcode catcode_trans.txt  --log log.txt
% --train_nball: output file of nball embeddings
% --w2v_2.txt: file of pre-trained word embeddings
% --child_trans.txt: file of parent-children relations among word-senses
% --catcode_trans.txt: file of the parent location code of a word-sense in the tree structure
% --log.txt: log file, shall be located in the same directory as the file of nball embeddings

After several hours, it will generate **nball.txt** and print the result of training.

## Result
We successfully generated the parent-child relations, catcode and wordnet files for Chinese language. Applying these to nball embedding, we obtain the nball-embedding file which records all tree structures and relations.
### the result of experiment
| Name | Result|
| --- | --- |
| parent-children relations among word-senses | [child_trans.txt](https://drive.google.com/open?id=1MLveoPRB4JN4HJF01a3cXWtZ-ox27JWL) |
|parent location code|[catcode_trans.txt](https://drive.google.com/open?id=1JI_UUbse-oJtdumfGy6wJxJOH_uqjuNJ)|
| pre-trained word embeddings|[w2v_2.txt](https://drive.google.com/open?id=1FGb6eIapn8VA_NafSzFRpP-cAigcQ4hf)|
|log file|[training.log](https://drive.google.com/open?id=1xHEgiq60YMW_V8Sf5tVTHCrpilurrAsI)|
|nball word embedding|[nball.txt](https://drive.google.com/open?id=1yqOyzzq6N_b54IU-tZCPcrLF90Od6Rcv)|

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>