# Data Generation
### Contributed by Praveen Yadav, [GitHub Repo](https://github.com/fnc11/nball4treehindi)
In this part, we already had two input files, the tree structure or paths for all nodes to the topmost parent and a file which had the set to word mapping which we printed from the modernSyn2Words dictionary in the pre-processing section.
So the first thing we did, we made two dictionaries out of the set to word mapping, one from set to word and another is the reverse, so by doing this we kind of ignored all other words in a set, kept only the first one in the list, because that was the easiest way to handle this data.
We formed a Tree class with standard attributes and methods which you’ll understand once you go through the implementation of that and some extra methods which we needed for later experiments also.
Now before we start building the tree from the paths we once again passed all the sets through our newly formed dictionary set to word to make sure there should not be any set number in the tree nodes for which we don’t have a corresponding set. After this filtering, we started building the tree, now we made *root* as the ultimate parent of each node. While making the tree we tried to store the set numbers in the nodes, further details can be found in the implementation.
Now the stage was set to just print out the files we needed for experiment 1, used some simple code to print out the data in the required format, wordSenseChildren was simple tree traversal while for printing category codes we needed to write some complicated code because we need to print the category codes for the intermediate nodes as well.

## Difficulties/Choices:
1. Taking only first word from the modernSyn2Words dictionary, ignoring all other words in the set.
2. While making the tree we stored the reference to all the node objects in a dictionary for faster access, to attach the child nodes directly and this dictionary was also used in later experiments.

## Run the below code
### To generate the tree out of the paths and also generate the wordSenseChildren.txt and catCodes.txt file which we ultimately need.

### Tree Class
* a utility class that we need for building the tree from our data.

In [None]:
class Tree(object):
    # "Generic tree node."
    def __init__(self, name='root', children=None):
        self.name = name
        self.children = []
        if children is not None:
            for child in children:
                self.add_child(child)

    def add_child(self, node):
        assert isinstance(node, Tree)
        self.children.append(node)

    def printTree(self):
        assert isinstance(self, Tree)
        stck = [self]
        while(True):
            if len(stck) != 0:
                temp = stck.pop(0)
                print(temp.name)
                if len(temp.children) > 0:
                    for child in temp.children:
                        stck.insert(0, child)
            else:
                break

    def printLevelOrder(self):
        assert isinstance(self, Tree)
        que = [self]
        nums_at_level = 1
        levelNum = 0
        # maxLevel = -1
        levelDict = {0:[]}
        for i in range(1,15):
            levelDict[i]=[]
        count = 0
        while(True):
            if nums_at_level == 0:
                # print("level: "+str(levelNum))
                # if levelNum > maxLevel:
                #     maxLevel = levelNum
                levelNum += 1
                nums_at_level = count
                count = 0

            if len(que) != 0:
                nums_at_level -= 1
                temp = que.pop(0)
                # print(temp.name)
                levelDict[levelNum].append(temp.name)
                if len(temp.children) > 0:
                    for child in temp.children:
                        que.append(child)
                        count += 1
            else:
                break
        return levelDict
        # print(maxLevel)

    def child_size(self, sz):
        assert isinstance(self, Tree)
        clist = len(self.children)
        count = 0
        for i in range(0, clist):
            if count > sz:
                return count
            count += self.children[i].size(sz)
        return count+1

    def node_and_children(self):
        assert isinstance(self, Tree)
        stck = [self]
        child_list = []
        while len(stck) != 0:
            node = stck.pop()
            child_list.append(node.name)
            for child in node.children:
                stck.append(child)
        return child_list

    def print_all_paths(self, path, filest):
        assert isinstance(self, Tree)
        new_path = self.name+" "+path
        # print(new_path)
        filest.write(new_path+"\n")
        if len(self.children)!= 0:
            for child in self.children:
                child.print_all_paths(new_path, filest)


### set2Word and word2Set Dictionaries
* The below code makes two dictionaries which help us interchange synsets to words and vice versa whenever we need.

In [None]:
set2Word = {}
word2Set = {}

with open("data/set2WordV.txt","r") as s2w:
    cont = s2w.read()
    sets = cont.split("$")
    for set in sets:
        pairs = set.split(":")
        num = pairs[0]
        word = pairs[1]
        set2Word[num] = word
        word2Set[word] = num

### Building the tree 
* Building the tree out of paths.

In [None]:
# Code to generate tree and printing wordsense and catcodes starts from here
root = Tree("root")
allNodes = {"root": root}

# final cleansing of the paths
paths = []
clean_paths = []
with open("data/tree_struct.txt", 'r') as tree_struct:
    struct_cont = tree_struct.read()
    paths = struct_cont.split("$")
    # count = 0
    for path in paths:
        tokens = path.split("<-")
        num_of_tokens = len(tokens)
        new_path = tokens[0]
        for i in range(1, num_of_tokens):
            if tokens[i] in set2Word.keys():
                new_path += "<-" + tokens[i]
        clean_paths.append(new_path)

    for path in clean_paths:
        tokens = path.split("<-")
        # print(tokens)
        num_tokens = len(tokens)
        # saving prev_token to save it's child if they don't exist already
        prev_token = tokens[num_tokens - 1]
        # remember to leave last set, need to replace it with actual word
        for i in range(0, num_tokens - 2):
            name = str(tokens[num_tokens - 1 - i])
            # to avoid index out of bound error
            if i > 0:
                prev_token = tokens[num_tokens - i]
            if name in allNodes.keys():
                continue
            else:
                # create new node and save it's mapping in the allNodes dict
                temp = Tree(name)
                allNodes[name] = temp
                if i == 0:
                    # if it's the node that goes below root
                    allNodes["root"].add_child(temp)
                else:
                    # if it is some other node's child, other than root(prev_token's)
                    allNodes[prev_token].add_child(temp)
        # now for the last node, as word
        name = str(tokens[0])
        # if the word is not directly attached to the root
        if num_tokens >= 3:
            if name in word2Set.keys():
                name = word2Set[name]
            if name not in allNodes.keys():
                temp = Tree(name)
                allNodes[name] = temp
                allNodes[tokens[2]].add_child(temp)
        else:
            if name in word2Set.keys():
                name = word2Set[name]
            if name not in allNodes.keys():
                temp = Tree(name)
                allNodes[name] = temp
                allNodes["root"].add_child(temp)

###  setOrderNum Dictionary
* This code makes a dictionary which keeps track of the number of a certain set under certain parent which will be used while printing catCodes of the words.

In [None]:
# a dictionary to hold the order of the set in certain level under certain parent
setOrderNum = {'root': 1 }
with open("data/sameLevelWords.txt", "w") as slw:
    stck =[root]
    while(True):
        if len(stck) != 0:
            temp = stck.pop(0)
            if len(temp.children) > 0:
                o_list = temp.children
                c_list  = []
                for child in o_list:
                    word_name = child.name
                    if child.name in set2Word.keys():    
                        word_name = set2Word[child.name]
                    c_list.append(word_name)
                    stck.append(child)
                c_list.sort()
                k=0
                for i in c_list:
                    k=k+1
                    if i in word2Set.keys():
                        setOrderNum[word2Set[i]] = k
                    else:
                        setOrderNum[i] = k
        else:
            break
        slw.write("\n\n")

### Printing CatCodes 
* Printing CatCodes for all the words

In [None]:
# printing cat codes of all the words in a file
def print_catcodes():
    with open("data/catCodes.txt", "w") as ctcd:
        # count = 0
        # A dictionary to hold all the words whose cat_codes are already generated
        cat_printed = {}
        for path in clean_paths:
            tokens = path.split("<-")
            num_of_tokens = len(tokens)

            for j in range(0, num_of_tokens):
                leaf = tokens[j]
                if leaf in set2Word.keys():
                    leaf = set2Word[leaf]
                if leaf not in cat_printed:
                    cat_printed[leaf]=True
                    sen = ""
                    slen = 0
                    for i in range(j, num_of_tokens):
                        if j==0 and i==1:
                            slen = 1
                            continue
                        word = tokens[i]
                        if word in set2Word.keys():
                            word = set2Word[word]
                        if word in word2Set.keys():
                            sen = str(setOrderNum[word2Set[word]])+" "+sen
                        # discussion need to be done here
                        elif word in setOrderNum.keys():
                            # count += 1
                            sen = str(setOrderNum[word])+" "+sen
                        else:
                            print("Evil case")
                            continue
                    # root order for every word
                    # need to add special case for root*
                    sen = "1 "+sen.strip()
                    for k in range(0, 13-(num_of_tokens-j-slen)):
                        sen += " 0"
                    ctcd.write(leaf+" "+sen+"\n")
        # print(count)

print_catcodes()

### Printing WordSenses
* Printing WordSenses for all the words

In [None]:
# root.printTree()
# printing word sense children file as English
def print_word_senses():
    with open("data/wordSenseChildren.txt","w") as wsc:
        stck = [root]
        while (True):
            if len(stck) != 0:
                temp = stck.pop(0)

                if temp.name in set2Word.keys():
                    wsc.write(set2Word[temp.name])
                else:
                    wsc.write(temp.name)

                if len(temp.children) > 0:
                    for child in temp.children:
                        stck.insert(0, child)
                        if child.name in set2Word.keys():
                            wsc.write(" "+set2Word[child.name])
                        else:
                            wsc.write(" "+child.name)

                wsc.write("\n")
            else:
                break

print_word_senses()

## Some modification has to be done in both the files
    * in wordSenseChildren.txt file change root to \*root\*
    * in catCodes.txt also make root to \*root\* and the make the second 1 as 0 in it's code.

## Now you are ready to use these files for experiments

<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, C. Bauckhage<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>