# Data Pre-Processing:
### Contributed by Parth Wadhwa, [GitHub Repo](https://github.com/fnc11/nball4treehindi)

Files used for Hindi data generation are taken from this [github repo](https://bitbucket.org/sivareddyg/python-hindi-wordnet/src/master) which mainly took data from [IIT Bombay University](http://www.cfilt.iitb.ac.in/).

You need to download w2v from this [website](https://fasttext.cc/docs/en/crawl-vectors.html) and make sure you remove first line of this file as it contains information about number of words and dimensions.


## Extract Embedding words:
First extract words from word2vector file, so that we can filter out the word from our database for which we don't have the vectors.
    * since the size of w2v file is large(4 GB), it's already done here, you can find the extracted words in data folder.
    * keep in mind this step is very computation intensive so only run it once.

In [None]:
with open("cc.hi.300.vec",'r') as vec, open("data/wordEmbs.txt",'w') as word_embs:
    cont = vec.read()
    lines = cont.split('\n')
    for line in lines:
        word = line.split(" ")[0]
        word_embs.write(word+"$")

## Download these three packages from the mentioned [repository](https://bitbucket.org/sivareddyg/python-hindi-wordnet/src/master/)
    * WordSynsetDict.pk, SynsetHypernym.pk, SynsetWords.pk(this can be skipped). These are also available in data folder.
## Data Format
Here is a description of the data format of these packages and what we tried to achieve from this data.
For this, we used WordSynsetDict.pk and SynsetHypernym.pk dictionaries which we got from Siva Reddy’s repository. Now, these files contain a structure like the following:
WordSynsetDict.pk
Actual:
नरहरि
{'1': [u'19440']}
झौंस
{'1': [u'12911', u'8884'], '3': [u'11454', u'8391']}
सेनजित्
{'1': [u'21358', u'21357']}

English Analogy:
word1
{'1': [u'set19440']}
word2
{'1': [u'set12911', u'set8884'], '3': [u'set11454', u'set8391']}
…
So, words are dictionary keys which also stores another dictionary as value, inside the value dictionary another dictionary which stores a list of sets according to the noun, adjective, verb and adverb as 1,2,3,4 respectively, which contains this word in their set. It’s pretty complicated to understand but by analysing an example perhaps you could understand more clearly.

SynsetHypernym.pk
{'12836': {1: [u'196']}, '12835': {1: [u'1070']}, '12834': {1: [u'652']}, '8545': {1: [u'1439', u'7290']}, '8544': {1: [u'564']}, '12839': {3: [u'13028']}, '8546': {1: [u'3139']}, '8541': {3: [u'5367']}..........}

Now, this is also a dictionary where set numbers are the keys, and they hold also a dictionary as their value, this value dictionary stores the immediate parent sets of the current set. But they are not stored as a list, the parents are also stored as a dictionary which stores the parents according to the set’s type(noun(1), adjective(2), verb(3) and adverb(4)). I know that you would think about why there are different parent sets for one set. But the data is in this format, maybe there are cases that a set contains a word which behaves as noun and verb depending upon the context, then the parent sets of this set in which this word resides could have multiple types of parent sets like noun as well as the verb.

Our main goal is to produce the data in such a format which can be ultimately used for generating data which can be fed to the nball4tree code. So for this, we tried to print the paths from word to root or the last parent they had so that in the next section we could easily just take these paths and build a tree.
“Just keep in mind in this context we use set/synset very frequently, often they mean the same thing here, but we’ll make sure to let you know if they meant otherwise.”
So the first thing we did was to filter out the sets which contained all the words which are not present in the word embeddings or for them we didn’t have the embeddings, while doing this operation we had to keep in my mind that we don’t remove a set if it contains some words for which we do have embeddings. And we made a set which contained all the synset numbers which need to be removed entirely from the data, we didn’t remove these sets here right away because doing so might break the hierarchical structure of many words instead we remove them from the final paths.
And since we had several words inside one set, we numbered them according to the word type and it’s number in the set for example:
झौंस
{'1': [u'12911', u'8884'], '3': [u'11454', u'8391']}

झौंस.n.01 -> 12911
झौंस.n.02 -> 8884
झौंस.v.01 -> 11454
झौंस.v.02 -> 8391

And we also made a reverse dictionary(modernSyn2Words) where sets were the keys and which held a list of words as their value like the following:
{‘12911’:[झौंस.n.01], ‘8884’:[झौंस.n.02], …}
We need this type of dictionary for later purposes since this each set in this dictionary containing a list of words we just printed only the first word in each list corresponding to a set, this was some kind of optimization approach we took, which will be explained later.
Now after formatting the words, it was time to print the paths which we did use the hypernym relation between the sets, yes we didn’t have the hypernym relation for a particular word but for the entire set. So we did the only sensible thing anyone would do, we kept the sets in leaves to root paths. So our paths looked like this:  
स्वरभंग.n.01<-17449<-1423<-652\\$  
नाइजीरिया.n.01<-20242<-7440<-3108<-2022<-923<-3259\\$  
….  
So that at later of point of time how to deal with this and also kept more information this way, as we said above we removed the sets for which we didn’t have embeddings finally here after gettings these paths.

## Difficulties/Choices:
1. The first difficulty we face was with the computation resources, the w2v file was almost around 4GB size so need to extract the words from it separately so that we can run the processes efficiently. 
2. In the first section when we had to make the paths for the nodes, we find there are some branch splitting/merging data exist, like a single set have multiple parents and they go on separate paths and later they merge on some same node. So for simplicity, we just took the first parent set number from each set of hypernym sets to make it work.
3. Keeping the set number on intermediate nodes instead of the actual words.
4. Deferring the decision to remove the sets to later point from the paths, found out after some tries.


## Run the below code stepwise it will generate tree structure or paths from leaves to last parent.

### Read the packages


In [None]:
import pickle

# Loading input dictionaries
word2synsets = pickle.load(open("data/WordSynsetDict.pk", 'rb'))
# synset2words = pickle.load(open("data/SynsetWords.pk", 'rb'))
synset2hypes = pickle.load(open("data/SynsetHypernym.pk", 'rb'))

### Writing paths

In [None]:
# All the paths from the leaf/word to root will be printed in tree_struct.txt
tree_struct = open("data/tree_struct.txt", 'w+')

### Assigning words according to their type 
A function to assign characters based upon the word type.

In [None]:
"""
This function assigns one special char to each word type in the wordnet. 
    1: n (noun), 2: j (Adjective), 3: v (verb), 4: a (Adverb)
"""


def assign_alpha(num):
    if num == "1":
        return "n"
    elif num == "2":
        return "j"
    elif num == "3":
        return "v"
    else:
        return "r"

### Filtering out the sets
* Make two sets to_remove and to_keep which holds the synsets which we need to remove and keep.

In [None]:
""" 
Compares the words from the Hindi Wordnet and the Hindi Word-embeddings, and keeps track of the synsets which 
contains the words which are not in the word-embeddings so that they can be removed later from the Hindi Wordnet.
"""

# Synsets of the words which are not in the word-embeddings file.
to_remove = set()

# Synsets of the commen words between the both files.
to_keep = set()

# embwords.txt contains all the words which are present in the Hindi Word embeddings file
# sets2remove.txt contains all the synsets which needs to be removed from the Hindi Wordnet
with open("data/wordEmbs.txt", 'r') as emb_word_f, open("data/sets2remove.txt", 'w') as inspectf:
    embedding_content = emb_word_f.read()
    bwords = embedding_content.split("$")
    print(len(bwords))
    # count = 0
    # first creating two sets one set for all the sets which are going to be removed.
    # second mapping for the sets which definitely have some words in the ball embeddings
    wordsto_remove = []
    for word, value in word2synsets.items():
        normword = word.strip(" ")
        # print(normword)
        # check if the words from wordnet are present in the embedding words
        if normword in bwords:
            # print("prs")
            # count += 1
            for typ, lis in value.items():
                for i in range(0, len(lis)):
                    # why not save type
                    to_keep.add(lis[i])
        else:
            wordsto_remove.append(word)
            # print("abs")
            for typ, lis in value.items():
                for i in range(0, len(lis)):
                    to_remove.add(lis[i])

    # removing the words from word2synsets Dictionary
    # this ensure we don't persue paths for such words
    for i in range(0, len(wordsto_remove)):
        del word2synsets[wordsto_remove[i]]

    # Ensuring sets that need not to be removed
    amgsset = set()
    for set in to_remove:
        if set in to_keep:
            amgsset.add(set)
    # removing these amg sets from to_remove
    # print(amgsset)
    for set in amgsset:
        to_remove.remove(set)

    # if some set in to_keep by some word
    inspectf.write(" ".join(to_remove))

### Method to print Paths
* This method prints the leaf to parent path of given word.

In [None]:
def printUptoRoot(sen, key, wordtype):
    key_exp = {}
    # path length
    count = 0
    tlen = 0
    while (True):
        if key not in key_exp:
            key_exp[key] = True
            sen += "<-" + str(key)
            tlen += 1
            # reminder take len1 words
            if key in synset2hypes:
                # check here wordtype not used
                if int(wordtype) in synset2hypes[key].keys():
                    lis = synset2hypes[key][int(wordtype)]
                    key = lis[0]
                    for i in range(0, len(lis)):
                        # we could do the to_remove check here for better results
                        if lis[i] not in to_remove:
                            key = lis[i]
                            break
                else:
                    print("Rare case!")
                    # reminder why not add them directly to the root
                    for typ, lis in synset2hypes[key].items():
                        lis = synset2hypes[key][int(typ)]
                        key = lis[0]
                        to_break = False
                        for i in range(0, len(lis)):
                            # we could do the to_remove check here for better results
                            if lis[i] not in to_remove:
                                to_break = True
                                key = lis[i]
                                break
                        if to_break:
                            break
            else:
                # Analyze this path and remove the sets which belongs to to_remove set
                tokens = sen.split("<-")
                newsen = tokens[0]
                for i in range(1, len(tokens)):
                    if tokens[i] not in to_remove:
                        newsen += "<-" + str(tokens[i])
                    else:
                        count += 1
                if (tlen > 1):
                    tree_struct.write(newsen + "$")
                break
        else:
            break
    return count

### Printing paths
* This code will run through all the word and print their paths until no further parent is present.

In [None]:
wordList = []
sentence = ""
num = 0
modernSyn2Words = {}
count = 0
for word, value in word2synsets.items():
    # num += 1
    for typ, lis in value.items():
        alpha = assign_alpha(typ)
        for i in range(0, len(lis)):
            # print(word)
            wordversion = word + "." + alpha + "." + "{:02d}".format(i+1)

            # if the set already exist in the modernSyn2Words just add this word also
            # otherwise add new list with this word
            if lis[i] in modernSyn2Words.keys():
                modernSyn2Words[lis[i]].append(wordversion)
            else:
                modernSyn2Words[lis[i]] = [wordversion]

            count += printUptoRoot(wordversion, lis[i], typ)
            # printHch(sentence, lis[i], 0)
            # wordList.append((key+alpha+str(i+1), lis[i]))
    # if num == 1000:
    #     break
print("Total sets removed:{}".format(count))
tree_struct.write("root")

### Printing new set to word pairs
* The below code prints the modernSyn2Words dictonary we created into the set2WordV.txt file so that it can be used in data generation process.

In [None]:
with open("data/set2WordV.txt", 'w') as s2w:
    for set, values in modernSyn2Words.items():
        s2w.write(str(set) + ":" + values[0] + "$")
    s2w.write("root:root")