# Preparation of data for Russian nballs construction

To prepare data for Russian nballs construction we need to:
* convert tsv file with Russian word2vec to txt form
* create catcode file
* create word-sense-children file

## Pre-processing of word2vec file

[Pre-trained Russian word2vec](https://github.com/Kyubyong/wordvectors) was used as features. Please, download zip with Russian word2vec [here](https://drive.google.com/file/d/0B0ZXk88koS2KMUJxZ0w0WjRGdnc/view) manually or by the commands:

In [128]:
%%script bash
export filename=ru.zip
export fileid=0B0ZXk88koS2KMUJxZ0w0WjRGdnc
wget -q --save-cookies cookies.txt 'https://docs.google.com/uc?export=download&id='$fileid -O-      | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > confirm.txt 
wget -q --load-cookies cookies.txt -O $filename      'https://docs.google.com/uc?export=download&id='$fileid'&confirm='$(<confirm.txt)
echo -e "File ru.zip is saved to the current directory"

File ru.zip is saved to the current directory


When file ru.zip is downloaded, extract files and make sure that file ru.tsv is saved in the current directory. This is a original file with features that will be converted to txt form. 

In [132]:
! unzip ru.zip

Archive:  ru.zip
  inflating: ru.bin                  
  inflating: ru.tsv                  
  inflating: ru.bin.syn1neg.npy      
  inflating: ru.bin.syn0.npy         


The following step will convert tsv file to txt form. Here is the [python program](https://github.com/valerie94/russian_nballs/blob/master/format_w2v_file.py) which is used for the step below. 

In [144]:
! python russian_nballs/format_w2v_file.py

# Installation of Russian wordnet
To istall the [Russian wordnet](https://wiki-ru-wordnet.readthedocs.io/en/latest/) run the following lines:

In [133]:
from wiki_ru_wordnet import WikiWordnet
#the Russian wordnet package which is used
wikiwordnet = WikiWordnet()

Example of usage: Find parents and children for word "математика" (mathematics).

In [145]:
synsets = wikiwordnet.get_synsets('математика') #get synsets for word 'язык' in Russian
synset1 = synsets[0]  #there are several meaning of this word, choose, for example, third one
print("Parents")
for hypernym in wikiwordnet.get_hypernyms(synset1): 
    print({w.lemma() for w in hypernym.get_words()}) #print parents (hypernyms) of this word
print("Children")
for hyponym in wikiwordnet.get_hyponyms(synset1):
    print({w.lemma() for w in hyponym.get_words()}) #print children of this word

Parents
{'точная наука'}
Children
{'комбинаторная логика'}
{'логистика'}
{'комбинаторика'}
{'комбинаторная математика', 'комбинаторный анализ'}
{'арифметика'}
{'геометрия'}
{'алгебра'}


Parents: 'точная наука' (exact science)

Children: 'комбинаторная логика' (combinatorial logic), 'логистика' (logistics), 'комбинаторика' (combinatorics), 'комбинаторная математика' (combinatorial mathematics), 'комбинаторный анализ' (combinatorial analysis), 'арифметика' (arithmetic), 'геометрия' (geometry), 'алгебра' (algebra).

It should be noted that wordnet has errors. For example, word 'шайба' (puck - small disk which is used while playing ice hockey) has parent 'снаряд' (which has several meanings like sport equipment and projectile (a missile designed to be fired from a rocket or gun)). This word's parent is 'боеприпас' (ammunition) since authors of wordnet didn't separate the word senses. By logic of this wordnet, ice hockey puck and some other sport equipment like trampoline, hoop and tennis racquet are types of military ammunition which is not true. 

These errors are ignored since they can not be detected automatically and require re-processing of entire wordnet by native speaker or expert.

# Creation of catcode file and word sense children

Not all words from Russian wordnet have word2vec features and, vice versa, not all words which have word2vec features are in the wordnet. 

Another problem, some words have several meanings and, therefore, have different instances in wordnet. 

To keep track of Russian words that contained both wordnet and word2vec model and to assign them indices, we create file idx.dat. This file contains information about word-sense index in the new data base and it's definition (in Russian) according to the vocabulary as well as index of synset in wordnet.

In [146]:
index_file = open("idx.dat", "w")
w2v_file = open("ru_w2v.txt")
idx = 2 #start with index 2 because index 1 is reserved for *root*
idx_dict = {} # key index value definition of the word
word_dict = {} # key word, value all indexes of different meanings (word senses)
for line in w2v_file:
    line = line.split(" ")
    word = line[0]
    synsets = wikiwordnet.get_synsets(word) # get different meanings of word
    if len(synsets) > 0: # if word in wordnet
        num_of_sense = 0 # initialize counter for first sense of the word
        for syn in synsets: # loop through all word senses
            for w in syn.get_words():
                if w.lemma() == word:
                    index_file.write(str(idx) + " " + word + ' ' + str(num_of_sense) + " " + w.definition() + "\n") #create the entry in index file
                    if word not in word_dict:
                        word_dict[word] = []

                    word_dict[word].append(idx)
                    idx_dict[idx] = w.definition()
                    idx += 1
            num_of_sense += 1
index_file.close()
w2v_file.close()

Please, note that some word-sense in the base repeated several times (errors in wordnet) causing same word-sense with one definition to have several indices in new data base. This repetitions will be deleted later.

The following code will create dictionary structure where key is a index of word and value is the index of it's parent as well as dictionary with key as index and value is word.n.number_of_word_sense. Note, that word can have several parents in wordnet but we select only one parent in order to preserve tree structure.

In [147]:
parent_dict = {}
i2w = {} #key index value word.n.number_of_word_sense
idx_file = open("idx.dat", 'r')
for line in idx_file:
    l_array = line.split(" ")
    idx = l_array[0]
    if idx.isdigit(): #sometimes definitions of word senses more than 1 line
        word = l_array[1] #
        syn_num = l_array[2] #number of word sense (number of corresponding synset)
        i2w[int(idx)] = word + ".n." + syn_num #word unique identifier word.n.word_sense_number
        synsets = wikiwordnet.get_synsets(word)
        syn = synsets[int(syn_num)]
        children = []
        for hypernym in wikiwordnet.get_hypernyms(syn): #loop through parent synsets
            for w in hypernym.get_words(): #loop through parent words
                parent_word = w.lemma()
                d = w.definition() # get definition of the parent
                if parent_word in word_dict:
                    for i in word_dict[parent_word]:
                        if idx_dict[i] == d: #get index of parent word by finding it's definition which is unique
                            parent_dict[int(idx)] = i # select one parent

idx_file.close()

Next line of codes create a dictionary where word is a key and value is array of it's children. If value array is empty then word has no children.

In [148]:
child_dict = {}
for i in idx_dict:
    word = i2w[i]
    child_dict[word] = [] #create array of children for each word_sense
for par in parent_dict:
    word_idx = par
    word = i2w[word_idx]
    parent_idx = parent_dict[par]
    parent_word = i2w[parent_idx] #get parent word
    if word not in child_dict[parent_word]:
        child_dict[parent_word].append(word) #write word to the children list of it's parent

The following function is used for recursive search of word parents in order to make catcode since we need indices of all parents until the root.

In [149]:
def add_parent(idx): #function which returns index of the parent
    if idx in parent_dict:
        return parent_dict[idx]
    else:
        return 1

Finally, we can create catcode file:

In [150]:
catcode_file = open("catcode.dat", "w")
for i in idx_dict:
    word = i2w[i]
    catcode_file.write(word) #write word to file
    idx = i
    parents = []
    while idx != 1: #recursively get parents
        p_idx = add_parent(idx)
        parents.append(p_idx)
        idx = p_idx
    parents = parents[::-1] #inverse the order from the root to leaves
    length = len(parents)
    array = [0] * (17 - length)
    parents.extend(array) #get missing zeros to get the standard 17-length format
    for p in parents:
        catcode_file.write(" " + str(p))
    catcode_file.write("\n")
catcode_file.close()

And we create file with word sense children

In [151]:
word_sense_children_file = open("children.dat", "w")
child_dict["*root*"] = [] #add root for the correct format
for i in idx_dict:
    if i not in parent_dict: #word has no parent
        if i2w[i] not in child_dict["*root*"]: #and not in the children of root yet
            child_dict["*root*"].append(i2w[i]) #add it to the children of root
word_sense_children_file.write("*root*")
for c in child_dict["*root*"]:
     word_sense_children_file.write(" " + c)
word_sense_children_file.write("\n")
for i in idx_dict:
    word = i2w[i]
    word_sense_children_file.write(word)
    for c in child_dict[word]:
        word_sense_children_file.write(" " + c)
    word_sense_children_file.write("\n") #write children of word in one line
word_sense_children_file.close()

The following function is used to solve problem of repititions of same word sense with several indices:

In [152]:
def delete_repititions(file_name): #in wordnet some words in synset repeat, which causes duplication in file. This function deletes duplicate lines of given file
    with open(file_name, 'r') as source_file:
        lines = []
        for line in source_file:
            if line not in lines:
                lines.append(line)
        new_file = open(file_name + "_no_duplicates", 'w')
        new_file.writelines(lines)
        new_file.close()
    source_file.close()

In [153]:
delete_repititions("catcode.dat")
delete_repititions("children.dat")

Finally, we get all required files to train Russian nballs: catcode file, word sense children file and word2vec file.