# Training Russian nballs
To train russian nballs the following sources were used:
* [Russian wordnet](https://wiki-ru-wordnet.readthedocs.io/en/latest/)
* [Pre-trained Russian word2vec](https://github.com/Kyubyong/wordvectors)

You can install Russian wordnet by the following command:

In [1]:
! pip install wiki-ru-wordnet

[33mYou are using pip version 19.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Next step is to download zip with Russian word2vec [here](https://drive.google.com/file/d/0B0ZXk88koS2KMUJxZ0w0WjRGdnc/view) manually or by the commands:

In [2]:
%%script bash
export filename=ru.zip
export fileid=0B0ZXk88koS2KMUJxZ0w0WjRGdnc
wget -q --save-cookies cookies.txt 'https://docs.google.com/uc?export=download&id='$fileid -O-      | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > confirm.txt 
wget -q --load-cookies cookies.txt -O $filename      'https://docs.google.com/uc?export=download&id='$fileid'&confirm='$(<confirm.txt)
echo -e "File ru.zip is saved to the current directory"

File ru.zip is saved to the current directory


Extract file ru.tsv. The file contains word2vec features for 50101 words in Russian language.

In [3]:
! unzip ru.zip

Archive:  ru.zip
  inflating: ru.bin                  
  inflating: ru.tsv                  
  inflating: ru.bin.syn1neg.npy      
  inflating: ru.bin.syn0.npy         


Then please clone project training Russian nballs from the following [github repository](https://github.com/valerie94/russian_nballs).

In [4]:
! git clone https://github.com/valerie94/russian_nballs

Cloning into 'russian_nballs'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 22 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (22/22), done.
Checking connectivity... done.


Next step is to run the following python file which converts tsv file with word2vec to txt form:

In [5]:
! python russian_nballs/format_w2v_file.py

The file ru_w2v.txt should be created. This file contains word2vec features in the text format.

To create catcode file and word sense children please run the following python file:

In [9]:
! python  russian_nballs/make_russian_dataset.py

The files catcode.dat_no_duplicates and children.dat_no_duplicates should be created. The files contain catcode and word sense children for total 15153 words (intersection of words from wordnet and word2vec file).

Train Russian nballs by the proccess which is describe [here](https://github.com/gnodisnait/nball4tree). Please git clone the following github project:

In [7]:
! git clone https://github.com/gnodisnait/nball4tree

Cloning into 'nball4tree'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 165 (delta 17), reused 27 (delta 12), pack-reused 128[K
Receiving objects: 100% (165/165), 1.30 MiB | 658.00 KiB/s, done.
Resolving deltas: 100% (76/76), done.
Checking connectivity... done.


To train nballs please change the lines 418, 517 and 564 in main_training_process.py to initialize first child from entity.n.01 to время.n.3. Or execute the following command which replaces this file:

In [8]:
! mv russian_nballs/main_training_process.py nball4tree/nball4tree

Then train Russian nball on constructed database by the following command:

In [None]:
! python nball4tree/nball.py --train_nball nball.txt --w2v ru_w2v.txt --ws_child children.dat_no_duplicates --ws_catcode catcode.dat_no_duplicates --log log.txt

Please note that training process takes more than 6 hours. You can acess the pre-trained model by the following [link](https://drive.google.com/uc?id=1gx2RSBOdf6BNZI5qOPBXz0rMIZyYbI0W&export=download) or by executing the following commands:

In [9]:
%%script bash
export filename=nballs2.tgz
export fileid=1gx2RSBOdf6BNZI5qOPBXz0rMIZyYbI0W
wget -q --save-cookies cookies.txt 'https://docs.google.com/uc?export=download&id='$fileid -O-      | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > confirm.txt
wget -q --load-cookies cookies.txt -O $filename      'https://docs.google.com/uc?export=download&id='$fileid'&confirm='$(<confirm.txt)
echo -e "File nballs2.tgz is saved to the current directory"

File nballs2.tgz is saved to the current directory


If you don't have time to train nballs, you can use pre-trained model for evaluation. The file has to be extracted from the tar file downloaded above:

In [34]:
! tar xvzf nballs2.tgz

nballs2.txt


Let's conduct several experiments for finding the closest neighbors. First word is синий.n.0 (blue):

In [1]:
! python nball4tree/nball.py --neighbors синий.n.0 --ball nballs2.txt --num 6

loading balls....
15153  balls are loaded

{   'синий.n.0': [   'синий.n.1',
                     'оранжевый.n.0',
                     'оранжевый.n.1',
                     'жёлтый.n.5',
                     'жёлтый.n.4',
                     'жёлтый.n.1']}


The closest neighbors are синий.n.1 (blue.n.1), оранжевый.n.0 (orange.n.0), оранжевый.n.1 (orange.n.1), жёлтый.n.5 (yellow.n.5), жёлтый.n.4 (yellow.n.4), жёлтый.n.1 (yellow.n.1). As we can see, all neigbors are colors as well.

Next test word is март.n.0 (March):

In [2]:
! python nball4tree/nball.py --neighbors март.n.0 --ball nballs2.txt --num 6

loading balls....
15153  balls are loaded

{   'март.n.0': [   'ноябрь.n.0',
                    'апрель.n.0',
                    'июнь.n.0',
                    'февраль.n.0',
                    'сентябрь.n.0',
                    'июль.n.0']}


The closest neighbors are: ноябрь.n.0 (November), апрель.n.0 (April), июнь.n.0 (June), февраль.n.0 (February), 
сентябрь.n.0 (September), июль.n.0 (July), i.e. other monthes.

Next test word is кофе.n.0 (coffee):

In [3]:
! python nball4tree/nball.py  --neighbors кофе.n.0 --ball nballs2.txt --num 6

loading balls....
15153  balls are loaded

{   'кофе.n.0': [   'кофе.n.2',
                    'кофе.n.1',
                    'виски.n.0',
                    'чай.n.2',
                    'чай.n.6',
                    'чай.n.0']}


The closest neighbors are: кофе.n.2 (coffee.n.2), кофе.n.1 (coffee.n.1), виски.n.0 (whiskey.n.0), чай.n.2 (tea.n.2), чай.n.6 (tea.n.6), чай.n.6 (tea.n.6) which are drinks as well.

Another test word is футбол.n.0 (football.n.0):

In [4]:
! python nball4tree/nball.py  --neighbors футбол.n.0  --ball nballs2.txt --num 6

loading balls....
15153  balls are loaded

{   'футбол.n.0': [   'теннис.n.0',
                      'баскетбол.n.0',
                      'бокс.n.4',
                      'бокс.n.2',
                      'бокс.n.1',
                      'бокс.n.3']}


The closest neighbors are: теннис.n.0 (tennis.n.0), баскетбол.n.0 (basketball), бокс.n.2 (boxing.n.2), бокс.n.4 (boxing.n.4), бокс.n.3 (boxing.n.3) which are sports as well.