# Exercise: Find correspondences between old and modern english 

The purpose of this execise is to use two vecsigrafos, one built on UMBC and Wordnet and another one produced by directly running Swivel against a corpus of Shakespeare's complete works, to try to find corelations between old and modern English, e.g. "thou" -> "you", "dost" -> "do", "raiment" -> "clothing". For example, you can try to pick a set of 100 words in "ye olde" English corpus and see how they correlate to UMBC over WordNet. 

![William Shaespeare](https://github.com/HybridNLP2018/tutorial/blob/master/images/220px-Shakespeare.jpg?raw=1)

Next, we prepare the embeddings from the Shakespeare corpus and load a UMBC vecsigrafo, which will provide the two vector spaces to correlate.

## Download a small text corpus

First, we download the corpus into our environment. We will use the Shakespeare's complete works corpus, published as part of Project Gutenberg and pbublicly available.

In [0]:
import os

In [0]:
%ls

[0m[01;34m5class[0m/                           figures_5class.h5
5class.zip                        figures_5class_weights.h5
captions_5class_cross.h5          quality5class.h5
captions_5class_cross_weights.h5  qualityMix5class.h5
captions_5class.h5                qualityUni5class.h5
captions_5class_weights.h5        [01;34msample_data[0m/
cross.h5                          title_abstract_5class.h5
cross_weights.h5                  title_abstract_5class_weights.h5
figures_5class_cross.h5           [01;34mtutorial[0m/
figures_5class_cross_weights.h5


In [0]:
#!rm -r tutorial
!git clone https://github.com/HybridNLP2018/tutorial

fatal: destination path 'tutorial' already exists and is not an empty directory.


Let us see if the corpus is where we think it is:

In [0]:
%cd tutorial/lit
%ls 

/content/tutorial/lit
[0m[01;34mcoocs[0m/  shakespeare_complete_works.txt  [01;34mswivel[0m/  wget-log


Downloading Swivel

In [0]:
!wget http://expertsystemlab.com/hybridNLP18/swivel.zip
!unzip swivel.zip
!rm swivel/*
!rm swivel.zip


Redirecting output to ‘wget-log.1’.
Archive:  swivel.zip
  inflating: swivel/analogy.cc       
  inflating: swivel/distributed.sh   
  inflating: swivel/eval.mk          
  inflating: swivel/fastprep.cc      
  inflating: swivel/fastprep.mk      
  inflating: swivel/glove_to_shards.py  
  inflating: swivel/nearest.py       
  inflating: swivel/prep.py          
  inflating: swivel/README.md        
  inflating: swivel/swivel.py        
  inflating: swivel/text2bin.py      
  inflating: swivel/vecs.py          
  inflating: swivel/wordsim.py       


## Learn the Swivel embeddings over the Old Shakespeare corpus

### Calculating the co-occurrence matrix

In [0]:
corpus_path = '/content/tutorial/lit/shakespeare_complete_works.txt'
coocs_path = '/content/tutorial/lit/coocs'
shard_size = 512
freq=3
!python /content/tutorial/scripts/swivel/prep.py --input={corpus_path} --output_dir={coocs_path} --shard_size={shard_size} --min_count={freq}

running with flags 
/content/tutorial/scripts/swivel/prep.py:
  --bufsz: The number of co-occurrences to buffer
    (default: '16777216')
    (an integer)
  --input: The input text.
    (default: '')
  --max_vocab: The maximum vocabulary size
    (default: '1048576')
    (an integer)
  --min_count: The minimum number of times a word should occur to be included in
    the vocabulary
    (default: '5')
    (an integer)
  --output_dir: Output directory for Swivel data
    (default: '/tmp/swivel_data')
  --shard_size: The size for each shard
    (default: '4096')
    (an integer)
  --vocab: Vocabulary to use instead of generating one
    (default: '')
  --window_size: The window size
    (default: '10')
    (an integer)

tensorflow.python.platform.app:
  -h,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')

absl.flags:
  --flagfile: Insert flag definitions from the given file in

In [0]:
%ls {coocs_path} | head -n 10

col_sums.txt
col_vocab.txt
row_sums.txt
row_vocab.txt
shard-000-000.pb
shard-000-001.pb
shard-000-002.pb
shard-000-003.pb
shard-000-004.pb
shard-000-005.pb


### Learning the embeddings from the matrix

In [0]:
vec_path = '/content/tutorial/lit/vec/'
!python /content/tutorial/scripts/swivel/swivel.py --input_base_path={coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=20 --dim=300 \
    --submatrix_rows={shard_size} --submatrix_cols={shard_size}

Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
'Tensor' object has no attribute 'to_proto'
2018-10-08 13:14:16.156023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-08 13:14:16.156566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-10-08 13:14:16.156611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-08 13:14:18.064223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with streng

Checking the context of the 'vec' directory. Should contain checkpoints of the model plus tsv files for column and row embeddings.

In [0]:
os.listdir(vec_path)


['model.ckpt-0.index',
 'model.ckpt-42320.index',
 'model.ckpt-42320.data-00000-of-00001',
 'model.ckpt-0.data-00000-of-00001',
 'row_embedding.tsv',
 'checkpoint',
 'col_embedding.tsv',
 'model.ckpt-42320.meta',
 'model.ckpt-0.meta',
 'graph.pbtxt',
 'events.out.tfevents.1539004459.46972dad0a54']

Converting tsv to bin:

In [0]:
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={vec_path}vocab.txt --output={vec_path}vecs.bin \
        {vec_path}row_embedding.tsv \
        {vec_path}col_embedding.tsv

executing text2bin
merging files ['/content/tutorial/lit/vec/row_embedding.tsv', '/content/tutorial/lit/vec/col_embedding.tsv'] into output bin


In [0]:
%ls {vec_path}

checkpoint
col_embedding.tsv
events.out.tfevents.1539004459.46972dad0a54
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-42320.data-00000-of-00001
model.ckpt-42320.index
model.ckpt-42320.meta
row_embedding.tsv
vecs.bin
vocab.txt


### Read stored binary embeddings and inspect them

In [0]:
import importlib.util

spec = importlib.util.spec_from_file_location("vecs", "/content/tutorial/scripts/swivel/vecs.py")
m = importlib.util.module_from_spec(spec)
spec.loader.exec_module(m)
shakespeare_vecs = m.Vecs(vec_path + 'vocab.txt', vec_path + 'vecs.bin')

Opening vector with expected size 23552 from file /content/tutorial/lit/vec/vocab.txt
vocab size 23552 (unique 23552)
read rows


##Basic method to print the k nearest neighbors for a given word

In [0]:
def k_neighbors(vec, word, k=10):
    res = vec.neighbors(word)
    if not res:
        print('%s is not in the vocabulary, try e.g. %s' % (word, vecs.random_word_in_vocab()))
    else:
        for word, sim in res[:10]:
            print('%0.4f: %s' % (sim, word))

In [0]:
k_neighbors(shakespeare_vecs, 'strife')

1.0000: strife
0.4599: tutors
0.3981: tumultuous
0.3530: future
0.3368: daughters’
0.3229: cease
0.3018: Nought
0.2866: strike.
0.2852: War
0.2775: nature.


In [0]:
k_neighbors(shakespeare_vecs,'youth')

1.0000: youth
0.3436: tall,
0.3350: vanity,
0.2945: idleness.
0.2929: womb;
0.2847: tall
0.2823: suffering
0.2742: stillness
0.2671: flow'ring
0.2671: observation


## Load vecsigrafo from UMBC over WordNet

In [0]:
%ls 

[0m[01;34mcoocs[0m/  shakespeare_complete_works.txt  [01;34mswivel[0m/  [01;34mvec[0m/  wget-log  wget-log.1


In [0]:
!wget https://zenodo.org/record/1446214/files/vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz
%ls


Redirecting output to ‘wget-log.2’.
[0m[01;34mcoocs[0m/
shakespeare_complete_works.txt
[01;34mswivel[0m/
[01;34mvec[0m/
vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz
wget-log
wget-log.1
wget-log.2


In [0]:
!tar -xvzf vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz
!rm vecsigrafo_umbc_tlgs_ls_f_6e_160d_row_embedding.tar.gz


vecsi_tlgs_wnscd_ls_f_6e_160d/row_embedding.tsv


In [0]:
umbc_wn_vec_path = '/content/tutorial/lit/vecsi_tlgs_wnscd_ls_f_6e_160d/'

Extracting the vocabulary from the .tsv file:

In [0]:
with open(umbc_wn_vec_path + 'vocab.txt', 'w', encoding='utf_8') as f:
  with open(umbc_wn_vec_path + 'row_embedding.tsv', 'r', encoding='utf_8') as vec_lines:
    vocab = [line.split('\t')[0].strip() for line in vec_lines]
    for word in vocab:
      print(word, file=f)

Converting tsv to bin:

In [0]:
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={umbc_wn_vec_path}vocab.txt --output={umbc_wn_vec_path}vecs.bin \
        {umbc_wn_vec_path}row_embedding.tsv 

executing text2bin
merging files ['/content/tutorial/lit/vecsi_tlgs_wnscd_ls_f_6e_160d/row_embedding.tsv'] into output bin


In [0]:
%ls

[0m[01;34mcoocs[0m/                          [01;34mvec[0m/                            wget-log.1
shakespeare_complete_works.txt  [01;34mvecsi_tlgs_wnscd_ls_f_6e_160d[0m/  wget-log.2
[01;34mswivel[0m/                         wget-log


In [0]:
umbc_wn_vecs = m.Vecs(umbc_wn_vec_path + 'vocab.txt', umbc_wn_vec_path + 'vecs.bin')

Opening vector with expected size 1499136 from file /content/tutorial/lit/vecsi_tlgs_wnscd_ls_f_6e_160d/vocab.txt
vocab size 1499136 (unique 1499125)
read rows


In [0]:
k_neighbors(umbc_wn_vecs, 'lem_California')

1.0000: lem_California
0.6301: lem_Central Valley
0.5959: lem_University of California
0.5542: lem_Southern California
0.5254: lem_Santa Cruz
0.5241: lem_Astro Aerospace
0.5168: lem_San Francisco Bay
0.5092: lem_San Diego County
0.5074: lem_Santa Barbara
0.5069: lem_Santa Rosa


# Add your solution to the proposed exercise here

Follow the instructions given in the prvious lesson (*Vecsigrafos for curating and interlinking knowledge graphs*) to find correlation between terms in old Enlgish extracted from the Shakespeare corpus and terms in modern English extracted from UMBC. You will need to generate a dictionary relating pairs of lemmas between the two vocabularies and use to produce a pair of translation matrices to transform vectors from one vector space to the other. Then apply the k_neighbors method to identify the correlations.

# Conclusion

This notebook proposes the use of Shakespeare's complete works and UMBC to provide the student with embeddings that can be exploited for different operations between the two vector spaces. Particularly, we propose to identify terms and their correlations over such spaces.

# Acknowledgements

In memory of Dr. Jack Brandabur, whose passion for Shakespeare and Cervantes inspired this notebook.