[View in Colaboratory](https://colab.research.google.com/github/iricelino/hybridNLP2018/blob/master/01_capturing_word_embeddings.ipynb)

## Download a small text corpus
First, let's download a corpus into our environment. We will use a small sample of the UMBC corpus that has been pre-tokenized and that we have included as part of our GitHub repository. First, we will clone the repo so we have access to it from this environment.

In [0]:
%ls

In [1]:
!git clone https://github.com/hybridNLP2018/tutorial.git

Cloning into 'tutorial'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 601 (delta 1), reused 6 (delta 1), pack-reused 592[K
Receiving objects: 100% (601/601), 58.56 MiB | 40.38 MiB/s, done.
Resolving deltas: 100% (338/338), done.


# Generate word embeddings using Swivel

## Overview

In this notebook we show how to generate word embeddings based on the K-Cap corpus using the [Swivel algorithm](https://arxiv.org/pdf/1602.02215). In particular, we reuse the implementation included in the [Tensorflow models repo on Github](https://github.com/tensorflow/models/tree/master/research/swivel) (with some small modifications).

The dataset comes as a zip file, so we unzip it by executing the following cell. We also define a variable pointing to the corpus file:

In [2]:
!unzip /content/tutorial/datasamples/umbc_t_5K.zip -d /content/tutorial/datasamples/
input_corpus='/content/tutorial/datasamples/umbc_t_5K'

Archive:  /content/tutorial/datasamples/umbc_t_5K.zip
  inflating: /content/tutorial/datasamples/umbc_t_5K  


You can inspect the file using the `%less` command to print the whole input file at the bottom of the screen. It'll be quicker to just print a few lines:

In [3]:
#%less {input_corpus}
!head -n1 {input_corpus}

the mayan image collection was contributed by oberlin college faculty and library staff . professor linda grimm , associate professor of anthropology and project coordinator at oberlin , explained the educational goals for this online project .


The output above shows that the input text has already been pre-processed. 
 * All words have been converted to lower-case (this will avoid having two separate words for *The* and *the*)
 * punctuation marks have been separated from words. This will avoid creating "words" such as "staff." or "grimm," in the example above.

## `swivel`: an algorithm for learning word embeddings
Now that we have a corpus, we need an (implementation of an) algorithm for learning embeddings. There are various libraries and implementations for this:
  * [word2vec](https://pypi.org/project/word2vec/) the system proposed by Mikolov that introduced many of the techniques now commonly used for learning word embeddings. It directly generates word embeddings from the text corpus by using a sliding window and trying to predict a target word based on neighbouring context words.
  * [GloVe](https://github.com/stanfordnlp/GloVe) an alternative algorithm by Pennington, Socher and Manning. It splits the process in two steps: 
    1. calculating a word-word co-occurrence matrix 
    2. learning embeddings from this matrix
  * [FastText](https://fasttext.cc/) is a more recent algorithm by Mikolov et al (now at Facebook) that extends the original word2vec algorithm in various ways. Among others, this algorithm takes into accout subword information.
  
In this tutorial we will be using [Swivel](https://github.com/tensorflow/models/tree/master/research/swivel) an algorithm similar to GloVe, which makes it easier to extend to include both words and concepts (which we will do in [notebook 03 vecsigrafo](https://colab.research.google.com/github/HybridNLP2018/tutorial/blob/master/03_vecsigrafo.ipynb)). As with GloVe, Swivel first extracts a word-word co-occurence matrix from a text corpus and then uses this matrix to learn the embeddings.

The official  [Swivel](https://github.com/tensorflow/models/tree/master/research/swivel)  implementation has a few issues when running on Colaboratory, hence we have included a slightly modified version as part of the HybridNLP2018 github repository. 

In [4]:
%ls /content/tutorial/scripts/swivel/

analogy.cc      fastprep.cc         __init__.py  README.md    vecs.py
distributed.sh  fastprep.mk         nearest.py   swivel.py    wordsim.py
eval.mk         glove_to_shards.py  prep.py      text2bin.py


## Learn embeddings

### Generate co-occurrence matrix using Swivel `prep`

Call swivel's `prep` command to calculate the word co-occurrence matrix. We use the `%run` magic command, which runs the named python file as a program, allowing us to pass parameters as if using a command-line terminal.

We set the `shard_size` to 512 since the corpus is quite small. For larger corpora we could use the standard value of 4096.

In [5]:
coocs_path = '/content/umbc/coocs/t_5K/'
shard_size = 512
!python /content/tutorial/scripts/swivel/prep.py \
  --input="/content/tutorial/datasamples/umbc_t_5K" \
  --output_dir="/content/umbc/coocs/t_5K/" \
  --shard_size=512

running with flags 
/content/tutorial/scripts/swivel/prep.py:
  --bufsz: The number of co-occurrences to buffer
    (default: '16777216')
    (an integer)
  --input: The input text.
    (default: '')
  --max_vocab: The maximum vocabulary size
    (default: '1048576')
    (an integer)
  --min_count: The minimum number of times a word should occur to be included in
    the vocabulary
    (default: '5')
    (an integer)
  --output_dir: Output directory for Swivel data
    (default: '/tmp/swivel_data')
  --shard_size: The size for each shard
    (default: '4096')
    (an integer)
  --vocab: Vocabulary to use instead of generating one
    (default: '')
  --window_size: The window size
    (default: '10')
    (an integer)

tensorflow.python.platform.app:
  -h,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')

absl.flags:
  --flagfile: Insert flag definitions from the given file in

The expected output is:

```
   ... tensorflow parameters ...
    vocabulary contains 5120 tokens

    writing shard 100/100
    done!
```

We see that first, the algorithm determined the **vocabulary** $V$, this is the list of words for which an embedding will be generated. Since the corpus is fairly small, so is the vocabulary, which consists of only about 5K words (large corpora can result in vocabularies with millions of words).

The co-occurrence matrix is a sparse matrix of $|V| \times |V|$ elements. Swivel uses shards to create submatrices of $|S| \times |S|$, where $S$ is the shard-size specified above. In this case, we have 100 sub-matrices.

All this information is stored in the output folder we specified above. It consists of  100 files, one per shard/sub-matrix and a few additional files:

In [6]:
%ls {coocs_path} | head -n 10

col_sums.txt
col_vocab.txt
row_sums.txt
row_vocab.txt
shard-000-000.pb
shard-000-001.pb
shard-000-002.pb
shard-000-003.pb
shard-000-004.pb
shard-000-005.pb



The `prep` step does the following:
  - it uses a basic, white space, tokenization to get sequences of tokens
  - in a first pass through the corpus, it counts all tokens and keeps only those that have a minimum frequency (5) in the corpus. Then it keeps a multiple of the `shard_size` of that. The tokens that are kept form the **vocabulary** with size $v = |V|$.
  - on a second pass through the corpus, it uses a sliding window to count co-occurrences between the focus token and the context tokens (similar to `word2vec`). The result is a sparse co-occurrence matrix of size $v \times v$.
  - for easier storage and manipulation, Swivel uses *sharding* to split the co-occurrence matrix into sub-matrices of size $s \times s$, where $s$ is the `shard_size`.
  ![Swivel co-occurrence matrix sharding](https://github.com/hybridNLP2018/tutorial/blob/master/images/swivel-sharding.PNG?raw=1)
  - store the sharded co-occurrence submatrices as [protobuf files](https://developers.google.com/protocol-buffers/).

## Learn embeddings from co-occurrence matrix
With the sharded co-occurrence matrix it is now possible to learn embeddings:
 - the input is the folder with the co-occurrence matrix (protobuf files with the sparse matrix).
 - `submatrix_` rows and columns need to be the same size as the `shard_size` used in the `prep` step.

In [7]:
vec_path = '/content/umbc/vec/t_5K/'
!python /content/tutorial/scripts/swivel/swivel.py --input_base_path={coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=40 --dim=150 \
    --submatrix_rows={shard_size} --submatrix_cols={shard_size}

Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
'Tensor' object has no attribute 'to_proto'
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path /content/umbc/vec/t_5K/model.ckpt
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
'Tensor' object has no attribute 'to_proto'
INFO:tensorflow:local_step=10 global_step=10 loss=56.9, 0.2% complete
INFO:tensorflow:local_step=20 global_step=20 loss=56.3, 0.5% complete
INFO:tensorflow:local_step=30 global_step=30 loss=53.6, 0.8% complete
INFO:tensorflow:local_step=40 global_step=40 loss=55.6, 1.0% complete
INFO:tensorflow:local_step=50 global_step=50 loss=43.9, 1.2% comp

This should take a few minutes, depending on your machine.
The result is a list of files in the specified output folder, including:
 - checkpoints of the model
 - `tsv` files for the column and row embeddings.

In [8]:
%ls {vec_path}

checkpoint
col_embedding.tsv
events.out.tfevents.1539016377.cc4a08418f7e
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-4000.data-00000-of-00001
model.ckpt-4000.index
model.ckpt-4000.meta
row_embedding.tsv


One thing missing from the output folder is a file with just the vocabulary, which we'll need later on. We copy this file from the folder with the co-occurrenc matrix.

In [0]:
%cp {coocs_path}/row_vocab.txt {vec_path}vocab.txt

### Convert `tsv` files to `bin` file
The `tsv` files are easy to inspect, but they take too much space and they are slow to load since we need to convert the different values to floats and pack them as vectors. Swivel offers a utility to convert the `tsv` files into a `bin`ary format. At the same time it combines the column and row embeddings into a single space (it simply adds the two vectors for each word in the vocabulary).

In [10]:
!python /content/tutorial/scripts/swivel/text2bin.py --vocab={vec_path}vocab.txt --output={vec_path}vecs.bin \
        {vec_path}row_embedding.tsv \
        {vec_path}col_embedding.tsv

executing text2bin
merging files ['/content/umbc/vec/t_5K/row_embedding.tsv', '/content/umbc/vec/t_5K/col_embedding.tsv'] into output bin


This adds the `vocab.txt` and `vecs.bin` to the folder with the vectors:

In [11]:
%ls -lah {vec_path}

total 56M
drwxr-xr-x 2 root root 4.0K Oct  8 16:36 [0m[01;34m.[0m/
drwxr-xr-x 3 root root 4.0K Oct  8 16:32 [01;34m..[0m/
-rw-r--r-- 1 root root  199 Oct  8 16:34 checkpoint
-rw-r--r-- 1 root root 8.5M Oct  8 16:34 col_embedding.tsv
-rw-r--r-- 1 root root 225K Oct  8 16:34 events.out.tfevents.1539016377.cc4a08418f7e
-rw-r--r-- 1 root root 291K Oct  8 16:32 graph.pbtxt
-rw-r--r-- 1 root root  18M Oct  8 16:32 model.ckpt-0.data-00000-of-00001
-rw-r--r-- 1 root root  366 Oct  8 16:32 model.ckpt-0.index
-rw-r--r-- 1 root root 116K Oct  8 16:32 model.ckpt-0.meta
-rw-r--r-- 1 root root  18M Oct  8 16:34 model.ckpt-4000.data-00000-of-00001
-rw-r--r-- 1 root root  366 Oct  8 16:34 model.ckpt-4000.index
-rw-r--r-- 1 root root 116K Oct  8 16:34 model.ckpt-4000.meta
-rw-r--r-- 1 root root 8.6M Oct  8 16:34 row_embedding.tsv
-rw-r--r-- 1 root root 3.0M Oct  8 16:36 vecs.bin
-rw-r--r-- 1 root root  43K Oct  8 16:36 vocab.txt


## Read stored binary embeddings and inspect them

Swivel provides the `vecs` library which implements the basic `Vecs` class. It accepts a `vocab_file` and a file for the binary serialization of the vectors (`vecs.bin`).

In [0]:
from tutorial.scripts.swivel import vecs

...and we can load existing vectors. We assume you managed to generate the embeddings by following the tutorial up to now. Note that,  due to random initialization of weight during the training step, your results may be different from the ones presented below.

In [13]:
#uncommend the following two lines if you did not manage to train embedding above 
#!tar -xzf /content/tutorial/datasamples/umbc_swivel_vec_t_5K.tar.gz -C / 
#vec_path = /content/umbc/vec/t_5K/
vectors = vecs.Vecs(vec_path + 'vocab.txt', 
            vec_path + 'vecs.bin')

Opening vector with expected size 5120 from file /content/umbc/vec/t_5K/vocab.txt
vocab size 5120 (unique 5120)
read rows


We have extended the standard implementation of `swivel.vecs.Vecs` to include a method `k_neighbors`. It accepts a string with the word and an optional `k` parameter, that defaults to $10$. It returns a list of python dictionaries with fields:
  * `word`: a word in the vocabulary that is near the input word
  * `cosim`: the cosine similiarity between the input word and the near word.
It's easier to display the results as a `pandas` table:

In [14]:
import pandas as pd
pd.DataFrame(vectors.k_neighbors('california'))

Unnamed: 0,cosim,word
0,1.0,california
1,0.422428,santa
2,0.386643,university
3,0.376657,berkeley
4,0.371043,barbara
5,0.337771,"california,"
6,0.336956,southern
7,0.332485,melvyl
8,0.331769,state
9,0.301233,recommender


In [15]:
pd.DataFrame(vectors.k_neighbors('knowledge'))

Unnamed: 0,cosim,word
0,1.0,knowledge
1,0.395314,organization
2,0.385889,workers
3,0.355162,warehousing
4,0.336204,skills
5,0.323475,procedural
6,0.315733,expertise
7,0.315241,management
8,0.30093,exchange
9,0.298531,exploiting


In [16]:
pd.DataFrame(vectors.k_neighbors('semantic'))

Unnamed: 0,cosim,word
0,1.0,semantic
1,0.402217,heterogeneity
2,0.340584,similarities
3,0.337173,employs
4,0.336691,spaces
5,0.321805,common
6,0.321429,expressing
7,0.314453,deciding
8,0.312328,procedural
9,0.307621,relationships


In [17]:
pd.DataFrame(vectors.k_neighbors('conference'))

Unnamed: 0,cosim,word
0,1.0,conference
1,0.442867,secretariat
2,0.433434,annual
3,0.398876,workshops
4,0.380836,2005
5,0.364306,international
6,0.354203,jcdl
7,0.333424,fee
8,0.324881,presentations
9,0.312933,amia


The cells above should display results similar the the following (for words *california* and *conference*):

|	cosim	| word | | cosim	| word |
| ---------- | -------- || ---------- | -------- |
| 0	1.000 |	california ||	1.0000	| conference |
| 0.5060 |	university ||	0.4320	| international |
| 0.4239 |	berkeley ||	0.4063	| secretariat |
| 0.4103 |	barbara ||	0.3857	| jcdl |
|	0.3941 |	santa ||	0.3798	| annual |
| 0.3899 |	southern ||	0.3708	| conferences |
| 0.3673 |	uc ||	0.3705	| forum |
| 0.3542 |	johns ||	0.3629	| presentations |
| 0.3396 |	indiana ||	0.3601	| workshop |
| 0.3388 | melvy ||	0.3580	| ... |



### Compound words

Note that the vocabulary only has single words, i.e. compound words are not present:

In [18]:
pd.DataFrame(vectors.k_neighbors('semantic web'))

"semantic web" is not in vocab, try "alternative"
semantic web is not in the vocabulary, try e.g. licenses


A common way to work around this issue is to use the average vector of the two individual words (of course this only works if both words are in the vocabulary):

In [19]:
semantic_vec = vectors.lookup('semantic')
web_vec = vectors.lookup('web')
semweb_vec = (semantic_vec + web_vec)/2
pd.DataFrame(vectors.k_neighbors(semweb_vec))

Unnamed: 0,cosim,word
0,0.554806,web
1,0.554806,semantic
2,0.294989,employs
3,0.263869,sites
4,0.258899,crawling
5,0.250607,crawler
6,0.239681,browser
7,0.234558,characterizing
8,0.234176,common
9,0.222663,remarkable


## Conclusion

In this notebook, we used swivel to generate word embeddings and we explored the resulting embeddings using `k neighbors` exploration. 

# Optional Excercise

## Create word-embeddings for texts from Project Gutenburg

### Download and pre-process the corpus

You can try generating new embeddings using a small `gutenberg` corpus, that is provided as part of the NLTK library. It consists of a few public-domain works published as part of the Project Gutenberg.

First, we download the dataset into out environment:

In [20]:
import os
import nltk
nltk.download('gutenberg')
%ls '/root/nltk_data/corpora/gutenberg/'

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
austen-emma.txt          carroll-alice.txt        README
austen-persuasion.txt    chesterton-ball.txt      shakespeare-caesar.txt
austen-sense.txt         chesterton-brown.txt     shakespeare-hamlet.txt
bible-kjv.txt            chesterton-thursday.txt  shakespeare-macbeth.txt
blake-poems.txt          edgeworth-parents.txt    whitman-leaves.txt
bryant-stories.txt       melville-moby_dick.txt
burgess-busterbrown.txt  milton-paradise.txt


As you can see, the corpus consists of various books, one per file. Most word2vec implementations require you to pass a corpus as a single text file. We can issue a few commands to do this by concatenating all the `txt` files in the folder into a single `all.txt` file, which we will use later on.

A couple of the files are encoded using iso-8859-1 or binary encodings, which will cause trouble later on, so we rename them to avoid including them into our corpus.

In [21]:
%cd /root/nltk_data/corpora/gutenberg/
# avoid including books with incorrect encoding
!mv chesterton-ball.txt chesterton-ball.badenc-txt
!mv milton-paradise.txt milton-paradise.badenc-txt
!mv shakespeare-caesar.txt shakespeare-caesar.badenc-txt
# now concatenate all other files into 'all.txt'
!cat *.txt >> all.txt
# print result
%ls -lah '/root/nltk_data/corpora/gutenberg/all.txt'
# go back to standard folder 
%cd /content/

/root/nltk_data/corpora/gutenberg
-rw-r--r-- 1 root root 11M Oct  8 16:39 /root/nltk_data/corpora/gutenberg/all.txt
/content


The full dataset is about 11MB.

### Learn embeddings

Run the steps described above to generate embeddings for the gutenberg dataset.

### Inspect embeddings
Use methods similar to the ones shown above to get a feeling for whether the generated embeddings have captured interesting relations between words.