## Compressing Word Embeddings

Downloadable version of GloVe embedding (with fallback source).

Probably best to include instructions for Levy test-suite installation, so that any given embedding can be tested.

Then require two main sections : 
 
*  Lloyd embedding generation

*  Sparsified embedding generation

Include downloadable version of sparsified GloVe embedding from own hosting.

And functions/tools to play with the loaded embedding (of whatever type).

### Download the Omer-Levy Test Regime

https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf

```
wget https://bitbucket.org/omerlevy/hyperwords/get/688addd64ca2.zip
unzip 688addd64ca2.zip
rm 688addd64ca2.zip

mv omerlevy-hyperwords-688addd64ca2 omerlevy

chmod 755 omerlevy/*.sh omerlevy/scripts/*.sh
```

### Function to test a (text) Embedding

Based on this script : 
```
more omerlevy/test-vectors.sh 
#!/bin/sh

# ./test-vectors.sh /home/andrewsm/sketchpad/redcatlabs/embeddings/data/1-glove-1-billion-and-wiki/window11-lc-36/vectors.txt 

# arg1 == filepath of word-vectors file
VECTORS_FILE=$1
  
# Fix up the 'file header' of a 'glove' vectors file into the one expected here
VECTORS_WORDS=${VECTORS_FILE}.words

if [ ! -f ${VECTORS_WORDS} ]; then 
  echo "Creating ${VECTORS_WORDS}"
  #echo "262144 300" > ${VECTORS_WORDS}
  #head -262144 ${VECTORS_FILE} >> ${VECTORS_WORDS}

  ## Glove min-freq : 36 -> 263633 words (just above 12^18=262144 words)
  echo "131072 300" > ${VECTORS_WORDS}
  head -131072 ${VECTORS_FILE} >> ${VECTORS_WORDS}
fi

VECTORS_NPY=${VECTORS_WORDS}.npy


#word2vecf/word2vecf -train w2.sub/pairs -pow 0.75 -cvocab w2.sub/counts.contexts.vocab -wvocab w2.sub/counts.words.vocab -dumpcv w2.sub/sgns.contexts -output w2.sub/sgns.words -threads 10 -
negative 15 -size 500;

python hyperwords/text2numpy.py ${VECTORS_WORDS}

# No need for this temporary file now
##rm ${VECTORS_WORDS}


#python hyperwords/text2numpy.py w2.sub/sgns.contexts
#rm w2.sub/sgns.contexts


echo
echo "Similarity"
echo "----------"
# Evaluate on Word Similarity
#python hyperwords/ws_eval.py --neg 5 PPMI  w2.sub/pmi testsets/ws/ws353.txt
#python hyperwords/ws_eval.py --eig 0.5 SVD w2.sub/svd testsets/ws/ws353.txt
#python hyperwords/ws_eval.py --w+c SGNS    w2.sub/sgns testsets/ws/ws353.txt

#echo -n "WS353 Results     "
#python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/ws353.txt

echo -n "WS353 Similarity  "
python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/ws353_similarity.txt

echo -n "WS353 Relatedness "
python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/ws353_relatedness.txt

echo -n "Bruni MEN         "
python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/bruni_men.txt

echo -n "Radinsky M.Turk   "
python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/radinsky_mturk.txt

echo -n "Luoung Rare Words "
python hyperwords/ws_eval.py VECTORS ${VECTORS_FILE} testsets/ws/luong_rare.txt

echo
echo "Geometry"
echo "--------"
# Evaluate on Analogies
#python hyperwords/analogy_eval.py PPMI        w2.sub/pmi testsets/analogy/google.txt
#python hyperwords/analogy_eval.py --eig 0 SVD w2.sub/svd testsets/analogy/google.txt
#python hyperwords/analogy_eval.py SGNS        w2.sub/sgns testsets/analogy/google.txt

echo -n "Google Analogy Results  "
python hyperwords/analogy_eval.py VECTORS ${VECTORS_FILE} testsets/analogy/google.txt

echo -n "MSR Analogy Results     "
python hyperwords/analogy_eval.py VECTORS ${VECTORS_FILE} testsets/analogy/msr.txt

echo
```

In [2]:
import os, subprocess

def test_embedding_file(vectors_txt, vocab_max=131072 ):
    # Do we need to process VECTORS_FILE->{ VECTORS_WORDS, VECTORS_NPY }?
    # Answer = YES : the .words is required, and is used to create .npy and .vocab
    
    vectors_txt_words = '%s.words' % (vectors_txt,)
    if not os.path.isfile(vectors_txt_words):
        # This is just a copy of 'text file' with the vocab_size and embedding_size pre-pended
        #echo "131072 300" > ${VECTORS_WORDS}
        #head -131072 ${VECTORS_FILE} >> ${VECTORS_WORDS}
        with open(vectors_txt) as fin:
            first_line = fim.readline()
            embedding_dim = len(first_line.strip().split()) -1 
            vocab_size = len(fin.readlines()) +1  # Ouch! - read in whole file to find length

        if vocab_size>vocab_max:
            vocab_size=vocab_max
            
        with open(vectors_txt) as fin:
            with open(vectors_txt_words, 'wt') as fout:
                # Write the first line, which, ironically, will be discarded by the omerlevy code
                fout.write("%d %d\n" % (vocab_size, embedding_dim))
                
                # And copy over at most vocab_max lines of the original file 
                for i, line in enumerate(fin.readlines()):
                    if i>vocab_size:
                        break
                    fout.write(line)
                
    vectors_txt_npy   = '%s.npy' % (vectors_txt_words,)
    if not os.path.isfile(vectors_txt_words):
        # Sadly, we can't just invoke this as a python function - need to go via shell...
        subprocess.call([ "python", "hyperwords/text2numpy.py", vectors_txt_words ])
    pass
    