# Loading word vectors from text files


| Author | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-27 |

This notebook illustrates how to use `gensim` to load pre-train word vectors.

## Intro

Sometimes you will find a resource online that provides word vectors in the form of a text file.
For example, the German organization *Deepset* provides download links to word2vec embeddings they have trained on German-language Wikipedia entries.

In cases like this one, we need read the word embeddings from a text file and convert them to a `gensim` `KeyedVectors` object.

## Explanation

The absolute minimum information you need to use pre-trained word embddings are 

1. the vocabulary
2. the word vectors
3. which vector belong to which word in the vocabulary

Ways to share this information with others (in increasing levels of ):

1. Just save the `KeyedVectors` object as a serialized (binary) file. Then the file can be loaded in python with `KeyedVectors.load(file, binary=True)`
2. Save the `KeyedVectors` in the standard word2vec format: 
    - The first lines records the vocabulary size $v$ and the number of embedding dimensions $d$ (numbers separated by white space).
    - All following lines record, first, the respective word and then its $d$ embedding values &mdash; all separated by a white space
3. Save the vocabulary and the word vectors in different files. In this case, we assume that the first word vector corresponds to the first word in the vocabulary file.

Regarding file formats, people use a number of different options

1. Files saved with Option 1 (i.e., with `KeyedVectors.save`) are usually just written into a file with the extension '.kv' (or sometimes '.vectors').
2. Files saved with Option 1 are ususally named with the file extension '.vectors'. If you can't just view the file in a text editor (try `head -n 1 <file-path>`in Mac's Terminal or `more /n 1 <file-path>` in Windows' command prompt), it's save as a binary file (see below)
3. Vocab files are typically saved as '.txt' files (just read with `pandas` or native python) or '.pkl' (read with python's `pickle` module). Word vector files are saved as '.txt' files, as '.pkl', or as '.npy' (read with `numpy`'s `load()` method)


## Examples

### Setup

In [6]:
# setup
import os

import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors

import numpy as np
import pickle

data_path = os.path.join('..', 'data', 'models')

In [2]:
# helper function for file download
import requests
import os

# write a function to download the file to a local file path
def download_file(url, fp):
    os.makedirs(os.path.dirname(fp), exist_ok=True)
    r = requests.get(url, stream=True)
    with open(fp, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

### Option 1: Load our *UK Commons* word embeddings

When we save the vectors of the word2vec model we trained on *UK Houes of Commons* speeches, we used the `KeyedVectors` `save()` methdod.

So we can just `load()` them again:

In [14]:
from gensim.models import KeyedVectors

fp = os.path.join(data_path, 'gbr_commons', 'gbr_commons_word2vec_w5_d100.kv')
vectors = KeyedVectors.load(fp)
type(vectors)

gensim.models.keyedvectors.KeyedVectors

### Option 2: The *Google News Corpus* word2vec model

Remeber how I told you that after downloading the 'word2vec-google-news-300' with `gensim`'s downloader module, it's saved on our computer? 

In [6]:
import os
import gensim.downloader as api

print(api.BASE_DIR)
os.listdir(api.BASE_DIR)

/Users/hlicht/gensim-data


['word2vec-google-news-300', 'information.json']

We'll the 'word2vec-google-news-300' folder in this location has a gzipped file that contains the word vectors: 

In [8]:
model_name = 'word2vec-google-news-300'
os.listdir(os.path.join(api.BASE_DIR, model_name))

['word2vec-google-news-300.gz', '__init__.py', '__pycache__']

**_Note:_** `gensim` stores it there with the `load_word2vec_format()` method the first time you download it.

So we can just load this file as follows:

In [12]:
fp = os.path.join(api.BASE_DIR, model_name, model_name+'.gz')
vectors = KeyedVectors.load_word2vec_format(fp, binary=True)
type(vectors)

gensim.models.keyedvectors.KeyedVectors

### Option 2: German word embeddings

The German word embeddings at https://devmount.github.io/GermanWordEmbeddings are an example of Option 1.

The file we are interested in is called `'german.model'`!

To find out that this file has been saved in the word2vec format, I had to find the relevant [line of code](https://github.com/devmount/GermanWordEmbeddings/blob/662af36834643963f5d10485a3d86e186a9ead17/training.py#L65) used to train the model.
There you'll see that the developer has saved the model's word vectors using gensim's `save_word2vec_format()` method.


Let's first download the file:

In [None]:
# file URL
url = 'https://cloud.devmount.de/d2bc5672c523b086/german.model'

# file destination on our computer
model_dir = os.path.join(data_path, 'german_word_embeddings')
os.makedirs(model_dir, exist_ok=True)
fp = os.path.join(model_dir, 'word2vec.model')
# note: we gonna name the file with the extension .kv
#        so in the future we'll know that it records 
#        gensim KeyedVectors

download_file(url, fp) # takes 0.5-1 minutes

Now we can load the file

In [4]:
# load word vectors from file
from gensim.models import KeyedVectors
vectors = KeyedVectors.load_word2vec_format(fp, binary=True)

In [5]:
vectors.similarity('Haus', 'Maus') 

0.30662322

## Option 2: Deepset German word embeddings

source: https://www.deepset.ai/german-word-embeddings

Let's first download the text file recording the embeddings from the link they provide on their website:

In [None]:
url = 'https://int-emb-word2vec-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt'

model_dir = os.path.join(data_path, 'deepnet_german_word2vec')
os.makedirs(model_dir, exist_ok=True)
fp = os.path.join(model_dir, 'vectors.txt')

download_file(url, fp) # takes 3-4 minutes

Let's have a look at the first line of the tex file:

In [17]:
# load the first line
with open(fp, 'r') as file:
    for i, line in enumerate(file):
        break
 # print first 100 characters in the line
line[:100]

"b'UNK' -0.07903 0.01641 0.006979 -0.035038 0.006474 0.002469 -0.050103 0.142654 -0.03505 0.003106 -0"

You see that the first entry in the line is the word (`UNK``) and then all subsequent entries are numbers.
The numebrs are the word's embeddding.

*Note:* You might also have notived that the word looks a little weird (`b'UNK'` -- we'll deal with this later.

So what we need to do is iterate over all lines in the file and for each get the word and its corresponding embedding.
Ther code below does just that, storing all embeddings in a standard python dictionary and already converting each embedding to a 1-dimensional numpy array (i.e., vector):

In [18]:
import ast
import numpy as np
from tqdm.auto import tqdm # for progress bar

vectors = {}
n_lines = sum(1 for l in open(fp, 'r') )
with open(fp, 'r') as file:
    for line in tqdm(file, total=n_lines):
        # split entries at white spaces
        line = line.split(sep=' ')
        # get the word (first entry)
        k = line[0]
        # convert embedding to 1-d numpy array and add to dict
        vectors[k] = np.array(line[1:], dtype=np.float32)


  0%|          | 0/854776 [00:00<?, ?it/s]

Let's look at the first 10 words in the embeddings dictionary:

In [19]:
list(vectors.keys())[:10]

["b'UNK'",
 "b'der'",
 "b'und'",
 "b'die'",
 "b'in'",
 "b'von'",
 "b'im'",
 "b'des'",
 "b'den'",
 "b'kategorie'"]

Apprantly the creators of the text file have messed up the encoding a little (see https://stackoverflow.com/a/53730346).
So' well fix this:

In [20]:
# NOTE: the next line is typically not necessary
vectors = {ast.literal_eval(k).decode(): v for k, v in vectors.items()}

Now we are ready to create a `KeyedVectors` object from our dictionary of word--embedding pairs:

In [21]:
from gensim.models import KeyedVectors

# create a gensim KeyedVectors object from the dictionary of word vectors
kv = KeyedVectors(vector_size=vectors['UNK'].shape[0])
kv.add_vectors(list(vectors.keys()), list(vectors.values()))

Check that it works just right:

In [22]:
kv.similarity('die', 'der')

0.6345489

Now we can save it as binary file to disk (in word2vec format) so it's easier to load it again sometime later:

In [None]:
# save to disk
kv.save_word2vec_format(fp.replace('.txt', '.kv'), binary=True)

### Example of Option 3

One of your fellow course participants pointed out that Stanford NLP provides word embeddings trained on the historical COHA corpus https://www.english-corpora.org/coha/).

The emebddings are available for download here: https://nlp.stanford.edu/projects/histwords/

For example, the link underlying "All English (1800s-1990s)" (`http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip`) will dowload a very large zip file.

IF you unzip this file, the `sgns/` folder will record a number of files:

- '1800-vocab.pkl'
- '1800-w.npy'
- '1810-vocab.pkl'
- ...

In my understanding, these fildes correspond to a bunch of decade-specific word embedding models.

In [9]:
url = 'http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip'
fp = os.path.join(data_path, 'sgns.zip')
download_file(url, fp)

In [None]:
import zipfile

model_dir = fp.removesuffix('.zip')
with zipfile.ZipFile(fp, 'r') as f:
    f.extractall(model_dir)

# Step 2: Remove the archive file
os.remove(fp)

In [7]:
# load word vectors from file
vectors = np.load(os.path.join(model_dir, '1800-w.npy'))
print(vectors.shape)

import pickle
# load word vocab from file
with open(os.path.join(model_dir, '1840-vocab.pkl'), 'rb') as f:
    vocab = pickle.load(f)
print(len(vocab))
vocab[:10]

(100000, 300)
100000


['the', 'of', 'to', 'and', 'in', 'a', 'that', 'is', 'it', 'be']

In [8]:
# create a gensim KeyedVectors object from the vectors and vocab
kv = KeyedVectors(vector_size=vectors.shape[1])
kv.add_vectors(vocab, vectors)