In [1]:
import gensim.downloader as api
from htrc_features import Volume, transformations

## Load a model with Gensim

The easiest way is using the downloader API, but you can also load locally downloaded models.

Just for the demonstration, I'm using the smalled GloVe model, the 25 dimension Twitter model. I highly encouage the model's trained on the Gigaword wiki corpus, and with more dimension (e.g. try `'glove-wiki-gigaword-300'`).

In [27]:
model = api.load('glove-twitter-25')
model['test']



array([ 1.1768  ,  0.82329 , -0.19366 , -0.25328 ,  0.99367 , -0.1751  ,
        0.95619 , -0.14049 ,  0.90307 ,  0.77942 ,  0.052748,  0.015829,
       -3.0639  ,  0.79883 ,  0.97166 ,  0.1536  ,  0.54858 , -0.062755,
       -1.1394  , -0.53928 , -0.49389 , -0.17549 , -0.41542 ,  0.62815 ,
       -0.33548 ], dtype=float32)

## Loading a Volume and transforming it to WEM

Here is a quick way to load a volume. *This loads directly from the HTRC's servers* and shoeldn't be done at scale. At scale, load local files that you've rsynced.

In [5]:
vol = Volume('nyp.33433061424580')
vol

The Feature Reader library has a function called `transformations.chunk_to_wem`, which gives you word embeddings for a EF tokenlist, either by page or by 'chunk' (which is a grouping of words from multiple pages, aimed at a target length). You can use stoplisting with `stop=True`, which requires the SpaCy library, and you can log transform the token counts so that very common words are not overly represented. The final vector per page/chunk is an average of all the word vectors, weighted by the (optionally log-transformed) word counts.

You can also set a `min_count`, which filters out infrequent words, and provide a vocabulary of words in the model, which will speed up the code a bit. Note that the default *n_count* is tuned for 10k word chunks, at the page level it might be sensible to set to 1 or 2.

In [6]:
transformations.chunk_to_wem?

[0;31mSignature:[0m
[0mtransformations[0m[0;34m.[0m[0mchunk_to_wem[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mchunk_tl[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvocab[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstop[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlog[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_ncount[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Take a file that has ['token', 'count'] data and convert to a WEM vector
[0;31mFile:[0m      ~/htrc-feature-reader/htrc_features/transformations.py
[0;31mType:[0m      function


In [29]:
tokenlist = vol.tokenlist(pos=False, drop_section=True).reset_index()
tokenlist.sample(3)

Unnamed: 0,page,token,count
41950,353,w,1
1414,21,male,3
11357,92,4,1


In [32]:
for page, group in tokenlist.groupby('page'):
    vec = transformations.chunk_to_wem(group, model, vocab=None, stop=False, log=True, min_ncount=1)
    print("Page:", page, "Vector:", vec)
    print("Breaking loop, just showing a single example")
    break

Page: 7 Vector: [ 0.7371707  -0.78592011 -0.2511647   0.17435114 -0.08399114  0.5422133
  0.87715426  1.88172583 -0.45720661  1.03818983  0.37447861 -0.60314632
 -3.32686159  0.22923918 -0.22501893 -0.74857824 -0.79556065 -0.02725386
 -1.11331707 -0.60854221  0.25328588 -0.16529209 -0.51196496  0.46722921
  0.47008771]
Breaking loop, just showing a single example


Here's an example using 5000 word chunks:

In [33]:
tokenlist = vol.tokenlist(chunk=True, chunk_target=5000, pos=False, drop_section=True).reset_index()
tokenlist.sample(3)

Unnamed: 0,chunk,token,count
31799,25,inevitable,4
15603,11,times,1
25324,20,published,4


In [35]:
for chunk, group in tokenlist.groupby('chunk'):
    vec = transformations.chunk_to_wem(group, model, vocab=None, stop=False, log=True, min_ncount=1)
    print("Chunk:", chunk, "Vector:", vec)
    break

Chunk: 1 Vector: [-0.19062095  0.233643   -0.14757691  0.08613188 -0.05712432 -0.15239129
  0.58771984 -0.58536827 -0.07212887 -0.120707   -0.0931105   0.26524463
 -3.18245603  0.24303958 -0.01725546  0.07850345  0.2587761   0.01840434
  0.24959177 -0.25468259 -0.03151951  0.23090205 -0.17642475 -0.15148901
 -0.30538046]
