# Learning Embeddings from Wikipedia Text
We can use a stream of articles from wikipedia as a corpus for learning useful representations that capture various relationships between words. First, we'll set up the corpus, build a vocabulary of words to model, and check the size of this vocabulary:

In [1]:
import os
from pysem.embeddings import ContextEmbedding, OrderEmbedding, SyntaxEmbedding
from pysem.corpora import Wikipedia

wiki = Wikipedia('/Users/peterblouw/corpora/wikipedia', article_limit=100)
wiki.build_vocab()

print('Vocab size: ', len(wiki.vocab))

Vocab size:  14648


## Context Encoding
Now, we can build a basic random indexing model to encode the word co-occurence patterns in these articles into a set of high-dimensional vectors. This method is related to well-known algorithms such as LSA, Word2Vec (i.e. CBOW + skip-gram encoding), and GloVe. One benefit of random indexing is that it is easy to parallelize, and hence efficient to run on large corpora:

In [2]:
dim = 512

model = ContextEmbedding(corpus=wiki, vocab=wiki.vocab)
model.train(dim=dim, batchsize=100)

We can find the nearest neighbors to any word in the resulting 'semantic space' with just a few lines of code. Note that with this small amount of training data, the results will be specific to the topics of the wikipedia articles that have been chosen:

In [3]:
word_list = ['brain','movie','king','wine','football','president']

for word in word_list:
    print('Nearest neighbors to "%s":' % word)
    model.get_nearest(word)
    print('')

Nearest neighbors to "brain":
brain 1.0
autism 0.54344561018
disorders 0.469028282092
pathophysiology 0.466330738368
abnormal 0.441061449132

Nearest neighbors to "movie":
movie 1.0
conferred 0.519436362895
chances 0.431285975196
spend 0.428732468579
awards 0.414598293813

Nearest neighbors to "king":
king 1.0
tsar 0.500444391653
elisabeth 0.495399346033
empress 0.462959086472
austria 0.460324661169

Nearest neighbors to "wine":
wine 1.0
fabrics 0.520567542197
shipments 0.491140089023
grain 0.479177993768
lamps 0.473673349113

Nearest neighbors to "football":
football 1.0
team 0.733451490448
national 0.574808780484
league 0.552105716309
competitions 0.53870027669

Nearest neighbors to "president":
president 1.0
lincoln 0.583303474198
government 0.572080966555
power 0.540296170581
elected 0.540091293746



## Order Encoding

To make the model a bit more interesting, we can encode positional information about the words that tend to occur around each target word in our voabulary. This amounts to adding information about ngrams in the corpus to each semantic vector. Computation is a bit more costly in this case, due to the need to compute several circular convolutions per word occurence. Again, though, this computation can be parallelized, so it's not too bad to implement.

In [4]:
model = OrderEmbedding(corpus=wiki, vocab=wiki.vocab)
model.train(dim=dim, batchsize=100)

The resulting vectors can be queried for likely words occuring in positions to left and right of a target word. We can also find words that tend to occur in the same 'order contexts' as a target word:

In [5]:
word_list = ['president','abraham','of','academy','argued','give','each','smallest']

for word in word_list:
    print('Likely words next to "%s":' % word)
    model.get_completions(word, position=1)
    print('')

Likely words next to "president":
of 0.382755494788
the 0.180164686655
open 0.15547932252
organ 0.153229070419
rank 0.153081494071

Likely words next to "abraham":
lincoln 0.658149572882
became 0.272971910354
four 0.175399749547
was 0.170755109111
microvertebrate 0.16943944256

Likely words next to "of":
the 0.613220994693
fame 0.171407702238
delegation 0.169932453691
indus 0.168967915638
fort 0.162275386815

Likely words next to "academy":
awards 0.387565945673
of 0.23705046183
award 0.185014873278
cultivated 0.176263114853
renovated 0.167384534068

Likely words next to "argued":
that 0.839421791859
in 0.15748030515
inclined 0.156978640176
confirm 0.153613452494
attracting 0.151525390146

Likely words next to "give":
the 0.21482697857
settlement 0.180179484466
up 0.179351508232
outsourcing 0.157924780219
buy 0.157780325143

Likely words next to "each":
other 0.484684055273
of 0.26651835649
minute 0.170385495395
microvertebrate 0.169999011835
armenia 0.169330322612

Likely words next t

There's a more accurate way to find preceding and subsequent words - we simply look for order vectors that tend to encode the target word in particular positions. (It's helpful to consider why this is more accurate)

In [6]:
phrase_list = [ 'promoted __ rights', 'which lincoln promoted __ rights for', 'president __', 
               ' __ civil war', 'aristotle held more __ theories']

for phrase in phrase_list:
    print('Phrase completion for %s' % phrase)
    model.get_resonants(phrase)
    print('')

Phrase completion for promoted __ rights
womens 0.295232661799
voting 0.260559512777
human 0.174971002393
exploitation 0.168301908727
gross 0.162211002789

Phrase completion for which lincoln promoted __ rights for
voting 0.285143378689
descriptive 0.272722302065
practically 0.26657383359
filters 0.262916394025
taxable 0.259761047626

Phrase completion for president __
abdelaziz 0.453357412357
sali 0.406385216355
richard 0.395164728279
nixons 0.367285183688
franklin 0.363557401577

Phrase completion for  __ civil war
yearlong 0.452919201331
spanish 0.274829115928
devastating 0.265167431973
commanderinchief 0.230518228242
ensuing 0.20712482073

Phrase completion for aristotle held more __ theories
accurate 0.329408727259
definitely 0.285656253704
efficient 0.275589221452
than 0.249757187713
distant 0.231377148262



## Syntax Encoding
It is possible to extend the methods used for encoding order information to encode information about the syntactic structure of the sentences a word typically occurs in. We'll use dependency structures to model this information, primarily because they are simpler than constituency structures and thus easier to encode in vectors with a limited capacity for storing structured information (all the usual facts about HRR capacity apply here). Instead of encoding words to the left or right of a target word, we'll encode words that occur as parents or children of a target word in a dependency tree.

In [7]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://taweb.aichi-u.ac.jp/tmgross/pix/PSG-DG.png")

The resulting vectors can be used to query a target word for words that are commonly linked to it by a given dependency relation. 

In [8]:
model = SyntaxEmbedding(corpus=wiki, vocab=wiki.vocab)
model.train(dim=dim, batchsize=100)

In [9]:
word_list = ['emphasized','invited','appeals']

for word in word_list:
    print('Common nsubj for "%s":' % word)
    model.get_verb_neighbors(word, 'nsubj')
    print('')

for word in word_list:
    print('Common dobj for "%s":' % word)
    model.get_verb_neighbors(word, 'dobj')
    print('')

Common nsubj for "emphasized":
historians 0.308699289221
iatrochemistry 0.30119167908
anthropology 0.296073104435
lincoln 0.259356753825
change 0.258065401129

Common nsubj for "invited":
leaders 0.310169961697
legislature 0.261423996759
massoud 0.251087800359
government 0.230902382508
they 0.229253309749

Common nsubj for "appeals":
party 0.519144348455
defendant 0.497877799977
who 0.440538492479
anarchocommunist 0.1948900773
cfb 0.163253282246

Common dobj for "emphasized":
doctrine 0.358332525265
application 0.291008756448
opposition 0.289021486144
independence 0.266479683281
rights 0.260209440485

Common dobj for "invited":
einstein 0.326522653232
lincoln 0.296432232376
states 0.275441344005
departments 0.264893546054
priestess 0.255444756649

Common dobj for "appeals":
it 0.407735316625
conviction 0.38481297082
bees 0.166581341639
rosa 0.159904901159
clement 0.151520139973

