In [1]:
from mxnet import nd
from mxnet.contrib import text
import numpy as np

# Greg Bruss
## KEN2570 - Spring 2019 Project

In our Natural Language Processing class, we saw how we can train a word2vec word embedding model on a large scale corpus. These "pre-trained" word vectors can be applied in a variety of tasks. The task I will be looking at in this notebook is the task of finding analogies.

An analogy is defined as a comparison between one thing and another, and represents a form of correspondence or similarity. Although the two terms compared are different, they share fundamental similarities that allows a relationship to be inferred. An example would be "Man is to woman, as son is to daughter". 

#### Why analogy finding is useful
"Analogical reasoning" is an important skill that anyone who truly understands language should know. An "analogical argument" is an explicit representation of a form of analogical reasoning that cites accepted similarities between two systems to support the conclusion that some further similarity exists [1]. It is clear that any AI system would need to have a grasp of these similarities, particularly because much of human speech uses analogy as a valid form of expression.

### Getting Pretrained Word Vectors

We need a way to get the pretrained word vectors. We can make use of the GluonNLP package (https://gluon-nlp.mxnet.io/), which makes it easy to evaluate and train word embeddings, using any choice of word2vec, fastText, or GloVe models. Word2Vec Models were introduced by Mikolov et. al [3], and FastText models by Bojanowski et. al [4]. GloVe models were introduced by Pennington et al [5].

We can make use of the mxnet.contrib.text API, which allows loading of pre-trained embedding vectors for text tokens and storing them in the mxnet.ndarray.NDArray format (MXNet documentation - https://mxnet.incubator.apache.org/api/python/contrib/text.html)

The keys of the pretrained files will be either glove or fasttext, as these are the ones supported in the MXNet model zoo

In [2]:
text.embedding.get_pretrained_file_names().keys()

dict_keys(['glove', 'fasttext'])

### Training Text

I will use the GloVe word embedding. It is trained on the "Wikipedia 2014 + Gigaword 5" dataset. A quick summary of this dataset is the following:  
    
    6 Billion tokens  
    400K vocab (uncased)  
    50, 100, 200, and 300-Dimensional vectors available  
    822 mb download  
    
    
    100d, 200d, & 300d vectors are available, downloads as glove.6B.zip  
    
    (See https://nlp.stanford.edu/projects/glove/ for further details).


In [3]:
print(text.embedding.get_pretrained_file_names('glove'))

['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', 'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', 'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt']


I can instantiate a pre-trained embedding using MXNet's text.embedding.create API. In this case I will use a 300-Dimensional word embedding,

In [5]:
glove_6b50d = text.embedding.create(
    'glove', pretrained_file_name='glove.6B.50d.txt')

In [6]:
glove_42b300d = text.embedding.create(
    'glove', pretrained_file_name='glove.42B.300d.txt')

How many words in this pre-trained vector's dictionary?

In [35]:
print("The dictionary size of the 50d glove model is:", len(glove_6b50d))

The dictionary size of the 50d glove model is: 400001


In [36]:
print("The dictionary size of the 300d glove model is:", len(glove_42b300d))

The dictionary size of the 300d glove model is: 1917495


To get a feel for this dictionary, lets look at some of the words using MXNet's index-to-token function

In [37]:
print("Index of the word 'knowledge' is:", glove_6b50d.token_to_idx['knowledge'])
print("Index of the word 'data' is:", glove_6b50d.token_to_idx['data'])
print("Index of the word 'robot' is:", glove_6b50d.token_to_idx['robot'])
print("Index of the word 'human' is:", glove_6b50d.token_to_idx['human'])

Index of the word 'knowledge' is: 2490
Index of the word 'data' is: 934
Index of the word 'robot' is: 9248
Index of the word 'human' is: 474


## Making use of the GloVe Word Embedding

Essentially, we can use the word embedding for the analogy task by searching for words that appear closer in the Vector Space, or that can be reached using a "relationship vector" which takes as input the analogical relationship 

####  "A --> B:  C --> D". Given A,B, relationship(A,B), and C, find D

We can use something like K-Nearest Neighbours [6] for this.

In [38]:
def knn(W, x, k):
    # The added 1e-9 is for numerical stability
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

To solve the analogy problem, we need to find the word vector that is most similar to the result vector of  vec(ùëê)+vec(ùëè)‚àívec(ùëé) .

In [39]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]  # Remove unknown words

# Types of Analogies: Will use 3 categories of analogy with 15 test cases each

The different types of analogy are:

#### capital: country

In [40]:
get_analogy('beijing', 'china', 'tokyo', glove_42b300d)

'japan'

#### adjective: superlative: adjective

In [42]:
get_analogy('bad', 'worst', 'big', glove_42b300d)

'biggest'

#### present-tense verb: past tense verb

In [None]:
get_analogy('accept', 'accepted', 'achieve', glove_6b_50d)

In [149]:
f = open("antonyms.txt", "r")
analogies = f.readlines()
for i in range(len(analogies)):
    analogies[i] = analogies[i].lower()
    analogies[i]=analogies[i].split()

print(analogies)

[['after', 'before', 'ahead', 'behind'], ['anterior', 'posterior', 'backward', 'forward'], ['before', 'after', 'beginning', 'end'], ['below', 'above', 'climb', 'descend'], ['dead', 'alive', 'decrement', 'increment'], ['descend', 'ascend', 'dive', 'emerge'], ['down', 'up', 'downslope', 'upslope'], ['drop', 'lift', 'dynamic', 'static'], ['employ', 'dismiss', 'exit', 'entrance'], ['fall', 'rise', 'first', 'last']]


In [150]:
real_answer = []
for i in range(len(analogies)):
    real_answer.append(analogies[i][-1])

In [151]:
predicted_answer=[get_analogy(analogy[0], analogy[1], analogy[2], glove_6b50d) for analogy in analogies]

In [152]:
print(predicted_answer)
print(real_answer)

['ahead', 'backward', 'beginning', 'climb', 'decrement', 'dive', 'downslope', 'dynamic', 'exit', 'first']
['behind', 'forward', 'end', 'descend', 'increment', 'emerge', 'upslope', 'static', 'entrance', 'last']


In [153]:
total = len(predicted_answer)
correct = 0
for i in range(len(predicted_answer)):
    if predicted_answer[i] == real_answer[i]:
                correct +=1
            

In [154]:
accuracy = correct/total * 100
print("Total correct: ", correct)
print("Total asked: ", total)
print("The accuracy on the ", f.name,"dataset is: ",accuracy)

Total correct:  0
Total asked:  10
The accuracy on the  antonyms.txt dataset is:  0.0


## References:

[1] Bartha, Paul, "Analogy and Analogical Reasoning", The Stanford Encyclopedia of Philosophy (Spring 2019 Edition), Edward N. Zalta (ed.), URL = <https://plato.stanford.edu/archives/spr2019/entries/reasoning-analogy/>.

[2] Zhang, A, "Dive into Deep Learning" (2019), Z. Lipton, M. Li, A. Smola URL = https://d2l.ai/

[3] Mikovol et al, ‚ÄúEfficient estimation of word representations in vector space‚Äù ICLR Workshop 2013.

[4] Bojanowski et al., ‚ÄúEnriching word vectors with subword information‚Äù TACL 2017.

[5] Pennington et al., ‚ÄúGlove: global vectors for word representation‚Äù, ACL 2014.

[6] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

