# Word Embedding Playground

In [1]:
import spacy
import gensim.downloader as api
import gensim
import csv
import re


import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

from bs4 import *
import requests

We can mess with word embeddings. 

# Option 1: Our embeddings

We can use the embeddings that we trained! Or that we will train!

# Option 2: Pre-trained embeddings

Alternately, we can use pretrained vectors or embeddings downloaded from the internet. We can use Word2Vec, or GloVe, which is a model that came out a few years later and works very well. 

We use `gensim` which is a great library for working with embeddings and training topic models. It has some 

In [2]:
info = api.info()

Print out available models (i.e. embeddings)

In [3]:
for model_name, model_data in sorted(info['models'].items()):
    print(
        '%s (%d records): %s' % (
            model_name,
            model_data.get('num_records', -1),
            model_data['description'][:40] + '...',
        )
    )

__testing_word2vec-matrix-synopsis (-1 records): [THIS IS ONLY FOR TESTING] Word vecrors ...
conceptnet-numberbatch-17-06-300 (1917247 records): ConceptNet Numberbatch consists of state...
fasttext-wiki-news-subwords-300 (999999 records): 1 million word vectors trained on Wikipe...
glove-twitter-100 (1193514 records): Pre-trained vectors based on  2B tweets,...
glove-twitter-200 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-25 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-twitter-50 (1193514 records): Pre-trained vectors based on 2B tweets, ...
glove-wiki-gigaword-100 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-200 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-300 (400000 records): Pre-trained vectors based on Wikipedia 2...
glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia 2...
word2vec-google-news-300 (3000000 records): Pre-trai

The first time you load a new dataset it will take a while to download, and the files can be quite large.

In [41]:

# keep this commented out - I'm using this dataset 
#model_news = api.load("word2vec-google-news-300")
# You should use these embeddings, and compare your results with mine!
model_wiki =  api.load("glove-wiki-gigaword-50")
model_twitter =  api.load("glove-twitter-100")




## Option 3: Train Embeddings

You can also train embeddings from a saved corpus. 

 For our experiments, we're going to use the abstracts of all ArXiv papers in the category cs.CL (computation and language) that were published before mid-April 2021 — a total of around 25,000 documents. We tokenize these abstracts with spaCy.

We define a wrapper to deal with csv data with one column.

Each column contains an abstract of an NLP paper. 

If your csv looks different, you will have to change this function to get the right column. 

In [9]:
class Corpus(object):

    def __init__(self, filename):
        self.filename = filename
        self.nlp = spacy.blank("en")
        
    def __iter__(self):
        with open(self.filename, "r") as i:
            reader = csv.reader(i, delimiter=",")
            for _, abstract in reader:
                tokens = [t.text.lower() for t in self.nlp(abstract)]
                yield tokens
                            

In [10]:
documents = Corpus("../data/arxiv.csv")

In [11]:
arxiv_model = gensim.models.Word2Vec(documents, min_count=50, window=5, vector_size=100)
arxiv_model.save("word2vec.arxiv.model")

In [12]:
arxiv_model = arxiv_model.wv

When we train our word embeddings, gensim allows us to set a number of parameters. The most important of these are min_count, window, vector_size and sg

* `min_count` is the minimum frequency of the words in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more.
* `window` is the number of words to the left and to the right that make up the context that word2vec will take into account.
* `vector_size` is the dimensionality of the word vectors. This is generally between 100 and 1000. This dimensionality often forces us to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.
* `sg`: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (sg=0).


# Playing with Embeddings

Let's take a look at the trained model. The word embeddings are on its wv attribute, and we can access them by the using the token as key. For example, here is the embedding for nlp, with the requested 100 dimensions.

We can also easily find the similarity between two words. Similarity is measured as the cosine between the two word embeddings, and therefore ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that nmt (neural machine translation) is closer to smt (statistical machine translation) than to ner (named entity recognition).

In [57]:
# change to model = whatever vectors you have loaded to explore them in the code below

model = model_twitter

In [62]:


print(model.similarity("language", "logic"))
print(model.similarity("language", "play"))

model.most_similar(positive=['biden'],topn=10)

0.5545392
0.4175607


[('romney', 0.8368731737136841),
 ('potus', 0.7850850224494934),
 ('obama', 0.7849058508872986),
 ('clinton', 0.7623938322067261),
 ('barack', 0.7508866190910339),
 ('rnc', 0.7207996249198914),
 ('boehner', 0.7167794704437256),
 ('president', 0.6993654370307922),
 ('palin', 0.6886443495750427),
 ('debates', 0.6864507794380188)]



In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to bert are all semantically related to it: other types of pretrained models such as roberta, mbert, xlm, as well as the more general model type BERT represents (transformer and transformers), and more generally related words (pretrained)

In [35]:
model.similar_by_word("cooperative", topn=10)

[('promote', 0.7328717112541199),
 ('establishing', 0.7325128316879272),
 ('development', 0.7282374501228333),
 ('promoting', 0.7234703302383423),
 ('cooperation', 0.7193742990493774),
 ('partnership', 0.7100776433944702),
 ('sustainable', 0.7055977582931519),
 ('productive', 0.7023764848709106),
 ('develop', 0.7007005214691162),
 ('collaborative', 0.6980806589126587)]

Interestingly, we can look for words that are similar to a set of words and dissimilar to another set of words at the same time. This allows us to look for analogies of the type BERT is to a transformer like an LSTM is to .... Our embedding model correctly predicts that LSTMs are a type of RNN, just like BERT is a particular type of transformer.

    > solves
    > 3 : 1 :: 2 : _____
    > man : king :: woman : _______

In [None]:
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

#### Exercise: solve the analogy with your vectors!
    
    > pasta : pizza :: poutine : _____

In [28]:
# work here
#
model.most_similar(positive=['poutine','pasta'],negative=['pizza'])


[('poos', 0.6951389908790588),
 ('charr', 0.6802443265914917),
 ('milfoil', 0.6759202480316162),
 ('adnet', 0.6674808263778687),
 ('poliakov', 0.6673842668533325),
 ('kalyuzhny', 0.6571604013442993),
 ('dauphinois', 0.6463200449943542),
 ('delamontagne', 0.6451795697212219),
 ('vedrine', 0.6449277400970459),
 ('splenomegaly', 0.6439857482910156)]

## Disambiguation



Similarly, we can also zoom in on one of the meanings of ambiguous words. For example, in NLP tree has a very specific meaning, which is obvious from its nearest neighbours constituency, parse, dependency and syntax.


In [None]:
model.most_similar(positive=["pizza", "poutine"], negative=["pasta"], topn=1)

In [None]:
model.most_similar(positive=["tree"], topn=10)



However, if we specify we're looking for words that are similar to tree, but dissimilar to forest, suddenly it gives a different, more domesticated image of a tree.


In [38]:
model.most_similar(positive=["cybernetic"], topn=10)

[('prosthetic', 0.6965978145599365),
 ('robotic', 0.67524653673172),
 ('reptilian', 0.6654907464981079),
 ('mind-control', 0.6636196970939636),
 ('cad/cam', 0.6501995921134949),
 ('prosthetics', 0.6475244760513306),
 ('exfoliating', 0.6458118557929993),
 ('prosthesis', 0.6454856395721436),
 ('constructs', 0.6423043608665466),
 ('cyborg', 0.6421464681625366)]



Finally, we can present the word2vec model with a list of words and ask it to identify the odd one out. It then uses the word embeddings to identify the word that is least similar to the other ones. For example, in the list lstm cnn gru svm transformer, it correctly identifies svm as the only non-neural model. In the list bert word2vec gpt-2 roberta xlnet, it correctly singles out word2vec as the only non-transormer model. In word2vec bert glove fasttext elmo, bert is singled out as the only transformer.


In [39]:
print(model.doesnt_match("effervescent bubbly sparkling drab".split()))
print(model.doesnt_match("monarchy democracy bureaucracy communism socialism".split()))
print(model.doesnt_match("bert ernie elmo barney".split()))

drab
bureaucracy
elmo


# Plotting Embeddings



Let's now visualize some of our embeddings. To plot embeddings with a dimensionality of 100 or more, we first need to map them to a dimensionality of 2. We do this with the popular t-SNE method. T-SNE, short for t-distributed Stochastic Neighbor Embedding, helps us visualize high-dimensional data by mapping similar data to nearby points and dissimilar data to distance points in the low-dimensional space.

T-SNE is present in Scikit-learn. To run it, we just have to specify the number of dimensions we'd like to map the data to (n_components), and the similarity metric that t-SNE should use to compute the similarity between two data points (metric). We're going to map to 2 dimensions and use the cosine as our similarity metric. Additionally, we use PCA as an initialization method to remove some noise and speed up computation. The Scikit-learn user guide contains some additional tips for optimizing performance.

Plotting all the embeddings in our vector space would result in a very crowded figure where the labels are hardly legible. Therefore we'll focus on a subset of embeddings by selecting the 200 most similar words to a target word.



In [None]:

# list of words to map
target_word = "disaster"
selected_words = [w[0] for w in model.most_similar(positive=[target_word], topn=100)] + [target_word]

# list of embeddings for our target words
embeddings = [model[w] for w in selected_words] + model["bert"]

# 2-D reduction of embeddings
mapped_embeddings = TSNE(n_components=2, metric='cosine', init='pca').fit_transform(embeddings)



If we take bert as our target word, the figure shows some interesting patterns. In the immediate vicinity of bert, we find the similar transformer models that we already identified as its nearest neighbours earlier: xlm, mbert, gpt-2, and so on. Other parts of the picture have equally informative clusters of NLP tasks and benchmarks (squad and glue), languages (german and english), neural-network architectures (lstm, gru, etc.), embedding types (word2vec, glove, fasttext, elmo), etc.


In [None]:
%matplotlib inline


plt.figure(figsize=(20,20))
x = mapped_embeddings[:,0]
y = mapped_embeddings[:,1]
plt.scatter(x, y)


# add labels to our map visualization
for i, txt in enumerate(selected_words):
    plt.annotate(txt, (x[i], y[i]))

# Compare to Twitter

Here are 50 dimension GloVe vectors from twitter. Use these for the exercises below.

In [44]:
# download the model and return as object ready for use
# embeddings_glove_twitter = api.load("glove-twitter-25")
model_glove_twitter = model_twitter



## Exercise: Semantic Similarity


What is the semantic similarity between 'meaning' and 'interpretation' in twitter space?
What about meaning and extract?

How does this compare to the arxiv embeddings?

In [None]:
# answer here
#
#



Pick two more concepts and compare their cosine similarity in two different vector space models

## Exercise: Near Neighbors

What are the nearest neighbors of these concepts?

In [None]:
# answer here
#
#
#

## Exercise: Analogy

How does twitter solve the tree analogy?

In [None]:
# answer here
#
#


Try another analogy. What is the result? In your mind, what shoudl the answer be?

In [None]:
# answer here
#
#


# Exercise: Plot spatial relationships

Pick a set of words that are related to each other in the same way. For instance you could use countries and their capitals, or adjectives and their superlatives

e.g.

```
Rome - Italy

Beijing - China

Berlin - Germany

Ottowa - Canada
```

superlatives
``` 
bad - worse

good - better

warm - warmer

red - redder

blue - bluer
```

What do you notice?

In [None]:
"""
Answer here

You'll need to copy the code for visualization from above 
but edit certain parts to give you the embeddings of the words 
you want to look at
""" 





"""
"""

#  Training new embeddings

Challenge: train a word2vec model on a corpus of your choosing or of your creation. Now that we have one line to do all of the work of yesterday, the challenge is getting the data into the right shape. The function

`gensim.models.Word2Vec(documents, min_count=x, window=x, vector_size=x)`

should be able to take any kind o


## Load your own corpus

#### CSV Data

Load the corpus into memory (returns a Dataset object, which is the exact kind of object we need).

We define a wrapper to deal with csv data with one column.

Each column contains an abstract of an NLP paper. 

If your csv looks different, you will have to change this function to get the right column. 

In [45]:
class Corpus(object):

    def __init__(self, filename):
        self.filename = filename
        self.nlp = spacy.blank("en")
        
    def __iter__(self):
        with open(self.filename, "r") as i:
            reader = csv.reader(i, delimiter=",")
            for _, abstract in reader:
                tokens = [t.text.lower() for t in self.nlp(abstract)]
                yield tokens
                            

Then you can create your corpus by initializing a Corpus object with the relative path to your csv

In [46]:
documents = Corpus("../data/arxiv.csv")

#### Load text from a website

In [47]:
url = 'https://theanarchistlibrary.org/library/the-invisible-committe-now.muse'
res = requests.get(url)
html_page = res.text

# Parse the source code using BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')

# Extract the plain text content
text = soup.get_text()

In [48]:
class Corpus(object):

    def __init__(self, text):
        self.nlp = spacy.blank("en")
        spaced_corpus = re.sub(r'(\w)([.,?!;:])', r'\1 \2', text) 
        self.sentences = spaced_corpus.split('.')

        
    def __iter__(self):
            
               
        # separate punctuation from previous word
        
        # split into sentences
        for sentence in self.sentences:
            words = sentence.split() # split on whitespace
            words = [word.lower() for word in words]
            yield words

In [49]:
corpus = Corpus(text)

train a model 

In [50]:
now_model = gensim.models.Word2Vec(corpus, min_count=0, window=5, vector_size=100)
now_model = now_model.wv

In [51]:
now_model["now"]

array([-0.17780584,  0.14165595, -0.02023686,  0.04731496, -0.03528563,
       -0.41216475,  0.08436754,  0.5550741 , -0.29759452, -0.3302673 ,
        0.0532984 , -0.3605916 ,  0.00634871,  0.19888034,  0.1136161 ,
       -0.12106553,  0.09652579, -0.16625862, -0.08971754, -0.44926703,
        0.16550542, -0.02957334,  0.27742964, -0.17005129,  0.05981685,
        0.08914005, -0.2739902 ,  0.0081562 , -0.20762356,  0.10143223,
        0.37559932, -0.09632047,  0.14442205, -0.27530015,  0.01123251,
        0.2604527 ,  0.06122362, -0.05353915, -0.05280174, -0.30480772,
        0.09729952, -0.23737392, -0.17903922, -0.00507795,  0.20530565,
       -0.04261357, -0.13149163, -0.05699653,  0.17866972,  0.16170652,
        0.20821849, -0.23083907,  0.01231476, -0.02665225, -0.04280896,
        0.08481832,  0.1451759 ,  0.04481702, -0.0545764 ,  0.12536135,
        0.00405   , -0.02745885,  0.00898578, -0.01171332, -0.26224548,
        0.30893615, -0.01165136,  0.21606153, -0.2752248 ,  0.26