# Get Embeddings

Hi!

In this notebook, I create multiple embedding representations for the phrasal verb data found [here](https://github.com/johnstarr-ling/light-verb-construction-embeddings/tree/main/data).

The kinds of embeddings that I currently use are:
1. pre-trained Word2Vec word embeddings (300d, trained on Google News Corpus)
2. pre-trained Word2Vec word embeddings (300d, trained on Google News Corpus), with phrasal verb as one compositional embedding (mean pooling)
3. pre-trained GLoVe word embeddings (300d, trained on Wikipedia 2014 + Gigaword 5 corpora)
4. pre-trained GLoVe word embeddings (300d, trained on Wikipedia 2014 + Gigaword 5 corpora), with phrasal verb as one compositional embedding (mean pooling)
5. [InferSent](https://github.com/facebookresearch/InferSent) sentence embeddings

Embedding representations that I expect to use in the future include:
1. BERT

__NOTE ON RUNNING THIS NOTEBOOK:__ This notebook requires the following files to be in the same directory for a successful run:
1. [models.py](https://github.com/facebookresearch/InferSent/blob/main/models.py) file from InferSent
2. [a pretrained GLoVe embedding .txt file](https://nlp.stanford.edu/projects/glove/)

## Data Preparation

In [16]:
import numpy as np
import pandas as pd
import torch 

In [17]:
# import stuff
%load_ext autoreload
%autoreload 2
%matplotlib inline

from random import randint

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We'll load in our phrasal verb data:

In [39]:
data = pd.read_csv('data/pvc_data.csv')

In [40]:
data.head()

Unnamed: 0,pvc_lemmas,file_path,row,is_phrasal,annotator_agreement_percentage,verb_idx,sents
0,"['take', 'on']",B/BN/BNN.xml,291,True,1.0,"['12', '13']","['At', 'about', 'the', 'same', 'time', 'the', ..."
1,"['give', 'in']",B/B1/B1E.xml,670,False,0.7051,"['15', '16']","['Production', 'is', 'centred', 'in', 'the', '..."
2,"['take', 'after']",K/K3/K3E.xml,56,False,0.6733,"['21', '22']","['By', 'Echo', 'reporter', 'CORONATION', 'Stre..."
3,"['get', 'out']",C/CK/CK9.xml,1654,True,1.0,"['19', '20']","['Mrs', 'Aggie', ',', 'I', 'do', 'want', 'to',..."
4,"['get', 'through']",G/G2/G2E.xml,2734,True,1.0,"['9', '10']","['He', 'was', 'charged', 'for', 'a', 'call', '..."


In [41]:
# Making sure all our data are in the right form
data['pvc_lemmas'] = data['pvc_lemmas'].apply(eval)
data['row'] = data['row'].apply(int)
data['annotator_agreement_percentage'] = data['annotator_agreement_percentage'].apply(float)
data['verb_idx'] = data['verb_idx'].apply(eval)
data['sents'] = data['sents'].apply(eval)
data['pvc_strings'] = data['pvc_lemmas'].apply(lambda x: ' '.join(x))
data['sent_strings'] = data['sents'].apply(lambda x: ' '.join(x))


## InferSent

Most of the following code is taken from [here](https://github.com/facebookresearch/InferSent/blob/main/demo.ipynb).

In [21]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\johns\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [22]:
# Load model
from model import InferSent
model_version = 1
MODEL_PATH = "infersent%s.pkl" % model_version
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

<All keys matched successfully>

In [23]:
# If infersent1 -> use GloVe embeddings. If infersent2 -> use InferSent embeddings.
W2V_PATH = 'glove.6B.300d.txt' if model_version == 1 else 'fastText/crawl-300d-2M.vec'
model.set_w2v_path(W2V_PATH)

In [24]:
# Load embeddings of K most frequent words
model.build_vocab_k_words(K=100000)

Vocab size : 100000


In [43]:
# Some test sentences!
test_lvcs = ['I gave John a bath .', 'I gave John a book .', 'I gave John a chance .']

In [44]:
test_embeds = model.encode(test_lvcs, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(test_lvcs)))

Nb words kept : 12/24 (50.0%)
Speed : 124.8 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 3


In [27]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [45]:
cosine(model.encode(['I gave John a bath.'])[0], model.encode(['I gave John a book.'])[0])

0.86921847

In [47]:
# Phrasal verb embedding
pv_embeddings = model.encode(data['pvc_strings'], bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

Nb words kept : 2616/5142 (50.9%)


  sentences = np.array(sentences)[idx_sort]


Speed : 804.0 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 1263


In [48]:
# Whole sentence embedding
sent_embeddings = model.encode(data['sent_strings'], bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

Nb words kept : 30795/36617 (84.1%)
Speed : 66.4 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 1263


In [49]:
# Adding these to the DataFrame
pv_infer_list = list(pv_embeddings)
sent_infer_list = list(sent_embeddings)

In [50]:
data['pv_infer_embeds'] = pv_infer_list
data['sent_infer_embeds'] = sent_infer_list

In [51]:
data.head()

Unnamed: 0,pvc_lemmas,file_path,row,is_phrasal,annotator_agreement_percentage,verb_idx,sents,pvc_strings,sent_strings,pv_infer_embeds,sent_infer_embeds
0,"[take, on]",B/BN/BNN.xml,291,True,1.0,"[12, 13]","[At, about, the, same, time, the, aliens, depa...",take on,At about the same time the aliens department o...,"[0.04849784, 0.082029976, 0.010337902, 0.00278...","[0.12564819, 0.06340436, 0.055802934, 0.096010..."
1,"[give, in]",B/B1/B1E.xml,670,False,0.7051,"[15, 16]","[Production, is, centred, in, the, Pacific, no...",give in,Production is centred in the Pacific northwest...,"[0.06049606, 0.066955656, -0.0019042839, -0.05...","[0.07641017, 0.13320042, 0.008155916, 0.096234..."
2,"[take, after]",K/K3/K3E.xml,56,False,0.6733,"[21, 22]","[By, Echo, reporter, CORONATION, Street, actre...",take after,By Echo reporter CORONATION Street actress Lyn...,"[0.04559154, 0.013943457, 0.013154965, -0.0398...","[0.116348945, 0.07574034, 0.013408521, 0.07490..."
3,"[get, out]",C/CK/CK9.xml,1654,True,1.0,"[19, 20]","[Mrs, Aggie, ,, I, do, want, to, go, to, a, sc...",get out,"Mrs Aggie , I do want to go to a school where ...","[0.04216131, -0.040760946, -0.071714684, -0.00...","[0.08019499, 0.06795283, 0.053973824, 0.092289..."
4,"[get, through]",G/G2/G2E.xml,2734,True,1.0,"[9, 10]","[He, was, charged, for, a, call, that, never, ...",get through,He was charged for a call that never got throu...,"[0.07250941, -0.0235534, -0.04305069, 0.039528...","[0.07009956, 0.052870482, 0.08822129, 0.059288..."


## W2V

In [33]:
import gensim.downloader

In [34]:
w2v = gensim.downloader.load('word2vec-google-news-300')


In [38]:
# Creating an <UNK> that is the mean of all vectors in the space
w2v['<UNK>'] = np.mean([w2v[x] for x in w2v.key_to_index.keys()], axis=0)


In [36]:
# Gets the vectors to the list
def get_w2v_list(word_list, vectors):
    vector_list = []
    for word in word_list:
        try:
            vector_list.append(vectors[word])
        except:
            vector_list.append(vectors['<UNK>'])
    return vector_list

In [55]:
# Creating embedding representations
data['pv_w2v_embed'] = data['pvc_lemmas'].apply(lambda x: get_w2v_list(x, w2v))
data['sent_w2v_embed'] = data['sents'].apply(lambda x: get_w2v_list(x, w2v))

In [56]:
data.head()

Unnamed: 0,pvc_lemmas,file_path,row,is_phrasal,annotator_agreement_percentage,verb_idx,sents,pvc_strings,sent_strings,pv_infer_embeds,sent_infer_embeds,pv_w2v_embed,sent_w2v_embed
0,"[take, on]",B/BN/BNN.xml,291,True,1.0,"[12, 13]","[At, about, the, same, time, the, aliens, depa...",take on,At about the same time the aliens department o...,"[0.04849784, 0.082029976, 0.010337902, 0.00278...","[0.12564819, 0.06340436, 0.055802934, 0.096010...","[[-0.05102539, 0.0041503906, 0.024902344, -0.0...","[[-0.088378906, -0.011962891, 0.21484375, 0.05..."
1,"[give, in]",B/B1/B1E.xml,670,False,0.7051,"[15, 16]","[Production, is, centred, in, the, Pacific, no...",give in,Production is centred in the Pacific northwest...,"[0.06049606, 0.066955656, -0.0019042839, -0.05...","[0.07641017, 0.13320042, 0.008155916, 0.096234...","[[0.06201172, -0.122558594, 0.016845703, 0.086...","[[-0.027954102, -0.19042969, -0.02746582, -0.1..."
2,"[take, after]",K/K3/K3E.xml,56,False,0.6733,"[21, 22]","[By, Echo, reporter, CORONATION, Street, actre...",take after,By Echo reporter CORONATION Street actress Lyn...,"[0.04559154, 0.013943457, 0.013154965, -0.0398...","[0.116348945, 0.07574034, 0.013408521, 0.07490...","[[-0.05102539, 0.0041503906, 0.024902344, -0.0...","[[-0.047851562, -0.29492188, 0.375, 0.359375, ..."
3,"[get, out]",C/CK/CK9.xml,1654,True,1.0,"[19, 20]","[Mrs, Aggie, ,, I, do, want, to, go, to, a, sc...",get out,"Mrs Aggie , I do want to go to a school where ...","[0.04216131, -0.040760946, -0.071714684, -0.00...","[0.08019499, 0.06795283, 0.053973824, 0.092289...","[[0.033203125, -0.08984375, -0.29492188, 0.115...","[[0.23242188, -0.1875, -0.28125, -0.06542969, ..."
4,"[get, through]",G/G2/G2E.xml,2734,True,1.0,"[9, 10]","[He, was, charged, for, a, call, that, never, ...",get through,He was charged for a call that never got throu...,"[0.07250941, -0.0235534, -0.04305069, 0.039528...","[0.07009956, 0.052870482, 0.08822129, 0.059288...","[[0.033203125, -0.08984375, -0.29492188, 0.115...","[[-0.038085938, 0.34570312, 0.103027344, -0.03..."


## GLoVe

__NOTE:__ Uncomment the following final line in the following code block if you have _not_ created a w2v filetype from the GLoVe embeddings already:

In [58]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_filename = 'glove.6B.300d.txt'
word2vec_output_file = glove_filename+'.word2vec'
# glove2word2vec(glove_filename, word2vec_output_file)

In [59]:
# This can take a while to run :( 
from gensim.models import KeyedVectors

glove = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

In [60]:
# Creating an <UNK> representation that consists of the mean of the embedding space
glove['<UNK>'] = np.mean([glove[x] for x in glove.key_to_index.keys()], axis=0)

In [61]:
def get_glove(token_list, model):
    vector_list = []
    for word in token_list:
        try:
            vector_list.append(model[word])
        except:
            vector_list.append(model['<UNK>'])
    return vector_list

In [62]:
# Retrieving vector representations for everything!
data['pv_glove_embed'] = data['pvc_lemmas'].apply(lambda x: get_glove(x, glove))
data['sent_glove_embed'] = data['pvc_lemmas'].apply(lambda x: get_glove(x, glove))

In [63]:
data.head()

Unnamed: 0,pvc_lemmas,file_path,row,is_phrasal,annotator_agreement_percentage,verb_idx,sents,pvc_strings,sent_strings,pv_infer_embeds,sent_infer_embeds,pv_w2v_embed,sent_w2v_embed,pv_glove_embed,sent_glove_embed
0,"[take, on]",B/BN/BNN.xml,291,True,1.0,"[12, 13]","[At, about, the, same, time, the, aliens, depa...",take on,At about the same time the aliens department o...,"[0.04849784, 0.082029976, 0.010337902, 0.00278...","[0.12564819, 0.06340436, 0.055802934, 0.096010...","[[-0.05102539, 0.0041503906, 0.024902344, -0.0...","[[-0.088378906, -0.011962891, 0.21484375, 0.05...","[[-0.015879, 0.11807, -0.12769, -0.16302, -0.0...","[[-0.015879, 0.11807, -0.12769, -0.16302, -0.0..."
1,"[give, in]",B/B1/B1E.xml,670,False,0.7051,"[15, 16]","[Production, is, centred, in, the, Pacific, no...",give in,Production is centred in the Pacific northwest...,"[0.06049606, 0.066955656, -0.0019042839, -0.05...","[0.07641017, 0.13320042, 0.008155916, 0.096234...","[[0.06201172, -0.122558594, 0.016845703, 0.086...","[[-0.027954102, -0.19042969, -0.02746582, -0.1...","[[0.1088, -0.21724, -0.55772, -0.15096, 0.0473...","[[0.1088, -0.21724, -0.55772, -0.15096, 0.0473..."
2,"[take, after]",K/K3/K3E.xml,56,False,0.6733,"[21, 22]","[By, Echo, reporter, CORONATION, Street, actre...",take after,By Echo reporter CORONATION Street actress Lyn...,"[0.04559154, 0.013943457, 0.013154965, -0.0398...","[0.116348945, 0.07574034, 0.013408521, 0.07490...","[[-0.05102539, 0.0041503906, 0.024902344, -0.0...","[[-0.047851562, -0.29492188, 0.375, 0.359375, ...","[[-0.015879, 0.11807, -0.12769, -0.16302, -0.0...","[[-0.015879, 0.11807, -0.12769, -0.16302, -0.0..."
3,"[get, out]",C/CK/CK9.xml,1654,True,1.0,"[19, 20]","[Mrs, Aggie, ,, I, do, want, to, go, to, a, sc...",get out,"Mrs Aggie , I do want to go to a school where ...","[0.04216131, -0.040760946, -0.071714684, -0.00...","[0.08019499, 0.06795283, 0.053973824, 0.092289...","[[0.033203125, -0.08984375, -0.29492188, 0.115...","[[0.23242188, -0.1875, -0.28125, -0.06542969, ...","[[-0.14124, -0.11836, -0.30782, 0.098416, 0.22...","[[-0.14124, -0.11836, -0.30782, 0.098416, 0.22..."
4,"[get, through]",G/G2/G2E.xml,2734,True,1.0,"[9, 10]","[He, was, charged, for, a, call, that, never, ...",get through,He was charged for a call that never got throu...,"[0.07250941, -0.0235534, -0.04305069, 0.039528...","[0.07009956, 0.052870482, 0.08822129, 0.059288...","[[0.033203125, -0.08984375, -0.29492188, 0.115...","[[-0.038085938, 0.34570312, 0.103027344, -0.03...","[[-0.14124, -0.11836, -0.30782, 0.098416, 0.22...","[[-0.14124, -0.11836, -0.30782, 0.098416, 0.22..."


In [64]:
# Saving to data/ folder:
data.to_csv('data/pvc_with_embeds.csv', index=False)


## Conclusion

That's it! This is how I constructed the embedding representations from a modified version of the PVC corpus described in [this paper](https://cogcomp.seas.upenn.edu/papers/TuRoth12.pdf).