# Introduction

This notebook gives reader a concept of our proposed method to extract phrase embeddings.  
If you want to get the whole code release and all training scripts, please check [this GitHub repo](https://github.com/NTHU-NLPLAB/TAAI_gen_paraphrase).  

# prepare environment

In [None]:
# If you want to install all packages at once, directly run this cell

# !pip install -r requirement.txt

# load data

For simplicity, we simple load our preprocessed data.  
The data contain single sentence in one line. All sentences have been tokenized and lemmatized, so they can be directly fed into Word2Vec to train.

In [1]:
def load_data(data_path):
    f = [line.strip() for line in open(data_path, 'r', encoding='utf-8')]
    return f

The result will be a list of processed sentences.

In [2]:
ret = load_data('data/all_hyphened_sent.txt')
ret[:3]

['and end at houli horse farm ( ) , or -PRON- could go further on to the lovely pilu buddhist monastery ( p )',
 'the dumpling be a big favourite',
 'a chill venue where musicindustry type hang , and everyone seem to know everyone else']

# train Word2Vec

## Download pretrained model  
For convenience, we choose not to train the whole model from scratch, but finetune the model from other's pretrained model.  
Hence, we download Google's Word2Vec pretrained model first. This model is trained on Google News Dataset, which contains about 100 billion vocabularies.  
For moe details, please refer to [Google's website](https://code.google.com/archive/p/word2vec/).

In [3]:
!pip install gdown

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
# File size: ~1.5G

!mkdir models
!gdown -O models/GoogleNews-vectors-negative300.bin.gz --id 0B7XkCwpI5KDYNlNUTTlSS21pQmM

Downloading...
From: https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
To: /home/nlplab/joaw/projects/TAAI_gen_paraphrase/models/GoogleNews-vectors-negative300.bin.gz
1.65GB [01:18, 21.0MB/s]


In [5]:
!gzip -d models/GoogleNews-vectors-negative300.bin.gz

## Finetune our own model 

After the model is downloaded, we use [gensim](https://github.com/RaRe-Technologies/gensim) package to help us finetune the model.  
  
**gensim** is a useful tools to use and train on NLP and IR tasks. It has many famous models implemented, like Word2Vec, Doc2Vec, FastText, ...etc. Here we use gensim's API to load and finetune Word2Vec model.  
For more information of gensim, please refer to:
 - [Official Word2Vec Tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html)
 - [Word2Vec's API documentation](https://radimrehurek.com/gensim/models/word2vec.html)

In [6]:
# gensim shoud have been installed in the first cell. If it's not correctly installed, please run this command.

!pip install gensim

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [7]:
from gensim.models import Word2Vec             # This is Word2Vec's base model in gensim
from gensim.models import KeyedVectors         # This stores all vocabulary information

First we create a function to setup Word2Vec object with specified parameters.  
Since we are using Google's model, we set embedding dimension (`size`) same as the dimension in pretrained model. 
`min_count` is set to `1` to correctly load pretrained model's vocobulary list.  


In [8]:
def create_model(training_data, emb_dim=300):
    model = Word2Vec(size = emb_dim,
                     min_count = 1)
    model.build_vocab(training_data)
    example_count = model.corpus_count
    return model, example_count

Now we can load Google's pretrained weight into our model.  
Since fine-tuning is not the funcionality officially supported by gensim, there are some work sould be done by ourselves:  

1. open pretrained model  
2. build all pretrained vocabularies into our voc list  
3. copy all model weights from pretrained model to our model  

Note that it needs a while due to pretrained model's large size.  

In [9]:
def load_pretrained_model(model, pretrained_path):
    pretrained_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
    model.build_vocab([list(pretrained_model.vocab.keys())], update=True)
    del pretrained_model   # free memory
    model.intersect_word2vec_format(pretrained_path, binary=True, lockf=0.0)
    return model

Then we can start to do training. We set 10 as default #ephcos because the model will have the best performance at this setting.  
Note that the training progress needs a while, too. (about ? minutes for ? epoches)

In [10]:
def train_model(training_data, model, example_count, epochs):
    return model.train(training_data,
                       total_examples = example_count,
                       epochs = epochs)

# Get phrase embeddings

To generate phrase embeddings, we proposed two method.  
We'll introduce two methods and show in following section.  

## Method A
Simply extract words in T9856_phrase_all.txt's embeddings from vector.kv file.    
You need to create a folder to save the extracted .npy files, and we use 'embeddings' here.

![](images/MethodA_model.png)

In [None]:
from gensim.models import KeyedVectors

import numpy as np

In [None]:
lb = []
with open('data/T8956_phrase_all.txt', 'r') as f:
    for lines in f:
        lb.append(lines.replace('\n', ''))

word_vectors = KeyedVectors.load('model/w3_a0.025_300_10i/vector.kv')

lb_dash = [lbs.replace(' ', '_') for lbs in lb]

for lbs in lb_dash:
    if lbs in word_vectors:
        path = 'embeddings/'+lbs
        np.save(path, word_vectors[lbs])

After extracting some phrase embeddings, you can now go to [Compare similarities](#Compare-similarities) section to see how similar the phrases are, or you can continue to go through [our Method B](#Method-B) first.

## Method B

As another method, differing from hyphening all phrases and train a new embedding model as Method A, we try to extract embeddings of **every words in a phrase**. Then, we use sentence embedding models to **encode those words into a single phrase embedding**, as the picture shows below.  
This is reasonable because phrases are actually combinations of words, and their meanings usually come from words.  
![](images/MethodB_model.png)

For simplicity, we use [InferSent](https://github.com/facebookresearch/InferSent) with Facebook's pretrained model as our sentence embedding model.  
Before we start, we should prepare our environment for InferSent first.  

In [11]:
# packages shoud have been installed in the first cell. If it's not correctly installed, please run these commands.
# If you couldn't install pytorch correctly, please refer to official install instructoin (https://pytorch.org/get-started/locally/)

!pip install nltk
!pip install torch

Collecting nltk
  Using cached https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip
Collecting click (from nltk)
  Using cached https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl
Collecting joblib (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/fc/c9/f58220ac44a1592f79a343caba12f6837f9e0c04c196176a3d66338e1ea8/joblib-0.17.0-py3-none-any.whl (301kB)
[K     |████████████████████████████████| 307kB 1.2MB/s eta 0:00:01
[?25hCollecting regex (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/2d/f7/7853ca43f65c6dfb7706b11c960718b90527a2419686b5a2686da904fc3e/regex-2020.9.27-cp38-cp38-manylinux2010_x86_64.whl (675kB)
[K     |████████████████████████████████| 675kB 3.6MB/s eta 0:00:01
Installing collected packages: click, joblib, regex, nltk
  Running setup.py install for nltk ... [?25ldone


In [12]:
# If this is your first time using nltk, remember download punkt data first
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/nlplab/joaw/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

To use InferSent model, we need to download Facebook's pretrained weight first.

In [13]:
# File size: 146M

!mkdir encoders
!curl -Lo encoders/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  11.3M      0  0:00:12  0:00:12 --:--:-- 13.5M


So far the environment is well prepared, so we can directly extract word embeddings from our finetuned word2vec, and throw then into InferSent to get phrase embeddings now.  
There are four steps to achieve this.  
1. Set up Word2Vec model
2. Get all word embeddings
3. Set up Inferset model
4. Get phrase embeddings

### Set up Word2Vec

Now, in the first step, we have to extract all word embedding from Word2Vec model.  
To get the w2v for our task, you can choose to [run training](#Training-by-yourself) (which needs about 10~20 mins), or simply skip it and [download our finetuned model](#Load-our-finetuned-model) in the next section.

#### Training by yourself

Since all utilities functions have been created above, we easily call them to set up our training pipeline.  
1. load our training data
2. create gensim's Word2Vec model
3. load Google's pretrained model (This needs a while. 8~10 mins)
4. finetuning

In [14]:
from time import time

methodB_training_data = load_data('data/all_unhyphened_sent.txt')

print('Creating model...')
t = time()
w2v_model, example_count = create_model(methodB_training_data)
w2v_model = load_pretrained_model(w2v_model, 'models/GoogleNews-vectors-negative300.bin')
print(f'model loaded. time: {time()-t} sec.')

# train model
print('training model...')
train_model(methodB_training_data, w2v_model, example_count, epochs=5)
print(f'training finished')

Creating model...
model loaded. time: 433.63464617729187 sec.
training model...
training finished


After the model is trained, we will use only the word vector inside to extract vocabulary information.

In [15]:
w2v = w2v_model.wv

#### Load our finetuned model
If you have done the training by yourself, you can skip this section and go to [the next part](#Get-list-of-word-embeddings).  
If the training wasn't done, please follow the codes below to download vocabulary weights trained on our task in [this Google Drive](https://drive.google.com/file/d/1iRj7OVlETT2mDXafm7JXCmvWpPhj7mAS/view?usp=sharing), and put all extracted contents unders `models/` folder.  

In [16]:
# file size: 1.7G

!gdown -O models/unhyphened_model.tar.gz --id 1iRj7OVlETT2mDXafm7JXCmvWpPhj7mAS

Downloading...
From: https://drive.google.com/uc?id=1iRj7OVlETT2mDXafm7JXCmvWpPhj7mAS
To: /home/nlplab/joaw/projects/TAAI_gen_paraphrase/models/unhyphened_model.tar.gz
1.77GB [00:36, 48.3MB/s]


In [17]:
!tar xzvf models/unhyphened_model.tar.gz -C models/

unhyphened_model.kv
unhyphened_model.kv.vectors.npy


Then we can load and use those vocubulary weight!

In [18]:
w2v = KeyedVectors.load('models/unhyphened_model.kv')

### Get list of word embeddings

After the w2v model is prepared, we extrace word embeddings from our model now\!   
To make whole progress easier, we here create a function to extract all embeddings from a phrase string in advance. 

In [19]:
def get_word_embeddings(model, phrase):
    words = phrase.split(' ')
    word_embeddings, unfound_words = [], []
    for word in words:
        try:
            emb = model[word]
            word_embeddings.append(emb)
        except:
            unfound_words.append(word)
    return word_embeddings

Here's a simple test to check results from `get_word_embeddings`.  
You should see a list of embeddings as return.  

In [20]:
methodB_phrase1 = 'look for the'
word_embs = get_word_embeddings(w2v, methodB_phrase1)

In [21]:
print(len(word_embs))
print(word_embs[0].shape)

3
(300,)


### Setup InferSent

We have to create an InferSent object with specified parameters, and then we load pretrained model we downloaded above.  

In [22]:
import torch
from infersent import InferSent

# defaul config of infersent
config = {'bsize': 64, 
          'word_emb_dim': 300, 
          'enc_lstm_dim': 2048,
          'pool_type': 'max', 
          'dpout_model': 0.0, 
          'version': 2}

infersent = InferSent(config)
infersent.load_state_dict(torch.load('encoders/infersent2.pkl'))

<All keys matched successfully>

### Get phrase embeddings

Before we use InferSent model, we have to convert word embeddings into InferSent-compatible batch first. We create a function here to do the job.  

In [23]:
import os
import numpy as np

In [24]:
def transform_batch(word_embs):
    # load beginning-of-sent and end-of-sent embedding
    emb_bos = np.load(os.path.join('word_embs', 'bos.npy'))
    emb_eos = np.load(os.path.join('word_embs', 'eos.npy'))
    
    # extract embeddings
    lengths = len(word_embs) + 2
    embeddings = np.vstack((emb_bos, np.array(word_embs), emb_eos))
    
    batch = np.zeros((lengths, 1, 300))
    for i in range(len(embeddings)):
        batch[i][0][:] = embeddings[i]
    
    return torch.FloatTensor(batch), np.array([lengths])

We can use the word embeddings extracted above to check the output tensor.

In [25]:
batch, length = transform_batch(word_embs)

Once the tensor is prepared, we can extract phrase embeddings from InferSent!

In [26]:
with torch.no_grad():
    methodB_pharse_emb1 = infersent.forward((batch, length)).data.cpu().numpy()
print(methodB_pharse_emb1)

[[ 0.00746889 -0.06208688  0.0579672  ... -0.01622153 -0.02536337
  -0.01013366]]


### Method B - All in one

In [27]:
methodB_phrase2 = 'check out the'
word_embs = get_word_embeddings(w2v, methodB_phrase2)
batch, length = transform_batch(word_embs)
with torch.no_grad():
    methodB_pharse_emb2 = infersent.forward((batch, length)).data.cpu().numpy()
print(methodB_pharse_emb2)

# save embeddings in numpy-format if you want
# out_path = 'phrase'
# np.save(out_path, phrase_emb.numpy())

[[ 0.00746889 -0.08161387  0.05412931 ... -0.02657945 -0.02917238
  -0.01013366]]


# Compare similarities

Once we have two embeddings, we can compare their similarities with cosine similarities.  

In [28]:
import numpy
from numpy.linalg import norm

def cosine_similarity(a, b):
    ret = np.inner(a, b) / (norm(a) * norm(b))
    return 0.0 if np.isnan(ret) else float(ret)

In [29]:
# phrase1, phrase2 = methodA_pharse_emb1, methodA_pharse_emb2
phrase1, phrase2 = methodB_pharse_emb1, methodB_pharse_emb2

In [30]:
print('{:.3f}'.format(cosine_similarity(phrase1, phrase2)))

0.894


## Find the most similar phrases
If you want to find the most similar phrases, we have to extract all phrases' embeddings and store them into a folder first. To keep the tutorial simple, we don't do this here in this notebook, you can refer to python scripts in [our GitHub](https://github.com/NTHU-NLPLAB/TAAI_gen_paraphrase) to do the job for you.  
Once all phrases are stored in a folder, we can load them and do the comparison.

First, load all phrase embeddings.

In [None]:
emb_folder = ''
for filename in os.listdir(folder):
    if filename in ['.', '..']: continue
    bundle = os.path.splitext(filename)[0].replace('_', ' ')
    emb = np.load(os.path.join(folder, filename), allow_pickle=True)
    embeddings[bundle] = emb

Then we define a utility function to help us find and print the most similar phrases.

In [None]:
def most_similar(target, bundles, n=5):
    similarities = []
    target_emb = bundles[target]
    for bundle, bundle_emb in bundles.items():
        if bundle == target: continue
        similarities.append((target, bundle, cosine_similarity(target_emb, bundle_emb)))
    similarities.sort(key=lambda emb:-emb[2])
    return similarities[:n]

def print_similarity(tuples):
    head = True
    for t in tuples:
        if head:
            print(f'{t[0]}')
            head = False
        print(f'  > {t[1]}\t{t[2]:.2f}')

Then we can set up a interactive searching progress. Enjoy\!

In [None]:
while True:
    query = input('input: ')
    if query in ['quit', 'q']: breakn
    print_similarity(most_similar(query, bundles_emb))