# Introduction

This notebook gives reader a concept of our proposed method to extract phrase embeddings.  
If you want to get the whole code release and all training scripts, please check [this GitHub repo](https://github.com/NTHU-NLPLAB/TAAI_gen_paraphrase).  

# Main

## prepare environment

In [None]:
# !pip install -r requirement.txt

## load data

For simplicity, we simple load our preprocessed data.  
The data contain single sentence in one line. All sentences have been tokenized and lemmatized, so they can be directly fed into Word2Vec to train.

In [1]:
def load_data(data_path):
    f = [line.strip() for line in open(data_path, 'r')]
    return f

The result will be a list of processed sentences.

In [2]:
ret = load_data('data/all_hyphened_sent.txt')
ret[:3]

['and end at houli horse farm ( ) , or -PRON- could go further on to the lovely pilu buddhist monastery ( p )',
 'the dumpling be a big favourite',
 'a chill venue where musicindustry type hang , and everyone seem to know everyone else']

## train Word2Vec

### Download pretrained model  
For convenience, we choose not to train the whole model from scratch, but finetune the model from other's pretrained model.  
Hence, we download Google's Word2Vec pretrained model first. This model is trained on Google News Dataset, which contains about 100 billion vocabularies.  
For moe details, please refer to [Google's website](https://code.google.com/archive/p/word2vec/).

In [3]:
!pip install gdown

Processing /Users/joaw/Library/Caches/pip/wheels/ba/e0/7e/726e872a53f7358b4b96a9975b04e98113b005cd8609a63abc/gdown-3.12.2-py3-none-any.whl
Collecting tqdm
  Downloading tqdm-4.50.0-py2.py3-none-any.whl (70 kB)
[K     |████████████████████████████████| 70 kB 785 kB/s eta 0:00:011
Collecting filelock
  Using cached filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting requests[socks]
  Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting PySocks!=1.5.7,>=1.5.6; extra == "socks"
  Using cached PySocks-1.7.1-py3-none-any.whl (16 kB)
Installing collected packages: tqdm, filelock, chardet, urllib3, certifi, idna

In [4]:
# File size: 1.5G

!mkdir models
!gdown -O models/GoogleNews-vectors-negative300.bin.gz --id 0B7XkCwpI5KDYNlNUTTlSS21pQmM

mkdir: models: File exists
Downloading...
From: https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
To: /Users/joaw/workspace/TAAI_gen_paraphrase/models/GoogleNews-vectors-negative300.bin
1.65GB [01:13, 22.4MB/s]


In [22]:
!gzip -d models/GoogleNews-vectors-negative300.bin.gz

### Finetune our own model 

After the model is downloaded, we use [gensim](https://github.com/RaRe-Technologies/gensim) package to help us finetune the model.  
  
**gensim** is a useful tools to use and train on NLP and IR tasks. It has many famous models implemented, like Word2Vec, Doc2Vec, FastText, ...etc. Here we use gensim's API to load and finetune Word2Vec model.  
For more information of gensim, please refer to:
 - [Official Word2Vec Tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html)
 - [Word2Vec's API documentation](https://radimrehurek.com/gensim/models/word2vec.html)

In [5]:
# gensim shoud have been installed in the first cell. If it's not correctly installed, please run this command.

!pip install gensim

Collecting gensim
  Using cached gensim-3.8.3-cp37-cp37m-macosx_10_9_x86_64.whl (24.2 MB)
Collecting scipy>=0.18.1
  Using cached scipy-1.5.2-cp37-cp37m-macosx_10_9_x86_64.whl (28.7 MB)
Collecting numpy>=1.11.3
  Using cached numpy-1.19.2-cp37-cp37m-macosx_10_9_x86_64.whl (15.3 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-2.2.1.tar.gz (122 kB)
[K     |████████████████████████████████| 122 kB 930 kB/s eta 0:00:01
Collecting boto3
  Downloading boto3-1.15.12-py2.py3-none-any.whl (129 kB)
[K     |████████████████████████████████| 129 kB 3.9 MB/s eta 0:00:01
Collecting jmespath<1.0.0,>=0.7.1
  Using cached jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.4.0,>=0.3.0
  Using cached s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting botocore<1.19.0,>=1.18.12
  Downloading botocore-1.18.12-py2.py3-none-any.whl (6.7 MB)
[K     |████████████████████████████████| 6.7 MB 9.6 MB/s eta 0:00:01
Using legacy setup.py install for smart-open, since package 'whe

In [6]:
from gensim.models import Word2Vec             # This is Word2Vec's base model in gensim
from gensim.models import KeyedVectors         # This stores all vocabulary information
# from gensim.models.callbacks import CallbackAny2Vec   # This makes us able to record

First we create a function to setup Word2Vec object with specified parameters.  
Since we are using Google's model, we set embedding dimension (`size`) same as the dimension in pretrained model. 
`min_count` is set to `1` to correctly load pretrained model's vocobulary list.  


In [16]:
def create_model(training_data, emb_dim=300):
    model = Word2Vec(size = emb_dim,
                     min_count = 1)
    model.build_vocab(training_data)
    example_count = model.corpus_count
    return model, example_count

Now we can load Google's pretrained weight into our model.  
Since fine-tuning is not the funcionality officially supported by gensim, there are some work sould be done by ourselves:  

1. open pretrained model  
2. build all pretrained vocabularies into our voc list  
3. copy all model weights from pretrained model to our model  

Note that it needs a while due to pretrained model's large size.  

In [8]:
def load_pretrained_model(model, pretrained_path):
    pretrained_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
    model.build_vocab([list(pretrained_model.vocab.keys())], update=True)
    del pretrained_model   # free memory
    model.intersect_word2vec_format(pretrained_path, binary=True, lockf=0.0)
    return model

Then we can start to do training. We set 10 as default #ephcos because the model will have the best performance at this setting.  
Note that the training progress needs a while, too. (about ? minutes for ? epoches)

In [9]:
def train_model(model, example_count, epochs):
    return model.train(training_data,
                       total_examples = example_count,
                       epochs = epochs)

## Get phrase embeddings

To generate phrase embeddings, we proposed two method.  
We'll introduce two methods and show in following section.  

### Method A
Simply extract words in T9856_phrase_all.txt's embeddings from vector.kv file.    
You need to create a folder to save the extracted .npy files, and we use 'embeddings' here.

![](images/MethodA_model.png)

In [None]:
from gensim.models import KeyedVectors

import numpy as np

In [None]:
lb = []
with open('data/T8956_phrase_all.txt', 'r') as f:
    for lines in f:
        lb.append(lines.replace('\n', ''))

word_vectors = KeyedVectors.load('model/w3_a0.025_300_10i/vector.kv')

lb_dash = [lbs.replace(' ', '_') for lbs in lb]

for lbs in lb_dash:
    if lbs in word_vectors:
        path = 'embeddings/'+lbs
        np.save(path, word_vectors[lbs])

After extracting some phrase embeddings, you can now go to [Compare similarities](#Compare-similarities) section to see how similar the phrases are, or you can continue to go through [our Method B](#Method-B) first.

### Method B

As another method, differing from hyphening all phrases and train a new embedding model as Method A, we try to extract embeddings of **every words in a phrase**. Then, we use sentence embedding models to **encode those words into a single phrase embedding**, as the picture shows below.  
This is reasonable because phrases are actually combinations of words, and their meanings usually come from words.  
![](images/MethodB_model.png)

For simplicity, we use [InferSent](https://github.com/facebookresearch/InferSent) with Facebook's pretrained model as our sentence embedding model.  
Before we start, we should prepare our environment for InferSent first.  

In [12]:
# packages shoud have been installed in the first cell. If it's not correctly installed, please run these commands.
# If you couldn't install pytorch correctly, please refer to official install instructoin (https://pytorch.org/get-started/locally/)

!pip install nltk
!pip install torch

You should consider upgrading via the '/Users/joaw/workspace/TAAI_gen_paraphrase/testenv/bin/python3.7 -m pip install --upgrade pip' command.[0m
Collecting torch
  Using cached torch-1.6.0-cp37-none-macosx_10_9_x86_64.whl (97.4 MB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Using legacy setup.py install for future, since package 'wheel' is not installed.
Installing collected packages: future, torch
    Running setup.py install for future ... [?25ldone
[?25hSuccessfully installed future-0.18.2 torch-1.6.0
You should consider upgrading via the '/Users/joaw/workspace/TAAI_gen_paraphrase/testenv/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [13]:
# If this is your first time using nltk, remember download punkt data first
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/joaw/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

To use InferSent model, we need to download Facebook's pretrained weight first.

In [14]:
# File size: 146M

!mkdir encoders
!curl -Lo encoders/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

mkdir: encoders: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  12.1M      0  0:00:12  0:00:12 --:--:-- 14.5M    0  0:00:24  0:00:03  0:00:21 6026k


So far the environment is well prepared, so we can directly extract word embeddings from our finetuned word2vec, and throw then into InferSent to get phrase embeddings now.  
There are four steps to achieve this.  
1. Set up Word2Vec model
2. Get all word embeddings
3. Set up Inferset model
4. Get phrase embeddings

#### Set up Word2Vec

Now, in the first step, we have to extract all word embedding from Word2Vec model.  
Since all utilities functions have been created above, we easily call them to set up our training pipeline.  
1. load our training data
2. create gensim's Word2Vec model
3. load Google's pretrained model (This needs a while)
4. finetuning (This also need a while)

In [None]:
training_data2 = load_data('data/all_unhyphened_sent.txt')

print('Creating model...')
w2v, example_count = create_model(training_data2)
w2v = load_pretrained_model(w2v, 'models/GoogleNews-vectors-negative300.bin')

# train model
print('training model...')
train_model(w2v, example_count, epochs=5)

Now we extrace word embeddings from our model\!   
To make whole progress easier, we here create a function to extract all embeddings from a phrase string in advance. 

#### Get list of word embeddings

In [None]:
def get_word_embeddings(model, phrase):
    words = phrase.split(' ')
    word_embeddings, unfound_words = [], []
    for word in words:
        try:
            emb = model.wv[word]
            word_embeddings.append(emb)
        except:
            unfound_words.append(word)
    return word_embeddings

Here's a simple test to check results from `get_word_embeddings`.  
You should see a list of embeddings as return.  

In [None]:
methodB_phrase1 = 'look for the'
word_embs = get_word_embeddings(methodB_phrase1)

#### Setup InferSent

We have to create an InferSent object with specified parameters, and then we load pretrained model we downloaded above.  

In [None]:
import torch
from infersent import InferSent

# defaul config of infersent
config = {'bsize': 64, 
          'word_emb_dim': 300, 
          'enc_lstm_dim': 2048,
          'pool_type': 'max', 
          'dpout_model': 0.0, 
          'version': 2}

infersent = InferSent(config)
infersent.load_state_dict(torch.load('encoders/infersent2.pkl'))

#### Get phrase embeddings

Before we use InferSent model, we have to convert word embeddings into InferSent-compatible batch first. We create a function here to do the job.  

In [None]:
def transform_batch(word_embs):
    # load beginning-of-sent and end-of-sent embedding
    emb_bos = np.load(os.path.join('word_embs', 'bos.npy'))
    emb_eos = np.load(os.path.join('word_embs', 'eos.npy'))
    
    # extract embeddings
    lengths = len(word_embs) + 2
    embeddings = np.stack((emb_bos, word_embs, emb_pos))
    
    batch = np.zeros((word_len, 1, 300))
    for i in range(len(embeddings)):
        batch[i][0][:] = embeddings[i]
    
    return torch.FloatTensor(batches), np.array(lengths)

We can use the word embeddings extracted above to check the output tensor.

In [None]:
batch, length = transform_batch(word_embs)

Once the tensor is prepared, we can extract phrase embeddings from InferSent!

In [None]:
with torch.no_grad():
    methodB_pharse_emb1 = infersent.forward((batch, length)).data.cpu().numpy()
print(methodB_pharse_emb1)

#### Method B - All in one

In [None]:
methodB_phrase2 = 'check out the'
word_embs = get_word_embeddings(methodB_phrase2)
batch, length = transform_batch(word_embs)
with torch.no_grad():
    methodB_pharse_emb2 = infersent.forward((batch, length)).data.cpu().numpy()
print(methodB_pharse_emb2)

# save embeddings in numpy-format if you want
# out_path = 'phrase'
# np.save(out_path, phrase_emb.numpy())

# Compare similarities

Once we have two embeddings, we can compare their similarities with cosine similarities.  

In [None]:
import numpy
from numpy.linalg import norm

def cosine_similarity(a, b):
    ret = np.inner(a, b) / (norm(a) * norm(b))
    return 0 if np.isnan(ret) else ret

In [None]:
# phrase1, phrase2 = methodA_pharse_emb1, methodA_pharse_emb2
phrase1, phrase2 = methodB_pharse_emb1, methodB_pharse_emb2

In [None]:
print('{:.3f}'.format(cosine_similarity(phrase1, phrase2)))

### Find the most similar phrases
If you want to find the most similar phrases, we have to extract all phrases' embeddings and store them into a folder first. To keep the tutorial simple, we don't do this here in this notebook, you can refer to python scripts in [our GitHub](https://github.com/NTHU-NLPLAB/TAAI_gen_paraphrase) to do the job for you.  
Once all phrases are stored in a folder, we can load them and do the comparison.

First, load all phrase embeddings.

In [None]:
emb_folder = ''
for filename in os.listdir(folder):
    if filename in ['.', '..']: continue
    bundle = os.path.splitext(filename)[0].replace('_', ' ')
    emb = np.load(os.path.join(folder, filename), allow_pickle=True)
    embeddings[bundle] = emb

Then we define a utility function to help us find and print the most similar phrases.

In [None]:
def most_similar(target, bundles, n=5):
    similarities = []
    target_emb = bundles[target]
    for bundle, bundle_emb in bundles.items():
        if bundle == target: continue
        similarities.append((target, bundle, cosine_similarity(target_emb, bundle_emb)))
    similarities.sort(key=lambda emb:-emb[2])
    return similarities[:n]

def print_similarity(tuples):
    head = True
    for t in tuples:
        if head:
            print(f'{t[0]}')
            head = False
        print(f'  > {t[1]}\t{t[2]:.2f}')

Then we can set up a interactive searching progress. Enjoy\!

In [None]:
while True:
    query = input('input: ')
    if query in ['quit', 'q']: breakn
    print_similarity(most_similar(query, bundles_emb))