# Introduction

 intro

# Related work

related work

# Main

## prepare environment

In [None]:
!pip install 

## load data

For simplicity, we simple load our preprocessed data. 
The data contain one sentence in one line. All sentences have been tokenized and lemmatized, so they can be directly fed into Word2Vec to train.

In [4]:
def load_data(data_path):
    f = [line.strip() for line in open(data_path, 'r')]
    return f

The result will be a list of processed sentences.

In [None]:
ret = load_data('data/all_hyphened_sent.txt')
ret[:3]

## train Word2Vec

### Download pretrained model

First, download Google's Word2Vec pretrained model. This model is trained on Google News Dataset, which contains about 100 billion words.  
You can find more details of this model on [Google's website](https://code.google.com/archive/p/word2vec/).

In [None]:
!pip install gdown

In [None]:
!mkdir models
!gdown -O models/GoogleNews-vectors-negative300.bin --id 0B7XkCwpI5KDYNlNUTTlSS21pQmM

### Finetune our own model

We use [gensim](https://github.com/RaRe-Technologies/gensim) to help us train the model.

In [None]:
!pip install gensim

In [None]:
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec

In [None]:
def create_model(training_data, emb_dim=300):
    model = Word2Vec(size = emb_dim,
                     in_count = 1)
    model.build_vocab(training_data)
    example_count = model.corpus_count
    return model, examplt_count

Now we can load Google's pretrained weight into our model.  
(This may need a while.)

In [None]:
def load_pretrained_model(model, pretrained_path):
    pretrained_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
    model.build_vocab([list(pretrained_model.vocab.keys())], update=True)
    del pretrained_model   # free memory
    return model

Then we can start to do training. We set 10 as default #ephcos because the model will have the best performance at this setting.  
Note that the training progress needs a while, too. (about ? minutes for ? epoches)

In [None]:
def train_model(model, example_count, epochs):
    return model.train(training_data,
                       total_examples = example_count,
                       epochs = epochs)

## Get phrase embeddings

simple intro of two method

### Method A
Simply extract words in T8956_phrase_all.txt's embeddings from vector.kv file.    
You need to create a folder to save the extracted .npy files, and we use 'embeddings' here.

In [21]:
from gensim.models import KeyedVectors

import numpy as np

In [42]:
lb = []
with open('data/T8956_phrase_all.txt', 'r') as f:
    for lines in f:
        lb.append(lines.replace('\n', ''))

word_vectors = KeyedVectors.load('model/w3_a0.025_300_10i/vector.kv')

lb_dash = [lbs.replace(' ', '_') for lbs in lb]

for lbs in lb_dash:
    if lbs in word_vectors:
        path = 'embeddings/'+lbs
        np.save(path, word_vectors[lbs])

![](images/MethodA_model.png)

### Method B

As another method, differing from hyphened phrases and train a new embedding model, we try to extract word embeddings from every words in a phrase through general word embedding model. Then, we use sentence embedding models to encode those words into one phrase embedding, as the picture shows below. This is reasonable because phrases are actually some combinations of words.

![](images/MethodB_model.png)

For simplicity, we use InferSent and Facebook's released pretrained model as our sentence embedding model.  
So we can directly extract word embeddings from our finetuned word2vec, and then trow embeddins into InferSent.  

First we need to load training data and word2vec pretreained models.

In [None]:
training_data2 = load_data('data/all_unhyphened_sent.txt')

print('Creating model...')
w2v, example_count = create_model(training_data2)
w2v = load_pretrained_model(w2v, 'models/GoogleNews-vectors-negative300.bin')

# train model
print('training model...')
train_model(w2v, example_count, epochs=5)

Now we extrace word embeddings from our model!

In [None]:
def get_word_embeddings(model, phrase):
    words = phrase.split(' ')
    word_embeddings, unfound_words = [], []
    for word in words:
        try:
            emb = model.wv[word]
            word_embeddings.append(emb)
        except:
            unfound_words.append(word)
    return word_embeddings

In [None]:
my_phrase = 'look for the'
word_embs = get_word_embeddings(my_phrase)

To use InferSent model, we need to download Facebook's pretrained weight first.

In [None]:
!mkdir encoders
!curl -Lo encoders/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

Then we load pretrained model.

In [None]:
import torch
from infersent import InferSent

# defaul config of infersent
config = {'bsize': 64, 
          'word_emb_dim': 300, 
          'enc_lstm_dim': 2048,
          'pool_type': 'max', 
          'dpout_model': 0.0, 
          'version': 2}

infersent = InferSent(config)
infersent.load_state_dict(torch.load('encoders/infersent2.pkl'))

Before we use InferSent model, we need to convert word embeddings into batch that is Infersent compatible.

In [None]:
def transform_batch(word_embs):
    # load beginning-of-sent and end-of-sent embedding
    emb_bos = np.load(os.path.join('word_embs', 'bos.npy'))
    emb_eos = np.load(os.path.join('word_embs', 'eos.npy'))
    
    # extract embeddings
    lengths = len(word_embs) + 2
    embeddings = np.stack((emb_bos, word_embs, emb_pos))
    
    batch = np.zeros((word_len, 1, 300))
    for i in range(len(embeddings)):
        batch[i][0][:] = embeddings[i]
    
    return torch.FloatTensor(batches), np.array(lengths)

In [None]:
batch, length = transform_batch(word_embs)

So we can extract phrase embeddings from InferSent!

In [None]:
with torch.no_grad():
    phrase_emb = infersent.forward((batch, length)).data.cpu().numpy()
print(phrase_emb)

# Compare similarities

To use our model for finding similar phrases, we need to extract all phrases' embeddings first.