# Sentence similarity
Compare a incoming sentence to a list of candidate sentences and return a most similar one.  
Here we use the pre-trained word2vec from Google. 

Reference:  
1.[Gensim package tutorial](https://radimrehurek.com/gensim/index.html)  
2.Google pretrained model can be download from [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)  
3.[Official website for gensim.](https://code.google.com/archive/p/word2vec/)  


## 1. Preparation: load google pre-trained word2vec model
This pre-trained model consist of 3 million words. This size of 'GoogleNews-vectors-negative300.bin' is more than 3 gigabytes. For usage convenient, here we extract the word2vec, normalize them, and save it to a pickle file. This step only need to run once. As long as we generate the vocab_dict.pickle file, the following code can be ignored.  

In [None]:
'''
import gensim
import pickle
from scipy.linalg import norm

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
vocab = set(model.wv.vocab)

vocab_dict = dict()
for word in vocab:
    try:
        tmp = model.wv[word] 
        vocab_dict[word] = tmp / norm(tmp)
    except:
        print(word)


with open('vocab_dict.pickle','wb') as f:
    pickle.dump(vocab_dict, f)
'''

## 2. Start from Here

In [1]:

from nltk.stem.wordnet import WordNetLemmatizer
wordnet = WordNetLemmatizer()
import operator
from nltk.corpus import stopwords
import re
import pickle
from nltk.tokenize import RegexpTokenizer
import numpy as np

In [5]:
class QuerySimilarSentence:
    '''
    Given a list of sentences and then input a new one, return a most similar sentence from the list
    '''
    def __init__(self, candidates, vocab_dict_path='vocab_dict.pickle'):
        '''
        Arguments:
            cadidates: a list of strings
            vocab_dict_path: the path of vocab_dict.pickle file
        '''
        with open(vocab_dict_path,'rb') as f:
            self.vocab_dict = pickle.load(f)
        self.vocab = set(self.vocab_dict.keys())
        # stop words
        self.stop_words = set(stopwords.words('english')) 
        # tokenize the sentences
        self.tokenizer = RegexpTokenizer(r'\w+')
        # candidate sentences' vector
        self.candidates = candidates
        self.candidates_vecs = [self.str_to_vec(s) for s in candidates]  
    
    def add_candidate(candidates):
        self.candidates += list(candidates)
        self.candidates_vecs += [self.str_to_vec(s) for s in list(candidates)]
    
    def update_candidate(candidates):
        self.candidates = list(candidates)
        self.candidates_vecs = [self.str_to_vec(s) for s in list(candidates)]
        
    def str_to_vec(self, s):
        '''
        Convert a string to a 2d vector
        Arguments:
            s: sentence string
        Return:
            vec: a 2D numpy vector
        '''
        s = self.tokenizer.tokenize(s)
        # lemmetize the word, for example: things -> thing, thinking ->think
        s = [wordnet.lemmatize(w.lower()) for w in s]
        # only keep the word in the vocabulary and non-stop word
        s = [w for w in s if w in self.vocab and w not in self.stop_words]  
        vec = np.array([self.vocab_dict[w] for w in s])
        return vec

    def get_sim(self, vec_1, vec_2):
        '''
        Compute the consine similarity between two 2d numpy vectors
        Arguments:
            vec_1,vec_2: 2D numpy vector
        Return:
            res: numerical number, similarity between vec_1 and vec_2
        '''
        return np.mean(np.max(np.dot(vec_1, vec_2.T),axis=1))
    
    def query(self, s):
        '''
        Input a sentence and query the most similar sentence in the candidates
        Arguments:
            s: string, input sentence
        Return:
            res: the most similar sentence in the candidate list
        '''
        s_vec = self.str_to_vec(s)
        # compute the similarity
        similarities = [self.get_sim(s_vec,c) for c in self.candidates_vecs]
        return self.candidates[np.argmax(similarities)]

In [6]:
candidates = ["""What's the temperature today?""","""How is the traffic?""","""What's the date today?""","""Play some music."""]
question = """How about the traffic?"""

In [7]:
query_s = QuerySimilarSentence(candidates)

In [8]:
query_s.query(question)

'How is the traffic?'

In [9]:
query_s.query("What's the degree?")

"What's the temperature today?"

In [10]:
query_s.query("Find a song")

'Play some music.'

## 3. Model improvement
1. The advantage of utilizing Google pre-trained model is the flexibility to different expressions and the drawback is the large memory usage (larger than 3 gigabytes).  
2. There are three practical approaches to solve this problem:  
    (1). The vocabulary contains 3 million words. Most of them are useless. Take them out and create a small subset vocabulary.  
    (2). Train your own word2vec model  
    (3). Use one-hot vector

In [11]:
len(query_s.vocab)

3000000