##  Word Mover’s Distance 

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.wmdistance.html
    
https://www.yelp.com/dataset_challenge/dataset`

~~~~~~

WMD: gensim.models.Word2Vec.wmdistance

1. Download the data from https://www.yelp.com/dataset_challenge/dataset
2. Preprocess the data, removing stopwords, etc
3. Filters the 'reviews' of 6 restaurants( Earl of Sandwich, Wicked Spoon, Serendipity 3, Bacchanal Buffet,
   The Buffet, Mon Ami Gabi) from the downloaded database.
4. Load the pretrained Google model of vectors into word2vec
5. Convert text into word2vec.
6. Compute the distance between the reviews and query input using word mover's distance(Word2Vec.wmdistance)
7. Retrieve top 10 similar reviews.

~~~~~~

### Importing Packages

In [13]:
import numpy as np
import pandas as pd
from os import listdir

# --- NLTK PACKAGE ---
import nltk
# Tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize, PunktSentenceTokenizer, RegexpTokenizer
# Stemming and Lemmatizing
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Stopwords
from nltk.corpus import stopwords, state_union, brown, movie_reviews, treebank
# Wordnet
from nltk.corpus import wordnet

# --- GENSIM PACKAGE ---
import gensim
from gensim.models import Word2Vec, doc2vec, Doc2Vec
from gensim.models.tfidfmodel import TfidfModel
from gensim import corpora, models, similarities
from gensim.models import KeyedVectors

### Loading WMD Google Pre-trained Model

In [14]:
WMD_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [10]:
## OUR MODEL
#  model = Word2Vec(words, size = 100, window = 10, hs=1, negative=0, workers = 4, min_count=1)

### Preprocessing Data

In [15]:
query = "Who is Pranjal ?"

In [16]:
data = '''My name is Pranjal Pathak. 
          My gender is Male. I am 23 years old. 
          I live in Bangalore. I like driving. 
          I have lived in Varanasi before but I like Bangalore more. 
          Phani is a nice girl. Her gender is Female.'''

In [29]:
check_similarity(query, data)

Unnamed: 0,Sentence,WMD_Score
0,My name is Pranjal Pathak.,2.61156
6,Phani is a nice girl.,3.309831
2,I am 23 years old.,3.331509
4,I like driving.,3.365052
3,I live in Bangalore.,3.378908
5,I have lived in Varanasi before but I like Ban...,3.408394
1,My gender is Male.,3.7782
7,Her gender is Female.,3.869779


In [28]:
def check_similarity(query, data):
    list_distances = []
    
    stop_words = set(stopwords.words("english"))
    sentence1 = [word for word in word_tokenize(query) if word not in stop_words]
    
    sentences_in_document = sent_tokenize(data)
    
    for each_sentence in sentences_in_document:
        sentence2 = [word for word in word_tokenize(each_sentence) if word not in stop_words]
        similarity_distance = WMD_model.wmdistance(sentence1, sentence2)
        list_distances.append(similarity_distance)
        
    WMD_Dataframe = pd.DataFrame({'Sentence': sentences_in_document, 'WMD_Score': list_distances}).sort_values(by=['WMD_Score'],ascending=True) 
    return WMD_Dataframe