<a href="https://colab.research.google.com/github/jhihan/Text-Search-Engine/blob/master/Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Build up the search engine using fast search BM25 ranking function and the more accurate machine-learning based method -- Doc2Vec word embedding technique.**

In [23]:
!pip install rank_bm25
!pip install langdetect

Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 2.9MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.8-cp36-none-any.whl size=993193 sha256=2c3721b440488943311b1923152eb2d2439ab05de5fab9a1bc81bbadd5469cb1
  Stored in directory: /root/.cache/pip/wheels/8d/b3/aa/6d99de9f3841d7d3d40a60ea06e6d669e8e5012e6c8b947a57
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.8


In [0]:
import gzip
import gensim 
import logging
from langdetect import detect
import pandas
from rank_bm25 import BM25Okapi
from gensim.summarization.bm25 import get_bm25_weights
import operator
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import os
import os.path
from gensim.test.utils import get_tmpfile
from google.colab import files

In [0]:
def query_BM25_search( query , documents, taggedDocuments, topn = 10, model = None ):
    """
    This function applies the search using BM25 algorithm. 

    Parameters
    ----------
    query: string, searched word or sentence.  

    documents: list of string, the documents to be searched.

    taggedDocuments: list of namedtuple(words,tags), a list of tagged documents.

    topn: int optional (default=10) , number of results to be retrieved.  

    model: BM25Okapi object (default=None), word embedding model. 
          If None the a function will be called in order to get bag-of-words retrieval function.

    Return
    ------
    list of top n search documents

    """  
    if model == None:
      model = BM25Okapi( list(map(operator.attrgetter('words'), taggedDocuments)) )

    doc_scores = model.get_scores( query.split() )
    doc_scores = sorted( list(enumerate(doc_scores)) ,key=lambda x:x[1],reverse=True) 
    result_doc = []
    for i in range( topn ):
      result_doc.append( documents[ doc_scores[i][0] ] )  
    return result_doc

In [0]:
def query_tfidf_search( query , documents, taggedDocuments, topn = 10 ):
  pass

In [0]:
# test
import os
import os.path
from gensim.test.utils import get_tmpfile

if is_new_model and path.isfile(model_name):
  os.remove(model_name)

fname = get_tmpfile(model_name)
if path.isfile(model_name) :
  Doc2Vec.load(fname)
else:
  model = gensim.models.Doc2Vec (vector_size=300, window=10, hs=0, min_count=2, dbow_words=1, workers=10)
  model.build_vocab(taggedDocuments)
  model.train(taggedDocuments,total_examples=len(taggedDocuments),epochs=10)  
  model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
  model.save(fname)

In [0]:
def query_doc2vec_search( query , documents, taggedDocuments, model_name, topn = 10, is_new_model = False ):
    """
    This function applies the search using the doc2vec word embedding algorithm. 

    Parameters
    ----------
    query: string, searched word or sentence.  

    documents: list of string, the documents to be searched.

    taggedDocuments: list of namedtuple(words,tags), a list of tagged documents.

    topn: int optional (default=10) , number of results to be retrieved.  

    model: Doc2Vec object (default=None), word embedding model. 
          If None the a function will be called in order to get word embedding model.

    Return
    ------
    list of top n search documents

    """

    if is_new_model and os.path.isfile(model_name):
      os.remove(model_name)

    fname = get_tmpfile(model_name)
    if os.path.isfile(model_name) :
      model = Doc2Vec.load(fname)
    else:
      model = gensim.models.Doc2Vec (vector_size=300, window=10, hs=0, min_count=2, dbow_words=1, workers=10)
      model.build_vocab(taggedDocuments)
      model.train(taggedDocuments,total_examples=len(taggedDocuments),epochs=10)  
      model.save(fname)

#    if model == None:
#      model = gensim.models.Doc2Vec (vector_size=300, window=10, hs=0, min_count=2, dbow_words=1, workers=10)
#      model.build_vocab(taggedDocuments)
#      model.train(taggedDocuments,total_examples=len(taggedDocuments),epochs=10)
  
    query_vector = model.infer_vector(query.split() ,epochs = 10)
    tagsim = model.docvecs.most_similar([query_vector], topn=topn)

    result_doc = []
    for i in range( topn ):
      result_doc.append( documents[ tagsim[i][0] ] )

    return result_doc

**Read the data from the dataset. When reading the data, we also do some simple preprocessing. Because there might be some non-English data, we use detect function from langdetect package in order to get rid of these data.**

Data is downloaded from: http://kavita-ganesan.com/entity-ranking-data/#.XnGMCpNKiL8 . THis data is some comments about the hotel.

In [74]:
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
lemmatizer = WordNetLemmatizer()

In [34]:
uploaded = files.upload()

Saving reviews_data.txt.gz to reviews_data.txt.gz


In [0]:

def read_input(input_file):
  """
  This method reads the input file

  Parameters
  ----------
  input_file: string, the full path and file name of the inputfile

  Yield
  -----
  ( string, namedtuple ), the string and the tagged document of single document.

  """
  logging.info("reading file {0}...this may take a while".format(input_file))
  with gzip.open(data_file, 'rb') as f:
    index = -1
    for i,line in enumerate (f):
          if (i%10000==0):
            logging.info ("read {0} reviews".format (i))
          words = gensim.utils.simple_preprocess (line)
          if (len(words) >= 2):
            sentence = ' '.join(word for word in words)
            if (detect(sentence) == 'en'):
              index += 1
              yield line, gensim.models.doc2vec.TaggedDocument(
                          words = ' '.join([lemmatizer.lemmatize(w) for w in words]).split(),
                          tags=[index] )



In [0]:
def query_preprocessing(query):
  """
  The preprocessing of the query string
  Parameters
  ----------
  input_file: string, the query word or sentence

  Return
  ------
  the preprocessed string

  """
  query_pre = [lemmatizer.lemmatize(w) for w in nltk.word_tokenize(query)]
  query_pre = ' '.join(  w.lower() for w in query_pre if w not in stopwords.words('english')  )
  return query_pre

In [39]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
data_file="reviews_data.txt.gz"
documents, taggedDocuments = list( map(list,zip( *read_input(data_file)  )) )
logging.info ("Done reading data file")

2020-03-18 07:30:50,121 : INFO : reading file reviews_data.txt.gz...this may take a while
2020-03-18 07:30:50,130 : INFO : read 0 reviews
2020-03-18 07:32:02,917 : INFO : read 10000 reviews
2020-03-18 07:33:10,016 : INFO : read 20000 reviews
2020-03-18 07:34:26,083 : INFO : read 30000 reviews
2020-03-18 07:35:38,555 : INFO : read 40000 reviews
2020-03-18 07:36:51,039 : INFO : read 50000 reviews
2020-03-18 07:38:08,077 : INFO : read 60000 reviews
2020-03-18 07:39:20,206 : INFO : read 70000 reviews
2020-03-18 07:40:32,042 : INFO : read 80000 reviews
2020-03-18 07:41:43,519 : INFO : read 90000 reviews
2020-03-18 07:42:53,654 : INFO : read 100000 reviews
2020-03-18 07:44:06,894 : INFO : read 110000 reviews
2020-03-18 07:45:16,089 : INFO : read 120000 reviews
2020-03-18 07:46:26,071 : INFO : read 130000 reviews
2020-03-18 07:47:35,599 : INFO : read 140000 reviews
2020-03-18 07:48:46,029 : INFO : read 150000 reviews
2020-03-18 07:49:56,517 : INFO : read 160000 reviews
2020-03-18 07:51:03,891

In [81]:
# query can also be several words or a sentence
# for example: 'dirty hotel', 'where is the most beautiful hotel'
query ='dirty'
query_pre = query_preprocessing(query)
result_B25 = query_BM25_search( query_pre , documents, taggedDocuments, topn = 10 )
print("-------------results from quick search--------------------")
print('Top search with respect to' , query , ':' )
for i, item in enumerate(result_B25):
    print(i+1, item)  

-------------results from quick search--------------------
Top search with respect to dirty :
1 b'Jul 26 2004\tHorrible\tThe room wasnt ready when we arrived, then it was dirty. Dirty towels, dirty ripped bedding, dirty fridge. Wont be going there again.\t\r\n'
2 b'Jul 1 2009\tDirty dirty dirty hotel\tThe worse hotel I ever stayed. Stay away from it. The management forgot the meaning of the word &quot;clean&quot;. Because every inch of this hotel is dirty.\t\r\n'
3 b"Apr 13 2004\tTerrible hotel....\tThis was the worst hotel I have ever been in my life. The uncomfortable bed, the poor breakfast, the size of the room, they were nothing compared to the dirty of the place. It is so dirty that even the rats run away from it ! The toilet/shower are very dirty, the carpets and walls are dirty, the bed-linnen are dirty, the windows are dirty.... Don't ever think about going to this place ! You'd better sleep in the subway !\t\r\n"
4 b'Nov 23 2007 \tPick another hotel!!\tThe positives:1. Great 

In [90]:
result_doc2vec = query_doc2vec_search( query_pre , documents, taggedDocuments, "doc2vec",topn = 10 )
print("-------------results from more relevant search--------------------")
print('Top search with respect to' , query , ':' )
for i, item in enumerate(result_doc2vec):
    print(i+1, item)  

2020-03-18 10:24:32,514 : INFO : collecting all words and their counts
2020-03-18 10:24:32,516 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-03-18 10:24:32,868 : INFO : PROGRESS: at example #10000, processed 1771213 words (5060562/s), 23424 word types, 10000 tags
2020-03-18 10:24:33,195 : INFO : PROGRESS: at example #20000, processed 3495619 words (5289611/s), 31835 word types, 20000 tags
2020-03-18 10:24:33,588 : INFO : PROGRESS: at example #30000, processed 5603532 words (5387558/s), 44281 word types, 30000 tags
2020-03-18 10:24:33,961 : INFO : PROGRESS: at example #40000, processed 7527347 words (5169041/s), 52461 word types, 40000 tags
2020-03-18 10:24:34,361 : INFO : PROGRESS: at example #50000, processed 9608543 words (5217547/s), 59124 word types, 50000 tags
2020-03-18 10:24:34,757 : INFO : PROGRESS: at example #60000, processed 11682075 words (5253450/s), 66314 word types, 60000 tags
2020-03-18 10:24:35,072 : INFO : PROGRESS: at example #7

-------------results from more relevant search--------------------
Top search with respect to dirty :
1 b"Feb 15 2005 \tDon't Stay Here\tI stayed here in Feburary while traveling on business. It was awful. Very dirty.Staff tried but not much you can do when the rooms are run down and dirty. In my opinion, this property gives Hyatt a bad name.\t\r\n"
2 b'May 11 2005\tDishonest Management\tRude and dishonest management.Too noisy (right on the busy Lombard) - not a good sleep.Too dirty.\t\r\n'
3 b'May 10 2005\tshocking loud and dirty\t\t\r\n'
4 b"Sep 12 2009 \tAbsolutely Horrible - Don't Go There!!!\tRoom tiny and dirty. . Staff uninterested and unhelpful. Breakfast disgusting. Doesn't deserve any stars.\t\r\n"
5 b'Nov 10 2003\tDeceiving Sneaky and Rude!\t\t\r\n'
6 b'Apr 14 2007 \tworst rooms ever !!\tworst rooms have ever stayed in,small is an understatement,dirty,cold shower and noisy!\t\r\n'
7 b'Nov 4 2007 \tBAAAADDDDDDDD\tdont go there is the worst hotel i ever stay in my hole life.it

  if np.issubdtype(vec.dtype, np.int):
