### This notebook is the implementation of the optimiazed search engine at user's end.
#### 1. Search for an optimal product based on the reviews using similarity matrix, positive sentiment, and weights of the keywords selected by the user.
#### 2. Retrieve the products with the reviews based on its relevancy to the query.

In [None]:
!pip install -U git+https://github.com/huggingface/transformers.git
# !pip install rake-nltk    

In [10]:
from collections import defaultdict
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
from tensorflow import keras

import spacy
# !python -m spacy download en
from itertools import chain
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from gensim.summarization import keywords as keywords_extractor
nlp = spacy.load('en')

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import matplotlib.pyplot as plt

import nltk
from nltk import pos_tag, word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from rake_nltk import Rake

import gensim.downloader as api
from utils import *

[nltk_data] Downloading package punkt to /Users/jungakim/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jungakim/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jungakim/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jungakim/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Load the dataset with 10 or more reviews.

In [11]:
reviews = pd.read_pickle('data/products_10_or_more_reviews.pkl')

#### Load the model for sentiment analysis

In [12]:
embed = hub.load('use')
# embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5") # Can load from Tensorflow_hub link
model = keras.models.load_model('model') # From google colab. 40 epochs 2**16 batchsize

INFO:absl:resolver HttpCompressedFileResolver does not support the provided handle.
INFO:absl:resolver GcsCompressedFileResolver does not support the provided handle.


#### Load the model for summarization

In [13]:
tokenizer_reddit = AutoTokenizer.from_pretrained("google/pegasus-reddit_tifu")
model_reddit = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-reddit_tifu")

Embeddings of the product titles

In [14]:
product_embeddings = pd.read_pickle('data/product_embeds.pkl') # Run google colab GPU to get the embeddings of all unique products in the dataset

Getting the embeddings of the product titles<br>
Done in Google Colab with GPU.
<code style="font-size: 10px; background-color:transparent;">
def tensors_from_series(series, lower, upper):
  return tf.convert_to_tensor(series.iloc[lower:upper].apply(lambda x: str(x)).values)
unique_products = reviews[['product_parent','product_title']].drop_duplicates()
product_names = unique_products.product_title
chunksize = unique_products.shape[0] // 1000
bounds = chunksize * np.arange(unique_products.shape[0] // chunksize)
<code style="font-size: 10px; background-color:transparent;">
bounds_tuples = list()
for i, bound in enumerate(bounds):
  if i == unique_products.shape[0] // chunksize - 1: break
  bounds_tuples.append((bounds[i], bounds[i+1]))
bounds_tuples.append((bounds_tuples[-1][1], unique_products.shape[0]))
<br>
with tf.device('/GPU:0'):
  print(f"{len(bounds_tuples)} chunks to embed.")
  embeddings = embed(tensors_from_series(product_names, 0, bounds_tuples[0][1]))
  for i, bounds_tuple in enumerate(bounds_tuples[1:]):
    if i % 100 == 0: 
      print(i,end='  ')
    embeddings = tf.concat([embeddings, embed(tensors_from_series(product_names, bounds_tuple[0], bounds_tuple[1]))], 0)
product_embeddings = embeddings.numpy()
</code>

### Searching for optimal products

**Within the categorily matched top_k_products, look at the top_k helpful reviews and save their "sentiment * similarity" scores.**
<br><br>
Positive similarity score of a sentence(in a review) is defined here as<br>
&emsp;"its predicted sentiment probability * (weighted) similarity score with a keyword/query sentence"
<br><br>
Positive similarity score of a review is the average of the maximum positive similarity score of a sentence of all keywords/query sentences.<br>
Following pseudo code using for-loop or the vectorization with matrices may better explain this concept.<br>

**For-loop** to compute the positive similarity score between a review and a query
<code style="font-size: 11px; background-color:transparent;">
For each "sentence in a review":
    Compute the positive similarity with each "keyword of a query" using 'get_sentiment_scores' and 'similarity_scores' functions.
    Get the maximum positive similarity out of all keywords(positive similarity = sentiment score * similarity score).
Repeat the same for-loop for "sentences in a query".
pos_sim_score_keywords = Mean of the maximum positive similarities.
pos_sim_score_sentences = Mean of the maximum positive similarities.
Return the smaller(being conservative) one from pos_sim_score_keywords and pos_sim_score_sentences.
</code><br>
**Vectorization** to compute the positive similarity score between a review and a query<br><br>
$\begin{bmatrix} & \scriptsize\textit{query's Keyword1} & \scriptsize\textit{query's Keyword2} & \scriptsize\textit{query's Keyword3} &  \\ 
\scriptsize\textit{Review's Sentence1} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \cdots \\ 
\scriptsize\textit{Review's Sentence2} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \cdots \\ 
\scriptsize\textit{Review's Sentence3} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \scriptsize\text{Similarity} & \cdots \\ 
\vdots & \vdots & \vdots & \vdots & \end{bmatrix}$
*
$\begin{bmatrix} \scriptsize\text{Sentiment score of Review's Sentence1} \\ 
\scriptsize\text{Sentiment score of Review's Sentence2} \\ \scriptsize\text{Sentiment score of Review's Sentence3} \\ \vdots \end{bmatrix}$
*
$\begin{bmatrix} \scriptsize\text{Weight of query's Keyword1} & \scriptsize\text{Weight of query's Keyword2} & \cdots \end{bmatrix}$
$=\begin{bmatrix}
\scriptsize\textit{Review's Sentence1} & \scriptsize\text{similarity with Keyword1 * sentiment * weight} & \scriptsize\text{Keyword2's similarity * sentiment * weight} & \cdots \\ 
\scriptsize\textit{Review's Sentence2} & \scriptsize\text{similarity with Keyword2 * sentiment * weight} & \scriptsize\text{Keyword2's similarity * sentiment * weight} & \cdots \\ 
\scriptsize\textit{Review's Sentence3} & \scriptsize\text{similarity with Keyword3 * sentiment * weight} & \scriptsize\text{Keyword2's similarity * sentiment * weight} & \cdots \\ 
\vdots & \vdots & \vdots & \end{bmatrix}$
$\leftarrow$ Take the max (axis=1) $\rightarrow$ shape = (num_sentences, 1) $\rightarrow$ Take the mean(axis = 0) $\rightarrow$ scalar $\Leftarrow$ This measures the positivity and similarity of a review with respect to the query.

##### Type the query

In [16]:
query = "backlit wireless keyboard" #  <---------- Set by User

##### Get the products of most likely category

In [17]:
%time most_similar_indices, sim_scores = get_similarity_score_with_product(query)

CPU times: user 50.1 s, sys: 12.2 s, total: 1min 2s
Wall time: 43.3 s


##### Set up the hyperparameters
- top_k_products: Choose the number of products to search (from 'most_similar_indices')
- top_k_reviews: Choose the number of reviews to search (sorted by 'helpful_votes' column)
- emphasized_keywords: list of keywords that user will define to be more emphasized(important)

In [66]:
top_k_products, top_k_reviews, emphasized_keywords = 5, 5, ['wireless', 'keyboard'] #  <---------- Set by User

##### Search the products

In [67]:
print(f"We will examine reviews of {top_k_products} products that are most similar to the query")
matching_scores = defaultdict(pd.Series)
unique_products = reviews[['product_parent','product_title']].drop_duplicates()
matching_products = unique_products.iloc[most_similar_indices[:top_k_products],0].values

print("Processing the review of product")
for i, product_id in enumerate(matching_products):
    print(f"...{i+1}", end='')
    reviews_list = reviews.loc[reviews.product_parent == product_id, ['review_body', 'helpful_votes', 'review_date']].\
    sort_values(ascending = False, by = ['helpful_votes', 'review_date'])[:top_k_reviews]['review_body']
    matching_scores[product_id] = reviews_list.apply(sentimental_similarity_score_of_a_review, args=(query, 'positive', emphasized_keywords))

We will examine reviews of 5 products that are most similar to the query
Processing the review of product
...1...2...3...4...5

**Print the product with the highest mean sentiment * similarity score.**

In [25]:
index = 0 #  <---------- Set by User

med_matching_scores = [np.nanmedian(v) for k, v in matching_scores.items()]
product_scores = np.sort(med_matching_scores)[::-1]
product_indices = np.argsort(med_matching_scores)[::-1]
matched_product_id = matching_products[product_indices[index]]
product_title = reviews.loc[reviews.product_parent == matched_product_id, :].head(1).product_title.values[0]
print(f"Your query matched with:\n{product_title}")
print(f"\nThe median of the Geometric mean of Positive Similarity score between \"{query}\" and \"{product_title}\" is {product_scores[index]:.3f}")
# Geometric mean takes into account the effect of compounding, therefore, better suited for calculating the returns.

Your query matched with:
RK728 Wireless Keyboard

The median of the Geometric mean of Positive Similarity score between "backlit wireless keyboard" and "RK728 Wireless Keyboard" is 0.279


### Show the reviews

**For the matched product, create a dataframe with whether the ratings are 5 or not and their similarity scores**

In [None]:
matched_reviews = reviews.loc[reviews.product_parent == matched_product_id, ['review_body', 'star_rating', 'review_date']]
matched_reviews['review_date'] = matched_reviews['review_date'].apply(lambda x: str(x).split()[0])

pos = (matched_reviews.star_rating == 5.0).astype(int)
sim = matched_reviews.review_body.apply(lambda x: sentimental_similarity_score_of_a_review(x, query, None, emphasized_keywords) if x is not None else None)
pos_sim = pd.concat([pos.rename('positive'), sim.rename('similarity')], axis=1)
matched_reviews[['positive', 'similarity']] = pos_sim

#### Create summary if the review is too long.

In [30]:
def create_summary(text):
    """
    Create summary of the input text using the NLP model
    """
    tokenized_text = tokenizer_reddit.encode(text, return_tensors="pt", truncation=True)
    summary_ids = model_reddit.generate(tokenized_text,
                                          num_beams=4,
                                          no_repeat_ngram_size=2,
                                          min_length=30,
                                          max_length=150,#  <---------- Set by User
                                          early_stopping=True)
    return "..."+tokenizer_reddit.decode(summary_ids[0], skip_special_tokens=True)+"..."

**Select 'show_k' number of reviews for positive rating(5.0) and negative rating(<=4.0) which are ordered in most similar-to-query manner.**

In [69]:
show_k = 5 #  <---------- Set by User
maximum_len = 300 # Maximum length not to be summarized  <---------- Set by User

In [70]:
pos_sim_indices = pos_sim.loc[pos_sim.positive == 1, 'similarity'].sort_values(ascending = False).head(show_k).index
neg_sim_indices = pos_sim.loc[pos_sim.positive == 0, 'similarity'].sort_values(ascending = False).head(show_k).index

k_pros = matched_reviews.loc[pos_sim_indices, ['review_body', 'star_rating', 'review_date', 'similarity']]
k_cons = matched_reviews.loc[neg_sim_indices, ['review_body', 'star_rating', 'review_date', 'similarity']]

In [71]:
mask = k_pros.review_body.map(lambda x: len(x)) >= maximum_len

k_pros.loc[mask, 'original'] = k_pros.loc[mask, 'review_body'].values
k_pros.loc[mask, 'review_body'] = k_pros.loc[mask, 'review_body'].apply(create_summary)
k_pros.loc[:, 'similarity'] = k_pros.loc[:, 'similarity'].apply(lambda x: str(round(x,2)))
k_pros.loc[:, ['review_body', 'star_rating', 'review_date', 'similarity']].style.set_properties(subset=['review_body'], **{'width': '600px'})

Unnamed: 0,review_body,star_rating,review_date,similarity
2903224,I used this with my laptop because I never liked the feel of the crammed laptop layout. It allows me to comfortably type in crammed spaces while in flight. Plus I like to type on my lap and is so much better not having a hot heavy notebook on my lap.,5.0,2008-07-18,0.58
2900544,If you are looking for the the keyboard to complete your laptop/hdtv combination THIS IS PERFECT. I just got this last week opened the box plugged it in and it started surfing the net rite away never even rebooted the computer. Netflicks has never been better!!!!,5.0,2008-08-02,0.58
2903353,this product is similar to the FK760 I purchased recently and it has an integrated touchpad. I like it. Very good price and the company delivered the product quickly.,5.0,2008-07-17,0.41
1258198,Loved It,5.0,2014-07-22,0.35
2117844,"good staff, we like it very much easy to use and easy to operations, no problems at alll, if you like it go get it.",5.0,2013-02-13,0.06


In [73]:
mask = k_cons.review_body.map(lambda x: len(x)) >= maximum_len

k_cons.loc[mask, 'original'] = k_cons.loc[mask, 'review_body'].values
k_cons.loc[mask, 'review_body'] = k_cons.loc[mask, 'review_body'].apply(create_summary)
k_cons.loc[:, 'similarity'] = k_cons.loc[:, 'similarity'].apply(lambda x: str(round(x,2)))
k_cons.loc[:, ['original', 'review_body', 'star_rating', 'review_date', 'similarity']].style.set_properties(subset=['original'], **{'width': '600px'})

Unnamed: 0,original,review_body,star_rating,review_date,similarity
2847875,,"this keyboard has some very nice features, but it is very hard to push the keys down, and the keyboard is cramped. also there is some delay on the touch pad, not much, but fine movements are hard. this is a great keyboard for use with a media center, but not for your primary keyboard.",3.0,2009-03-18,0.68
2882919,"It's good to have this option, I need a keyboard being able to put on lap to type, my shoulder keeps pain if I use keyboard on high table. This product is light enough. And wireless is also needed, there is no so many choices on the market. USB receiver is good. setup is easy. A bad is that the key press, they should really making keypress better. Plus the mouse point move not fast using the built-in touchpad. A good compliment, but sometimes you would prefer changing back to normal keyboard/mouse","...wireless keyboard/mouse combo is a good compliment, but sometimes you would prefer changing back to normal keyboard or mouse. ;-(;)...",3.0,2008-11-12,0.58
2829066,"I have a wireless system and the idea of separate board and mouse on the couch are ridiculous to me so I've tried several of these keyboards with pointing devices built in. I have to say this is the best one so far. The Media center buttons are great and probably the strong point. The cons are the key return and layout are horrible, if you touch type you will see at least a 50% drop in your speed and accuracy. The Touch pad it's self is quirky jumping and hanging, I find I have to actually bang the unit to get it to go back to normal, and you often have to re-sync with the dongle, which is a pain. But in the end it's cool and it works, maybe I'll get better on the touch typing but it's been several months so I doubt it.","...don't touch type on a wireless keyboard, it's a pain and you'll get a 50% drop in speed and accuracy if you touch....",3.0,2009-06-10,0.56
1715141,,"Good as a Media Center keyboard. Trackpad works ok, but it is old style (does not click as Macbook ones do). Decent range. Cons: No backlight",3.0,2013-11-24,0.55
2480950,PROS:  Battery life is fantastic. Range is superb. Tactile feedback is exactly what you expect from a real keyboard. CONS:  The track-pad can be a bit goofy. The keyboard doesn't have an indicator for NumLock or CapsLock... this really bugs me. It's just a bit too big for coffee table use.,"...keyboard is too big for coffee table use./b>br /> a href=""http://i.imgur.com/a/f0ttxx"" target=""_blank"" rel=""book"" title=""https://www.amazon.co.uk/gp/product/listing.asp?ie=u-bloblot&qid=1&sr=0&keywords=keyboard"" width=""640"" height=""480"" />...",3.0,2011-12-18,0.49


Although it's not used here. Using Rake and NLTK, find the probable category of the query
<code style="font-size: 11px; background-color:transparent;">
lemmatizer = WordNetLemmatizer()
r = Rake()# Uses stopwords for english from NLTK, and all puntuation characters.
def get_ranked_phrases(query):
    r.extract_keywords_from_text(query)
    phrases = {token for token in r.get_ranked_phrases()}
    nouns = {word.lower() for word, pos in pos_tag(word_tokenize(query)) if pos in ('NN', 'NNS', 'NNP')}
    ranked_phrases = list(chain(*[[token for token in phrases if noun in token.split()] for noun in nouns]))
    ranked_phrases_split = [set(query_product.split()) for query_product in ranked_phrases]
    category_set = [{lemmatizer.lemmatize(token) for token in token_set} for token_set in ranked_phrases_split]
    return category_set
</code><br>
Although it's not used here.
Word Movers' Distance from [link](https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-similarity-a-use-case-in-nlp-e83e08469c1c)
<code style="font-size: 10px; background-color:transparent;">
stop_words = stopwords.words('english')
def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop_words]
model_wv = api.load('word2vec-google-news-300')
query = "quiet keyboards"
product_titles = unique_products.product_title.map(preprocess)
%time sim_scores = product_titles.apply(model_wv.wmdistance, args = (preprocess(query), ))
unique_products.iloc[np.argsort(sim_scores)[::-1], 1].head(10).values
</code>