<a href="https://colab.research.google.com/github/pprathi2018/Amazon-Product-Review-Rankings/blob/main/Final_Proj_Amazon_Product_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Completed:

- pre-processing: tokenization, removing punctuation, lowercase, removing stop words, stemming
- query likelihood model with laplace smoothing
- query likelihood model with linear interpolation
- word embedding model with cosine similarity of vectors
- bigram model
- comparing weightage of title vs body
- evaluation with MAP
- results

Load the data from https://nijianmo.github.io/amazon/index.html into a dataframe. We selected the AMAZON FASHION reviews dataset with over 880000 reviews to process. Perform some pre-processing to replace NaN values with something easier to handle later.

In [179]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [180]:
import pandas as pd
import gzip
import json

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('/content/drive/MyDrive/Information Retrieval/AMAZON_FASHION.json.gz')

In [181]:
df['reviewText'] = df['reviewText'].fillna("")
df['summary'] = df['summary'].fillna("")
df['vote'] = df['vote'].fillna(0)
df = df.reset_index()

In [182]:
df.head()

Unnamed: 0,index,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,0,5.0,True,"10 20, 2014",A1D4G1SNUZWQOT,7106116521,Tracy,Exactly what I needed.,perfect replacements!!,1413763200,0,,
1,1,2.0,True,"09 28, 2014",A3DDWDH9PX2YX2,7106116521,Sonja Lau,"I agree with the other review, the opening is ...","I agree with the other review, the opening is ...",1411862400,3,,
2,2,4.0,False,"08 25, 2014",A2MWC41EW7XL15,7106116521,Kathleen,Love these... I am going to order another pack...,My New 'Friends' !!,1408924800,0,,
3,3,2.0,True,"08 24, 2014",A2UH2QQ275NV45,7106116521,Jodi Stoner,too tiny an opening,Two Stars,1408838400,0,,
4,4,3.0,False,"07 27, 2014",A89F3LQADZBS5,7106116521,Alexander D.,Okay,Three Stars,1406419200,0,,


For 3 unique products, write 3 different queries and annotate the results for all the reviews of that product

In [183]:
# Product ID: B0097AJS2U

p1_id = [v for v in df.loc[df['asin'] == 'B0097AJS2U', 'index']]

# Do the glasses fit well?
relevance_q1_1 = [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
answers_q1_1 = {p1_id[i]:relevance_q1_1[i] for i in range(len(p1_id))}

# Are these glasses worth the money?
# relevance_array2 = [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Are these glasses stylish?
relevance_q1_2 = [0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0 ]
answers_q1_2 = {p1_id[i]:relevance_q1_2[i] for i in range(len(p1_id))}


# Product ID: B00IOHDIV0

p2_id = [v for v in df.loc[df['asin'] == 'B00IOHDIVO', 'index']]

# Is it big enough to carry a lot of stuff?
relevance_q2_1 = [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0]
answers_q2_1 = {p2_id[i]:relevance_q2_1[i] for i in range(len(p2_id))}

# Is the material good quality?
relevance_q2_2 = [1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1]
answers_q2_2 = {p2_id[i]:relevance_q2_2[i] for i in range(len(p2_id))}


# Product ID: B00QV3OFMY

p3_id = [v for v in df.loc[df['asin'] == 'B00QV3OFMY', 'index']]

# Does it rust or change color?
relevance_q3_1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0]
answers_q3_1 = {p3_id[i]:relevance_q3_1[i] for i in range(len(p3_id))}

# Does it last long without damage?
relevance_q3_2 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0]
answers_q3_2 = {p3_id[i]:relevance_q3_2[i] for i in range(len(p3_id))}

In [184]:
from collections import Counter
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stopwords_dict = Counter(stopwords.words('english'))
ps = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [185]:
# Pre-processing of both query and reviews 
def preprocess_doc(document):
  # remove capitalization and punctuation
  document = document.lower().translate(str.maketrans('', '', string.punctuation))

  # removing stopwords
  document_split = [word for word in document.split() if word not in stopwords_dict]

  # remove stopwords and stemming
  # document_split = [ps.stem(word) for word in document.split() if word not in stopwords_dict]

  # stemming
  # document_split = [ps.stem(word) for word in document.split()]

  # return document
  return document_split

# Create the bag of words for the document
def count_terms(document):
  return Counter(preprocess_doc(document))

In [186]:
# Convert the relevant information acquired from the dataset in a dictionary that is easier to manipulate
# The final structure look something like:
# product_id:
#   - review_id:
#     - text: bag of words
#     - summary: bag of words
#     - time: time
#     - vote: number of helpful votes

def store_metadata(df_group):
  ids = df_group['index']
  text = df_group['reviewText']
  summary = df_group['summary']
  time = df_group['unixReviewTime']
  vote = df_group['vote']

  return {id: {'text': count_terms(text[id]), 'summary': count_terms(summary[id]), 'time': time[id], 'vote': vote[id]} for id in ids}

In [187]:
bag_of_words = df.reset_index().groupby('asin').apply(store_metadata).to_dict()

In [188]:
len(bag_of_words)

186189

In [189]:
# example
bag_of_words['B0097AJS2U'][94073]

{'text': Counter({'awesome': 1,
          'glasss': 1,
          'bought': 1,
          '2': 1,
          'pairs': 1,
          'black': 1,
          'one': 1,
          'brown': 1}),
 'summary': Counter({'awesome': 1}),
 'time': 1406678400,
 'vote': 0}

In [190]:
def query_likelihood_laplace(product_id, review_id, section, term, V):
  
  n = bag_of_words[product_id][review_id][section][term] + 1
  d = sum(bag_of_words[product_id][review_id][section].values()) + V
  return n / d

In [191]:
# MODEL 1
# Based on query likelihood model with laplace smoothing  

# Iterate through and score the reviews of the given product
# Return the top results 
def search_reviews_laplace(product_id, query, num_results=10, summary_weight = 0):
  results = Counter()
  query_term_count = count_terms(query)

  V = len(set([term for i in bag_of_words[product_id].keys()
   for term in bag_of_words[product_id][i]['summary'] + bag_of_words[product_id][i]['text']]))

  product_reviews = bag_of_words[product_id]
  for review_id in product_reviews:

    score = 0
    for term in query_term_count:
      score += ((1 - summary_weight) * (query_term_count[term] * query_likelihood_laplace(product_id, review_id, 'text', term, V)))
      score += ((summary_weight) * (query_term_count[term] * query_likelihood_laplace(product_id, review_id, 'summary', term, V)))

    results[review_id] = score
    
  return results.most_common(num_results)

In [192]:
# MODEL 2
# Based on query likelihood model using linear interpolation smoothing
# Considers both term frequency and inverse document frequency 

def query_likelihood_model(product_id, query, num_results=10, summary_weight = 0):
  smoothing_param=0.5
  results = Counter()
  query_term_count = count_terms(query)

  product_reviews = bag_of_words[product_id]

  collection_count_text = 0
  collection_count_summary = 0
  for x in product_reviews:
    collection_count_text += sum(product_reviews[x]['text'].values())
    collection_count_summary += sum(product_reviews[x]['summary'].values())

  collection_query_term_count_text = Counter()
  collection_query_term_count_summary = Counter()
  for term in query_term_count:
    collection_term_frequency_text = 0
    collection_term_frequency_summary = 0
    for x in product_reviews:
      if term in product_reviews[x]['text']:
        collection_term_frequency_text += product_reviews[x]['text'][term]
      if term in product_reviews[x]['summary']:
        collection_term_frequency_summary += product_reviews[x]['summary'][term]
    collection_query_term_count_text[term] = (collection_term_frequency_text / collection_count_text)
    collection_query_term_count_summary[term] = (collection_term_frequency_summary / collection_count_summary)

  for review_id in product_reviews:
    text_score = 1
    summary_score = 1

    doc_count_text = sum(product_reviews[review_id]['text'].values())
    doc_count_summary = sum(product_reviews[review_id]['summary'].values())
    for term in query_term_count:
      doc_term_count_text = product_reviews[review_id]['text'][term] if term in product_reviews[review_id]['text'] else 0
      doc_term_count_summary = product_reviews[review_id]['summary'][term] if term in product_reviews[review_id]['summary'] else 0

      text_score *= (((smoothing_param * (doc_term_count_text / doc_count_text)) + ((1 - smoothing_param) * collection_query_term_count_text[term])) ** query_term_count[term])
      summary_score *= (((smoothing_param * (doc_term_count_summary / doc_count_summary)) + ((1 - smoothing_param) * collection_query_term_count_summary[term])) ** query_term_count[term])

    score = ((1 - summary_weight) * text_score) + (summary_weight * summary_score)
    results[review_id] = score
  
  return results.most_common(num_results)
    


In [193]:
# Sample results
search_reviews_laplace('B0097AJS2U', "Do the glasses fit well?")

[(94074, 0.03333333333333334),
 (434957, 0.02777777777777778),
 (94093, 0.027472527472527472),
 (94090, 0.026881720430107527),
 (434966, 0.02247191011235955),
 (94091, 0.022099447513812154),
 (434965, 0.022099447513812154),
 (94080, 0.02197802197802198),
 (434960, 0.02197802197802198),
 (94096, 0.02185792349726776)]

In [194]:
# Sample results
query_likelihood_model('B0097AJS2U', "Do the glasses fit well?")

[(434957, 0.0001657508980666387),
 (94093, 8.13710728955696e-05),
 (94090, 3.519827077919216e-05),
 (434966, 1.9691392423008617e-05),
 (434960, 1.605128013282469e-05),
 (94096, 1.3971215967005303e-05),
 (94099, 1.3971215967005303e-05),
 (94074, 1.2807927486726555e-05),
 (94091, 8.771055552456837e-06),
 (434965, 8.771055552456837e-06)]

In [195]:
# Sample results 
i = 0
for (k,v) in query_likelihood_model('B0097AJS2U', "Do the glasses fit well?"):
  i += 1
  print (f"Rank {i}")
  print(df.iloc[k]['reviewText'])
  print('\n')

Rank 1
Wished they fit over glasses, other than that ok.


Rank 2
Love these glasses, fits great, very well made


Rank 3
Love these glasses! So comfortable, well made and so beautiful! I have gotten several complements! I so recommend these!


Rank 4
Love these glasses


Rank 5
Absolutely love these. Best fit, great quality.


Rank 6
What can I say? The picture does NOT do these justice! These are fantastic! They fit perfect, they are stylish!


Rank 7
Perfect gift for our daughter. Well made and a good value.


Rank 8
Like my glasses ALOT....pic doesn't do justice the glasses look way better in person...like having the case.  Checked these out in store and run from $40-$50.  So price is good.  Very happy with my glasses,  lens quality seems real good so far, I've only had them a short time,  but they seem well made :)!!!  They are so Cool!! B)
Olo


Rank 9
These glasses are Amazing I Love Them!! Thanks So Much!!


Rank 10
LARGE but FABULOUS, "STAR" glasses. LOVE THEM




Word Embeddings

In [196]:
import gensim
import numpy as np
from numpy.linalg import norm
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [197]:
# store lists of tokens that are pre-processed
def store_metadata_embeddings(df_group):
  ids = df_group['index']
  text = df_group['reviewText']
  summary = df_group['summary']
  time = df_group['unixReviewTime']
  vote = df_group['vote']

  return {id: {'text': preprocess_doc(text[id]), 'summary': preprocess_doc(summary[id])} for id in ids}

In [198]:
bag_of_word_embeddings = df.reset_index().groupby('asin').apply(store_metadata_embeddings).to_dict()

In [199]:
# build list of lists of tokens for training of word embeddings 
lines = []

for product_dict in bag_of_word_embeddings.values():
  for review_dict in product_dict.values():
    lines.append(review_dict['text'])
    lines.append(review_dict['summary'])

In [200]:
# model = Word2Vec(lines)
# word embedding model is trained to output vectors of size 100
model = Word2Vec(
    lines,
    size=100,
    min_count=3,  
    sg = 1,       
    window=7,
    workers=4    
)       

In [201]:
# converts each token in the provided doc to a word embedding vector and calculates
# the mean of all the vectors to return one vector representing
# the word embedding for the doc
def get_word_embeddings(doc):
  embeddings = []
  if len(doc) < 1:
    return np.zeros(300)
  else:
    for word in doc:
      if word in model.wv.vocab:
        embeddings.append(model.wv.word_vec(word))
      else:
        embeddings.append(np.random.rand(100))
    return np.mean(embeddings, axis=0)

In [202]:
def cos_sim(a, b):
  return np.dot(a, b)/(norm(a)*norm(b))

# MODEL 3
# Based on word embedding and vector space model using cosine similarity to compare vectors

def word_embedding_rankings(product_id, query, num_results=10, summary_weight=0):
  tokenized_query = preprocess_doc(query)
  results = Counter()

  for review_id in bag_of_word_embeddings[product_id]:
    text_cs = cos_sim(np.array(get_word_embeddings(bag_of_word_embeddings[product_id][review_id]['text'])),np.array(get_word_embeddings(tokenized_query)))
    summary_cs = cos_sim(np.array(get_word_embeddings(bag_of_word_embeddings[product_id][review_id]['summary'])),np.array(get_word_embeddings(tokenized_query)))

    results[review_id] = ((1 - summary_weight) * text_cs) + (summary_weight * summary_cs)
  
  return results.most_common(num_results)

In [203]:
# Sample results
word_embedding_rankings('B0097AJS2U', 'Do the glasses fit well?')

[(94093, 0.8823963403701782),
 (434957, 0.881686270236969),
 (434966, 0.8397279381752014),
 (94074, 0.8249606660818134),
 (94097, 0.7908575149952524),
 (94091, 0.7872361540794373),
 (434961, 0.7703776955604553),
 (434965, 0.7618000507354736),
 (94092, 0.7583695650100708),
 (94090, 0.756409227848053)]

In [204]:
i = 0
for (k,v) in word_embedding_rankings('B0097AJS2U', 'Do the glasses fit well?'):
  i += 1
  print(f'Rank {i}')
  print(df.iloc[k]['reviewText'])
  print('\n')

Rank 1
Love these glasses, fits great, very well made


Rank 2
Wished they fit over glasses, other than that ok.


Rank 3
Love these glasses


Rank 4
Like my glasses ALOT....pic doesn't do justice the glasses look way better in person...like having the case.  Checked these out in store and run from $40-$50.  So price is good.  Very happy with my glasses,  lens quality seems real good so far, I've only had them a short time,  but they seem well made :)!!!  They are so Cool!! B)
Olo


Rank 5
These glasses are gorgeous I always get compliments...come in a really nice zippered case. Fast shipping


Rank 6
These glasses are Amazing I Love Them!! Thanks So Much!!


Rank 7
Absolutely love these sunglasses, only ones I'll buy!!


Rank 8
LARGE but FABULOUS, "STAR" glasses. LOVE THEM


Rank 9
Love these. Fit and look great, was presently surprised to see they came in a nice case with a cleaning cloth. Received them faster than expected.


Rank 10
Love these glasses! So comfortable, well made and

Evaluation Model

In [206]:
from statistics import mean

def average_precision_query(results, answers):
  i = 0
  num_relevant = 0
  precisions= []
  for (review_id, score) in results:
    i += 1
    if answers[review_id]:
      num_relevant += 1
      precisions.append(num_relevant/i)
  if len(precisions) > 0:
    return mean(precisions)
  else:
    return 0
  

def evaluate_map(model, summary_weight = 0):
  map = []

  # product 1 query 1
  p1q1_results = model('B0097AJS2U', "Do the glasses fit well?", 10, summary_weight)
  map.append(average_precision_query(p1q1_results, answers_q1_1))

  # product 1 query 2
  p1q2_results = model('B0097AJS2U', "Are these glasses stylish?", 10, summary_weight)
  map.append(average_precision_query(p1q2_results, answers_q1_2))

  # product 2 query 1
  p2q1_results = model('B00IOHDIVO', "Is it big enough to carry a lot of stuff?", 10, summary_weight)
  map.append(average_precision_query(p2q1_results, answers_q2_1))

  # product 2 query 2
  p2q2_results = model('B00IOHDIVO', "Is the material good quality?", 10, summary_weight)
  map.append(average_precision_query(p2q2_results, answers_q2_2))

  # product 3 query 1
  p3q1_results = model('B00QV3OFMY', "Does it rust or change color?", 10, summary_weight)
  map.append(average_precision_query(p3q1_results, answers_q3_1))

  # product 3 query 2
  p3q2_results = model('B00QV3OFMY', "Does it last long without damage?", 10, summary_weight)
  map.append(average_precision_query(p3q2_results, answers_q3_2))

  return mean(map)

In [207]:
evaluate_map(search_reviews_laplace)

0.6653659611992945

In [208]:
evaluate_map(query_likelihood_model)

0.4017019400352734

In [209]:
evaluate_map(word_embedding_rankings)

0.8141132842025699

Evaluation with modified text vs. summary weights

In [210]:
def range_with_floats(start, stop, step):
    while stop > start:
        yield start
        start += step
for i in range_with_floats(0.0, 0.6, 0.1):
  print(f"Summary weight - {round(i, 1)}")
  print("------------")
  print(f'Laplace model - {round(evaluate_map(search_reviews_laplace, i), 4)}')
  print(f'Linear interpolation model - {round(evaluate_map(query_likelihood_model, i), 4)}')
  print(f'Word embedding model - {round(evaluate_map(word_embedding_rankings, i), 4)}')
  print ('\n')


Summary weight - 0.0
------------
Laplace model - 0.6654
Linear interpolation model - 0.4017
Word embedding model - 0.8094


Summary weight - 0.1
------------
Laplace model - 0.6808
Linear interpolation model - 0.4017
Word embedding model - 0.8348


Summary weight - 0.2
------------
Laplace model - 0.6889
Linear interpolation model - 0.4017
Word embedding model - 0.7923


Summary weight - 0.3
------------
Laplace model - 0.6919
Linear interpolation model - 0.4123
Word embedding model - 0.7749


Summary weight - 0.4
------------
Laplace model - 0.6919
Linear interpolation model - 0.4053
Word embedding model - 0.7621


Summary weight - 0.5
------------
Laplace model - 0.69
Linear interpolation model - 0.3923
Word embedding model - 0.743




N-grams

In [211]:
def count_bigrams(document):
  document_split = preprocess_doc(document)
  bigrams = []

  bigrams.append("%s, " + document_split[0])
  for i in range(len(document_split) - 2):
    bigrams.append(document_split[i] + ", " + document_split[i+1])

  bigrams.append(document_split[len(document_split) - 1] + ", %s")

  return Counter(bigrams)

In [212]:
# Convert the relevant information acquired from the dataset in a dictionary that is easier to manipulate
# The final structure look something like:
# product_id:
#   - review_id:
#     - text: bag of words
#     - summary: bag of words
#     - time: time
#     - vote: number of helpful votes

def store_metadata_bigrams(df_group):
  ids = df_group['index']
  text = df_group['reviewText']
  summary = df_group['summary']
  time = df_group['unixReviewTime']
  vote = df_group['vote']

  return {id: {'text': count_bigrams(text[id]), 'summary': count_bigrams(summary[id]), 'time': time[id], 'vote': vote[id]} for id in ids}

In [213]:
bag_of_words_bigrams = df.reset_index().groupby('asin').apply(store_metadata).to_dict()

In [218]:
# Iterate through and score the reviews of the given product
# Return the top results 
def search_reviews_bigrams(product_id, query, num_results=10, summary_weight = 0.5):
  results = Counter()
  query_term_count = count_bigrams(query)

  V = len(set([term for i in bag_of_words[product_id].keys()
   for term in bag_of_words[product_id][i]['summary'] + bag_of_words[product_id][i]['text']]))
  
  product_reviews = bag_of_words[product_id]
  for review_id in product_reviews:

    score = 0
    for term in query_term_count:
      score += ((1 - summary_weight) * (query_term_count[term] * query_likelihood_laplace(product_id, review_id, 'text', term, V)))
      score += (summary_weight * (query_term_count[term] * query_likelihood_laplace(product_id, review_id, 'summary', term, V)))
      # score += vote
      # score += f(time)

    results[review_id] = score
    
  return results.most_common(num_results)

In [219]:
evaluate_map(search_reviews_bigrams)

0.10246913580246914

Model Results and Evaluation

lowercase and punctuation

removing stopwords

laplace model - 0.6900

linear interpolation model - 0.402

word embedding model - 0.826

bigram model - 0.017

Summary weight - 0.0
------------
Laplace model - 0.6654

Linear interpolation model - 0.4017

Word embedding model - 0.8286


Summary weight - 0.1
------------
Laplace model - 0.6808

Linear interpolation model - 0.4017

Word embedding model - 0.8347


Summary weight - 0.2
------------
Laplace model - 0.6889

Linear interpolation model - 0.4017

Word embedding model - 0.7922


Summary weight - 0.3
------------
Laplace model - 0.6919

Linear interpolation model - 0.4123

Word embedding model - 0.7867


Summary weight - 0.4
------------
Laplace model - 0.6919

Linear interpolation model - 0.4053

Word embedding model - 0.7655


Summary weight - 0.5
------------
Laplace model - 0.69

Linear interpolation model - 0.3923

Word embedding model - 0.7494

--------------------------

lowercase and punctuation

removing stopwords

stemming tokens

laplace model - 0.6803

linear interpolation model - 0.398

word embedding model - 0.8451

bigram model - 0.0796

Summary weight - 0.0
------------
Laplace model - 0.6529

Linear interpolation model - 0.398

Word embedding model - 0.832


Summary weight - 0.1
------------
Laplace model - 0.6721

Linear interpolation model - 0.398

Word embedding model - 0.8796


Summary weight - 0.2
------------
Laplace model - 0.684

Linear interpolation model - 0.398

Word embedding model - 0.8408


Summary weight - 0.3
------------
Laplace model - 0.6826

Linear interpolation model - 0.3961

Word embedding model - 0.8222


Summary weight - 0.4
------------
Laplace model - 0.6826

Linear interpolation model - 0.4011

Word embedding model - 0.8138


Summary weight - 0.5
------------
Laplace model - 0.6803

Linear interpolation model - 0.4011

Word embedding model - 0.7986

-----------

lowercase and punctuation

stemming tokens

laplace model - 0.7723

linear interpolation model - 0.4085

word embedding model - 0.8044

bigram model - 0.014

Summary weight - 0.0
------------
Laplace model - 0.8006

Linear interpolation model - 0.4085

Word embedding model - 0.8044


Summary weight - 0.1
------------
Laplace model - 0.7847

Linear interpolation model - 0.4085

Word embedding model - 0.7693


Summary weight - 0.2
------------
Laplace model - 0.7882

Linear interpolation model - 0.4085

Word embedding model - 0.7594


Summary weight - 0.3
------------
Laplace model - 0.8091

Linear interpolation model - 0.4085

Word embedding model - 0.7704


Summary weight - 0.4
------------
Laplace model - 0.8034

Linear interpolation model - 0.4085

Word embedding model - 0.8425


Summary weight - 0.5
------------
Laplace model - 0.7723

Linear interpolation model - 0.4085

Word embedding model - 0.8131
