##### CSCE 670 :: Information Storage and Retrieval - Final Project Report

<h1><center>Amazon Fake Reviews Classifier and Analysis</center></h1>
<h4><center> Josiah Coad, Savinay Narendra, Sheelabhadra Dey, Chaiwei Chang, Kevin Chang</center></h4>
Github: [Click Me](https://github.com/josiahcoad/Faker)<br>
Data: Please refer to library folder in the repository. (Source: Amazon)<br>


## 1. Introduction

  It has become a common practice for online reviews to have a major impact on the decision of the potential customers of that product. Positive reviews can result in significant financial gains. This gives a strong incentive for fraud reviews, also commonly called opinion scamming. Opinion scamming includes fake blogs, reviews, deceptive advertising and more. Our research is accordingly focusing on product reviews based on the Amazon dataset both from University of Illinois at Chicago[1] and Professor Caverlee’s lab at Texas A&M University. Reports indicated 2-6% reviews on average are fake with up to 20% on sites such as Yelp. This leads to an unrealistic representation of places and products on the internet. Additionally, there are some fake review cases in the news for example [6].

  Based on existing machine learnign techniques, we applied unsupervised clustering and observe that whether the spam/fake reviewers can be clustered together. In order to enhance the performance of learning process, we preprocessed the data to focus on suspicious group or individual because we believe that the some obivious features are similar within these groups.
  
  The key difficulty in determining fake reviews is that it is extremely hard for humans to identify fake reviews. In one related work, it was said to take a team of industry experts eight weeks to develop a labeled data set. We believe that a machine can do better at identifying the fake reviews by extracting the implicit information or behavior feature inside the reviews efficiently.





## 2. Data Initialization



In [2]:
from __future__ import print_function
from collections import defaultdict

def parserJSON(path, numLines=None):
  numLines = numLines or len(open(path).read().split("\n")) - 1
  with open(path) as txt:
    reviews = [eval(next(txt)) for x in range(numLines)]
  print("Number of reviews:", len(reviews))
  return reviews


def get_reviewers(reviews):
   reviewers = {}
   for review in reviews:
      reviewerId = review["memberId"]
      if reviewerId not in reviewers:
         reviewers[reviewerId] = [review]
      else:
         reviewers[reviewerId].append(review)
   print("Number of reviewers:", len(reviewers))
   return reviewers

def remove_lessthan3(reviewers_reviews):
   final = {}
   for reviewer, reviews in reviewers_reviews.items():
      reviews = list(filter(lambda review: review["Rate"] == 1 or review["Rate"] == 5, reviews))
      if len(reviews) >= 3:
            final[reviewer] = sorted(reviews, key=lambda review: review["productId"])
   print("Number of reviewers with 3+ reviews rated 1 or 5 star:", len(final))
   return final

def get_products(reviews):
   products = {}
   for review in reviews:
      productId = review["productId"]
      if productId not in products:
         products[productId] = [review]
      else:
         products[productId].append(review)
   return products

def normalizedVector(vector):
    total = 0
    for key in vector:
        total += vector[key] ** 2
    total = total ** 0.5
    for key in vector:
        vector[key] /= total
    return vector


# from modules.amazon_parser import *
reviewers_products = []

# get a list of dictionary items which represent each review object (including metadata like product id and user id) 
reviews = parserJSON('./library/amazon-review-data.json')
# get a list of tuples with user as first entry and a list of the review objects their part of as the second
reviewers_reviews_dict = get_reviewers(reviews)
reviewers_reviews = reviewers_reviews_dict
# remove all reviewers who reviewed less than 3 products with ratings other than 1 or 5
reviewers_reviews = remove_lessthan3(reviewers_reviews)

# create a new list of tuples... with first entry being the reviewer 
# and second being a list of the product ids reviewed
for reviewer, reviews in reviewers_reviews.items():
   reviewers_products.append( (reviewer, [review["productId"] for review in reviews]) )

# get a sorted list of reviews that a user left for products which match 'productIds'
def get_product_reviews(productIds, userId):
   return [review for review in reviewers_reviews_dict[userId] if review["productId"] in productIds]



Number of reviews: 99117
Number of reviewers: 4743
Number of reviewers with 3+ reviews rated 1 or 5 star: 3268
('A2R1SS382YW679', [])

('A135L0KYJC3K4H', [])

('A1NGEEN1F7FVMK', [])

('A3L7Z3ZXGIMWD3', [])
Number of groups:  261



### 2.1 Group Indicator




In [6]:
groups = []
for i in range(len(reviewers_products)-1):
   ref_user = reviewers_products[i]
   newgroup = [ref_user]
   for j in range(i+1, len(reviewers_products)):
      compare_user = reviewers_products[j]
      shared_products = set(ref_user[1]).intersection(set(compare_user[1]))
      if len(shared_products) >= 3:
         newgroup.append(compare_user)
   if len(newgroup) >= 2:
      group_products = sorted(list(set(ref_user[1]).intersection(*[set(user[1]) for user in newgroup])))
      newgroup = [( user[0], get_product_reviews(group_products, user[0]) ) for user in newgroup]
      groups.append(newgroup)
print(*groups[0], sep="\n\n")
print("Number of groups: ", len(groups))

import math, re, string
from collections import defaultdict
from nltk.stem.porter import PorterStemmer

def purify(s):
   s = s.translate(None, string.punctuation)
   s = re.sub('(\s+)(a|an|and|but|the)(\s+)', ' ', s)
   s = [ps.stem(word.lower()) for word in re.split('\W+', s)]
   # print(s)
   return s


ps = PorterStemmer()
def cosine_sim(string1, string2):
   count1 = defaultdict(int)
   count2 = defaultdict(int)
   for word in purify(string1):
      count1[ps.stem(word.lower())] += 1
   for word in purify(string2):
      count2[ps.stem(word.lower())] += 1
   dot_product = sum(count1.get(key, 0)*count2.get(key, 0) for key in count1)
   magnitude = math.sqrt(sum([int(val)**2 for val in count1.values()])) * math.sqrt(sum([int(val)**2 for val in count2.values()]))
   return dot_product/magnitude if magnitude else 0

from numpy import mean as avg

review_objects = parserJSON('./library/amazon-review-data.json')

products_dict  = get_products(review_objects) # create a dict with product ID as the key and a list of the product's reviews as the value


products_dict = get_products(review_objects) # create a dict with product ID as the key and a list of the product's reviews as the value


MAX_USERS = 5 # found previously
MAX_PRODS = 7 # found previously


with open("./library/groups.txt") as f:
   groups = eval(f.read())

# takes a dictionary of groups which are organized by groupID as the key and a list of tuples as the value
# return a list of groups where each group is structured as: [(product, [reviews]), (product, [reviews])]
def organize_by_product(groups_dict):
   group_list = []
   for groupId, group in groups_dict.items():
      reviews = []
      for user, user_reviews in group:
         reviews.extend(user_reviews)
      products_reviews = defaultdict(list)
      for review in reviews:
         products_reviews[review["productId"]].append(review)
      group_list.append( products_reviews.items() )
   return group_list

groups_by_products = organize_by_product(groups)

# takes a dictionary of groups which are organized by groupID as the key and a list of tuples as the value
# return a list of groups where each group is structured as: [(reviewer, [reviews]), (reviewer, [reviews])]
def organize_by_user(groups_dict):
   return [groups_dict[key] for key in groups_dict]

groups_by_reviewers = organize_by_user(groups)

def get_avg(Name):
    if(len(products_dict[Name])>0):
        count = 0
        sum = 0
        for i in range(len(products_dict[Name])):
            sum+= products_dict[Name][i]["Rate"]
            count+=1
        return float(sum/count)
    else:
        return 0

# Group Deviation (GD)
def GD(group):
    Deviation = []
    handle = set()
    for i in range(len(group)):
        cur_user = group[i]
        for item in cur_user[1]:
            if(item["productId"] not in handle):
                handle.add(item["productId"])
                if(item["Rate"]==5):
                    Deviation.append(abs(5-get_avg(item["productId"]))/4)
                elif(cur_user[1][1]["Rate"]==1):
                    Deviation.append(abs(get_avg(item["productId"])-1)/4)
    return max(Deviation)

# Group Member Content Similarity
def GMCS(group):
  MCS = []
  count = []
  for i in range(len(group)):
    cur_user = group[i]
    MCS.append(0)
    count.append(0)
    for x in range(len(cur_user[1])-1):#each review
      for y in range(x+1,len(cur_user[1])):
        MCS[i]+=cosine_sim(cur_user[1][x]["reviewText"], cur_user[1][y]["reviewText"])    
        count[i]+=1 
    MCS[i]/=count[i]
  Sum = 0
  for indi in MCS:
    Sum+=indi
  return float(Sum)/len(group)

# Group Size (GS) (number of users in group)
def GS(group_by_users):
    return float(len(group_by_users)) / MAX_USERS

# Group Size Ratio (GSR) (returns 1 if each product in the group were only reviewed by the group members)
def GSR(group_by_products):
  return avg ( [gsr(product, reviews) for product, reviews in group_by_products] )

def gsr(product, reviews):
  return float(len(reviews)) / len(products_dict[product])
# ------------------------

def GTW(group):
   return max([prod_TW(reviews) for product, reviews in group])

def prod_TW(reviews):
   GTW_MAXTIME = 345600 # number of seconds in 4 days
   timestamps = [float(review["Date"]) for review in reviews]
   _range = max(timestamps)-min(timestamps)
   return 1-_range/GTW_MAXTIME if _range < GTW_MAXTIME else 0

def GCS(group):
   return max([CS(reviews) for product, reviews in group])

def CS(reviews):
   texts = [review["reviewText"] for review in reviews]
   return avg([cosine_sim(review1, review2) for review1 in texts for review2 in texts])

def GETF(group):
   return max([GTF(product, reviews) for product, reviews in group])

def GTF(product, reviews):
   GTF_MAXTIME = 15552000 # seconds in 6 months
   earliest_product_review = min([float(review["Date"]) for review in products_dict[product]])
   latest_group_review = max([float(review["Date"]) for review in reviews])
   _range = latest_group_review-earliest_product_review
   return 1-_range/GTF_MAXTIME if _range < GTF_MAXTIME else 0

# Group Support Count (GSUP) (number of products in group)
def GSUP(group):
  return float(len(group)) / MAX_PRODS

# Sum Scores
def scores(gbp, gbr):
   return [GCS(gbp), GTW(gbp), GETF(gbp), GSUP(gbp), GS(gbr), GSR(gbp), GD(gbr), GMCS(gbr)]


def get_all_scores():
  all_scores = []
  for i in range(len(groups_by_reviewers)):
     all_scores.append(scores(groups_by_products[i], groups_by_reviewers[i]))
  return all_scores

# scores = [( i, sum(score) ) for i, score in enumerate(get_all_scores())]
# fakest_indexes = sorted(scores, lambda k: k[1], reverse=True)
# for top in fakest_indexes[:5]:
#   fakest_users = [reviewer for reviewer, review in groups_by_reviewers[fakest_index]]
#   print(fakest_users)


('A2R1SS382YW679', [])

('A135L0KYJC3K4H', [])

('A1NGEEN1F7FVMK', [])

('A3L7Z3ZXGIMWD3', [])
Number of groups:  261
Number of reviews: 99117





### 2.2 Individual Indicator


## 3. Clustering

In [9]:
import numpy as numpy
import math as math
from random import randint
class som:
    def __init__(self, input,maxIterations=10,sigmaInitial = 4,somCol=3, somRow=3):
		self.somCol = somCol
		self.somRow = somRow
                self.input  = input
                self.maxIterations = maxIterations;
                self.sigmaInitial = sigmaInitial
    
 
    def trainmodel(self):
        input = self.input
       
        somCol = self.somCol
        somRow = self.somRow
        inputvectorlen = len(input[0,:])
        inputsSize = len(input[:,0])
        #initialise neurons layer.
        somMap = numpy.zeros(shape=(somCol,somRow,inputvectorlen))
        #print somMap

        #Max number of iterations
        maxIterations = self.maxIterations

        # Initial effective width
        sigmaInitial = self.sigmaInitial

        # Time constant for sigma
        t1 = maxIterations / numpy.log(sigmaInitial)

        #Initialise matrix to store eucledian distances.
        #euclideanD = numpy.zeros(shape =(somRow, somCol))
       
        # Initialize 10x10 matrix to store neighbourhood functions
        # of each neurons on the map
        neighbourhoodFunctionVal = numpy.zeros(shape =(somRow, somCol));

        # initial learning rate
        learningRateInitial = 0.1;

        #time constant for eta
        t2 = maxIterations;
        #Assign random weight vectors for all the neurons
        for num in range (0,(somRow)):
            for iter in range (0,(somCol)):
                #Squeezed the matrix into an ndarray.
                somMap[num,iter,:] = numpy.squeeze(numpy.random.rand(inputvectorlen,1))
                    
#         print "Again #printing the som with randomly initialised weight vectors"
#         print somMap

        count = 1;
        while(count < maxIterations):
            
            sigma = sigmaInitial * numpy.exp(-count/t1)
            variance = pow(sigma,2) 	
            eta = learningRateInitial * numpy.exp(-count/t2)
    
            #Prevent eta from falling below 0.01
            if (eta < 0.01):
                eta = 0.01
            #Randomly select a weight vector from the input weight vectors.
            inputIndex = randint(0,inputsSize-1)
            selectedWeightVector = input[inputIndex,:]
          
            #Select the winning neuron which has the weight vector closest to that of selected input weight vector.
            #Find the indices of minimum eucledian distance element.
            mineuclideanD=numpy.linalg.norm(selectedWeightVector-somMap[0,0,:])
            minr=0
            minc=0
            for num in range (0,somRow):
                for iter in range (0,somCol):
                    
                    temp  = numpy.linalg.norm(selectedWeightVector-somMap[num,iter,:])
                    
                    if(temp <=mineuclideanD):
                        minr=num
                        minc=iter
                        mineuclideanD=temp
            #print euclideanD
                   
            #print 'indices are',minr,minc
        
            #compute the neighbourhood function for all the neurons
            #For the winning noe
            for r in range (0,somRow):
                for c in range (0,somCol):
                    if (r == minr & c == minc):  
                        neighbourhoodFunctionVal[r, c] = 1;
                        continue;
                    else:
                        distance = (minr - r)^2 + (minc - c)^2;
                        neighbourhoodFunctionVal[r, c] = numpy.exp(-distance/(2*variance));
            
            #print 'neighbourhood functions are',neighbourhoodFunctionVal
            #Update weights 

            for r in range (0,somRow):
                for c in range (0,somCol):
                    oldWeightVector = somMap[r, c,:]
                    somMap[r, c,:]     = oldWeightVector + eta*neighbourhoodFunctionVal[r, c]*(selectedWeightVector - oldWeightVector)
   
            #Increment the counter
            count +=1
        
        #Return updated map of neurons.
        return somMap
        


In [11]:
import numpy as np
with open("./library/groups.txt") as f:
    groups = eval(f.read())


final_input = np.array(get_all_scores())
somCol = 10
somRow = 10
som = som(final_input,12,4,somCol,somRow)
ans =som.trainmodel()
print ('trained model is',ans)

trained model is [[[ 0.86873737  0.722772    0.37563978  0.60210187  0.83515825  0.73670491
    0.11304375  0.36711602]
  [ 0.86879263  0.80270152  0.69239661  0.89949096  0.75284733  0.71704691
    0.75496031  0.77311934]
  [ 0.87523371  0.52073934  0.70366966  0.20782134  0.42443584  0.43883043
    0.46791777  0.73526197]
  [ 0.39466976  0.55776331  0.61962557  0.7222089   0.56766668  0.14485464
    0.36985992  0.59708854]
  [ 0.36499381  0.60159268  0.68025203  0.26062214  0.61416041  0.36399045
    0.07610384  0.51782515]
  [ 0.74068678  0.64419985  0.85684241  0.82316189  0.38558285  0.2470232
    0.35141184  0.30793878]
  [ 0.81968896  0.55032513  0.9085003   0.49987986  0.63856564  0.21034179
    0.2521369   0.58636993]
  [ 0.69955258  0.93980972  0.98241344  0.84026044  0.39584135  0.31046762
    0.07363662  0.38181184]
  [ 0.760588    0.85069933  0.77104002  0.67463311  0.38024     0.23136249
    0.13964104  0.2902743 ]
  [ 0.66962262  0.8243615   0.95790111  0.74175401  0.434

## 4. Result and Analysis

6 

z

## You may use markdown like this


```python
print "Hello World"
```

## You may use LaTeX equation like this

$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$



## You may use table as html like this

<table>
    <thead>
        <tr>
            <td> row head 1 
            <td> row head 2
        </tr>
    </thead>
    <tbody>
        <tr>
            <td> row body 1
            <td> row body 2
        </tr>   
    </tbody>    
</table>

## You may insert photo like this

<img src="./images/test.jpg" />

10

## 5. Reference

[1] http://liu.cs.uic.edu/download/data/

[2] https://www.cs.uic.edu/~liub/publications/WWW-2012-group-spam-camera-final.pdf

[3] https://www.hindawi.com/journals/mpe/2016/4935792/

[4] https://www.cs.uic.edu/~liub/FBS/fake-reviews.html

[5]http://cs231n.github.io/neural-networks-1/

[6]http://www.bbc.com/news/technology-22166606


# {delete before submission} Proposal -- You may need to refer to this paragraph

Prior works on opinion spam focused on detecting fake reviews and individual fake reviewers. However, a fake reviewer group (a group of reviewers who work collaboratively to write fake reviews) is even more damaging as they can take total control of the sentiment on the target product due to its size. Therefore, we are going to follow our base paper trying to identify fake reviews by clustering fake spammers into groups, which are also called group spammers[1]. Group spamming refers to a group of reviewers writing fake reviews together to promote or to demote some target products. The base paper[1] we choose has done experiments that show it is hard to detect spammer groups using review content features or even indicators for detecting abnormal behaviors of individual spammers because a group has more manpower to post reviews and thus, each member may no longer appear to behave abnormally. A group of reviewers refers to a set of reviewer-ids. The actual reviewers behind the ids could be a single person with multiple ids (sockpuppet), multiple persons, or a combination of both.


Therefore, we will also implement a relation-based model “GSRank” described in [2] using an Artificial Neural Network [5] as the following figure1. The GS rank algorithm is an unsupervised iterative algorithm that works differently from the traditional supervised learning approach to spam detection. The paper [2] has concluded after experiments that “GSRank” performs better than the state-of-the-art supervised classification, regression, and learning to rank algorithms. Basically, we also follow this paper [2] to build a more effective model which can consider the inter-relationship among products, groups, and group members in computing group spamicity. In other words, we will try to reimplement the paper from scratch. In conclusion, after getting two different result as we mentioned above, we will evaluate with Precision, Recall and NDCG.




# {delete before submission}  Requirement



Your main deliverable is your project notebook. This will act as both a written report plus a walkthrough of your code. This is the critical piece that will document and detail your project experience. We expect your project notebook to tell us the story of your project -- from initial question and data collection, to initial exploratory data analysis, perhaps to a revised question, to analyses, visualizations, and key takeaways.


Note that we do not want to see a completely raw, moment-by-moment accounting of your project (include all 99 missteps and dead-ends); rather, you should carefully put together your final project notebook for submission that captures the key steps along the way.
