# D-Cube v.s. M-Zoom on Amazon Dataset
In the following part of notebook, we will apply the D-Cube algorithm on Amazon dataset, then compare the results with D-Cube's main compatitor - M-Zoom algorithm.

In [1]:
import json
import numpy as np
import pandas as pd

In [3]:
data_raw = []
i = 0
with open('Yelp-data/yelp_academic_dataset_review.json') as f:
    for line in f:
        data_raw.append(json.loads(line))

user_set = []
business_set = []
date_set = []
rating_set = []
for i in range (0, len(data_raw)):
    user_set.append(data_raw[i]['user_id'])
    business_set.append(data_raw[i]['business_id'])
    date_set.append(data_raw[i]['date'])
    
user_set = list(set(user_set))
business_set = list(set(business_set))
date_set = list(set(date_set))

print "The Size of the Amazon review dataset is as follows:"
print "Unique Users:\t\t" + str(len(user_set))
print "Unique Businesses:\t" + str(len(business_set))
print "Unique Dates:\t\t" + str(len(date_set))

The Size of the Amazon review dataset is as follows:
Unique Users:		1029432
Unique Businesses:	144072
Unique Dates:		4221


In [4]:
user = dict(zip(user_set,np.arange(0,len(user_set))))
business = dict(zip(business_set,np.arange(0,len(business_set))))
date = dict(zip(date_set,np.arange(0,len(date_set))))

real_review = []
for i in range (0, len(data_raw)):
    real_review.append([user[data_raw[i]['user_id']], business[data_raw[i]['business_id']], date[data_raw[i]['date']], data_raw[i]['stars']])

## Fake Review Generator
After getting the original Amazon review dataset, we would like to inject some fake reviews and test if both D-Cube and M-Zoom algorithm will find these fake reviews (users, business and data) or not. Now, we consider the following four different type of injective data to challenge both algorithm. 

### Type 1. Fraud Users (Fake users are generated to review businesses)
In this type of injective dataset, we will generated some fraud users, who only give fake (positive and negative) reviews to some businesses.

In [5]:
fake_user_positive = np.arange(len(user_set),len(user_set) + 50)
fake_business_positive = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_positive = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

fake_user_negative = np.arange(len(user_set) + 50,len(user_set) + 100)
fake_business_negative = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_negative = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

print "Type 1 Positive Fake Reveiw Size: "
print str(len(fake_user_positive)) + " * " + str(len(fake_business_positive)) + " * " + str(len(fake_date_positive))

print "Type 1 Negative Fake Reveiw Size: "
print str(len(fake_user_negative)) + " * " + str(len(fake_business_negative)) + " * " + str(len(fake_date_negative))

Type 1 Positive Fake Reveiw Size: 
50 * 50 * 20
Type 1 Negative Fake Reveiw Size: 
50 * 50 * 20


In [7]:
fake_review = []
for i in range (0,len(fake_user_positive)):
    for j in range (0,len(fake_business_positive)):
        for k in range (0,len(fake_date_positive)):
            fake_review.append([fake_user_positive[i],fake_business_positive[j],fake_date_positive[k],5])

for i in range (0,len(fake_user_negative)):
    for j in range (0,len(fake_business_negative)):
        for k in range (0,len(fake_date_negative)):
            fake_review.append([fake_user_negative[i],fake_business_negative[j],fake_date_negative[k],1])

all_review = real_review + fake_review

print "After injecting Type 1 fake reviews, here is the dataset we currently have:"
print "Size of all reviews:\t" + str(len(all_review))
print "Size of real reviews:\t" + str(len(real_review))
print "Size of fake reviews:\t" + str(len(fake_review))

After injecting Type 1 fake reviews, here is the dataset we currently have:
Size of all reviews:	4253150
Size of real reviews:	4153150
Size of fake reviews:	100000


### Shuffle the tensor
After injecting the fake reviews, we would like to shuffle the whole tensor, to make sure the fake reviews are randomly distributed anywhere inside of tensor. In another word, after shuffling the tensor, the dense block detection algorithms should work really hard and carefully re-organize the order of each dimesion to return the dense blocks.

In [8]:
np.random.shuffle(all_review)

<img src="2.jpg">

In [9]:
with open('yelp_reviews_with_fake_1.txt','a') as f:
    for i in range (0, len(all_review)):
        f.write(str(all_review[i][0]) + ',' + str(all_review[i][1]) + ',' + str(all_review[i][2]) + ',' + str(all_review[i][3]) + ',' + '\n')

#### Type 1 Fake Review Result:
<img src="01.png">

### Type 2: Employed Users (Real users are employed to review businesses)
Now, we are planning to add more challege on these two algorithms. In the Type 2 injective dataset, we are using real users, instead of generated fraud users, who will be employed to give fake (both positive and negative) reviews to some businesses. We will also do the shuffling processes after injecting them.

In [10]:
fake_user_positive = np.arange(len(user_set))[np.random.permutation(len(user_set))[:50]]
fake_business_positive = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_positive = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

fake_user_negative = np.arange(len(user_set))[np.random.permutation(len(user_set))[:50]]
fake_business_negative = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_negative = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

print "Type 2 Positive Fake Reveiw Size: "
print str(len(fake_user_positive)) + " * " + str(len(fake_business_positive)) + " * " + str(len(fake_date_positive))

print "Type 2 Negative Fake Reveiw Size: "
print str(len(fake_user_negative)) + " * " + str(len(fake_business_negative)) + " * " + str(len(fake_date_negative))

fake_review = []
for i in range (0,len(fake_user_positive)):
    for j in range (0,len(fake_business_positive)):
        for k in range (0,len(fake_date_positive)):
            fake_review.append([fake_user_positive[i],fake_business_positive[j],fake_date_positive[k],5])

for i in range (0,len(fake_user_negative)):
    for j in range (0,len(fake_business_negative)):
        for k in range (0,len(fake_date_negative)):
            fake_review.append([fake_user_negative[i],fake_business_negative[j],fake_date_negative[k],1])

all_review = real_review + fake_review

with open('yelp_reviews_with_fake_2.txt','a') as f:
    for i in range (0, len(all_review)):
        f.write(str(all_review[i][0]) + ',' + str(all_review[i][1]) + ',' + str(all_review[i][2]) + ',' + str(all_review[i][3]) + ',' + '\n')

Type 2 Positive Fake Reveiw Size: 
50 * 50 * 20
Type 2 Negative Fake Reveiw Size: 
50 * 50 * 20


#### Type 2 Fake Review Result:
<img src="02.png">

### Type 3: Employed Users (Real users are employed to review businesses)

Based on the Type 2 injective data, we make the Type 3 data which randomly deleting some (30%) fake injective reviews and make the injective block be more real. In the real world, suppose some of real users or reviews are detected and blocked by Yelp, the fake reviews block could not perfectly exist.

<img src="3.jpg">

In [11]:
fake_user_positive = np.arange(len(user_set))[np.random.permutation(len(user_set))[:200]]
fake_business_positive = np.arange(len(business_set))[np.random.permutation(len(business_set))[:10]]
fake_date_positive = np.arange(len(date_set))[np.random.permutation(len(date_set))[:10]]

fake_user_negative = np.arange(len(user_set))[np.random.permutation(len(user_set))[:200]]
fake_business_negative = np.arange(len(business_set))[np.random.permutation(len(business_set))[:10]]
fake_date_negative = np.arange(len(date_set))[np.random.permutation(len(date_set))[:10]]

print "Type 3 Positive Fake Reveiw Size: "
print str(len(fake_user_positive)) + " * " + str(len(fake_business_positive)) + " * " + str(len(fake_date_positive))

print "Type 3 Negative Fake Reveiw Size: "
print str(len(fake_user_negative)) + " * " + str(len(fake_business_negative)) + " * " + str(len(fake_date_negative))

fake_review = []
for i in range (0,len(fake_user_positive)):
    for j in range (0,len(fake_business_positive)):
        for k in range (0,len(fake_date_positive)):
            fake_review.append([fake_user_positive[i],fake_business_positive[j],fake_date_positive[k],5])

for i in range (0,len(fake_user_negative)):
    for j in range (0,len(fake_business_negative)):
        for k in range (0,len(fake_date_negative)):
            fake_review.append([fake_user_negative[i],fake_business_negative[j],fake_date_negative[k],1])

a = np.array(fake_review)
fake_review = list(a[np.random.permutation(a.shape[0])[:int(a.shape[0]*0.7)]])
            
all_review = real_review + fake_review

with open('yelp_reviews_with_fake_3.txt','a') as f:
    for i in range (0, len(all_review)):
        f.write(str(all_review[i][0]) + ',' + str(all_review[i][1]) + ',' + str(all_review[i][2]) + ',' + str(all_review[i][3]) + ',' + '\n')

Type 3 Positive Fake Reveiw Size: 
200 * 10 * 10
Type 3 Negative Fake Reveiw Size: 
200 * 10 * 10


#### Type 3 Fake Review Result:
<img src="03.png">

### Type 4: Smarter Employed Users (Real users are employed to give 4 or 5 for positive reviews, and 1 or 2 for negative reviews)
Rely on the Type 3 injective dateset, we plan to add more challenge here. In the real world, fraud users not only give 5 stars for positive reviews. but they also give 4 stars to make the fake reviews with more reality. Based on this case, we change our injective fake reviews with both 4 or 5 stars for positive reviews, and both 1 or 2 stars for negative reviews.

In [12]:
fake_user_positive = np.arange(len(user_set))[np.random.permutation(len(user_set))[:50]]
fake_business_positive = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_positive = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

fake_user_negative = np.arange(len(user_set))[np.random.permutation(len(user_set))[:50]]
fake_business_negative = np.arange(len(business_set))[np.random.permutation(len(business_set))[:50]]
fake_date_negative = np.arange(len(date_set))[np.random.permutation(len(date_set))[:20]]

print "Type 4 Positive Fake Reveiw Size: "
print str(len(fake_user_positive)) + " * " + str(len(fake_business_positive)) + " * " + str(len(fake_date_positive))

print "Type 4 Negative Fake Reveiw Size: "
print str(len(fake_user_negative)) + " * " + str(len(fake_business_negative)) + " * " + str(len(fake_date_negative))

fake_review = []
for i in range (0,len(fake_user_positive)):
    for j in range (0,len(fake_business_positive)):
        for k in range (0,len(fake_date_positive)):
            fake_review.append([fake_user_positive[i],fake_business_positive[j],fake_date_positive[k],np.random.permutation([4,5])[0]])

for i in range (0,len(fake_user_negative)):
    for j in range (0,len(fake_business_negative)):
        for k in range (0,len(fake_date_negative)):
            fake_review.append([fake_user_negative[i],fake_business_negative[j],fake_date_negative[k],np.random.permutation([1,2])[0]])

a = np.array(fake_review)
fake_review = list(a[np.random.permutation(a.shape[0])[:int(a.shape[0]*0.7)]])
            
all_review = real_review + fake_review

with open('yelp_reviews_with_fake_4.txt','a') as f:
    for i in range (0, len(all_review)):
        f.write(str(all_review[i][0]) + ',' + str(all_review[i][1]) + ',' + str(all_review[i][2]) + ',' + str(all_review[i][3]) + ',' + '\n')

Type 4 Positive Fake Reveiw Size: 
50 * 50 * 20
Type 4 Negative Fake Reveiw Size: 
50 * 50 * 20


#### Type 4 Fake Review Result:
<img src="04.png">

## Conclusion
In total, based on the results we got above, we can make a conclusion that both D-Cube and M-Zoom work perfectly on detecting injective data. In terms of speed, D-Cube is much faster than traditional M-Zoom algorithm. 