<h1> Data Collection </h1>

Data Source: Yelp Dataset Challenge https://www.yelp.com/dataset_challenge

The original file is too large, we seperate the original json file into smaller file, each file only include one feature.

Below is an example how we do data extraction. The json file is too large and it is NOT uploaded to GitHub. We run data collection on Brazos Cluster.

See more in ./preprocess/data-extraction/DATA.README

In [None]:
PreProcess = False
if PreProcess:
    location = []
    fInput = open("./yelp_academic_dataset_business.json",'r')
    for line in fInput:
        txt = "[" + line.rstrip() + "]"
        json_txt = json.loads(txt)
        location.append([json_txt[0]["business_id"], json_txt[0]["latitude"], json_txt[0]["longitude"], json_txt[0]["city"], json_txt[0]["state"], json_txt[0]["postal_code"]])
    fInput.close()

# Geo-clustering

First we extracted location information from Yelp business data and clustered those business using DBSCAN library. We got 126 clusters. We choose a cluster Longtitude between -81.4 and -81.4, Latitude between 34.8 and 35.6.


It takes some time to run clustering. 
See more details in ./preprocess/location-cluster/DBSCAN.py

In [None]:
RunCluster = False
if RunCluster:
    import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
    from sklearn.cluster import DBSCAN
    from sklearn import metrics
    from geopy.distance import great_circle
    from shapely.geometry import MultiPoint

    kms_per_radian = 6371.0088
    
    df = pd.read_csv('./preprocess/location-cluster/local2.csv', encoding='utf-8')
    df.head()

    coords = df.as_matrix(columns=['lat', 'lon'])

    # define epsilon as 1.5 kilometers to have a middle size cluster size
    epsilon = 1.5 / kms_per_radian

    db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
    cluster_labels = db.labels_
    num_clusters = len(set(cluster_labels))
    clusters = pd.Series([coords[cluster_labels==n] for n in range(num_clusters)])

    def get_centermost_point(cluster):
        centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
        centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
        return tuple(centermost_point)
    centermost_points = clusters.map(get_centermost_point)

<img src="./preprocess/fig/Distribution-All.png">
<img src="./preprocess/fig/Distribution-local2.png">

# Review statistics

Then we plot the user vs. number of review and business vs. number of review. Result showed below.

Based on statistical results, user/business with review number less than 30 have a large proportion while these user/business review has little help with recommendation (only make the recomendation matrix sparse). So we decided to analysis user/business whose review number is more than 30.

Since we will build a model based on ratings, reviews and relations. We also exclude user with less than 30 friends from our data to make the relation network more intensive. The final dataset used for our project is in ../new_5k

In [None]:
plot_user = False
plot_business = False

import math
import matplotlib.pylab as plt
import numpy as np

u = open('./preprocess/user+business_distribution/user-stat.out', 'r').read()
user = eval(u)
b = open('./preprocess/user+business_distribution/business-stat.out', 'r').read()
business = eval(b)

review_count  = user[0]
average_stars = user[5]
reviews = business[0]
stars = business[1]

if plot_user:
    review_count_log = {}
    for k in review_count: review_count_log[int(k/30)] = 0 
    for k in review_count: review_count_log[int(k/30)] += review_count[k]
    for k in review_count_log: review_count_log[k] = 1.0*math.log(review_count_log[k]+1.0)
    xname = "count(review)/30"
    yname = "log( count(user) )"
    title_name = "distribution of reviews per user"
    fig_name = "user_review_count"

    lists = sorted(review_count_log.items()) 
    x, y = zip(*lists) 
    plt.plot(x, y)
    plt.xlabel(xname)
    plt.ylabel(yname)
    plt.title(title_name)
    plt.savefig(fig_name)
    plt.close('all')

if plot_business:
    review_count_log = {}
    for k in reviews: review_count_log[int(k/30)] = 0 
    for k in reviews: review_count_log[int(k/30)] += reviews[k]
    for k in review_count_log: review_count_log[k] = 1.0*math.log(review_count_log[k]+1.0)
    xname = "count(review)/30"
    yname = "log( count(business) )"
    title_name = "distribution of reviews per business"
    fig_name = "business_review_count"
    lists = sorted(review_count_log.items()) 
    x, y = zip(*lists) 

    plt.plot(x, y)
    plt.xlabel(xname)
    plt.ylabel(yname)
    plt.title(title_name)
    plt.savefig(fig_name)
    plt.close('all')

<img src="./preprocess/fig/user_review_count.png">
<img src="./preprocess/fig/business_review_count.png">

<h1> Recommendation System </h1>

File 5k-data stores all business_ID, business_avg_rating, user_ID, and user_avg_rating.

File 5k-relation stores all users and their friends.

File 5k-review stores all business_ID(review_to), user_ID(review_by), and rating and review_text.

In [None]:
f = open('./new_5k/5k-data', 'r')
business_name = eval(f.readline())
business_avg = eval(f.readline())
user_name = eval(f.readline())
user_avg = eval(f.readline())

f = open('./new_5k/5k-relation', 'r').read()
relation = eval(f)

f = open('./new_5k/5k-review', 'r')
review5k_business = eval(f.readline())
review5k_user = eval(f.readline())
review5k_rating = eval(f.readline())
review5k_text = eval(f.readline())

In [None]:
import random 
import math
import numpy as np
from operator import itemgetter
from scipy.sparse import csr_matrix
from sklearn.decomposition import NMF
from sklearn.utils.extmath import randomized_svd
from sklearn.decomposition import TruncatedSVD
from scipy.sparse.linalg import svds
from sklearn.metrics import mean_squared_error

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer


<h1>Parameters</h1>

Below, we run the model once to show how our model works using the first 10% as test data and the rest 90% as train data. We alse use 10-fold method to select parameters. For more details about the 10-fold model, please refer to
./bin/TopicMF-10fold-YY.py

In [None]:
train_user = []
train_business = []
train_rating = []
train_text = []

test_user = []
test_business = []
test_rating = []

for r in xrange(len(review5k_rating)):
    if r < len(review5k_rating)/10:
        test_user.append(review5k_user[r])
        test_business.append(review5k_business[r])
        test_rating.append(review5k_rating[r])
    else:
        train_user.append(review5k_user[r])
        train_business.append(review5k_business[r])
        train_rating.append(review5k_rating[r])
        train_text.append(review5k_text[r])
        
K_topic = 10
Times = 50
DocWord = 300
DocTopic = 5

lbd0 = 0.4 # Topic effect
lbd1 = 0.5 # Similarity effect 
lbd2 = 0.5 # relation afftect
lbd3 = 0.5 # VIP effect
lbd4 = 0.5 # Topic effect in Tri-model
lbd5 = 0.5 # VIP effect in Tri-model

<h1>Initialization</h1>

In [None]:
num_user = len(user_avg)
num_business = len(business_avg)
num_train = len(train_rating)
num_test = len(test_rating)

mu = np.mean(train_rating)
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

def prep(doc):
    raw = doc.lower().replace("\n", "").replace("\t", "")
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    texts = [p_stemmer.stem(i) for i in stopped_tokens]
    return (" ").join(texts)

from sklearn import linear_model
def MLR(X, Y):
    reg = linear_model.LinearRegression()
    reg.fit(X, Y)
    return reg.coef_

<h1>Basic Model</h1>

In [None]:
Ubb = []
for i in xrange(num_user):
    Ubb.append([])
    for j in xrange(num_business):
        val = user_avg[i] + business_avg[j] - mu
        if val < 1: val = 1
        if val > 5: val = 5
        Ubb[i].append(val) 
        
UbbPd = []
for r in xrange(num_test):
    UbbPd.append(Ubb[test_user[r]][test_business[r]]) 

Ubb_rmse = mean_squared_error(test_rating, UbbPd)  

print "Ubb_rmse =", Ubb_rmse

In [None]:
BasicIn = []
for i in xrange(num_user):
    BasicIn.append([])
    for j in xrange(num_business):
        val = user_avg[i] + business_avg[j] - mu
        if val < 1: val = 1
        if val > 5: val = 5
        BasicIn[i].append(val) 
        
for r in xrange(num_train):
    BasicIn[train_user[r]][train_business[r]] = train_rating[r]
model = NMF(n_components=K_topic, init='random', random_state=0)
U = model.fit_transform(BasicIn);
V = model.components_;
BasicOut = np.dot(U,V)
BasicPd = []
for r in xrange(num_test):
    BasicPd.append(BasicOut[test_user[r]][test_business[r]]) 

BasicPd_rmse = mean_squared_error(test_rating, BasicPd)  

print "BasicPd_rmse =", BasicPd_rmse

<h1>Topic Model</h1>

cell-1: For the topic model, we first preprocessed all reviews (including stemming, removing stop-words...). 

cell-2: For each business, we grouped all its reviews in training data as bag-of-words, and then use the LDA library to calculate the topic distribution for each business. To shoeter calculation time, we choose 5 topics.

|      | topic 1| topic2 | topic3 | ... |
|------|--------|--------|--------|-----|
| doc1 | 0.0513 | 0.4686 | 0.0092 | ... |
| doc2 | 0.0006 | 0.8601 | 0.0006 | ... |
| doc3 | 0.9989 | 0.0003 | 0.0003 | ... |
| ...  | ...    | ...    | ...    | ... |

cell-3: For each user, we calculate his/her topic rating using linear regression based on his/her real-ratings in traing data substract Ubb. We mannually add six points to the linear regression in order to scale down the regression results. X1 = [0,0,0,0,0] Y1 = 0, X2 = [1,0,0,0,0] Y2 = 0, X3 = [0,1,0,0,0] Y3 = 0, X4 = [0,0,1,0,0] Y4 = 0, X5 = [0,0,0,1,0] Y5 = 0, X6 = [0,0,0,0,1] Y6 = 0


$$delta = real_rating - mu - b_x - b_i$$
$$predicted(rating_i−Ubb) = delta\_reg = coef1*topic1 + coef2*topic2 + ...$$

cell-4: calculation of rmse

In [None]:
print list(DocTopDist[1])

In [None]:
Breview = [""]*num_business
for r in xrange(num_train):
    Breview[train_business[r]] += prep(train_text[r])

In [None]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=DocWord, stop_words='english')
tf = tf_vectorizer.fit_transform(Breview)

lda = LatentDirichletAllocation(n_topics=DocTopic, max_iter=5, learning_method='online',learning_offset=50.,random_state=0)
DocTopDist = lda.fit_transform(tf)
print "topic distribution for the first business is"
print list(DocTopDist[0])

In [None]:
row = np.array([])
col = np.array([])
val = np.array([])

# delta -> difference between rating and ubb [u*b], delta2 -> rating boolean
delta = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
delta2 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
for r in xrange(num_train):
    delta[train_user[r]][train_business[r]] = train_rating[r] - Ubb[train_user[r]][train_business[r]]
    delta2[train_user[r]][train_business[r]] = 1

deltaReg = []
for i in xrange(num_user):
    X = [[0,0,0,0,0], [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1]]; Y = [0,0,0,0,0,0]
    for j in xrange(num_business):
        if delta2[i][j]:
            X.append(list(DocTopDist[j]))
            Y.append(delta[i][j])
    coef = MLR(X, Y)
    deltaReg.append(np.dot(DocTopDist, coef))

In [None]:
row = np.array([])
col = np.array([])
val = np.array([])
TopicIn = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
for i in xrange(num_user):
    for j in xrange(num_business):
        val = Ubb[i][j] + lbd0*deltaReg[i][j]
        if val < 1: val = 1
        if val > 5: val = 5
        TopicIn[i][j] = val

TopicInPd = []
for r in xrange(num_test):
    TopicInPd.append(TopicIn[test_user[r]][test_business[r]]) 

TopicInPd_rmse = mean_squared_error(test_rating, TopicInPd)  
print TopicInPd_rmse

<h1> Social Model</h1>
We compare three models of Social model.

cell-1: Based on the paper, we calculate the pagerank of each user from its relation impact, user similarity from real-ratings in train data and a relation matrix Tij = 1 if user i and j are friends, 0 other wise. 

relationship: $Tij = 1 if i and j are friends, 0 otherwise $

relation impact: $Wi = \frac{1.0}{1.0+ log(pagerank)}$

user similarity: $Cos_{ij} = \frac{rating\_vector_i * rating\_vector_j}{length(rating\_vector_i) * length(rating\_vector_j)}$

cell-2
$Ubb = mu - b_x - b_i$

social-model1: $predicted(rating_i-Ubb) = \frac{\sum (rating_j-Ubb)}{Num(total friends)}$ if j is a friend of i

social-model2: $predicted(rating_i-Ubb) = \frac{\sum Social\_impact*(rating_j-Ubb)}{Num(total user)}$ for all user as j

social-model3: $predicted(rating_i-Ubb) = \frac{\sum User\_similarity*(rating_j-Ubb)}{Num(total user)}$ for all user as j

cell-3: calculation of rmse. It turns out that using VIP users to calculate rating has least rmse, that is to say using social impact to do recommendation is most accurate.

In [None]:
row = np.array([])
col = np.array([])
val = np.array([])

Rij = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
Sij = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
Tij = csr_matrix((val,(row,col)), shape=(num_user,num_user)).toarray()
Uij = csr_matrix((val,(row,col)), shape=(num_user,num_user)).toarray()
Vij = csr_matrix((val,(row,col)), shape=(num_user,num_user)).toarray()

# user-item rating matrix. If ui gives a rating to vj, Rij is the rating score, otherwise 0
for r in xrange(num_train):
    Rij[train_user[r]][train_business[r]] = train_rating[r] - Ubb[train_user[r]][train_business[r]]
    Sij[train_user[r]][train_business[r]] = 1
       
# user-user social relations where Tij = 1 if ui,uj has a relation and zero otherwise
for u in relation:
    for f in u[1]:
        Tij[u[0]][f] = 1

# VIP matrix where Uij = 1 if uj is a VIP and zero otherwise        
PR = {}
for u in relation:
    PR[u[0]] = len(u[1])    
sorted_PR = sorted(PR.items(), key=itemgetter(1), reverse=True)
rank = {}
for u in xrange(num_user):
    rank[sorted_PR[u][0]] = u+1
Wi = []
for u in xrange(num_user):
    Wi.append(1.0/(1.0+ math.log(rank[u])))

for u in xrange(num_user):
    for v in xrange(num_user):
        Uij[u][v] = Wi[v]
        Vij[u][v] = 1

# user-user similarity
Cos_norm = []
for u in xrange(num_user):
    Cos_norm.append(math.sqrt(np.dot(Rij[u], Rij[u])))

Cos = np.dot(Rij, Rij.T)
for i in xrange(num_user):
    for j in xrange(num_user):
        if Cos_norm[i] and Cos_norm[j]:
            Cos[i][j] = Cos[i][j]/Cos_norm[i]/Cos_norm[j]
        else: 
            Cos[i][j] = 0

TR = np.dot(Tij, Rij)
TS = np.dot(Tij, Sij)
UR = np.dot(Uij, Rij)
VS = np.dot(Uij, Sij)
CR = np.dot(Cos, Rij)

row = np.array([])
col = np.array([])
val = np.array([])
Vij1 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
Vij2 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
Vij3 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()

for i in xrange(num_user):
    for j in xrange(num_business):
        if TS[i][j]: Vij1[i][j] = TR[i][j]/TS[i][j]
        if VS[i][j]: Vij2[i][j] = UR[i][j]/VS[i][j]
        if num_user: Vij3[i][j] = CR[i][j]/num_user 

In [None]:
row = np.array([])
col = np.array([])
val = np.array([])
SocialIn1 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
SocialIn2 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()
SocialIn3 = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()

for i in xrange(num_user):
    for j in xrange(num_business):
        val1 = Ubb[i][j] + lbd1*Vij1[i][j]
        val2 = Ubb[i][j] + lbd2*Vij2[i][j]
        val3 = Ubb[i][j] + lbd3*Vij3[i][j]
        
        if val1 < 1: val1 = 1
        if val1 > 5: val1 = 5
        if val2 < 1: val2 = 1
        if val2 > 5: val2 = 5
        if val3 < 1: val3 = 1
        if val3 > 5: val3 = 5
            
        SocialIn1[i][j] = val1
        SocialIn2[i][j] = val2
        SocialIn3[i][j] = val3

SocialInPd1 = []
SocialInPd2 = []
SocialInPd3 = []
for r in xrange(num_test):
    SocialInPd1.append(SocialIn1[test_user[r]][test_business[r]]) 
    SocialInPd2.append(SocialIn2[test_user[r]][test_business[r]]) 
    SocialInPd3.append(SocialIn3[test_user[r]][test_business[r]]) 

SocialInPd_rmse1 = mean_squared_error(test_rating, SocialInPd1)  
SocialInPd_rmse2 = mean_squared_error(test_rating, SocialInPd2)  
SocialInPd_rmse3 = mean_squared_error(test_rating, SocialInPd3)  

print "relation rmse =", SocialInPd_rmse1
print "VIP rmse =", SocialInPd_rmse2
print "similarity rmse =", SocialInPd_rmse3

if SocialInPd_rmse1 < SocialInPd_rmse2 and SocialInPd_rmse1 < SocialInPd_rmse3:
    print "relation between users dominant rating"
    SocialInPd_rmse = SocialInPd_rmse1
elif SocialInPd_rmse2 < SocialInPd_rmse3:
    print "VIP user dominant rating"
    SocialInPd_rmse = SocialInPd_rmse2
else:
    print "user similarity dominant rating"
    SocialInPd_rmse = SocialInPd_rmse3


<h1>Tri-Model</h1>

Tri-Model is a linear combination of Topic model and social model.

$$predicted\_rating = Ubb + lbd4 * Topic\_predicted(ratingi−Ubb) +
lbd5 * Social\_predicted(ratingi−Ubb)$$ 

In [None]:
row = np.array([])
col = np.array([])
val = np.array([])
Tri = csr_matrix((val,(row,col)), shape=(num_user,num_business)).toarray()

for i in xrange(num_user):
    for j in xrange(num_business):
        val = Ubb[i][j] + lbd4*deltaReg[i][j] + lbd5*Vij2[i][j]
        
        if val1 < 1: val1 = 1
        if val1 > 5: val1 = 5
        if val2 < 1: val2 = 1
        if val2 > 5: val2 = 5
        if val3 < 1: val3 = 1
        if val3 > 5: val3 = 5
            
        Tri[i][j] = val

TriPd = []
for r in xrange(num_test):
    TriPd.append(Tri[test_user[r]][test_business[r]]) 

TriPd_rmse = mean_squared_error(test_rating, TriPd)  
print TriPd_rmse

<h1>Disscussion</h1>

In [None]:
print "Ubb_rmse =", Ubb_rmse
print "Basic-model_rmse =", BasicPd_rmse
print "TopicIn-model_rmse =", TopicInPd_rmse
print "Social-model_rmse =", SocialInPd_rmse
print "Tri-model_rmse =", TriPd_rmse

<h1>10 fold results and parameter optimazation</h1>


<img src="./bin/fig-lbd0.png">
<img src="./bin/fig-lbd1.png">
<img src="./bin/fig-lbd2.png">
<img src="./bin/fig-lbd3.png">
<img src="./bin/fig-10fold.png">