## Yelp Challenge

Dataset Documentation: <br>
https://www.yelp.com/dataset/documentation/main

In [1]:
import pandas as pd
import tarfile
from tqdm import tqdm
import json

In [2]:
zf = tarfile.open('yelp_dataset.tar') 
#df = pd.read_csv(zf.open('intfile.csv'))
for name in zf.list():
    print(name)

?rw-r--r-- daniellg/users  138279749 2018-11-15 11:22:39 business.json 
?rw-r--r-- daniellg/users  408807658 2018-11-15 11:25:00 checkin.json 
?rw-r--r-- daniellg/users 5347475638 2018-11-15 11:35:37 review.json 
?rw-r--r-- daniellg/users  244535478 2018-11-15 11:26:18 tip.json 
?rw-r--r-- daniellg/users 2485747393 2018-11-15 11:24:48 user.json 
?rw-r--r-- daniellg/users   25661152 2019-01-11 19:06:09 photo.json 
?rw-r--r-- daniellg/users     101186 2019-01-14 11:31:35 Dataset_Challenge_Dataset_Agreement.pdf 
?rw-r--r-- daniellg/users     111822 2019-01-14 11:35:09 Yelp_Dataset_Challenge_Round_13.pdf 


TypeError: 'NoneType' object is not iterable

In [4]:
## Feel free to extract more files here
zf.extract("review.json")

In [5]:
line_count = len(open("review.json").readlines())
user_ids, business_ids, stars, dates, texts = [], [], [], [], []
with open("review.json") as f:
    for line in tqdm(f, total=line_count):
        blob = json.loads(line)
        user_ids += [blob["user_id"]]
        business_ids += [blob["business_id"]]
        stars += [blob["stars"]]
        dates += [blob["date"]]
        texts += [blob["text"]]
ratings = pd.DataFrame(
    {"user_id": user_ids, "business_id": business_ids, "rating": stars, "date": dates, "text": texts}
)
user_counts = ratings["user_id"].value_counts()
active_users = user_counts.loc[user_counts >= 5].index.tolist()
ratings = ratings.loc[ratings.user_id.isin(active_users)]

100%|██████████| 6685900/6685900 [01:37<00:00, 68835.62it/s]


In [6]:
active_users[0:5]

['CxDOIDnH8gp9KXzpBHJYXw',
 'bLbSNkLggFnqwNNzzq-Ijw',
 'PKEzKWv_FktMm2mGPjwd0Q',
 'ELcQDlf69kb-ihJfxZyL0A',
 'DK57YibC5ShBmqQl97CKog']

In [7]:
ratings.head()

Unnamed: 0,user_id,business_id,rating,date,text
0,hG7b0MtEbXx5QzbzE6C_VA,ujmEBvifdJM6h6RLv4wQIg,1.0,2013-05-07 04:34:36,Total bill for this horrible service? Over $8G...
2,n6-Gk65cPZL6Uz8qRm3NYw,WTqjgwHlXbSFevF32_DJVw,5.0,2016-11-09 20:09:03,I have to say that this office really has it t...
6,jlu4CztcSxrKx56ba1a5AQ,3fw2X5bZYeW9xCz_zGhOHg,3.0,2016-05-07 01:21:02,Tracy dessert had a big name in Hong Kong and ...
7,d6xvYpyzcfbF_AZ8vMB7QA,zvO-PJCpNk4fgAVUnExYAA,1.0,2010-10-05 19:12:35,This place has gone down hill. Clearly they h...
8,sG_h0dIzTKWa3Q6fmb4u-g,b2jN2mm9Wf3RcrZCgfo1cg,2.0,2015-01-18 14:04:18,I was really looking forward to visiting after...


In [8]:
ratings.shape

(4538272, 5)

In [9]:
n_users = len(ratings.user_id.unique())
n_restaurants = len(ratings.business_id.unique())
print('Unique Users: {0}, unique restaurants: {1}'.format(n_users, n_restaurants))

Unique Users: 286130, unique restaurants: 185723


### Notes

1. Baseline:
User based (Bowen)
ALS (later)


1. Cold start: (<5 reviews)
content based (Nearest neighbors on review text, metadata tex) (Zhongling)


2. Main model
Field-aware factorization machine (James)
Locality Sensitive Hashing (Ujjwal)
Collective Matrix Factorization (Bowen)


Hybrid approach
3. Metrics
TBD

### Train / Holdout 

The rating dataset has ~6.6 million rows and is time consuming to perform group by & aggregation operations, which is required for constructing holdout set. Therefore, I randomly subsample 1/10 of users in order to speed up testing. Since we have filtered out inactive users earlier, downsampling will not result in new inactive users. Note that this is not an optimal practice, I did it solely because I want to speed up the data pre-processing and modeling cycle. 

In [10]:
# ratings_sample = ratings.sample(frac= 1/10, replace=False, random_state=1)
# Downsample by users
user_id_unique = ratings.user_id.unique()
user_id_sample = pd.DataFrame(user_id_unique, columns=['unique_user_id']) \
                    .sample(frac= 1/10, replace=False, random_state=1)

In [11]:
ratings_sample = ratings.merge(user_id_sample, left_on='user_id', right_on='unique_user_id') \
                    .drop(['unique_user_id'], axis=1)
print(ratings_sample.head())
print(ratings_sample.shape)

                  user_id             business_id  rating  \
0  n6-Gk65cPZL6Uz8qRm3NYw  WTqjgwHlXbSFevF32_DJVw     5.0   
1  n6-Gk65cPZL6Uz8qRm3NYw  hk5wpV-_pi5jmDDVPeG8DA     5.0   
2  n6-Gk65cPZL6Uz8qRm3NYw  30Q5xBagQHmkwp8Q9I1FCg     5.0   
3  n6-Gk65cPZL6Uz8qRm3NYw  UtWngqS-WloIY_A53W5K-Q     5.0   
4  n6-Gk65cPZL6Uz8qRm3NYw  dU-Nt1-LjV9mAgFOVcdAJw     5.0   

                  date                                               text  
0  2016-11-09 20:09:03  I have to say that this office really has it t...  
1  2018-09-14 18:50:19  I highly recommend Arizona Pet Mortuary, David...  
2  2018-02-03 23:27:43  First time at this restaurant our server "Ramo...  
3  2016-02-18 06:42:16  Such an amazing hospital with friendly staff, ...  
4  2018-08-15 22:14:18  Went for my yearly GYN exam and was seen by Lo...  
(463277, 5)


In [12]:
#ratings_sample.to_csv('ratings_sample.csv')

In [25]:
# hold out last review
ratings_user_date = ratings_sample.loc[:, ['user_id', 'date']]
ratings_user_date.date = pd.to_datetime(ratings_user_date.date)
index_holdout = ratings_user_date.groupby(['user_id'], sort=False)['date'].transform(max) == ratings_user_date['date']
ratings_holdout = ratings_sample[index_holdout]
ratings_traincv = ratings_sample[~index_holdout]

ratings_user_date = ratings_traincv.loc[:, ['user_id', 'date']]
index_holdout = ratings_user_date.groupby(['user_id'], sort=False)['date'].transform(max) == ratings_user_date['date']
ratings_cv = ratings_traincv[index_holdout]
ratings_train = ratings_traincv[~index_holdout]

In [26]:
ratings_cv = ratings_cv[~ratings_cv.user_id.isin(['HiT9sg9pvDiEVMFHJYihXg'])]
ratings_holdout = ratings_holdout[~ratings_holdout.user_id.isin(['HiT9sg9pvDiEVMFHJYihXg'])]
ratings_holdout.to_csv('ratings_sample_holdout.csv')
ratings_cv.to_csv('ratings_sample_cv.csv')
ratings_train.to_csv('ratings_sample_train.csv')

In [27]:
print('There are {0} rows, {1} columns in training set.'.format(ratings_train.shape[0], ratings_train.shape[1]))
print('There are {0} rows, {1} columns in training set.'.format(ratings_cv.shape[0], ratings_cv.shape[1]))
print('There are {0} rows, {1} columns in holdout set.'.format(ratings_holdout.shape[0], ratings_holdout.shape[1]))

There are 406042 rows, 5 columns in training set.
There are 28615 rows, 5 columns in training set.
There are 28612 rows, 5 columns in holdout set.
