# Applications of NLP in recommender systems

The main goal of the project is to create a recommender system of restaurant based on the reviews of the user. The dataset finally selected is a Yelp Dataset that contains information about different business in the main US and Canada cities as well as complete information about the reviews and users. The dataset in json format can be found here: https://www.yelp.com/dataset 

From all the differents json in the data set, *business* and *reviews* were selected from this project. From the *business* dataset only restaurant were selected and only 10 different cuisines were chosen among the most popular cuisenes.

In [1]:
# Import libraries

import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import json

In [2]:
## Load Bussiness Data
path = '../data/yelp_academic_dataset_business.json'
business = pd.read_json(path, lines=True)
print(f'The dimensions of the dataset are: {business.shape[0]} rows and {business.shape[1]} columns')

The dimensions of the dataset are: 209393 rows and 14 columns


The columns from the dataset *business* contains the following information:

- business_id: *string*, 22 character unique string business id
- name: *string*, the business's name
- address: *string*, the full address of the business
- city: *string*, the city
- state: *string*, the state (if aplicable)
- postal code: *string*, the postal code
- latitude: *float*, latitude
- longitude: *float*, longitude
- stars: *float*, rounded mean of rating
- review_count: *integer*, number of reviews
- is_open: *integer*, 1 is is open
- attributes: *object*, business attributes to values
- hours: an *dictionary* of key day to value hours

In [3]:
df_business = business[business['is_open']==1].drop(['hours','is_open','review_count'], axis=1)
df_explode = df_business.assign(categories = df_business.categories.str.split(', ')).explode('categories')

In [4]:
business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",
3,6OAZjbxqM5ol29BuHsil3w,Nevada House of Hose,1015 Sharp Cir,North Las Vegas,NV,89030,36.219728,-115.127725,2.5,3,0,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Hardware Stores, Home Services, Building Suppl...","{'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ..."
4,51M2Kk903DFYI6gnB5I6SQ,USE MY GUY SERVICES LLC,4827 E Downing Cir,Mesa,AZ,85205,33.428065,-111.726648,4.5,26,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Plumbing, Electricians, Handyma...","{'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '..."


In [5]:
# Select only business == Restaurant
df_business = business[business['is_open']==1].drop(['hours','is_open'], axis=1)
df_explode = df_business.assign(categories = df_business.categories.str.split(', ')).explode('categories')
restaurants = df_explode[df_explode['categories'].str.contains('Restaurants', case=True, na=False)]

In [6]:
# Filter restaurant by cuisine type
cuisine_list = ['Mexican', 'American', 'Japanese', 'Chinese', 'Italian', 'French', 'Indian', 'Mediterranean', 'Thai']
cuisine_restaurants = pd.DataFrame()
for cuisine in cuisine_list:
    data = restaurants[business['categories'].str.contains(cuisine, case=True, na=False)]
    data['categories'] = cuisine
    cuisine_restaurants = pd.concat([cuisine_restaurants, data])

  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [7]:
# Remove restaurant that has various types of cuisines
cuisine_restaurants['business_id'].drop_duplicates(inplace=True)
print(f'The dimensions of the dataset are: {cuisine_restaurants.shape[0]} rows' +
      f' and {cuisine_restaurants.shape[1]} columns')

The dimensions of the dataset are: 24176 rows and 12 columns


In [8]:
# Export to csv
cuisine_restaurants.to_csv('../data/yelp_restaurants.csv', index=False)

In [116]:
# Load the big Dataset (+6GB) of reviews avoiding the memory error
path = '../data/yelp_academic_dataset_review.json'
size = 50000
review = pd.read_json(path, lines=True,
                      dtype={'review_id':str,'user_id':str,
                             'business_id':str,'stars':int,
                             'date':str,'text':str,'useful':int,
                             'funny':int,'cool':int},
                      chunksize=size)

# Loop over each chunk of data and merge with the bussiness information
chunk_list = []
for chunk_review in review:
    # Drop columns that aren't needed
    chunk_review = chunk_review.drop(['review_id','useful','funny','cool'], axis=1)
    # Renaming column name to avoid conflict with business overall star rating
    chunk_review = chunk_review.rename(columns={'stars': 'review_stars'})
    # Inner merge with edited business file so only reviews related to the business remain
    chunk_merged = pd.merge(chunk_review, cuisine_restaurants, on='business_id', how='inner')
    chunk_list.append(chunk_merged)
    
df = pd.concat(chunk_list, ignore_index=True, join='outer', axis=0)

The columns from the dataset business contains the following information:

- review_id: *string*, 22 character unique review id
- user_id: *string*, 22 character unique user id, maps to the user in user.json
- business_id: *string*, 22 character unique string business id
- stars: *integer*, rating 
- date: *string*, date format
- text: *string*, the review itself
- useful: *integer*, number of useful votes received
- funny: *integer*, number of funny votes received
- cool: *integer*, number of cool votes received

In [117]:
df.head()

Unnamed: 0,user_id,business_id,review_stars,text,date,name,address,city,state,postal_code,latitude,longitude,stars,attributes,categories
0,V34qejxNsCbcgD8C0HVk-Q,HQl28KMwrEKHqhFrrDqVNQ,5,I love Deagan's. I do. I really do. The atmosp...,2015-12-05 03:18:11,Deagan's Kitchen & Bar,14810 Detroit Ave,Lakewood,OH,44107,41.485192,-81.800145,4.0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...",American
1,zFCuveEe6M-ijY1iy23IJg,HQl28KMwrEKHqhFrrDqVNQ,5,"We walked into Melt. ""Did you want to put your...",2011-08-25 04:24:23,Deagan's Kitchen & Bar,14810 Detroit Ave,Lakewood,OH,44107,41.485192,-81.800145,4.0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...",American
2,4V985R3RG-rv0B7WCPQzeQ,HQl28KMwrEKHqhFrrDqVNQ,1,I commented on how slow the service was last A...,2015-03-04 20:37:43,Deagan's Kitchen & Bar,14810 Detroit Ave,Lakewood,OH,44107,41.485192,-81.800145,4.0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...",American
3,nFGcoL6wuPQzxsNJVSfGrA,HQl28KMwrEKHqhFrrDqVNQ,4,We walked in off the streets on a September ni...,2014-09-10 01:38:55,Deagan's Kitchen & Bar,14810 Detroit Ave,Lakewood,OH,44107,41.485192,-81.800145,4.0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...",American
4,CJqgUQeWhdgbDyLAFy7xvQ,HQl28KMwrEKHqhFrrDqVNQ,4,Brunch on Saturday was excellent. The Bloody M...,2018-01-21 18:50:29,Deagan's Kitchen & Bar,14810 Detroit Ave,Lakewood,OH,44107,41.485192,-81.800145,4.0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...",American


In [115]:
# Dimensions of the data
df.shape

(3140427, 15)

In [118]:
# Distribution of rating
df.review_stars.value_counts()

5    1354320
4     766599
3     382444
1     364645
2     272419
Name: review_stars, dtype: int64

In [119]:
df.to_csv("../data/yelp_reviews_restaurant.csv", index=False)