In [1]:
import pickle

def load_obj(name ):
    """ load object """
    with open('obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

master_bizdet_df = load_obj('master_bizdet_df2')
my_rev_features = load_obj('my_rev_features')
my_bm_features = load_obj('my_bm_features')
friend_rev_features = load_obj('friend_rev_features')
friend_bm_features = load_obj('friend_bm_features')
both_sim = load_obj('both_sim')
r_sim = load_obj('r_sim')
rec_details = load_obj('rec_details')

# Yelp Recommendation Engine: 

## Background:
Yelp is a local-search serviced powered by crowd-sourced reviews of local businesses. I am a frequent visitor of Yelp and rely heavily on the reviews to make decisions on which restaurants to visit, what to order at the restaurant, and outside of food, I also rely on Yelp reviews for services I seek. In addition to leaving reviews with a rating between 1-5, users with an account with Yelp are able to add friends and to create Bookmarks for businesses they want to visit in the future and can refer to later. Yelp also has their own API has various endpoints that provide different information related to businesses, events, and categories.

## Purpose: 
To build a recommendation system for Yelp using bookmarks and review ratings and the users that are most similar to me.

I spend a lot of time going through Yelp trying to find a place to eat, especially if I am not craving anything in particular. With a recommendation engine, I can narrow down the list of places that I may enjoy.

## Acquiring Data Set: 

#### Scraped
The following data will be scraped from www.Yelp.com:
- list of friends of my account - `user_id`
- list of my bookmarks - `business_id`, `url`
- list of my reviews - `business_id`, `url`, `user_rating`
- list of my friend's bookmarks - `business_id`, `url` 
- list of my friend's reviews - `business_id`, `url`, `user_rating`

#### API
For each business in the above, data from Business Details Endpoint from __[Yelp Fusion API](https://www.yelp.com/developers/documentation/v3/business)__ :

Note: Yelp's API has a limit of 5000 calls per day

Relevant fields:

| Name | Type   |  Description   |
|------|------|------|
|   categories | object[]| A list of category title and alias pairs associated with this business.|
|   categories[x].alias | string| Alias of a category, when searching for business in certain categories, use alias rather than the title.|
|   alias | string| Unique Yelp alias of this business. Can contain unicode characters. Example: 'yelp-san-francisco'. Also see: What's the difference between the Yelp business ID and business alias?|
|   is_closed | bool| Whether business has been (permanently) closed|
|   price | string| Price level of the business|
|   rating | decimal| Rating for this business|
|   review_count | int| Number of reviews for this business.|
|   id | str| Unique Yelp ID of this business.|

## Cleaning Data and Feature Engineering

1. `master_bizdet_df` For each business, create a dataframe containing business details. 
2. For my own bookmarks + reviews and my friend's bookmarks + reviews, create a dataframe of features that will be used to calculate similarities
    - `price_val` - average price level for businesses bookmarked/reviewed (1 being cheapest, 4 being priciest)
    - `rating` - average rating of businesses bookmarked/reviewed (1 being lowest, 5 being highest)
    - `review_count` - number of reviews the business has
    - `categories` - the column category's percentage of total bookmarks/reviews. Note that a business can have up to 4 different categories

In [9]:
master_bizdet_df.head() # relevant business details

Unnamed: 0,alias,cat_0,cat_1,cat_2,cat_3,cat_4,id,is_closed,price,rating,review_count,price_val
0,fork-in-aussie-pies-santa-monica-2,coffee,australian,,,,5lJMJRTui0tQ3G8gdEHo7A,True,$$,4.5,634,2.0
0,providence-los-angeles-2,seafood,,,,,TzIJzamxdVGc3zReKbLGaA,False,$$$$,4.5,2478,4.0
0,society-billiards-and-cafe-pacific-beach-3,poolhalls,bars,,,,_nEYB0urvuAxSBrFpYxwcg,True,$$,3.0,154,2.0
0,island-vintage-coffee-honolulu-4,coffee,,,,,UdvXV2ux3uOj3UqP04cjqA,False,$$,4.5,1317,2.0
0,beat-the-lock-escape-rooms-santa-clara,escapegames,kids_activities,,,,0qj734IaYhGgAg2jOpiI2g,False,,4.5,82,


In [7]:
my_rev_features.head() # features used to calculate similarity

Unnamed: 0_level_0,price_val,rating,guamanian,eyelashservice,japanese,newamerican,seafood,pizza,thai,mobilephones,...,tradamerican,museums,othersalons,tapas,desserts,buffets,laserlasikeyes,musicvenues,sportsbars,num
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ShHBKjuJbQAVBLs7DgA95A,2.072727,4.064516,0.014925,0.014925,0.134328,0.044776,0.119403,0.014925,0.074627,0.014925,...,0.014925,0.014925,0.014925,0.014925,0.014925,0.014925,0.014925,0.014925,0.014925,1.0


In [29]:
print('my bookmark features shape: ', my_bm_features.shape)
print('my review features shape: ', my_rev_features.shape)
print('friends bookmark features shape: ', friend_bm_features.shape)
print('friends review features shape: ', friend_rev_features.shape)

my bookmark features shape:  (1, 124)
my review features shape:  (1, 73)
friends bookmark features shape:  (144, 693)
friends review features shape:  (59, 554)


## Exploring Data

Looks like my friends and I have high standards for reviews and for bookmarks but like to eat at cheaper places. At least 75% of all businesses had a rating of 4 stars or more and has 2 dollar signs or less.

In [10]:
master_bizdet_df.describe()

Unnamed: 0,rating,review_count,price_val
count,16707.0,16707.0,15074.0
mean,4.060962,561.238702,1.857967
std,0.487062,776.005924,0.676829
min,1.0,1.0,1.0
25%,4.0,121.0,1.0
50%,4.0,314.0,2.0
75%,4.5,701.0,2.0
max,5.0,15746.0,4.0


Now I want to see what are the top 10 most common categories among all users that were bookmarked or reviewed and compare it to my own bookmarks and my own reviews.

<img src="./Yelp Pics/All Categories Bar Chart.png">
<img src="./Yelp Pics/My Reviews_Category Bar Chart.png">
<img src="./Yelp Pics/My Bookmarks_Category Bar Chart.png">

I definitely see similarities among the categories already. Looks like we all enjoy coffee, breakfast/brunch, sushi, japanese, seafood, and bakeries.

Now let's look at where I stand in comparison to the top 10 most common categories among my friends

<img src= "./Yelp Pics/top10_histogram.png">

<img src= "./Yelp Pics/cat10_boxplot2.png">

## Calculate Similarity

To determine the similarity between myself and other users, I calculated the similarity using cosine similarity. Cosine similarity is most commonly used in high-dimensional positive spaces, which is relevant in this case with over 700 different features. 

In [2]:
r_sim[:10] # top 10 similar users

Unnamed: 0_level_0,similarity
user_id,Unnamed: 1_level_1
ShHBKjuJbQAVBLs7DgA95A,0.0
WJfNVm4mXXDt1Vakg13CpA,0.000376
MzC1_5kXxGw336fMYSrsdg,0.000501
UaKdT4twgZ4DguHJhT6vPw,0.000522
-OKmukwdCrHq6bkF2_gSwQ,0.000609
MHpv_wmNjt3lw6gX68n8Fw,0.000669
QjLiYeQLeMIqYu-ncmkUXg,0.000712
5RifcJP_Lf-MzojTHybBNw,0.000727
yx3eatgbMTnzzzAcnBjxCQ,0.000738
AtBlOFl4FUtmLOzebCibEw,0.000758


## Get Recommendations

To get recommendations:
1. Start with all the IDs reviews and bookmarks of the three users
2. Remove the businesses that I have already bookmarked or reviewed
3. Filter for businesses that all three users have bookmarked or reviewed
4. Filter out businesses that are closed (`is_closed` == True)
5. Only look at businesses that have ratings greater than 4 stars
6. Create a dataframe with the resulting business IDs and merge with the master business detail dataframe

In [3]:
rec_details

Unnamed: 0,alias,cat_0,cat_1,cat_2,cat_3,cat_4,id,is_closed,price,rating,review_count,price_val
0,eight-am-san-francisco,newamerican,breakfast_brunch,,,,4-ra3RxOy1PpvnaK49dy8w,False,$$,4.5,1019.0,2
0,kokkari-estiatorio-san-francisco,greek,mediterranean,,,,PsY5DMHxa5iNX_nX0T-qPA,False,$$$,4.5,4240.0,3
0,sixth-course-san-francisco,desserts,chocolate,gelato,,,K0r1oltM3JbM14ApsTe_yA,False,$$,4.5,319.0,2
0,wood-tavern-oakland,newamerican,,,,,bhnKl105GwMVVlsiUnwr2w,False,$$$,4.5,1816.0,3


<img src="./Yelp Pics/Recommendations Bar Chart.png">

## Results

The results are a list of 4 businesses that have a rating of 4.5 and appear very popular with hundreds of reviews. While I didn't add a column to indicate location, the business name includes the location and looking through it, most of the businesses are located in the Bay Area.

## Limitations and Further Improvement

The recommendations does not remove businesses that I have been to but not reviewed or bookmarked as there is no check-in data available (but even if there was, the user needs to be consistently using the check-in feature)

Yelp's API only allows businesses to be called using their Yelp name and not the business ID. Because some business names have special characters, those businesses were unable to be joined with their business details. As a result, there are 1021 businesses that have null business details. I determined that 1021/16707 or 6% of the data is small enough to be able to continue with the analysis.