
# ML-1: Collaborative filtering
## Project: Building a Restaurant Recommendation System

## Improvement Model


**ML-1 Cohort 1** <br>

Name of people who have worked on this project: 

Nehal Sharma 11675

Keerthi Jayaram 11688

Manila Devaraj 11699

Gurushankar K. 500



# Model Improvisation using Co-sine similarity 

Clearly, the base model recommendations based on stars are not so great. We have to consider other columns of data to make better recommendations. 


Here, we use the tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions. We use text similarity to understand the similarity between different restaurants using their text reviews.

Text similarity has to determine how ‘close’ two pieces of text are both in lexical similarity, surface closeness and semantic similarity, meaning.  
In order to get a similarity metric that gives higher scores for texts belonging to the particular topic (in our case, restaurants) and lower scores when comparing texts from different topic, which basically is semantic similarity, we convert the words into respective word vectors using countVectoriser. 

For the clustering algorithm, we are using cosine similarity. 


## Table of Contents 

* Loading Data

  

### **Importing the neccesary libraries**

In [33]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Loading Data

In [34]:
df_b=pd.read_json("yelp_academic_dataset_business.json", lines=True)
df_b.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,f9NumwFMBDn751xgFiRbNA,The Range At Lake Norman,10913 Bailey Rd,Cornelius,NC,28031,35.462724,-80.852612,3.5,36,1,"{'BusinessAcceptsCreditCards': 'True', 'BikePa...","Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...","{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'..."
1,Yzvjg0SayhoZgCljUJRF9Q,"Carlos Santo, NMD","8880 E Via Linda, Ste 107",Scottsdale,AZ,85258,33.569404,-111.890264,5.0,4,1,"{'GoodForKids': 'True', 'ByAppointmentOnly': '...","Health & Medical, Fitness & Instruction, Yoga,...",
2,XNoUzKckATkOD1hP6vghZg,Felinus,3554 Rue Notre-Dame O,Montreal,QC,H4C 1P4,45.479984,-73.58007,5.0,5,1,,"Pets, Pet Services, Pet Groomers",
3,6OAZjbxqM5ol29BuHsil3w,Nevada House of Hose,1015 Sharp Cir,North Las Vegas,NV,89030,36.219728,-115.127725,2.5,3,0,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Hardware Stores, Home Services, Building Suppl...","{'Monday': '7:0-16:0', 'Tuesday': '7:0-16:0', ..."
4,51M2Kk903DFYI6gnB5I6SQ,USE MY GUY SERVICES LLC,4827 E Downing Cir,Mesa,AZ,85205,33.428065,-111.726648,4.5,26,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Home Services, Plumbing, Electricians, Handyma...","{'Monday': '0:0-0:0', 'Tuesday': '9:0-16:0', '..."


In [35]:
df_b.shape

(209393, 14)

Performing the same transformations on the "Business.json" file as we did in the EDA Notebook.
* Filtering the column "is_open" and keep only businesses which have the value as "1".
* Using df.explode to split the category types for different businesses
* We filter out all the categories except "restaurants"

In [36]:
df_b = df_b[df_b['is_open']==1]

In [37]:
df_explode = df_b.assign(categories = df_b.categories.str.split(', ')).explode('categories')

In [38]:
df_explode.shape

(715205, 14)

In [39]:
df_restaurants = df_explode[df_explode['categories']=="Restaurants"]

In [40]:
del df_b
del df_explode

In [41]:
df_restaurants.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
8,pQeaRpvuhoEqudo3uymHIQ,The Empanadas House,404 E Green St,Champaign,IL,61820,40.110446,-88.233073,4.5,5,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:30-14:30', 'Tuesday': '11:30-14..."
24,eBEfgOPG7pvFhb2wcG9I7w,Philthy Phillys,"15480 Bayview Avenue, unit D0110",Aurora,ON,L4G 7J1,44.010962,-79.448677,4.5,4,1,"{'RestaurantsTableService': 'False', 'Restaura...",Restaurants,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'..."
25,lu7vtrp_bE9PnxWfA8g4Pg,Banzai Sushi,300 John Street,Thornhill,ON,L3T 5W4,43.820492,-79.398466,4.5,7,1,"{'GoodForKids': 'True', 'RestaurantsTakeOut': ...",Restaurants,
30,9sRGfSVEfLhN_km60YruTA,Apadana Restaurant,13071 Yonge Street,Richmond Hill,ON,L4E 1A5,43.947011,-79.454862,3.0,3,1,"{'Ambience': '{'touristy': False, 'hipster': F...",Restaurants,"{'Tuesday': '12:0-21:0', 'Wednesday': '12:0-21..."
33,vjTVxnsQEZ34XjYNS-XUpA,Wetzel's Pretzels,"4550 East Cactus Rd, #KSFC-4",Phoenix,AZ,85032,33.602822,-111.983533,4.0,10,1,"{'GoodForKids': 'True', 'RestaurantsTakeOut': ...",Restaurants,"{'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'..."


***Note: We encountered MemoryError during the execution of the notebook because the state "ON" had a large number of restaurants' data.Therefore, to counter this system limitation, we went ahead with filtering our dataframe with state="IL".***

In [42]:
df = df_restaurants[df_restaurants['state']=="IL"]

In [43]:
tip = pd.read_json("yelp_academic_dataset_tip.json", lines=True)

In [44]:
tip.shape

(1320761, 5)

In [45]:
tip.head(20)

Unnamed: 0,user_id,business_id,text,date,compliment_count
0,hf27xTME3EiCp6NL6VtWZQ,UYX5zL_Xj9WEc_Wp-FrqHw,Here for a quick mtg,2013-11-26 18:20:08,0
1,uEvusDwoSymbJJ0auR3muQ,Ch3HkwQYv1YKw_FO06vBWA,Cucumber strawberry refresher,2014-06-15 22:26:45,0
2,AY-laIws3S7YXNl_f_D6rQ,rDoT-MgxGRiYqCmi0bG10g,Very nice good service good food,2016-07-18 22:03:42,0
3,Ue_7yUlkEbX4AhnYdUfL7g,OHXnDV01gLokiX1ELaQufA,It's a small place. The staff is friendly.,2014-06-06 01:10:34,0
4,LltbT_fUMqZ-ZJP-vJ84IQ,GMrwDXRlAZU2zj5nH6l4vQ,"8 sandwiches, $24 total...what a bargain!!! An...",2011-04-08 18:12:01,0
5,HHNBqfbDR8b1iq-QGxu8ww,ALwAlxItASeEs2vYAeLXHA,Great ramen! Not only is the presentation gorg...,2015-05-20 20:17:38,0
6,r0j4IpUbcdC1-HfoMYae4w,d_L-rfS1vT3JMzgCUGtiow,Cochinita Pibil was memorable & delicious !,2014-09-01 01:23:48,0
7,gxVQZJVeKBUk7jEhSyqv-A,5FIOXmUE3qMviX9GafGH-Q,Get a tsoynami for sure.,2010-01-30 02:03:16,0
8,2hdR7KYAmnCk2FjTnPFsuw,rcaPajgKOJC2vo_l3xa42A,Kelly is an awesome waitress there!,2012-05-29 02:05:56,0
9,DsWg3leomfasGs3j0rOfbQ,hfBrethLHS9iXeBNR8MtzQ,Check out the great assortment of organic & co...,2011-09-30 18:38:47,0


In [46]:
df_merge = pd.merge(df,tip,on='business_id',how='inner')

In [47]:
df_merge.head(10)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,user_id,text,date,compliment_count
0,R32Yh0XxxanldkIp11fuRg,BoBo's BBQ,1511 W Springfield Ave,Champaign,IL,61821,40.112515,-88.271575,3.5,45,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",bOTKDN3RcrbZoAZlGwFkow,Keep driving. Three better BBQ places within a...,2019-11-17 21:28:45,0
1,R32Yh0XxxanldkIp11fuRg,BoBo's BBQ,1511 W Springfield Ave,Champaign,IL,61821,40.112515,-88.271575,3.5,45,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",2oNh_vWxKo-E6aClobmL9Q,"Great food, down to business management, great...",2019-05-14 19:38:28,0
2,R32Yh0XxxanldkIp11fuRg,BoBo's BBQ,1511 W Springfield Ave,Champaign,IL,61821,40.112515,-88.271575,3.5,45,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",qZHvguRWXdfzCaD9lXRBPw,Yelp has their hours as open till 10 pm on Sat...,2015-04-26 02:19:31,0
3,R32Yh0XxxanldkIp11fuRg,BoBo's BBQ,1511 W Springfield Ave,Champaign,IL,61821,40.112515,-88.271575,3.5,45,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",qZHvguRWXdfzCaD9lXRBPw,"Good food, good prices, and nice staff.",2015-05-07 02:32:44,0
4,R32Yh0XxxanldkIp11fuRg,BoBo's BBQ,1511 W Springfield Ave,Champaign,IL,61821,40.112515,-88.271575,3.5,45,1,"{'RestaurantsAttire': 'u'casual'', 'Restaurant...",Restaurants,"{'Monday': '11:0-20:0', 'Tuesday': '11:0-20:0'...",m8D0_D-w3zGid_QhLrIx5Q,"Not feeling the political post at the store, o...",2018-11-21 00:40:42,0
5,wTBfpTjdWG_zoE62B_bLtA,Monical's Pizza,102 W Vine St,Tolono,IL,61880,39.98906,-88.262436,3.0,10,1,"{'BusinessParking': '{'garage': False, 'street...",Restaurants,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...",7hLausNdpLFwBsN8Az6VSg,Breadsticks and nacho cheese!,2013-09-01 00:09:44,0
6,wTBfpTjdWG_zoE62B_bLtA,Monical's Pizza,102 W Vine St,Tolono,IL,61880,39.98906,-88.262436,3.0,10,1,"{'BusinessParking': '{'garage': False, 'street...",Restaurants,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...",lEK4qOghgynWrEYjrwFPTg,I live in FL and Monicals is always my first s...,2015-12-14 13:14:28,0
7,Z7r_FJXEyfyvVsyv2y7gFQ,Cactus Grill,"1405 S Neil St, Ste C",Champaign,IL,61820,40.099648,-88.244768,4.0,60,1,"{'RestaurantsDelivery': 'True', 'RestaurantsRe...",Restaurants,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...",3wfIh8zXcgPmJs0UBgs_IQ,State that you refuse to pay if the order take...,2014-04-29 01:42:34,0
8,Z7r_FJXEyfyvVsyv2y7gFQ,Cactus Grill,"1405 S Neil St, Ste C",Champaign,IL,61820,40.099648,-88.244768,4.0,60,1,"{'RestaurantsDelivery': 'True', 'RestaurantsRe...",Restaurants,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...",Fehq1PxRzvT49YvYN5dg2A,Cactus Grill was close to my hotel and it was ...,2016-07-13 02:20:47,0
9,Z7r_FJXEyfyvVsyv2y7gFQ,Cactus Grill,"1405 S Neil St, Ste C",Champaign,IL,61820,40.099648,-88.244768,4.0,60,1,"{'RestaurantsDelivery': 'True', 'RestaurantsRe...",Restaurants,"{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...",e2ht1J2xp7aqjvTllCPPNw,Loaded nachos r the best steak and chicken r g...,2015-04-09 18:50:21,0


In [48]:
df_merge.shape

(2733, 18)

In [49]:
df_final0 = df_merge.copy()

In [50]:
df_final = df_final0.drop(['business_id','address','city','state','postal_code','latitude', 'longitude','is_open','attributes','categories','hours','user_id','date','compliment_count'],axis=1)

In [51]:
df_final.head()

Unnamed: 0,name,stars,review_count,text
0,BoBo's BBQ,3.5,45,Keep driving. Three better BBQ places within a...
1,BoBo's BBQ,3.5,45,"Great food, down to business management, great..."
2,BoBo's BBQ,3.5,45,Yelp has their hours as open till 10 pm on Sat...
3,BoBo's BBQ,3.5,45,"Good food, good prices, and nice staff."
4,BoBo's BBQ,3.5,45,"Not feeling the political post at the store, o..."


In [52]:
listwords = []
for i in df_final.text:
    listwords.append(i.split())

In [53]:
df_final['bag_of_words'] = pd.Series(listwords)

In [54]:
df_final.head(20)

Unnamed: 0,name,stars,review_count,text,bag_of_words
0,BoBo's BBQ,3.5,45,Keep driving. Three better BBQ places within a...,"[Keep, driving., Three, better, BBQ, places, w..."
1,BoBo's BBQ,3.5,45,"Great food, down to business management, great...","[Great, food,, down, to, business, management,..."
2,BoBo's BBQ,3.5,45,Yelp has their hours as open till 10 pm on Sat...,"[Yelp, has, their, hours, as, open, till, 10, ..."
3,BoBo's BBQ,3.5,45,"Good food, good prices, and nice staff.","[Good, food,, good, prices,, and, nice, staff.]"
4,BoBo's BBQ,3.5,45,"Not feeling the political post at the store, o...","[Not, feeling, the, political, post, at, the, ..."
5,Monical's Pizza,3.0,10,Breadsticks and nacho cheese!,"[Breadsticks, and, nacho, cheese!]"
6,Monical's Pizza,3.0,10,I live in FL and Monicals is always my first s...,"[I, live, in, FL, and, Monicals, is, always, m..."
7,Cactus Grill,4.0,60,State that you refuse to pay if the order take...,"[State, that, you, refuse, to, pay, if, the, o..."
8,Cactus Grill,4.0,60,Cactus Grill was close to my hotel and it was ...,"[Cactus, Grill, was, close, to, my, hotel, and..."
9,Cactus Grill,4.0,60,Loaded nachos r the best steak and chicken r g...,"[Loaded, nachos, r, the, best, steak, and, chi..."


In [55]:
df_sreview = df_final.drop(['stars','review_count','text'],axis=1)

In [56]:
df_sreview.head()

Unnamed: 0,name,bag_of_words
0,BoBo's BBQ,"[Keep, driving., Three, better, BBQ, places, w..."
1,BoBo's BBQ,"[Great, food,, down, to, business, management,..."
2,BoBo's BBQ,"[Yelp, has, their, hours, as, open, till, 10, ..."
3,BoBo's BBQ,"[Good, food,, good, prices,, and, nice, staff.]"
4,BoBo's BBQ,"[Not, feeling, the, political, post, at, the, ..."


In [57]:
df_sreview['liststring'] = [','.join(map(str, l)) for l in df_sreview['bag_of_words']]

In [58]:
df_sreview.rename(columns={'bag_of_words':'bag_of_words_list','liststring':'bag_of_words'},inplace=True)

In [59]:
df_sreview = df_sreview.set_index('name')

In [60]:
df_sreview.shape

(2733, 2)

**We convert the words into respective word vectors using countVectoriser**

In [61]:
count = CountVectorizer()

In [62]:
count_matrix = count.fit_transform(df_sreview['bag_of_words'])
indices = pd.Series(df_sreview.index)

In [63]:
indices

0                    BoBo's BBQ
1                    BoBo's BBQ
2                    BoBo's BBQ
3                    BoBo's BBQ
4                    BoBo's BBQ
                 ...           
2728                  Taco Bell
2729                  Taco Bell
2730    Blaze Fast-Fire'd Pizza
2731    Blaze Fast-Fire'd Pizza
2732    Blaze Fast-Fire'd Pizza
Name: name, Length: 2733, dtype: object

In [64]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.04166667, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.03636965, 0.        ,
        0.        ],
       ...,
       [0.        , 0.04166667, 0.03636965, ..., 1.        , 0.06804138,
        0.08333333],
       [0.        , 0.        , 0.        , ..., 0.06804138, 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.08333333, 0.        ,
        1.        ]])


**Why is cosine similarity considered a good metric for understanding similarity between two texts?**

In word2vec the training process generates word vectors that are semantically closer to point in roughly the same direction in the high dimension space of the vectors. The direction the text points can be thought as its meaning, so texts with similar meanings will be similar. Cosine measure is large when the vectors point in the same direction (i.e. are similar).

In [95]:
def recommendation(name,cosine_sim = cosine_sim):
    recommended_restaurant = []
    idx = indices[indices == name].index[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending=False)
    top_10_indexes = list(score_series.iloc[1:11].index)
    for i in top_10_indexes:
        recommended_restaurant.append(list(df_sreview.index)[i])
    return recommended_restaurant
    

## Recommendations
### Recommending restaurants based on the cosine similarities of the "tips(short texts)" of different restaurants.

In [96]:
recommendation('Cactus Grill')

['Ming Garden',
 'Sonic Drive-In',
 'Kofusion',
 'Courier Cafe',
 'Black Dog Smoke & Ale House',
 'Black Dog Smoke & Ale House',
 'Kofusion',
 "J Gumbo's",
 'Maize Mexican Grill',
 'Black Dog Smoke & Ale House']

In [97]:
recommendation('Taco Bell')

['Kofusion',
 'Maize Mexican Grill',
 'Fresh International Market',
 'Sonic Drive-In',
 'Old Chicago',
 "O'Charley's Restaurant & Bar",
 "Chili's",
 'Fat Sandwich Company',
 'Penn Station',
 'Texas Roadhouse']

## Limitations

* **We observed that in some cases the same restaurant is getting recommended multiple times by the recommender. This is because there are multiple tips for a single restaurant and in some cases our recommender considers these same restaurant's tips as different cosine similarity candidates.**


* **Drawbacks in cosine similarity.**

## Conclusion and Future Work

**We recommended the restaurants using user ratings in our base model and using short reviews (tips) in the improvement model.
However, we encountered limitations in our models as listed above, which can be improved as part of future works.**

**The future work for this model would include adding more features to improve the model like :**
* Location based features (Latitude and Longitude)
* Restaurant Categories
* Restaurant Attributes (Valet avaibalility, pets are allowed etc)
