# **Restaurant Recommender System**

##### **Group 1 (i.e. Foodies) members**:
<ul type='square'> 
    <li> David Mwiti</li>
    <li> Karen Amanya</li>
    <li> Mercy Onduso </li>
    <li> Nicholus Magak </li>
    <li> Penina Wanyama </li>
    <li> Stephen Thuo </li>
</ul>

# Business Understanding

## **Overview**

### **Problem Statement**
The current lack of personalized recommendations based on explicit feedback on restaurant recommendation platforms is resulting in decreased visits and revenue. Traditional approaches based on cuisine or ratings are no longer sufficient as customers prefer personal recommendations from friends and influencers, as they provide more informative and reliable reviews. 

To address this challenge, a content recommender system that incorporates explicit feedback and relevant keywords can provide a more personalized experience and deliver relevant recommendations, ultimately improving website traffic and revenue for restaurant platforms. This presents a unique opportunity for platforms like EatOut to gain a competitive advantage by enhancing their recommendations to enable users to easily find restaurants and hotels based on their desired experience, rather than just cuisine or ratings.


### **Objectives**

> **General Objective:**

The General objective for this project is to develop a personalized recommender system that will analyze customer preferences and recommend the most suitable restaurants on EatOut, with the goal of increasing customer satisfaction, engagement rates, and revenue.


> **Specific Objectives:**

1. Develop a user-friendly interface that allows customers to easily input their preferences for cuisine type, price range, and location.
2. Implement a collaborative filtering algorithm that can analyze customer data and generate personalized restaurant recommendations based on their preferences.
3. Integrate the recommender system with the restaurant platform and display personalized recommendations to customers as part of their overall platform experience.

> **Research Questions:**

1. What data sources should be used to train the recommendation engine, and how can this data be effectively processed and analyzed to identify patterns in customer preferences?
2. What features should be included in the user interface to allow customers to input their preferences easily and efficiently?
3. What algorithms should be implemented on customer data to generate personalized restaurant recommendations based on their preferences?
4. What are the technical requirements for integrating the recommender system with the restaurant platform, and how can the system be seamlessly integrated into the overall platform experience for customers?


### **Success Criteria**

As the aim of the project is to add a personalized experience to the website, the project’s success will be measured based on its ability to provide fast and relevant recommendations according to key words in a person’s search.

The success criteria we will follow depends on the predictive accuracy of the recommendations. This means we will rate how close the estimated ratings are to genuine use ratings, which is a measure used for evaluating non-binary ratings (e.g. 1-10 scale). Since selling books is crucial for a platform that is in business, this is the best metric we decided to use.

The two metrics that we will use are Mean Squared Error (M.S.E) and Root Mean Squared Error (R.M.S.E) due to the fact the rating scale is the same throughout.

## **Importing the required libraries**

In [1]:
#install required libraries
# ! pip install dataprep
# ! pip install pycountry
# ! pip install surprise
# ! pip install sidetable

In [2]:
import pandas as pd
import numpy as np
import string

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config inlineBackend.figure_format = 'retina'
sns.set_context('notebook')

from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob

import folium

from ast import literal_eval
from dataprep.clean import clean_country #pip install dataprep
import pycountry #conda install -c conda-forge pycountry
from surprise import Reader, Dataset, SVD, accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
import sidetable
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import operator


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

# download nltk packages
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pandas'

## **Loading the Data**

### **Restaurants_df**

In [None]:
restaurant_df = pd.read_csv('https://raw.githubusercontent.com/ThuoM/Restaurant-Recommender-System/base_modeling/restaurant_data.csv', encoding='utf-8', on_bad_lines='skip', low_memory=False)
restaurant_df.head()

## **Users_df**

In [None]:
data_url = 'https://raw.githubusercontent.com/ThuoM/Restaurant-Recommender-System/main/final_revs.csv'
user_revs_df = pd.read_csv(data_url)
columns=['User_Name','Account','Location','Date_of_review','Rating','Comment','URL']
user_revs_df.columns = columns
user_revs_df.head()

# **Data Understanding**

The data being used on this project was object after scraping on Yelp. It is meant to be used as a mockup for how Eat Out's data could look like on the probability our recommender is accepted.

The data contained two files: 

* **_restaurants.csv_**

Contained the restaurants we desired to recommend. A few notable features in the dataset are name(of restaurant), avg_rating, pricing_range, & cuisine. Regional data we have is from New York hence there is a location field with the restaurants' individual locs.

* **_final_revs.csv_**

Contained user info from the individual restaurants. Users have identification  based on their account links. Other features include (Username, date of review, individual rating). We were also able to acquire comments from users which can be used to give restaurants more context. 


##### _**1. restaurants_df**_

In [None]:
restaurant_df.info()

# Observations:
#   16 columns & 500 rows
#   It has some missing phone & display phone numbers
#   Missing quite a bit of data on the pricing section

In [None]:
# Converting the cuisine text to a more human readable format
restaurant_df['Cuisine'] = restaurant_df['Cuisine'].map(lambda x: x.replace(',',''))
restaurant_df['Cuisine'] = restaurant_df['Cuisine'].apply(lambda x: x.replace(" ", ""))

In [None]:
# viewing the types of cuisines 
set(restaurant_df['Cuisine'])

restaurant_df['Cuisine'].head(10)

In [None]:
# viewing the number of unique restaurants
len(restaurant_df['Restaurant ID'].unique())

In [None]:
# Types of transactions occuring in the restaurants
set(restaurant_df['Transactions'])

In [None]:
# Preview the locations
restaurant_df['Location'].head(10)

In [None]:
# viewing the rating scale of the restaurants
set(restaurant_df['Rating'])

In [None]:
fig, ax = plt.subplots(figsize=(11,8))
sns.histplot(restaurant_df['Rating'], bins = np.arange(8) - 0.5, ax=ax)
ax.set_xticks(range(1,6))
ax.set_xlabel('Ratings')
ax.set_ylabel('Number of Restaurants')
ax.set_title('Restaurant Ratings');

rating_count = restaurant_df['Rating'].value_counts().sort_index()
for i, val in enumerate(rating_count.values):
    ax.text(i+1, val+5, (f'{round((val/sum(rating_count.values))*100, 1)}%'), ha='center', va='bottom',size=12)  
    
    # Observations:
    # The book rating ranges from 0 to 10
    # Most books have a rating of 0
    # The ratings between 1-4 have very few books

##### _**2. users_df**_

In [None]:
user_revs_df.info()

In [None]:
# Checking how many unique users rated
len(user_revs_df['Account'].unique())

In [None]:
user_revs_df.describe()

In [None]:
# # List of locations to users majority
# list_of_locs = set(user_revs_df['Location'])
# dic_loc = {}

# for location in list_of_locs:
#     num = len(user_revs_df[user_revs_df['Location'] == location])
#     dic_loc[location] = num
# dic_loc

**Checking for the trend in restaurant ratings production over time**

In [None]:
dic_years = {}
for years in user_revs_df['Date_of_review']:
    if int(years[-4:]) in dic_years.keys():
        dic_years[int(years[-4:])] += 1
    else:
        dic_years[int(years[-4:])] = 1
list(dic_years.keys())

In [None]:
# years and comments trend made
myKeys = list(dic_years.keys())
myKeys.sort()

X = myKeys
y = []
for i in myKeys:
    y.append(dic_years[i]) 

year_plot = pd.DataFrame()
year_plot['years'] = X
year_plot['numbers'] = y

plt.figure(figsize=(10, 6))
ax = sns.lineplot(x='years', y='numbers', data=year_plot, color='teal')

ax.set(xlabel='Years', ylabel='Number of users', 
       title='Users\' rating growth in years')
plt.show()
# Observations:
    # Over time, there has been an increasing trend in people embracing rating restaurants on Yelp as a means of sharing their experiences and opinions with others.

Conclusion: The amount of data retrieved over the years since 2005-2023 has been increasing exponentially over the years except for 2020 when there was a lock down. 2023 has merely reached its 1st quarter when we scraped the data

###Merging the 2 Datasets

In [None]:
#merge the two datasets
user_restaurant_df = pd.merge(user_revs_df, restaurant_df, on='URL', how='left')
user_restaurant_df.head(3)

In [None]:
user_restaurant_df.columns

The 'Account' Column is the unique identifier for the users,
'Location_x' is the location of the restaurant where the users dined. The 'Date_of_review' column is when the user made the reviews for the restaurant. The 'Rating_x' is the ratings by user. The 'Comment' is the various comments by the user. The 'URL' is the unique identifier of the restaurant. 'Rating_y' is the average ratings the rstaurant received from the various users. 'Pricing' is used to identify the various levels of the restaurant(high-end, middle-class, affordable, cheap). The 'Transactions" is used to specify which services are offfered by the restaurant. 'Location_y' is the restaurant's location. 'Reviews' column is the various reviews the restaurant received

In [None]:
# renaming columns
user_restaurant_df.rename(columns={'Account': 'user_ID', 'Rating_x': 'user_rating', 'Rating_y': 'avg_restaurant_rating',
                                    'Cuisine': 'cuisine', 'Pricing': 'pricing', 
                                    'URL': 'url', 'Location_x': 'location', 
                                    'Restaurant ID': 'id', 'Date_of_review': 'review_date', 
                                    'Transactions': 'transactions', 'Number of Reviews':'number_of_reviews',
                                   'Comment': 'comments', 'Reviews': 'reviews'}, inplace=True)

## Data Cleaning

In [None]:
#Drop unnecessary columns
unnecessary_columns = ['User_Name', 'Name', 'Phone', 'Display Phone', 'Distance', 'Location_y',
                       'Review Count']
user_restaurant_df = user_restaurant_df.drop(unnecessary_columns, axis =1)
user_restaurant_df.head(3)

### **Handling Missing Values**

In [None]:
user_restaurant_df.stb.missing()

From the table above, we can see the missing values in the various columns: Pricing has missing values at 27.97%, while transactions, reviews ,avg_restaurant_rating and cuisine and ID have missing values at 20.56%, The pricing col is categorical, and the missing values can be replaced by the mode. The other columns with missing values can be either be dropped or replaced

In [None]:
# Since pricing categorical data, we can use mode
user_restaurant_df['pricing'] = user_restaurant_df['pricing'].fillna(user_restaurant_df['pricing'].mode()[0])
# Empty transactions filled with None
user_restaurant_df['transactions'] = user_restaurant_df['transactions'].fillna(user_restaurant_df['transactions'].mode()[0])
# Missing Cuisines filled with a random one
user_restaurant_df['cuisine'] = user_restaurant_df['cuisine'].fillna('random')
# Missing average rating filled with the mode
user_restaurant_df['avg_restaurant_rating'] = user_restaurant_df['avg_restaurant_rating'].fillna(user_restaurant_df['avg_restaurant_rating'].mode()[0])
# Missing review filled with no reviews
user_restaurant_df['reviews'] = user_restaurant_df['reviews'].fillna('no reviews')
# Missing id filled with no identifier
user_restaurant_df['id'] = user_restaurant_df['id'].fillna('no identifier')

In [None]:
user_restaurant_df.stb.missing()

Encoding the pricing column

In [None]:
user_restaurant_df['pricing'].nunique()

In [None]:
# from sklearn.preprocessing import OrdinalEncoder

# # create an instance of the encoder
# encoder = OrdinalEncoder(categories=[['$', '$$', '$$$', '$$$$']])

# # fit and transform the pricing column
# encoded_pricing = encoder.fit_transform(user_restaurant_df['pricing'].values.reshape(-1, 1))

# # replace the original pricing column with the encoded values
# user_restaurant_df['pricing'] = encoded_pricing.astype(int)

In [None]:
user_restaurant_df['pricing'].head()

In [None]:
user_restaurant_df.dtypes

# **Exploratory Data Analysis**

## **Data transformation**

### **Preprocess text data**

In [None]:
# Converting the columns to lower case
user_restaurant_df['url'] = user_restaurant_df['url'].str.lower()
user_restaurant_df['location'] = user_restaurant_df['location'].str.lower()
user_restaurant_df['cuisine'] = user_restaurant_df['cuisine'].str.lower()
user_restaurant_df['comments'] = user_restaurant_df['comments'].str.lower()

user_restaurant_df.sample(5)

In [None]:
# converts the strings to python list
user_restaurant_df['transactions'] = user_restaurant_df['transactions'].apply(lambda x: literal_eval(x))

# joins the created python lists together
user_restaurant_df['transactions'] = user_restaurant_df['transactions'].apply(lambda x: ', '.join(x))

In [None]:
user_restaurant_df['transactions'].sample(10)

<p> We chose to use NLTK word Lemmatizer and also remove English stop words and non alphabet tokens from reviews and style attributes. </p>

In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def process_sentences(text):
    temp_sent =[]

    # Tokenize words
    words = nltk.word_tokenize(text)

    # Lemmatize each of the words based on their position in the sentence
    tags = nltk.pos_tag(words)
    for i, word in enumerate(words):
        # only verbs
        if tags[i][1] in ('VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'):
            lemmatized = lemmatizer.lemmatize(word, 'v')
        else:
            lemmatized = lemmatizer.lemmatize(word)
        
        # Remove stop words and non alphabet tokens
        if lemmatized not in stop_words and lemmatized.isalpha(): 
            temp_sent.append(lemmatized)

    # Some other clean-up
    full_sentence = ' '.join(temp_sent)
    full_sentence = full_sentence.replace("n't", " not")
    full_sentence = full_sentence.replace("'m", " am")
    full_sentence = full_sentence.replace("'s", " is")
    full_sentence = full_sentence.replace("'re", " are")
    full_sentence = full_sentence.replace("'ll", " will")
    full_sentence = full_sentence.replace("'ve", " have")
    full_sentence = full_sentence.replace("'d", " would")
    return full_sentence

In [None]:
# creation of filtered comments
user_restaurant_df['processed_comments'] = user_restaurant_df['comments'].apply(process_sentences)

In [None]:
# creation of filtered cuisines
user_restaurant_df['processed_cuisine'] = user_restaurant_df['cuisine'].apply(process_sentences)

In [None]:
# creation of filtered transactions
user_restaurant_df['processed_transactions'] = user_restaurant_df['transactions'].apply(process_sentences)

In [None]:
# Preview of the processed columns
user_restaurant_df[['processed_comments', 'comments', 'processed_cuisine', 'cuisine', 'processed_transactions', 'transactions']].sample(5)

<p>Eventually, let's create a bag_of_words with a combination of our new preprocessed attributes:</p>
<br>

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; <strong>bag of words=style processed+reviews processed</strong>

In [None]:
user_restaurant_df['bogs'] = user_restaurant_df['processed_cuisine'] + ' ' + user_restaurant_df['processed_comments'] + ' ' + user_restaurant_df['processed_transactions']
display('A sample of bag of words', user_restaurant_df[['processed_comments', 'processed_cuisine', 'processed_transactions','bogs']].sample(5))

### **Mapping the location of the Restaurants**

In [None]:
user_restaurant_df.shape

In [None]:
reviewed_data = user_restaurant_df[user_restaurant_df['reviews'] != 'no reviews']

reviewed_data.shape

In [None]:
# Create a map centered on a specific location
map_restaurants = folium.Map(location=[40.7128, -74.0060], zoom_start=11, zoom_control=False)

# Add markers for each restaurant using the latitude and longitude data
for index, row in restaurant_df.iterrows():
    name = row['Name']
    latitude = row['Latitude']
    longitude = row['Longitude']
    marker = folium.Marker([latitude, longitude], popup=name)
    marker.add_to(map_restaurants)

# Display the map
map_restaurants


Most restaurants in our data are in NewYork.

### **Common Words in Reviews**

In [None]:
# Creating a word cloud of the tweets

# concatenate all the tweets into a single string
all_reviews = ' '.join(user_restaurant_df['processed_comments'].values)

# create a WordCloud object
wc = WordCloud(width=800, height=400, background_color='white', max_words=200).generate(all_reviews)

# display the word cloud
plt.figure(figsize=(12,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

The common words have positive connotations like Good, amazing, delicious, great.

### **Sentiment Analysis of Reveiws**

In [None]:
#calculating subjectivity and polarity scores using TextBlob

user_restaurant_df['Subjectivity'] = user_restaurant_df['processed_comments'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
user_restaurant_df['Polarity'] = user_restaurant_df['processed_comments'].apply(lambda x: TextBlob(x).sentiment.polarity)
user_restaurant_df['Review_sentiment'] = user_restaurant_df['Polarity'].apply(lambda x: 'positive' if x > 0 else (
                                'negative' if x < 0 else 'neutral'))

user_restaurant_df.head(3)

In [None]:
# Show the value counts
print(user_restaurant_df['Review_sentiment'].value_counts())

# bar plot showing the sentiment categories
plt.subplots(figsize= (11, 5))
plt.title('Review Sentiment Analysis')
plt.xlabel('Sentiment')
plt.ylabel('Counts')
plt.xticks(rotation=45)
user_restaurant_df['Review_sentiment'].value_counts().plot(kind='bar')
plt.show()

### **Most popular cuisines**

In [None]:
cuisine_processed = restaurant_df['Cuisine'].apply(process_sentences)

# Convert the series into string text
text = ' '.join(cuisine_processed.values)

# create a WordCloud object
wc = WordCloud(width=800, height=400, background_color='white', max_words=200).generate(text)
# display the word cloud
plt.figure(figsize=(12,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

Italian, Japanese,mexican, american new, Korean are the most popular cuisine styles

### **Top 10 Most reviewed restaurants**

In [None]:
restaurants_review_df = restaurant_df[['Name','Number of Reviews']]
#get the top10 restaurants
top10 = restaurants_review_df.nlargest(10, 'Number of Reviews')
# create a barplot of the top 10 restaurants
plt.figure(figsize=(11,6))
sns.barplot(data=top10, x='Name', y='Number of Reviews')
plt.title("Top 10 Restaurants by Review Counts")
plt.xlabel("Restaurant Name")
plt.ylabel("Review Counts")
plt.xticks(rotation=45)
plt.show()

The highest reviews range from 14000 to about 2500

### **Restaurant's Ratings over Time**

In [None]:
# Extract the years from the date column and create a new column with them
user_revs_df['Year'] = pd.DatetimeIndex(user_revs_df['Date_of_review']).year

# Plot a histogram of the years using seaborn.histplot
plt.figure(figsize=(11,6))
sns.histplot(data=user_revs_df, x='Year', bins=10)
plt.title("Ratings Over Time")
plt.xlabel("Year")
plt.ylabel("Review Counts")
plt.xticks(rotation=45)
plt.show();

### **Ratings Distribution**

In [None]:
#Checking the distriburion of resturant ratings
plt.figure(figsize=(11,6))
sns.histplot(data=user_revs_df, x='Rating', bins=10)
plt.title('Distribution of restaurant ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

**Correlation Matrix**

In [None]:
plt.figure(figsize=(11,6))
corr_matrix = user_restaurant_df.corr()
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True)
plt.title('Correlation matrix')
plt.show()

From the correlation we see that most of our features have low collinearity since their correlation coefficient (usually the Pearson correlation coefficient) is close to 0 or between -0.3 and 0.3. This means that there is little or no linear relationship between the features and they are essentially independent. For this reason, we will retain all the features since they all have a unique information to show during moddeling

In [None]:
# create a scatter plot with number of reviews on the x-axis and user rating on the y-axis
plt.figure(figsize=(11,6))
plt.scatter(user_restaurant_df['number_of_reviews'], user_restaurant_df['user_rating'])

# set the axis labels and title
plt.xlabel('Number of Reviews')
plt.ylabel('User Rating')
plt.title('Relationship between Number of Reviews and User Rating')

# show the plot
plt.show()

In [None]:
# group the data by restaurant ID and count the number of comments for each restaurant
plt.figure(figsize=(11,6))
comments_by_restaurant = user_restaurant_df.groupby('url')['comments'].count()

# create a histogram with the number of comments on the x-axis and frequency on the y-axis
plt.hist(comments_by_restaurant, bins=20)

# set the axis labels and title
plt.xlabel('Number of Comments')
plt.ylabel('Frequency')
plt.title('Distribution of Comments by Restaurant')

# show the plot
plt.show()

In [None]:
# create a box plot of user rating by pricing
# plt.figure(figsize=(11,6))
# user_restaurant_df.boxplot(column='user_rating', by='pricing')
# plt.xlabel('Pricing')
# plt.ylabel('User Rating')
# plt.title('Relationship between Pricing and User Rating')
# plt.show()

In [None]:
# plt.scatter(user_restaurant_df['pricing'], user_restaurant_df['number_of_reviews'])
# plt.xlabel('Pricing')
# plt.ylabel('Number of Reviews')
# plt.title('Relationship between Pricing and Number of Reviews')
# plt.show()

# **Data Preparation**

### **Data Reduction**

<ul>
    <li> Remove attributes that we don't need for content </li>
    <li> Rename attributes for better convention </li>
</ul>
    

In [None]:
# viewing the restaurant data set before cleaning occurs
restaurant_df.head(3)

In [None]:
# giving the restaurants a numerical id
restaurant_df['Rest_num_id'] = pd.factorize(restaurant_df['URL'])[0]

In [None]:
# Pick out the number of restaurants already scraped in the users_revs side

filtered_restaurant_df = restaurant_df[restaurant_df['URL'].isin(user_revs_df['URL'].unique())]
len(filtered_restaurant_df)

In [None]:
# user_revs_df['Account_2'] = pd.factorize(user_revs_df['Account'])[0]
# user_revs_df.head()

In [None]:
# 
# url_ids = restaurant_df[['Restaurant ID','URL']][restaurant_df['URL'].isin(user_revs_df['URL'].unique())]
# url_ids
# user_revs_df = pd.merge(user_revs_df, url_ids, on="URL", how='left')
# user_revs_df.head()

In [None]:
# filtered_users_df = user_revs_df.copy()

# # removing unnecessary columns
# filtered_users_df.drop(columns=['User_Name', 'Location', 'Date_of_review', 'Comment', 'Account', 'URL'], inplace=True)

# # renaming the remaining columns appropriately
# filtered_users_df.rename(columns={'Account_2': 'user-id', 'Restaurant ID': 'restaurant-id',
#                                  'Rating': 'rating'}, inplace=True)

# display('After reduction on user df', filtered_users_df)


### **Data transformation**

In [None]:
user_restaurant_df.info()

<p>Time to deal with the price column and convert the signs to meaningful values.
<br>Listing all the possible values of price in the dataset first</p>

<br>

* All the values on the user df are at the correct format as for now 


In [None]:
user_restaurant_df['pricing'].unique()

<p>There are 4 ranges so far:</p>

<ul>
<li> Mid-range (3 signs) </li>
<li> Cheap Eats (1 sign) </li>
<li> Affordable (2 signs) </li>
<li> Pricey Dining (4 signs) </li>
</ul>

In [None]:
# replacing the signs
user_restaurant_df['pricing'].replace(['$$', '$$$$', '$', '$$$'], ['low', 'pricey-dining', 'popular-eats', 'mid-range'], inplace=True)

In [None]:
constituents_list = list(set(user_restaurant_df['location'].str.split(',').str[-2].str.strip(string.punctuation)))

In [None]:
# viewing the locations restaurants are commonly located
constituents_list

In [None]:
user_restaurant_df['location']

In [None]:
# Need to work on locations
# filtered_restaurant_df['location'] = filtered_restaurant_df['location'].str.split(',').str[-2].str.strip(string.punctuation)
# display('All changes done before preprocessing on text data', filtered_restaurant_df.head(3))

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
user_restaurant_df[['processed_comments', 'comments', 'processed_cuisine', 'cuisine', 'processed_transactions', 'transactions']].sample(5)

In [None]:
price_map = {
    'low':('everybody', 'no-expense', 'accomodating', 'inexpensive', 'cheap', 'ample', 'rock-bottom'), 
    'popular-eats': ('low-price', 'low-cost', 'economical', 'economic', 'modest'),
    'mid-range': ('moderate', 'fair', 'mid-price', 'reasonable', 'average'),
    'pricey-dining': ('expensive', 'fancy', 'lavish', 'fine', 'extravagant')
}

# **Modeling**

## **Content Based Recommendation**

In [None]:
def contentB_recommend(description):
    # Convert user input to lowercase
    description = description.lower()

    data = user_restaurant_df.copy()

    # Extract cities
    constituents_input = []
    for city in description:
      if city in constituents_list:
    # for const in constituents_list:
    #     if const in description:
            constituents_input.append(city)
            description = description.replace(city, "")

    if constituents_input:
        data = data[data['location'].isin(constituents_input)]

    # Extract price class
    for key, value in price_map.items():
        if any(v in description for v in value):
            data = data[data['pricing'] == key]
            break
    
    # Process user description text input 
    description = process_sentences(description)
    description = description.strip()
    print('Processed user feedback:', description)

    # Init a TF-IDF vectorizer
    tfidfvec = TfidfVectorizer()

    # Fit data on processed reviews
    vec = tfidfvec.fit(data["bogs"])
    features = vec.transform(data["bogs"])

    # Transform user input data based on fitted model
    description_vector =  vec.transform([description])

    # Calculate cosine similarities between users processed input and reviews
    cos_sim = linear_kernel(description_vector, features)

    # Add similarities to data frame
    data['similarity'] = cos_sim[0]

    # Sort data frame by similarities
    data.sort_values(by='similarity', ascending=False, inplace=True)

    return data[['url', 'avg_restaurant_rating', 'location', 'pricing', 'cuisine', 'transactions', 'comments', 'similarity']]

In [None]:
user_restaurant_df.columns

In [None]:
# specified cuisine
contentB_recommend('for chinese food')

In [None]:
# with price class and location
contentB_recommend('a reasonable breakfast in Brooklyn')

In [None]:
# with transaction and price class and location
contentB_recommend('a delivery shop for burgers in New york')

## **Collaborative Filtering recommendation Model**

* We will not be implementing Collaborative Filtering from scratch. 
 Instead, we will use the Surprise library that used extremely powerful algorithms like
**Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.**

In [None]:
# surprise reader
reader = Reader(rating_scale=(1, 5))

In [None]:
user_restaurant_df.columns

In [None]:
data = Dataset.load_from_df(user_restaurant_df[['user_ID', 'id', 'user_rating']], reader)

trainset, testset = train_test_split(data, test_size=0.25)

In [None]:
model = SVD()
# cross_validate(model, trainset, measures=['rmse', 'mae'], cv=5)

In [None]:
# data_train = data.build_full_trainset()
model.fit(trainset)

In [None]:
predictions = model.test(testset)

In [None]:
# Viewing contents of test set
for uid, bid, rating in testset[:5]:
    print(f"User {uid} rated restaurant {bid} with a rating of {rating}")

In [None]:
# Viewing predictions
for prediction in predictions[0:5]:
    print(prediction)

In [None]:
# Print the performance metrics
accuracy.rmse(predictions)

In [None]:
true_ratings = [pred.r_ui for pred in predictions]
est_ratings = [pred.est for pred in predictions]
uids = [pred.uid for pred in predictions]

**Recommending unseen restaurants to the test set**

In [None]:
# Get list of user ids from test set
users = list(set(uids))

In [None]:
# restaurants which the users have not yet evaluated
restaurants = trainset.build_anti_testset()

In [None]:
# using an example of 15 users
for user_id in users[0:15]:
    rests_seen = list(filter(lambda x: x[0] == user_id, restaurants))
    
    print(f'This user {user_id} has rated {len(rests_seen)} restaurants')
    
    # generate recommendations
    recommendations = model.test(rests_seen)
    recommendations.sort(key=operator.itemgetter(3), reverse=True)
    
    print(f"This user {user_id}'s recommendations:")
    # viewing 3 recommendations if available
    for reco in recommendations[0:3]:
        name = user_restaurant_df['url'][user_restaurant_df['id'] == (reco[1])]
        print(f'Restaurant {str(name)} with estimated rating {reco[3]}')

## **Hybrid Recommendation Model**

* In this section, will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work

In [None]:
id_map = pd.read_csv('https://raw.githubusercontent.com/ThuoM/Restaurant-Recommender-System/base_modeling/restaurant_data.csv')
id_map.drop_duplicates(subset=['Restaurant ID'], inplace=True)
len(id_map)

In [None]:
def hybrid_recommender(userId, description):
    # Convert user input to lowercase
    description = description.lower()

    data = user_restaurant_df.copy()

    # Extract cities
    constituents_input = []
    for city in description:
      if city in constituents_list:
    # for const in constituents_list:
    #     if const in description:
            constituents_input.append(city)
            description = description.replace(city, "")

    if constituents_input:
        data = data[data['location'].isin(constituents_input)]

    # Extract price class
    for key, value in price_map.items():
        if any(v in description for v in value):
            data = data[data['pricing'] == key]
            break
    
    # Process user description text input 
    description = process_sentences(description)
    description = description.strip()
    print('Processed user feedback:', description)

    # Init a TF-IDF vectorizer
    tfidfvec = TfidfVectorizer()

    # Fit data on processed reviews
    vec = tfidfvec.fit(data["bogs"])
    features = vec.transform(data["bogs"])

    # Transform user input data based on fitted model
    description_vector =  vec.transform([description])

    # Calculate cosine similarities between users processed input and reviews
    cos_sim = linear_kernel(description_vector, features)
    
    # Add similarities to data frame
    data['similarity'] = cos_sim[0]
    sim_scores = list(cos_sim[0])
    print(sim_scores)
    
    restaurant_indices = [i for i in sim_scores]
    print(restaurant_indices)
    # rest = data.iloc[restaurant_indices][['id', 'name', 'rating', 'location', 'cuisine', 'comments', 'num_id']]
    rest = data.iloc[restaurant_indices][['id', 'url', 'avg_restaurant_rating', 'location', 'pricing' 'cuisine', 'comments', 'num_id']]
    # print(rest)
    rest['est'] = rest['num_id'].apply(lambda x: model.predict(userId, id_map.loc[x]['Restaurant ID']).est)
    rest = rest.sort_values('est', ascending=False)
    return display('Predicted:', rest.head(10))  

In [None]:
def hybrid_recommender(userId, description):
    rest = contentB_recommend(description)
    # print(rest)
    # display('Prior to personal recommendations: ', rest.head(10))
    rest['est'] = rest['num_id'].apply(lambda x: model.predict(userId, x).est)
    rest = rest.sort_values('est', ascending=False)
    print('\n\n\n\n')
    return rest

In [None]:
user_restaurant_df[user_restaurant_df['user_ID'] == 1]

In [None]:
user_restaurant_df[user_restaurant_df['id'] == 'hdiuRS9sVZSMReZm4oV5SA']

In [None]:
hybrid_recommender(1, 'good korean with delivery')

# Evaluation

# Deployment