### Building a recommendation system with restaurants in London

### Ideas

#### < Maybe find a map dataset or something to get geo data for each restaurant, would be nice to have a cuisine datapoint too

#### < NLP angle with the reviews textual data

#### < Can use date as a factor, more recent the review, the more weighting it should have

#### < What does good look like? Like what is a 'good' recommendation?

_______________________________________________________________________

### Factor Release Plan

#### Release 1

##### 1. Initial recommender based on review value and volume
##### 2. Involve date as a factor of review reliability
##### 3. Introduce textual NLP analysis of reviews as another way of weighting
##### 4. Bring in geo data for people to select areas or something?
##### 5. Would be nice to webscrape cuisine as a datapoint too
##### 6. Could also include author id as a factor - potentially someone who posts lots of reviews are more reliable?

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import dateutil.parser as parser
from datetime import datetime, date, timedelta
import torch
import skorch
import scipy
import torch.nn as nn
import torch
import torch.nn.functional as F
import sys
from skorch.helper import DataFrameTransformer
import time
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
from sklearn import metrics
from sklearn.preprocessing import FunctionTransformer
from skorch.callbacks import EarlyStopping
from sklearn.pipeline import Pipeline
from skorch import NeuralNetRegressor
import pickle
import emoji
import requests
from bs4 import BeautifulSoup

In [2]:
initial_df = pd.read_csv('Data/London_reviews.csv')

  initial_df = pd.read_csv('Data/London_reviews.csv')


In [3]:
print(initial_df)

       Unnamed: 0 parse_count       restaurant_name rating_review    sample  \
0               0           1  Cocotte_Notting_Hill           5.0  Positive   
1               1           2  Cocotte_Notting_Hill           5.0  Positive   
2               2           3  Cocotte_Notting_Hill           5.0  Positive   
3               3           4  Cocotte_Notting_Hill           5.0  Positive   
4               4           5  Cocotte_Notting_Hill           5.0  Positive   
...           ...         ...                   ...           ...       ...   
996562     999995      999996       The_Old_Brewery           4.0  Positive   
996563     999996      999997       The_Old_Brewery           2.0  Negative   
996564     999997      999998       The_Old_Brewery           5.0  Positive   
996565     999998      999999       The_Old_Brewery           5.0  Positive   
996566     999999     1000000       The_Old_Brewery           3.0  Negative   

               review_id                           

In [4]:
# dropping some unwanted columns from the initial dataset
df = initial_df.drop(['Unnamed: 0','parse_count'],axis=1)
df = df.dropna(subset=['url_restaurant'])
# making a new name column which removes the underscores
df['restaurant_name_clean'] = [(str(s).replace('_', ' ')) for s in df['restaurant_name']]


In [7]:
gmaps1 = pd.read_csv('Data/Restaurant_Location_Details_2.0.csv')
gmaps1 = gmaps1.drop_duplicates(keep='first')

gmaps2 = pd.read_csv(('Data/Restaurant_Location_Additional_Details.csv'))
gmaps2 = gmaps2.drop_duplicates(keep='first')

gmapsdata = gmaps2.merge(gmaps1, how='left', on='place_id')

gmaps3 = pd.read_csv('Data/Restaurant_Cuisine_and_Loc.csv')
gmaps3 = gmaps3.drop_duplicates(keep='first')

gmapsdata2 = gmaps3.merge(gmapsdata, how='left', left_on='id', right_on='place_id')


In [8]:
df_merged = df.merge(gmapsdata2, how='left', left_on='restaurant_name_clean', right_on='TA_Names')

df_merged.to_csv('Data/Merged_TA_Gmaps_Dataset_2.0.csv')

In [9]:
df_merged.groupby('primaryType').agg({'primaryType': 'count'})

Unnamed: 0_level_0,primaryType
primaryType,Unnamed: 1_level_1
american_restaurant,15443
art_gallery,271
bakery,1835
bar,104554
barbecue_restaurant,4110
beauty_salon,298
brazilian_restaurant,1849
breakfast_restaurant,3439
brunch_restaurant,847
cafe,15107


In [2]:
# The primary type column is helpful, but as you can see from the above, there are still 296,000 restaurants
# still marked as just "restaurant", which isn't helpful for categorising by cuisine. So I will attempt to use
# the overview column to pull out certain key words like "European" or "British" which will denote the type 
# of cuisine served.

df_merged = pd.read_csv('Data/Merged_TA_Gmaps_Dataset_2.0.csv')

In [3]:
df_merged['overview'] = df_merged['overview'].fillna('')
restaurant_subset = df_merged.loc[df_merged['primaryType'] == 'restaurant']

restaurant_subset['overview'].unique()

array(['',
       'Inmates prep and serve European lunches in a smart dining room in Brixton Prison. Bookings only.',
       'Polished eatery specialising in elevated Pan-Asian staples such as curries, seafood & meat dishes.',
       'Modern European brasserie run by trainee chefs, for contemporary cuisine in a sophisticated space.',
       'Cosy, hip eatery with an open kitchen offering contemporary, locally sourced European cuisine.',
       'Understated Modern British restaurant with fixed-price menu using rare herbs and seasonal produce.',
       'Low-lit Moroccan, Algerian & Tunisian restaurant with traditional artefacts and colourful drapes.',
       'Old-fashioned, long-standing chip shop offering fried fish, burgers & pies, plus a take-out option.',
       'High-end, eclectic tasting menus and à la carte dishes in a polished basement with craft cocktails.',
       'Chic restaurant & bar with a large terrace serving contemporary Mexican & Peruvian small plates.',
       'Chargri

In [4]:
# From the above I have manually created a mapping which maps specific cuisine-denoting words in the overview
# into a cuisine category, as well as creating more overall categories for regions/subcontinents.

cuisine_mapping = pd.read_excel('Data/Manual Cuisine Mapping.xlsx')

# pattern = '|'.join(cuisine_mapping['Keyword'])
# df_merged['cuisine_mapping'] = df_merged[df_merged['overview'].str.contains(pattern, na=False)]


# Method 1: Using apply with a lambda function
def extract_pattern(text, patterns):
    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            return match.group()  # Return the matched pattern
    return None

df_merged['cuisine_mapping'] = df_merged['overview'].apply(lambda x: extract_pattern(x, cuisine_mapping['Keyword']))


In [6]:
restaurant_subset = df_merged.loc[df_merged['primaryType'] == 'restaurant']

df_merged2 = df_merged.merge(cuisine_mapping, how='left', left_on='cuisine_mapping', right_on='Keyword')

In [8]:
# I have also mapped the previous primary cuisines I obtained through the Googlemaps API to overall
# areas, to give a broader search range and more data to process at once

overall_mapping = pd.read_excel('Data/PrimaryType Overall Mapping.xlsx')

df_merged3 = df_merged2.merge(overall_mapping, how='left', on='primaryType')