# Obtaining Restaurant Data

For restaurant data, we will use data scraped from the [Zomato website](https://www.zomato.com/) - an Indian restaurant aggregator and delivery service that has an extensive database of restaurants in India.

At the time of the project, I was unable to obtain an API key from Zomato, so I have instead used the dataset uploaded by Himanshu Poddar to Kaggle [(link)](https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants)

Since this database lacks latitude and longitude information, we will have to reverse geocode it. Also, since the data is a couple of years old (uploaded March 2019), it will not be entirely accurate to the current scenario. But, to get a general overview, it should be more than enough.

In [16]:
import pandas as pd
import numpy as np

from ratelimiter import RateLimiter
from diskcache import Cache
from tqdm.notebook import tqdm

from geopy.distance import distance
from geopy import Nominatim

import plotly.express as px

## Loading Data

We will load only the required columns from the dataset.

**List of columns in the dataset**

| Index | Column                      | Description                                                                                                                |
|-------|-----------------------------|----------------------------------------------------------------------------------------------------------------------------|
| 0     | url                         | contains the url of the restaurant in the zomato website                                                                   |
| 1     | address                     | contains the address of the restaurant in Bengaluru                                                                        |
| 2     | name                        | contains the name of the restaurant                                                                                        |
| 3     | online_order                | whether online ordering is available in the restaurant or not                                                              |
| 4     | book_table                  | table book option available or not                                                                                         |
| 5     | rate                        | contains the overall rating of the restaurant out of 5                                                                     |
| 6     | votes                       | contains total number of rating for the restaurant as of the above mentioned date                                          |
| 7     | phone                       | contains the phone number of the restaurant                                                                                |
| 8     | location                    | contains the neighborhood in which the restaurant is located                                                               |
| 9     | rest_type                   | restaurant type                                                                                                            |
| 10    | dish_liked                  | dishes people liked in the restaurant                                                                                      |
| 11    | cuisines                    | food styles, separated by comma                                                                                            |
| 12    | approx_cost(for two people) | contains the approximate cost for meal for two people                                                                      |
| 13    | reviews_list                | list of tuples containing reviews for the restaurant, each tuple consists of two values, rating and review by the customer |
| 14    | menu_item                   | contains list of menus available in the restaurant                                                                         |
| 15    | listed_in(type)             | type of meal                                                                                                               |
| 16    | listed_in(city)             | contains the neighborhood in which the restaurant is listed                                                                |

In [2]:
# Load only the required columns from the downloaded zip file

df_zomato = pd.read_csv(
    '../data/zomato_data.zip',
    usecols = [
        'name',
        'address',
        'location',
        'rest_type',
        'cuisines',
        'approx_cost(for two people)',
        'rate',
        'votes'
    ]
).rename(columns = {'approx_cost(for two people)': 'cost'})

print(df_zomato.shape)
df_zomato.head()

(51717, 8)


Unnamed: 0,address,name,rate,votes,location,rest_type,cuisines,cost
0,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,4.1/5,775,Banashankari,Casual Dining,"North Indian, Mughlai, Chinese",800
1,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,4.1/5,787,Banashankari,Casual Dining,"Chinese, North Indian, Thai",800
2,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,3.8/5,918,Banashankari,"Cafe, Casual Dining","Cafe, Mexican, Italian",800
3,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,3.7/5,88,Banashankari,Quick Bites,"South Indian, North Indian",300
4,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,3.8/5,166,Basavanagudi,Casual Dining,"North Indian, Rajasthani",600


In [3]:
# Display summary statistics for all columns
df_zomato.describe(include = 'all')

Unnamed: 0,address,name,rate,votes,location,rest_type,cuisines,cost
count,51717,51717,43942,51717.0,51696,51490,51672,51371.0
unique,11495,8792,64,,93,93,2723,70.0
top,Delivery Only,Cafe Coffee Day,NEW,,BTM,Quick Bites,North Indian,300.0
freq,128,96,2208,,5124,19132,2913,7576.0
mean,,,,283.697527,,,,
std,,,,803.838853,,,,
min,,,,0.0,,,,
25%,,,,7.0,,,,
50%,,,,41.0,,,,
75%,,,,198.0,,,,


## Data Cleaning

Lets go through the columns one by one, correcting formats and removing unecessary rows.

We can see from the descriptive statistics above that all rows have atleast the name and the address. Some have not been assigned restaurant types, cuisines and cost values.

We will be using these attributes in our exploration, and since only a very small fraction of our datapoints are missing these values, we can drop them from our set.

In [4]:
df_zomato.dropna(axis = 0, how = 'any', subset = ['rest_type', 'cuisines', 'cost'], inplace = True)

In [5]:
df_zomato['address'].value_counts()

Delivery Only                                                                                                                                                                           127
14th Main, 4th Sector, HSR, Bangalore                                                                                                                                                    71
The Ritz-Carlton, 99, Residency Road, Bangalore                                                                                                                                          61
Citrus Hotels, 34, Cunningham Road, Bangalore                                                                                                                                            53
Conrad Bengaluru, Kensington Road, Ulsoor, Bangalore                                                                                                                                     49
                                                            

In [6]:
# Drop Delivery Only locations - we are looking at dine-in restaurants
df_zomato = df_zomato[df_zomato['address'] != 'Delivery Only']
df_zomato.shape

(51021, 8)

In [7]:
df_zomato['rate'].unique()

array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
       '3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
       '4.3/5', 'NEW', '2.9/5', '3.5/5', nan, '2.6/5', '3.8 /5', '3.4/5',
       '4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
       '3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
       '4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
       '3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
       '4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
       '4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
       '2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)

In [8]:
# Replace 'NEW' in rate with NULL, and remove /5 from ratings
df_zomato['rate'] = df_zomato['rate'].str.replace('NEW', '', case = False)\
    .str.replace('/5', '')\
    .str.replace('-','')\
    .str.strip()\
    .replace('',np.nan)\
    .astype(float)
df_zomato['rate'].unique()

array([4.1, 3.8, 3.7, 3.6, 4.6, 4. , 4.2, 3.9, 3.1, 3. , 3.2, 3.3, 2.8,
       4.4, 4.3, nan, 2.9, 3.5, 2.6, 3.4, 4.5, 2.5, 2.7, 4.7, 2.4, 2.2,
       2.3, 4.8, 4.9, 2.1, 2. , 1.8])

In [9]:
df_zomato['cost'].unique() # Check why cost is being interpreted as string

array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
       '900', '200', '750', '150', '850', '100', '1,200', '350', '250',
       '950', '1,000', '1,500', '1,300', '199', '80', '1,100', '160',
       '1,600', '230', '130', '50', '190', '1,700', '1,400', '180',
       '1,350', '2,200', '2,000', '1,800', '1,900', '330', '2,500',
       '2,100', '3,000', '2,800', '3,400', '40', '1,250', '3,500',
       '4,000', '2,400', '2,600', '120', '1,450', '469', '70', '3,200',
       '60', '560', '240', '360', '6,000', '1,050', '2,300', '4,100',
       '5,000', '3,700', '1,650', '2,700', '4,500', '140'], dtype=object)

In [10]:
# Remove commas and covert to integers
df_zomato['cost'] = df_zomato['cost'].str.replace(',','').astype(int)

Let's just view the data one more time to ensure our data is clean.

In [11]:
df_zomato.dtypes # Check datatypes are correct

address       object
name          object
rate         float64
votes          int64
location      object
rest_type     object
cuisines      object
cost           int32
dtype: object

In [12]:
df_zomato.describe(include = 'all')

Unnamed: 0,address,name,rate,votes,location,rest_type,cuisines,cost
count,51021,51021,41177.0,51021.0,51021,51021,51021,51021.0
unique,11393,8698,,,92,93,2698,
top,"14th Main, 4th Sector, HSR, Bangalore",Cafe Coffee Day,,,BTM,Quick Bites,North Indian,
freq,71,96,,,5071,19044,2851,
mean,,,3.702397,285.479606,,,,556.428235
std,,,0.440094,807.392688,,,,439.99352
min,,,1.8,0.0,,,,40.0
25%,,,3.4,7.0,,,,300.0
50%,,,3.7,41.0,,,,400.0
75%,,,4.0,200.0,,,,700.0


Everything seems to be in order, so save the table to disk.

In [13]:
df_zomato.reset_index(drop = True, inplace = True)
df_zomato.to_feather('../data/zomato_bangalore.feather')

## Adding coordinates

Our dataset does not contain the latitutde and longitude which we require for our analysis, so we will need to use reverse geocoding to address this problem. We have around 50k datapoints, so we will need to use an API with a higher free limit.

In [3]:
df_zomato = pd.read_feather('../data/zomato_bangalore.feather') # Read file from disk
df_zomato.head()

#TODO If no API works well enough, attempt to link the data in some other way. Maybe fuzzy matching restaurant names and locations?

Unnamed: 0,address,name,rate,votes,location,rest_type,cuisines,cost
0,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,4.1,775,Banashankari,Casual Dining,"North Indian, Mughlai, Chinese",800
1,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,4.1,787,Banashankari,Casual Dining,"Chinese, North Indian, Thai",800
2,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,3.8,918,Banashankari,"Cafe, Casual Dining","Cafe, Mexican, Italian",800
3,"1st Floor, Annakuteera, 3rd Stage, Banashankar...",Addhuri Udupi Bhojana,3.7,88,Banashankari,Quick Bites,"South Indian, North Indian",300
4,"10, 3rd Floor, Lakshmi Associates, Gandhi Baza...",Grand Village,3.8,166,Basavanagudi,Casual Dining,"North Indian, Rajasthani",600


In [14]:
testgc = Nominatim(user_agent='coursera_capstone')
testlmt = RateLimiter(max_calls=1, period=1)
testset = df_zomato.sample(n=10)
testset.head(10)

Unnamed: 0,address,name,rate,votes,location,rest_type,cuisines,cost
17190,"2nd Floor, biriyani&39 House 24th Main, Near P...",Biriyani's House,,0,HSR,Quick Bites,"Biryani, North Indian",200
37575,"286, 2nd Floor, Commercial Plaza, Near Westsid...",Petoo,3.7,32,Commercial Street,Casual Dining,"North Indian, Fast Food, Street Food",800
35549,"28, 4th 'B' Cross, Koramangala 5th Block, Bang...",Truffles,4.7,14723,Koramangala 5th Block,"Cafe, Casual Dining","Cafe, American, Burger, Steak",900
28214,"61, 1st Main Road, Koramangala 7th Block, Bang...",Smokey Tribe Restaurant,3.8,267,Koramangala 7th Block,Casual Dining,"North Eastern, Chinese",650
7016,"35/1B, Munnekolala, Marathahalli, Bangalore",Bangaliana,3.6,124,Marathahalli,Quick Bites,Bengali,600
35697,"1st cross, Tavarekere Main Road, SG Palya, BTM...",Oveanly,,0,BTM,Delivery,Bakery,200
26133,"6, Hennur Village, Kalyan Nagar Post, Near Gov...",Samosa Singh,,0,Kalyan Nagar,Quick Bites,"North Indian, Mithai",250
37882,"47/48, Residency Road, Ashok Nagar, Bangalore",Hit and Run,3.6,15,Central Bangalore,Food Truck,"Fast Food, Continental",350
15269,"69, 1st Floor, M.M Road, Frazer Town, Bangalore",Alibaba Cafe and Restaurant,3.9,450,Frazer Town,Casual Dining,"Arabian, Middle Eastern",700
24596,"959, 5th Main, 2nd Block, Kammanahalli, Bangalore",Al-Badia Restaurant,4.2,89,Kammanahalli,Casual Dining,Arabian,700


In [15]:
for i,r in testset.iterrows():
    q = ', '.join([r['name'], r['location'], 'Bengaluru'])
    print(q)
    with testlmt:
        resp = testgc.geocode(q)
    print(resp)
    print('----------\n')

Biriyani's House, HSR, Bengaluru
None
----------

Petoo, Commercial Street, Bengaluru
None
----------

Truffles, Koramangala 5th Block, Bengaluru
Truffles, 1st A Cross Road, Koramangala 7 Block, Adugodi, South Zone, Bengaluru, Bangalore South, Bangalore Urban, Karnataka, 560029, India
----------

Smokey Tribe Restaurant, Koramangala 7th Block, Bengaluru
None
----------

Bangaliana, Marathahalli, Bengaluru
None
----------

Oveanly, BTM, Bengaluru
None
----------

Samosa Singh, Kalyan Nagar, Bengaluru
None
----------

Hit and Run, Central Bangalore, Bengaluru
None
----------

Alibaba Cafe and Restaurant, Frazer Town, Bengaluru
Alibaba Cafe & Restaurant, No.69,1st floor,, Bourdillon Road, Frazer Town, Pulikeshinagar, East Zone, Bengaluru, Bangalore North, Bangalore Urban, Karnataka, 560005, India
----------

Al-Badia Restaurant, Kammanahalli, Bengaluru
None
----------

