# Introduction and Business Problem

# Introduction
The city of Mumbai, India is relatively Big and it is packed with restaurants, night life and amazing
people. For people that are new to Hoboken, despite its small geographic size, it can be daunting to figure out what 
restaurants are worth going to and where they are.  For people that used to live in Hoboken or are visiting Mumbai, 
how do you know what the best places are to get something to eat?
    
# Business Problem
For this project, I created a simple guide on where to eat based on Foursquare likes, restaurant category and geographic location data for restaurants in Mumbai.  I will then cluster these restaurants based on their similarities so that a 
user can easily determine what type of restaurants are best to eat at based on Foursquare user feedback.

# Data Requirements and Methodology

# Data Requirements
For this project, I will be utilizing the Foursquare API to pull the following location data on restaurants in Hoboken, NJ:
•	Venue Name
•	Venue ID
•	Venue Location
•	Venue Category
•	Count of Likes

# Data Acquisition Approach
To acquire the data mentioned above, I will need to do the following:
•	Get geolocator lat and long coordinates for Mumbai, India
•	Use Foursquare API to get a list of all venues in Mumbai
o	Get venue name, venue ID, location, category, and likes


In [460]:

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

! conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#import beautiful soup
from urllib.request import urlopen
from bs4 import BeautifulSoup


print('Libraries imported.')

usage: conda [-h] [-V] command ...
conda: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


Libraries imported.


usage: conda [-h] [-V] command ...
conda: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


In [461]:
CLIENT_ID = 'D0YVHBVBS1MCYPR2AGEOK5V3VSCQMLHFRJ3Q2JG5F2Z21GS0' # your Foursquare ID
CLIENT_SECRET = 'P34NPCVIHKJQHZJJFYUEU5Q4YY1C3YIMFNORC3U4BKPGHV1A' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: D0YVHBVBS1MCYPR2AGEOK5V3VSCQMLHFRJ3Q2JG5F2Z21GS0
CLIENT_SECRET:P34NPCVIHKJQHZJJFYUEU5Q4YY1C3YIMFNORC3U4BKPGHV1A


CLIENT_ID="D0YVHBVBS1MCYPR2AGEOK5V3VSCQMLHFRJ3Q2JG5F2Z21GS0"
CLIENT_SECRET="P34NPCVIHKJQHZJJFYUEU5Q4YY1C3YIMFNORC3U4BKPGHV1A"
VERSION="20180605"

print("CLIENT_ID :"+CLIENT_ID)
print("CLIENT_SECRET :"+CLIENT_SECRET)
print("VERSION : "+VERSION)

In [462]:
LIMIT=100
radius=50000
latitude=19.0330
longitude=73.0297
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=D0YVHBVBS1MCYPR2AGEOK5V3VSCQMLHFRJ3Q2JG5F2Z21GS0&client_secret=P34NPCVIHKJQHZJJFYUEU5Q4YY1C3YIMFNORC3U4BKPGHV1A&v=20180605&ll=19.033,73.0297&radius=50000&limit=100'

In [463]:
results=requests.get(url).json()
results

{'meta': {'code': 429,
  'errorType': 'quota_exceeded',
  'errorDetail': 'Quota exceeded',
  'requestId': '5cfd02b7f594df57e97da3b2'},
 'response': {}}

In [464]:
def get_category_type(row):
    try :
        category_list=row["categories"]
    except:
        categories_list=row["venue.categories"]
        
    if len(categories_list)==0:
        return None
    else:
        return categories_list[0]["name"]

In [465]:
# Pull actual data from Foursquare App
venues= results['response']['groups'][0]['items']
nearby_venues=json_normalize(venues)
filtered_columns=['venue.name','venue.id','venue.categories','venue.location.lat','venue.location.lng']
nearby_venues

nearby_venues=nearby_venues.loc[:,filtered_columns]
nearby_venues['venue.categories']=nearby_venues.apply(get_category_type,axis=1)
nearby_venues

KeyError: 'groups'

In [440]:
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues

Unnamed: 0,name,id,categories,lat,lng
0,Barbeque Nation,4e29d25281dc3da11a7c119a,BBQ Joint,19.043827,73.008713
1,Baker's Treat,4ce9136ef3bda143c4cabde4,Bakery,19.03195,73.021683
2,Pronto Pasta and Noodles,4f48ecfde4b024e50ca6c4c5,Italian Restaurant,19.088129,73.004427
3,Health Juice And Fast Food,4d94b41ecf46224b27179b94,Juice Bar,19.088218,73.004234
4,Khau Galli,4b4c8f5df964a5201cb626e3,Food Truck,19.080442,72.904319
5,Hamleys,51d2a531454ad6055f94cdce,Toy / Game Store,19.086655,72.889783
6,Mehman Nawazi,5107ff0ce4b05e493f979b9a,Indian Restaurant,19.111485,72.909
7,Elephanta,50c0278ee4b0b9f404296a6f,Historic Site,18.971931,72.930505
8,Starbucks Coffee: A Tata Alliance,50f63f31e4b050d06e1cb448,Coffee Shop,19.116317,72.9097
9,PVR Cinemas,50950f23e4b0578aae7a7a00,Movie Theater,19.086643,72.889839


In [441]:
# find a list of unique categories from the API SO WE Can see what may or may not fit for restaurent
nearby_venues['categories'].unique()

array(['BBQ Joint', 'Bakery', 'Italian Restaurant', 'Juice Bar',
       'Food Truck', 'Toy / Game Store', 'Indian Restaurant',
       'Historic Site', 'Coffee Shop', 'Movie Theater', 'Dessert Shop',
       'Restaurant', 'Hotel', 'Shopping Mall', 'Snack Place', 'Café',
       'Gym Pool', 'Vegetarian / Vegan Restaurant', 'Bar',
       'North Indian Restaurant', 'Dim Sum Restaurant',
       'Punjabi Restaurant', 'Salad Place', 'Deli / Bodega', 'Park',
       'Multiplex', 'Theater', 'Scenic Lookout', 'Seafood Restaurant',
       'Ice Cream Shop', 'Irani Cafe', 'Pub', 'Chinese Restaurant',
       'General Entertainment', 'Fast Food Restaurant', 'Cupcake Shop',
       'Lounge', 'Playground', 'Breakfast Spot', 'Art Gallery',
       'Performing Arts Venue', 'Pizza Place', 'Grocery Store'],
      dtype=object)

In [442]:
# creating a list of categorie to remove from our dataframe because they are not restaurants
# I am sure there is a function that can be written to do this at scale but since it was a small list, I did it manually

removal_list = ['Gym / Fitness Center', 'Bakery', 'Park', "Women's Store", 'Sporting Goods Shop', 'Dog Run', 'Gaming Cafe',
               'Optical Shop', 'Yoga Studio', 'Pet Store', 'Shoe Repair', 'Jewelry Store', 'Record Shop', 'Juice Bar', 
               'Cosmetics Shop', 'Business Service', 'Salon / Barbershop', 'Liquor Store', 'Grocery Store', 'Stationery Store',
               'Pilates Studio', 'Dessert Shop', 'Bookstore', 'Concert Hall', 'Video Game Store', 'Pharmacy', 'Mobile Phone Shop',
               'Deli / Bodega']

nearby_venues2 = nearby_venues.copy()


#getting a clear dataframe of just restaurants
nearby_venues2 = nearby_venues2[~nearby_venues2['categories'].isin(removal_list)]
nearby_venues2

Unnamed: 0,name,id,categories,lat,lng
0,Barbeque Nation,4e29d25281dc3da11a7c119a,BBQ Joint,19.043827,73.008713
2,Pronto Pasta and Noodles,4f48ecfde4b024e50ca6c4c5,Italian Restaurant,19.088129,73.004427
4,Khau Galli,4b4c8f5df964a5201cb626e3,Food Truck,19.080442,72.904319
5,Hamleys,51d2a531454ad6055f94cdce,Toy / Game Store,19.086655,72.889783
6,Mehman Nawazi,5107ff0ce4b05e493f979b9a,Indian Restaurant,19.111485,72.909
7,Elephanta,50c0278ee4b0b9f404296a6f,Historic Site,18.971931,72.930505
8,Starbucks Coffee: A Tata Alliance,50f63f31e4b050d06e1cb448,Coffee Shop,19.116317,72.9097
9,PVR Cinemas,50950f23e4b0578aae7a7a00,Movie Theater,19.086643,72.889839
11,IVY Restaurant & Banquets,4dc6a998b0fb5556cd0e99a2,Restaurant,19.069663,72.900535
12,Sofitel Mumbai BKC,4d9b39ed2ae860fc0f5a81cb,Hotel,19.067628,72.869499


In [443]:
venue_id_list=nearby_venues2['id'].tolist()
venue_id_list

['4e29d25281dc3da11a7c119a',
 '4f48ecfde4b024e50ca6c4c5',
 '4b4c8f5df964a5201cb626e3',
 '51d2a531454ad6055f94cdce',
 '5107ff0ce4b05e493f979b9a',
 '50c0278ee4b0b9f404296a6f',
 '50f63f31e4b050d06e1cb448',
 '50950f23e4b0578aae7a7a00',
 '4dc6a998b0fb5556cd0e99a2',
 '4d9b39ed2ae860fc0f5a81cb',
 '4b27c85ff964a520dc8924e3',
 '4bc085064cdfc9b6a5ee9221',
 '54a3a602498e6d5ac992f927',
 '546f25c5498e2d4d056ad6eb',
 '4b0587cef964a52077a222e3',
 '4b0587dbf964a52069a422e3',
 '4b5b4ecdf964a5206af328e3',
 '4b0587d9f964a52021a422e3',
 '4b865eeaf964a520af8731e3',
 '5a21b1cea4236243c9628ad2',
 '55463c55498ef91e86b51620',
 '4b0587d8f964a52008a422e3',
 '4b0abdf5f964a520292723e3',
 '4ee333ba9adf3982fe44da3b',
 '4bc201084cdfc9b6fb589521',
 '5496e94e498e349305d81639',
 '4ed139495c5c9528fa406b06',
 '55c4347b498e1fc7af0b14aa',
 '4b0587ccf964a52024a222e3',
 '524edc34498ebd9556e084fc',
 '4bc78f5793bdeee1848337ae',
 '5275159f11d265069773cd85',
 '4f2791aee4b0fa5ce519a11c',
 '4b0587caf964a520dfa122e3',
 '4b0587dcf964

In [444]:
url_list=[]
like_list=[]
json_list=[]

for i in venue_id_list:
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i,CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)

for link in url_list:
    result=requests.get(link).json()
    likes=result['response']['likes']['count']
    like_list.append(likes)
print(like_list)
    

KeyError: 'likes'

In [None]:
print(len(like_list))
print(len(venue_id_list))

# Data Prep Intro
The thought process behind this is that likes are a proxy for quality. The more likes there are, the better the restaurant is. This might be incorrect but API call issues (how many I can use for free) holds me back from getting price / rating data. I will then bin this data into a quality categorical variables so we can cluster appropriately.

I am also going to create new categorical variables for the restaurants to better group them based on type of cuisine. This way you can look for good Indian food or now what type of food might be best to eat in Mumbai if you are new to the area.

In [445]:
mumbai_venues=nearby_venues2.copy()
mumbai_venues

Unnamed: 0,name,id,categories,lat,lng
0,Barbeque Nation,4e29d25281dc3da11a7c119a,BBQ Joint,19.043827,73.008713
2,Pronto Pasta and Noodles,4f48ecfde4b024e50ca6c4c5,Italian Restaurant,19.088129,73.004427
4,Khau Galli,4b4c8f5df964a5201cb626e3,Food Truck,19.080442,72.904319
5,Hamleys,51d2a531454ad6055f94cdce,Toy / Game Store,19.086655,72.889783
6,Mehman Nawazi,5107ff0ce4b05e493f979b9a,Indian Restaurant,19.111485,72.909
7,Elephanta,50c0278ee4b0b9f404296a6f,Historic Site,18.971931,72.930505
8,Starbucks Coffee: A Tata Alliance,50f63f31e4b050d06e1cb448,Coffee Shop,19.116317,72.9097
9,PVR Cinemas,50950f23e4b0578aae7a7a00,Movie Theater,19.086643,72.889839
11,IVY Restaurant & Banquets,4dc6a998b0fb5556cd0e99a2,Restaurant,19.069663,72.900535
12,Sofitel Mumbai BKC,4d9b39ed2ae860fc0f5a81cb,Hotel,19.067628,72.869499


In [446]:
# add in the list of likes

mumbai_venues['total likes'] = like_list
mumbai_venues.head()

ValueError: Length of values does not match length of index

In [447]:
# lets bin total likes
print(mumbai_venues['total likes'].max())
print(mumbai_venues['total likes'].min())
print(mumbai_venues['total likes'].median())
print(mumbai_venues['total likes'].mean())

KeyError: 'total likes'

In [448]:
# let's visualize our total likes based on a histogram

import matplotlib.pyplot as plt
mumbai_venues['total likes'].hist(bins=4)
plt.show()


KeyError: 'total likes'

In [449]:
# Converted likes into bins
print(np.percentile(mumbai_venues['total likes'], 25))
print(np.percentile(mumbai_venues['total likes'], 50))
print(np.percentile(mumbai_venues['total likes'], 75))

KeyError: 'total likes'

In [450]:
poor =mumbai_venues['total likes'] <= 40
below_avg=mumbai_venues[(mumbai_venues['total likes']>40) & (mumbai_venues['total likes'] <=104)]
abv_avg=mumbai_venues[(mumbai_venues['total likes']) > 104 & (mumbai_venues['total likes'] <= 194)]
good=mumbai_venues['total likes'] > 194

KeyError: 'total likes'

In [451]:
def condition(s):
    if s['total likes']<= 40:
        return 'poor'
    if s['total likes'] <= 104:
        return 'below_avg'
    if s['total likes'] <=194:
        return 'abv_avg'
    if s['total likes'] >194:
        return 'good'
    
mumbai_venues['total_likes_cat']=mumbai_venues.apply(condition,axis=1)

KeyError: ('total likes', 'occurred at index 0')

In [452]:
mumbai_venues['categories'].unique()

array(['BBQ Joint', 'Italian Restaurant', 'Food Truck',
       'Toy / Game Store', 'Indian Restaurant', 'Historic Site',
       'Coffee Shop', 'Movie Theater', 'Restaurant', 'Hotel',
       'Shopping Mall', 'Snack Place', 'Café', 'Gym Pool',
       'Vegetarian / Vegan Restaurant', 'Bar', 'North Indian Restaurant',
       'Dim Sum Restaurant', 'Punjabi Restaurant', 'Salad Place',
       'Multiplex', 'Theater', 'Scenic Lookout', 'Seafood Restaurant',
       'Ice Cream Shop', 'Irani Cafe', 'Pub', 'Chinese Restaurant',
       'General Entertainment', 'Fast Food Restaurant', 'Cupcake Shop',
       'Lounge', 'Playground', 'Breakfast Spot', 'Art Gallery',
       'Performing Arts Venue', 'Pizza Place'], dtype=object)

In [453]:
bars=['Bar','Pub']
Hang_out=['Scenic Lookout','Historic Site','Shopping Mall','Gym Pool','Multiplex','Art Gallery','Theater','Lounge','Movie Theater']
Entertainment=['Toy / Game Store','Performing Arts Venue','Playground','General Entertainment']
veg_restaurant=['Indian Restaurant','Vegetarian / Vegan Restaurant','Punjabi Restaurant','Salad Place','BBQ Joint','Restaurant']
fast_food=['Pizza Place','Fast Food Restaurant','Café','Snack Place','Coffee Shop','Food Truck','Breakfast Spot']
other_restaurant=['Italian Restaurant','North Indian Restaurant','Dim Sum Restaurant','Seafood Restaurant','Chinese Restaurant']
desert=['Ice Cream Shop','Cupcake Shop']    
Hotels=['Hotel']

def conditions2(s):
    if s['categories'] in bars:
        return 'bars'
    if s['categories'] in Hotels:
        return 'Hotels'
    if s['categories'] in Hang_out:
        return 'Hang_out'
    if s['categories'] in Entertainment:
        return 'Entertainment'
    if s['categories'] in veg_restaurant:
        return 'veg_restaurant'
    if s['categories'] in fast_food:
        return 'fast_food'
    if s['categories'] in other_restaurant:
        return 'other_restaurant'
    if s['categories'] in desert:
        return 'desert'
mumbai_venues['categories_new']=mumbai_venues.apply(conditions2,axis=1)

In [454]:
mumbai_venues

Unnamed: 0,name,id,categories,lat,lng,categories_new
0,Barbeque Nation,4e29d25281dc3da11a7c119a,BBQ Joint,19.043827,73.008713,veg_restaurant
2,Pronto Pasta and Noodles,4f48ecfde4b024e50ca6c4c5,Italian Restaurant,19.088129,73.004427,other_restaurant
4,Khau Galli,4b4c8f5df964a5201cb626e3,Food Truck,19.080442,72.904319,fast_food
5,Hamleys,51d2a531454ad6055f94cdce,Toy / Game Store,19.086655,72.889783,Entertainment
6,Mehman Nawazi,5107ff0ce4b05e493f979b9a,Indian Restaurant,19.111485,72.909,veg_restaurant
7,Elephanta,50c0278ee4b0b9f404296a6f,Historic Site,18.971931,72.930505,Hang_out
8,Starbucks Coffee: A Tata Alliance,50f63f31e4b050d06e1cb448,Coffee Shop,19.116317,72.9097,fast_food
9,PVR Cinemas,50950f23e4b0578aae7a7a00,Movie Theater,19.086643,72.889839,Hang_out
11,IVY Restaurant & Banquets,4dc6a998b0fb5556cd0e99a2,Restaurant,19.069663,72.900535,veg_restaurant
12,Sofitel Mumbai BKC,4d9b39ed2ae860fc0f5a81cb,Hotel,19.067628,72.869499,Hotels


In [None]:
# one hot encoding
# Now let's create dummy variables for our total likes and categories so we can cluster
mumbai_onehot = pd.get_dummies(mumbai_venues[['categories_new', 'total_likes_cat']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mumbai_onehot['Name'] = mumbai_venues['name'] 

# move neighborhood column to the first column
fixed_columns = [mumbai_onehot.columns[-1]] + list(mumbai_onehot.columns[:-1])
mumbai_onehot = mumbai_onehot[fixed_columns]

mumbai_onehot.head()


In [None]:
cluster_df = mumbai_onehot.drop('Name', axis=1)

k_clusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(cluster_df)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
mumbai_venues['label'] = kmeans.labels_
mumbai_venues.head()

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i+x+(i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_venues['lat'], mumbai_venues['lng'], mumbai_venues['name'], mumbai_venues['label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# characteristics (label=0)
 o good quality food
 
 o Mostly fast food or Hangout/Ready wala food

In [None]:
mumbai_venues.loc[mumbai_venues['label']==0]

In [None]:
# characteristics (label = 1)
# below average food
# Mostly veg_restaurant or fast_food

In [None]:
mumbai_venues.loc[mumbai_venues['label']==1]

In [None]:
# characteristics
# Poor quality food
# Mostly all type of restaurant

In [None]:
mumbai_venues.loc[mumbai_venues['label']==2]

In [None]:
# characteristics
# above average food
# Mostly Hotels

In [None]:
mumbai_venues.loc[mumbai_venues['label']==3]