<h1>Restaurant Site Selection in California</h1>

<h4>Setting up Environment</h4>

In [6]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes 
print('Environment solved.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         238 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0        conda-forge
    geopy:         1.20.0-py_0      conda-forge

The following packages will be UPDATED:

    certifi:       2019.6.

<h4>Importing required Libraries</h4>

In [7]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<h4> Load and explore the dataset </h4>

In [8]:
us_data = pd.read_csv('uscities.csv')
us_data.head()

Unnamed: 0,city,city_ascii,state_id,state_name,county_fips,county_name,county_fips_all,county_name_all,lat,lng,population,density,source,military,incorporated,timezone,ranking,zips,id
0,South Creek,South Creek,WA,Washington,53053,Pierce,53053,Pierce,46.9994,-122.3921,2500.0,125.0,polygon,False,True,America/Los_Angeles,3,98580 98387 98338,1840116412
1,Roslyn,Roslyn,WA,Washington,53037,Kittitas,53037,Kittitas,47.2507,-121.0989,947.0,84.0,polygon,False,True,America/Los_Angeles,3,98941 98068 98925,1840097718
2,Sprague,Sprague,WA,Washington,53043,Lincoln,53043,Lincoln,47.3048,-117.9713,441.0,163.0,polygon,False,True,America/Los_Angeles,3,99032,1840096300
3,Gig Harbor,Gig Harbor,WA,Washington,53053,Pierce,53053,Pierce,47.3352,-122.5968,9507.0,622.0,polygon,False,True,America/Los_Angeles,3,98332 98335,1840097082
4,Lake Cassidy,Lake Cassidy,WA,Washington,53061,Snohomish,53061,Snohomish,48.0639,-122.092,3591.0,131.0,polygon,False,True,America/Los_Angeles,3,98223 98258 98270,1840116371


In [6]:
us_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28889 entries, 0 to 28888
Data columns (total 19 columns):
city               28889 non-null object
city_ascii         28889 non-null object
state_id           28889 non-null object
state_name         28889 non-null object
county_fips        28889 non-null int64
county_name        28889 non-null object
county_fips_all    28889 non-null object
county_name_all    28889 non-null object
lat                28889 non-null float64
lng                28889 non-null float64
population         28889 non-null float64
density            28889 non-null float64
source             28889 non-null object
military           28889 non-null bool
incorporated       28889 non-null bool
timezone           28889 non-null object
ranking            28889 non-null int64
zips               28888 non-null object
id                 28889 non-null int64
dtypes: bool(2), float64(4), int64(3), object(10)
memory usage: 3.8+ MB


In [7]:
us_data.shape

(28889, 19)

<h4> Data Cleaning and Wrangling </h4>

Since our area of interest is California, let's remove the data of all other states.

In [9]:
california_data = us_data[us_data['state_name']=='California'].reset_index(drop=True)
california_data.head()

Unnamed: 0,city,city_ascii,state_id,state_name,county_fips,county_name,county_fips_all,county_name_all,lat,lng,population,density,source,military,incorporated,timezone,ranking,zips,id
0,El Dorado Hills,El Dorado Hills,CA,California,6017,El Dorado,6017,El Dorado,38.675,-121.049,45104.0,359.0,polygon,False,False,America/Los_Angeles,3,95672 95762 95682,1840112094
1,Lemon Cove,Lemon Cove,CA,California,6107,Tulare,6107,Tulare,36.379,-119.0312,232.0,107.0,polygon,False,True,America/Los_Angeles,3,93244,1840112606
2,Dillon Beach,Dillon Beach,CA,California,6041,Marin,6041,Marin,38.2436,-122.956,156.0,20.0,polygon,False,False,America/Los_Angeles,3,94929 94971,1840112676
3,Patterson Tract,Patterson Tract,CA,California,6107,Tulare,6107,Tulare,36.3795,-119.2956,2320.0,619.0,polygon,False,True,America/Los_Angeles,3,93291,1840116495
4,Redcrest,Redcrest,CA,California,6023,Humboldt,6023,Humboldt,40.3987,-123.9474,36.0,23.0,polygon,False,False,America/Los_Angeles,3,95569,1840117585


Let us also remove the fields which will not be used for data analyzation.

In [10]:
california_data.drop(['city_ascii','state_id','state_name','county_name','county_fips','county_fips_all','county_name_all',
                      'source','military','timezone','zips','id'],axis=1,inplace=True)
california_data.head()

Unnamed: 0,city,lat,lng,population,density,incorporated,ranking
0,El Dorado Hills,38.675,-121.049,45104.0,359.0,False,3
1,Lemon Cove,36.379,-119.0312,232.0,107.0,True,3
2,Dillon Beach,38.2436,-122.956,156.0,20.0,False,3
3,Patterson Tract,36.3795,-119.2956,2320.0,619.0,True,3
4,Redcrest,40.3987,-123.9474,36.0,23.0,False,3


Field incorporated contains TRUE if the place is a city/town and FALSE if the place is just a commonly known name for a
populated area. Hence, let's filter the data for only the places which are city/town of California.

In [11]:
california_df = california_data[california_data.incorporated == True].reset_index(drop=True)
california_df.head()

Unnamed: 0,city,lat,lng,population,density,incorporated,ranking
0,Lemon Cove,36.379,-119.0312,232.0,107.0,True,3
1,Patterson Tract,36.3795,-119.2956,2320.0,619.0,True,3
2,Madera,36.964,-120.0803,83636.0,1602.0,True,3
3,Stanton,33.8002,-117.9935,38528.0,4802.0,True,2
4,Amador City,38.419,-120.8232,190.0,236.0,True,3


In [12]:
california_df.drop(['incorporated'],axis=1,inplace=True)
california_df.head()

Unnamed: 0,city,lat,lng,population,density,ranking
0,Lemon Cove,36.379,-119.0312,232.0,107.0,3
1,Patterson Tract,36.3795,-119.2956,2320.0,619.0,3
2,Madera,36.964,-120.0803,83636.0,1602.0,3
3,Stanton,33.8002,-117.9935,38528.0,4802.0,2
4,Amador City,38.419,-120.8232,190.0,236.0,3


Now, let's use the population and density i.e. the estimated population per square kilometer to get the estimated km size 
of the city.

In [13]:
california_df['area(km)'] = california_df['population']/california_df['density']
california_df.head()

Unnamed: 0,city,lat,lng,population,density,ranking,area(km)
0,Lemon Cove,36.379,-119.0312,232.0,107.0,3,2.168224
1,Patterson Tract,36.3795,-119.2956,2320.0,619.0,3,3.747981
2,Madera,36.964,-120.0803,83636.0,1602.0,3,52.207241
3,Stanton,33.8002,-117.9935,38528.0,4802.0,2,8.023324
4,Amador City,38.419,-120.8232,190.0,236.0,3,0.805085


Now, we can remove the density field since it's no longer required.

In [14]:
california_df.drop(['density'],axis=1,inplace=True)
california_df.head()

Unnamed: 0,city,lat,lng,population,ranking,area(km)
0,Lemon Cove,36.379,-119.0312,232.0,3,2.168224
1,Patterson Tract,36.3795,-119.2956,2320.0,3,3.747981
2,Madera,36.964,-120.0803,83636.0,3,52.207241
3,Stanton,33.8002,-117.9935,38528.0,2,8.023324
4,Amador City,38.419,-120.8232,190.0,3,0.805085


In [15]:
california_df.shape

(623, 6)

The above dataframe now contains only the relevant fields and records which will be further analysed.

<h4>Defining and Visualising locations</h4>

Let us first get the coordinates of California state.

In [15]:
address = 'California, US'

geolocator = Nominatim(user_agent="cal_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of California state are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of California state are 36.7014631, -118.7559974.


Now we will visualize our location data including California state and all the cities of our dataset on a map.

In [16]:
# create map of New York using latitude and longitude values
map_california = folium.Map(location=[latitude, longitude], zoom_start=6)

state = 'California'

# add markers to map
for lat, lng, city in zip(california_df['lat'], california_df['lng'], california_df['city']):
    label = '{}, {}'.format(city, state)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_california)  
    
map_california

The above map clearly shows the cities inside California that we will study for further analysis. Let us now start 
utilizing the Foursquare API to explore these cities.

<h4>Utilizing the Foursquare API</h4>

Defining Foursquare credentials and version

In [16]:
CLIENT_ID = 'I2ANMNMA1RO24SLJ45C1MEKG3QAZYJNQGSVUV02DYOJPHCYH' #Foursquare ID
CLIENT_SECRET = 'YPBRAMVYEWN2C21KTIPMECKS4Q15DI2IWNJXOAOONJ03EHC2' #Foursquare Secret
VERSION = '20180605' # Foursquare API version

Now we will start exploring the cities through Foursquare API.

<h4>Exploring the cities</h4>

Let's begin by creating a function to get the details of the cities from our dataset.

In [17]:
def getCityDetails(i):
        city_name = california_df.loc[i, 'city']
        city_lat = california_df.loc[i, 'lat'] 
        city_lng = california_df.loc[i, 'lng'] 
        city_area = california_df.loc[0,'area(km)']
        return city_name, city_lat, city_lng, city_area

Let us now create a function to query the Foursquare API and get response of all the restaurants in the cities of our dataset.

In [20]:
def getQueryResponse(i):
    city_name, city_lat, city_lng, city_area = getCityDetails(i)
    LIMIT = 10 # limit of number of venues returned by Foursquare API
    radius = city_area*1000 #take radius as area of city(in metres)
    catId = '4d4b7105d754a06374d81259' #take the category id of Food to get all restaurants in a city
    
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    city_lat, 
    city_lng,
    catId,
    radius,
    LIMIT)
    
    results = requests.get(url).json()
    return results


Let us also create a dataframe in which we will save our parsed response for all the cities.

In [21]:
city_df = pd.DataFrame(columns=['city','place','id','category','lat','lng'])
city_df

Unnamed: 0,city,place,id,category,lat,lng


Now lets create a function to clean the json and structure the query response into a *pandas* dataframe.

In [22]:
def getCityInfo(i):

    results = getQueryResponse(i)
    venues = results['response']['venues']
    
    k = city_df.shape[0]

    for venue in venues:
        city_df.loc[k,'city'] = california_df.loc[i, 'city']
        city_df.loc[k,'place'] = venue['name']
        city_df.loc[k,'id'] = venue['id']

        catList = venue['categories']
        for j in range(0,len(catList)):
            if(j == len(catList)-1):
                cat = catList[j]['name']
            else:
                cat = catList[j]['name']+','
        city_df.loc[k,'category'] = cat

        city_df.loc[k,'lat'] = venue['location']['lat']
        city_df.loc[k,'lng'] = venue['location']['lng']
        k = k+1

Let us use the above functions to make our final dataframe with data for all the cities and restaurants in it

In [25]:
#for i in range(0,california_df.shape[0]):
#Taking into account only the 10 cities since there is a limit on free plan of Foursquare API
for i in range(0,10):
    getCityInfo(i)
    
city_df.head()

Unnamed: 0,city,place,id,category,lat,lng
0,Lemon Cove,Red Barn Bar-B-Q,4f32a87919836c91c7ed28b5,Food,36.3968,-119.02
1,Lemon Cove,Alfarez Rustic Orchard (Alferez Rustic Orchard),5782a420498e9779ee1e695d,Diner,36.395,-119.021
2,Patterson Tract,El Taco Chino,52db2947498ee5f6278bb6fc,Food Truck,36.3748,-119.297
3,Patterson Tract,McDonald's,4c3e88121ef0d13aef879280,Fast Food Restaurant,36.3572,-119.297
4,Patterson Tract,Starbucks,4fc94580d4f24895b4467ca9,Coffee Shop,36.3605,-119.297


Now we will create a function to query the Foursquare API and get the response indicating the count of users that like the above venues.

In [26]:
def getLikeCount(i):
    venue_id = city_df.loc[i,'id']
    
    url = 'https://api.foursquare.com/v2/venues/{}/likes?&client_id={}&client_secret={}&v={}'.format(
    venue_id,CLIENT_ID, CLIENT_SECRET, VERSION)
    
    results = requests.get(url).json()
    return results

Let us add another column indicating the count of users who liked the above venues

In [27]:
for i in range(0,city_df.shape[0]):
    results = getLikeCount(i)
    count = results['response']['likes']['count']
    city_df.loc[i,'like_count'] = count
    
city_df.head()

Unnamed: 0,city,place,id,category,lat,lng,like_count
0,Lemon Cove,Red Barn Bar-B-Q,4f32a87919836c91c7ed28b5,Food,36.3968,-119.02,0.0
1,Lemon Cove,Alfarez Rustic Orchard (Alferez Rustic Orchard),5782a420498e9779ee1e695d,Diner,36.395,-119.021,0.0
2,Patterson Tract,El Taco Chino,52db2947498ee5f6278bb6fc,Food Truck,36.3748,-119.297,0.0
3,Patterson Tract,McDonald's,4c3e88121ef0d13aef879280,Fast Food Restaurant,36.3572,-119.297,6.0
4,Patterson Tract,Starbucks,4fc94580d4f24895b4467ca9,Coffee Shop,36.3605,-119.297,6.0


Let's try to analyze the above dataset

In [28]:
city_df.shape

(219, 7)

In [29]:
print(len(city_df['category'].unique()))

30


This shows that there are many types of food places present in these cities. Let's analyze these values further.

<h4>Analyzing each city</h4>

In [30]:
# one hot encoding
category_onehot = pd.get_dummies(city_df[['category']], prefix="", prefix_sep="")

# add city column back to dataframe
category_onehot['city'] = city_df['city']

# move city column to the first column
fixed_columns = [category_onehot.columns[-1]] + list(category_onehot.columns[:-1])
category_onehot = category_onehot[fixed_columns]

category_onehot.head()

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Lemon Cove,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Lemon Cove,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Patterson Tract,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Patterson Tract,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Patterson Tract,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
category_onehot.shape

(219, 31)

Given the above data, we will also take into account the likability of these places.

In [32]:
for i,column in enumerate(category_onehot.columns):
    if(i == 0):
        continue
    category_onehot[column] = category_onehot[column]*city_df['like_count']
    
category_onehot.head()

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Lemon Cove,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Lemon Cove,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Patterson Tract,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Patterson Tract,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Patterson Tract,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we will group rows by cities and take the mean of the frequency of occurrence of each category.

In [95]:
city_grouped = category_onehot.groupby('city').mean().reset_index()
city_grouped.head()

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Amador City,0.0,0.0,0.0,0.0,12.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Fairfax,0.0,0.0,0.1,0.0,0.0,0.0,2.9,0.0,6.6,0.0,0.0,1.5,0.0,0.0,0.0,0.0,5.3,1.5,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,2.0,0.0
2,Inglewood,2.7,0.0,0.0,0.0,0.0,22.7,0.0,0.9,6.4,0.0,2.2,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.3,0.0,0.0,5.1,0.0
3,Lemon Cove,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Madera,0.0,1.6,0.0,0.0,0.0,0.0,0.0,0.0,6.5,0.0,1.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.4,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.1


Now we will also take into account the importance of cities. This is indicated by ranking in our original dataset.

In [96]:
for i,column in enumerate(city_grouped.columns):
    if(i == 0):
        continue
    city_grouped[column] = city_grouped[column]*california_df['ranking']
    
city_grouped.head()

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Amador City,0.0,0.0,0.0,0.0,37.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Fairfax,0.0,0.0,0.3,0.0,0.0,0.0,8.7,0.0,19.8,0.0,0.0,4.5,0.0,0.0,0.0,0.0,15.9,4.5,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,6.0,0.0
2,Inglewood,8.1,0.0,0.0,0.0,0.0,68.1,0.0,2.7,19.2,0.0,6.6,0.0,0.0,0.0,7.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,111.9,0.0,0.0,15.3,0.0
3,Lemon Cove,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Madera,0.0,4.8,0.0,0.0,0.0,0.0,0.0,0.0,19.5,0.0,4.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.2,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.3


Looking at the above dataset, we can realize that food places with zero likability are insignificant to realize people's taste. Hence, we will remove the cities with zero likability for all the categories.

In [98]:
for i in range(0,city_grouped.shape[0]):
    k = 0
    for j in range(1,city_grouped.shape[1]):
        if(city_grouped.iloc[i,j] == 0):
            k = k+1
    if(k == city_grouped.shape[1]-1):
        city_grouped = city_grouped.drop([i],axis=0)

city_grouped.reset_index(drop=True)
city_grouped.head()

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Amador City,0.0,0.0,0.0,0.0,37.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Fairfax,0.0,0.0,0.3,0.0,0.0,0.0,8.7,0.0,19.8,0.0,0.0,4.5,0.0,0.0,0.0,0.0,15.9,4.5,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,6.0,0.0
2,Inglewood,8.1,0.0,0.0,0.0,0.0,68.1,0.0,2.7,19.2,0.0,6.6,0.0,0.0,0.0,7.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,111.9,0.0,0.0,15.3,0.0
4,Madera,0.0,4.8,0.0,0.0,0.0,0.0,0.0,0.0,19.5,0.0,4.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.2,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.3
5,Patterson Tract,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.333333,0.0,1.333333,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,4.666667,0.0,0.0,0.666667,0.666667,0.0,0.0,0.0,0.0,0.0


In [99]:
#This is the final dataset we will use for analysis
print('There are {} cities and {} categories of food venues within them.'.format(
    city_grouped.shape[0],
    city_grouped.shape[1]))

There are 8 cities and 31 categories of food venues within them.


<h4>ANALYSIS 1: Analyzing most common food places for all the cities</h4>

In [100]:
num_top_venues = 5
for data in city_grouped['city']:
    print("****"+data+"****")
    temp = city_grouped[city_grouped['city'] == data].T.reset_index()
    temp.columns = ['Food_place','likability_indicator']
    temp = temp.iloc[1:]
    temp['likability_indicator'] = temp['likability_indicator'].astype(float)
    temp = temp.round({'likability_indicator': 2})
    print(temp.sort_values('likability_indicator', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

****Amador City****
           Food_place  likability_indicator
0              Bakery                  37.5
1  African Restaurant                   0.0
2      Ice Cream Shop                   0.0
3          Taco Place                   0.0
4    Sushi Restaurant                   0.0


****Fairfax****
           Food_place  likability_indicator
0         Coffee Shop                  19.8
1      Ice Cream Shop                  15.9
2                Café                   8.7
3          Taco Place                   6.0
4  Italian Restaurant                   4.5


****Inglewood****
                        Food_place  likability_indicator
0  Southern / Soul Food Restaurant                 111.9
1                     Burger Joint                  68.1
2                      Coffee Shop                  19.2
3                       Taco Place                  15.3
4               African Restaurant                   8.1


****Madera****
             Food_place  likability_indicator
0        

Now, let's write a function to sort the food places in descending order.

In [46]:
def get_most_liked_food_places(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 5 venues for each neighborhood.

In [47]:
num_top_venues =5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['city']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cities_venues_sorted = pd.DataFrame(columns=columns)
cities_venues_sorted['city'] = city_grouped['city']

for ind in np.arange(city_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 1:] = get_most_liked_food_places(city_grouped.iloc[ind, :], num_top_venues)

cities_venues_sorted.head()

Unnamed: 0,city,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Amador City,Bakery,Theme Restaurant,Taco Place,American Restaurant,BBQ Joint
1,Fairfax,Coffee Shop,Ice Cream Shop,Café,Taco Place,Italian Restaurant
2,Inglewood,Southern / Soul Food Restaurant,Burger Joint,Coffee Shop,Taco Place,African Restaurant
4,Madera,Coffee Shop,Mexican Restaurant,American Restaurant,Fast Food Restaurant,Theme Restaurant
5,Patterson Tract,Mexican Restaurant,Fast Food Restaurant,Coffee Shop,Sandwich Place,Pizza Place


The above analysis shows the data for all the cities. If somebody is willing to start a new food venue in a particular city, then above data clearly suggests what these venues should be. 

<h4>ANALYSIS 2: Analyzing best cities to open a particular food place</h4>

This analysis will help us determine which city is best suited to open up a particular food venue.

In [101]:
city_grouped

Unnamed: 0,city,African Restaurant,American Restaurant,BBQ Joint,Bagel Shop,Bakery,Burger Joint,Café,Caribbean Restaurant,Coffee Shop,Diner,Fast Food Restaurant,Food,Food Truck,Fried Chicken Joint,Grocery Store,Hawaiian Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Pie Shop,Pizza Place,Sandwich Place,Southern / Soul Food Restaurant,Supermarket,Sushi Restaurant,Taco Place,Theme Restaurant
0,Amador City,0.0,0.0,0.0,0.0,37.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Fairfax,0.0,0.0,0.3,0.0,0.0,0.0,8.7,0.0,19.8,0.0,0.0,4.5,0.0,0.0,0.0,0.0,15.9,4.5,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0,0.0,0.0,6.0,0.0
2,Inglewood,8.1,0.0,0.0,0.0,0.0,68.1,0.0,2.7,19.2,0.0,6.6,0.0,0.0,0.0,7.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,111.9,0.0,0.0,15.3,0.0
4,Madera,0.0,4.8,0.0,0.0,0.0,0.0,0.0,0.0,19.5,0.0,4.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.2,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0,0.3
5,Patterson Tract,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.333333,0.0,1.333333,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,4.666667,0.0,0.0,0.666667,0.666667,0.0,0.0,0.0,0.0,0.0
6,Scotts Valley,0.0,0.0,3.6,0.3,0.0,0.0,0.6,0.0,9.3,0.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.6,0.0,0.0,13.5,4.2,0.0
7,Stanton,0.0,0.0,0.0,0.0,0.0,12.6,0.0,0.0,27.9,0.0,5.7,0.0,0.0,1.2,5.4,0.0,0.0,0.0,32.1,16.2,0.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Vallejo,0.0,0.0,0.0,0.0,0.0,2.1,0.0,0.0,37.5,0.0,11.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,2.4,0.0,0.3,0.0,0.0,0.0


We will now create a new dataframe to display the most preferred city for each city as per our data. 

In [144]:
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['category']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Preferred City'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Preferred City'.format(ind+1))

# create a new dataframe
cities_sorted = pd.DataFrame(columns=columns)
cities_sorted['category'] = city_grouped.columns

cities_sorted = cities_sorted.set_index('category')

#Remove the first row with city(first column from previous dataframe) as category
cities_sorted = cities_sorted.drop(['city'],axis=0)

cities_sorted.head()

Unnamed: 0_level_0,1st Most Preferred City,2nd Most Preferred City,3rd Most Preferred City,4th Most Preferred City,5th Most Preferred City
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
African Restaurant,,,,,
American Restaurant,,,,,
BBQ Joint,,,,,
Bagel Shop,,,,,
Bakery,,,,,


Let's now fill in the dataframe above using values from our original dataset.

In [145]:
num_top_cities = 5
for i,category in enumerate(cities_sorted.index.values.tolist()):
    temp = city_grouped.sort_values(by=[category],ascending=False)
    for j in range(0,num_top_cities):
        cities_sorted.iloc[i,j] = temp.iloc[j,0]
        
cities_sorted.head()

Unnamed: 0_level_0,1st Most Preferred City,2nd Most Preferred City,3rd Most Preferred City,4th Most Preferred City,5th Most Preferred City
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
African Restaurant,Inglewood,Amador City,Fairfax,Madera,Patterson Tract
American Restaurant,Madera,Amador City,Fairfax,Inglewood,Patterson Tract
BBQ Joint,Scotts Valley,Fairfax,Amador City,Inglewood,Madera
Bagel Shop,Scotts Valley,Amador City,Fairfax,Inglewood,Madera
Bakery,Amador City,Fairfax,Inglewood,Madera,Patterson Tract


In [92]:
cities_sorted.shape

(30, 5)

The above dataframe clearly shows which city should be most preferred for opening a particular category of food place in California.

<h4>Clustering Cities for further analysis</h4>

Lets begin by creating a dataframe to facilitate our clustering.

In [146]:
cities_clustering = city_grouped.transpose()
cities_clustering = cities_clustering.drop('city', 0)
cities_clustering.head()

Unnamed: 0,0,1,2,4,5,6,7,8
African Restaurant,0.0,0.0,8.1,0.0,0,0.0,0,0
American Restaurant,0.0,0.0,0.0,4.8,0,0.0,0,0
BBQ Joint,0.0,0.3,0.0,0.0,0,3.6,0,0
Bagel Shop,0.0,0.0,0.0,0.0,0,0.3,0,0
Bakery,37.5,0.0,0.0,0.0,0,0.0,0,0


We will run *k*-means to cluster the food categories into 5 clusters.

In [147]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cities_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 3, 0, 0, 2, 0], dtype=int32)

Let's create a new dataframe that includes the cluster labels as well as the top 5 cities for each food category.

In [168]:
cluster_merged = cities_sorted
cluster_merged['Cluster labels'] = kmeans.labels_
cluster_merged.head()

Unnamed: 0_level_0,1st Most Preferred City,2nd Most Preferred City,3rd Most Preferred City,4th Most Preferred City,5th Most Preferred City,Cluster labels
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
African Restaurant,Inglewood,Amador City,Fairfax,Madera,Patterson Tract,0
American Restaurant,Madera,Amador City,Fairfax,Inglewood,Patterson Tract,0
BBQ Joint,Scotts Valley,Fairfax,Amador City,Inglewood,Madera,0
Bagel Shop,Scotts Valley,Amador City,Fairfax,Inglewood,Madera,0
Bakery,Amador City,Fairfax,Inglewood,Madera,Patterson Tract,0


We can also group categories by cluster labels. Categories belonging to same cluster are similarly liked by people.

In [169]:
#Combining neighborhoods of same PostalCode
cluster_grouped = cluster_merged.reset_index()
cluster_grouped = cluster_grouped.groupby('Cluster labels').agg({'1st Most Preferred City':'first',
                                                                 '2nd Most Preferred City':'first',
                                                                 '3rd Most Preferred City':'first',
                                                                 '4th Most Preferred City':'first',
                                                                 '5th Most Preferred City':'first',
                                                                 'category': ', '.join}).reset_index()
cluster_grouped.head()

Unnamed: 0,Cluster labels,1st Most Preferred City,2nd Most Preferred City,3rd Most Preferred City,4th Most Preferred City,5th Most Preferred City,category
0,0,Inglewood,Amador City,Fairfax,Madera,Patterson Tract,"African Restaurant, American Restaurant, BBQ J..."
1,1,Inglewood,Amador City,Fairfax,Madera,Patterson Tract,Southern / Soul Food Restaurant
2,2,Vallejo,Stanton,Fairfax,Madera,Inglewood,Coffee Shop
3,3,Inglewood,Stanton,Vallejo,Amador City,Fairfax,Burger Joint
4,4,Stanton,Amador City,Fairfax,Inglewood,Madera,"Japanese Restaurant, Mediterranean Restaurant,..."


The above analysis show that all categories in a particular cluster are equally liked by people in the cities mentioned for that cluster. So these facts can be used to combine the menu of a new food place. For eg, a person opening up a Restaurant in Inglewood can keep its menu a mixture of African and American dishes since these are the most liked restaurants in this city of California.

This concludes our analysis. Thank you for reading!

<h4>Developed by PREETI SETHI</h4>