# Final capstone project
## Based on data of Auckland

## 1 The problem:
Where are the zones that meet client's food tastes requirement best?

Background:  
In todays world, people are moving across the world quite often. A person who lives in North America could originally from South Africa or vice versa. So it might be important for people to choose to live closer to most of their favourite food restaurants. And based on some existing research that we know food type from some countries or continent have same similarity. Hence it is possible to cluster zones based on the restaurant tastes nearby.

Audience:
The target audience could be anyone who are interested in where his/her favourite taste restaurants are located in. They could even use those information to choose where to live since delicious food is essential to daily life.

## 2 The data
Zone locations:  
The zone locations data would include the name, postal code, coordinates of those zones so that we can locate those nodes on to a map. This needs to be localised for certain city rather than the whole world. For this project, I would try to get data from City Council for Auckland, New Zealand. If I can't, I will use Toronto or New York instead.

Restaurants:  
Restaurants data would be get through API from Foursquare. 500m radius would be use at the beginning. If there are very less restaurants, the radius would be increased to 1km. Otherwise if too many restaurants, a filter of them by rating points would be done.

## 3 Methodology


postal code and zone data download from:  
https://www.aggdata.com/free/new-zealand-postal-codes

## Exploratory data analysis

Load data into pandas DataFrame

In [190]:
import pandas as pd
import numpy as np
file_name = 'nz_postal_codes.csv'
df = pd.read_csv(file_name)
print(df.shape)
df.head()

(1737, 4)


Unnamed: 0,Postal Code,Place Name,Latitude,Longitude
0,110,Woodhill,-35.749,174.327
1,112,Whau Valley,-35.693,174.3001
2,114,Waro,-35.5909,174.2815
3,116,Ruakaka,-35.892,174.4644
4,140,Riverside,-35.7245,174.3213


Filter place to be in Auckland only by postal code from 0600 to 2299

In [197]:
auckland_df = df[df['Postal Code'].isin(range(600, 2300))]
auckland_df.reset_index(drop = True, inplace = True)
auckland_df.columns = ['postcode', 'zone', 'latitude', 'longitude']
print(auckland_df.shape)
auckland_df.head()

(231, 4)


Unnamed: 0,postcode,zone,latitude,longitude
0,600,Blockhouse Bay,-36.9158,174.6922
1,602,Kelston,-36.9042,174.6474
2,604,Huia,-36.9808,174.5827
3,610,Te Atatu South,-36.8531,174.6466
4,612,McLaren Park,-36.8988,174.5918


Group by zone

In [198]:
auckland_df = auckland_df.drop('postcode', axis = 1).groupby(['zone']).mean().reset_index()
print(auckland_df.shape)
auckland_df.head()

(160, 3)


Unnamed: 0,zone,latitude,longitude
0,Ahuroa,-36.501,174.5307
1,Albany,-36.72795,174.70135
2,Algies Bay,-36.4404,174.7235
3,Arch Hill,-36.863,174.7432
4,Auckland,-36.8625,174.7658


Get the geographical coordinate of Auckland

In [199]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'Auckland'
geolocator = Nominatim(user_agent = "ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('{}, {}'.format(latitude, longitude))

-36.852095, 174.7631803


Create a map of Auckland with zones superimposed on top

In [201]:
import folium # map rendering library
map_auckland = folium.Map(location = [latitude, longitude], zoom_start = 10)
# add markers to map
for lat, lng, zone in zip(auckland_df['latitude'], auckland_df['longitude'], auckland_df['zone']):
    label = '{}'.format(zone)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_auckland)    
map_auckland

Define Foursquare credentials and version

In [204]:
CLIENT_ID = 'EB1TH5JVTSQEZ3KR1ZRVO315PQAOHWBCOQYBQPEUN444O5ST' # your Foursquare ID
CLIENT_SECRET = '2DUTWBOODRK551DXQQ4RVYEQRDHSN1CSL0JZUSVFIWSXFZIL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Get nearby venues of zones in Auckland from Foursquare and save to auckland_venues.csv

In [205]:
import requests # library to handle requests
def getNearbyVenues(names, latitudes, longitudes, radius = 1000, LIMIT = 100): 
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [207]:
#auckland_venues = getNearbyVenues(names = auckland_df['zone'], latitudes = auckland_df['latitude'], longitudes = auckland_df['longitude'])
auckland_venues.to_csv('auckland_venues.csv')

Check the size of the resulting dataframe

In [208]:
print(auckland_venues.shape)
auckland_venues.head()

(2537, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albany,-36.72795,174.70135,QBE Stadium,-36.726937,174.702059,Stadium
1,Albany,-36.72795,174.70135,茶顏觀色 Bubble Tea Cafe,-36.726178,174.697295,Bubble Tea Shop
2,Albany,-36.72795,174.70135,Albany Mega Centre,-36.73122,174.706719,Shopping Mall
3,Albany,-36.72795,174.70135,K-Mart,-36.728644,174.709722,Department Store
4,Albany,-36.72795,174.70135,Opium Cafe,-36.725174,174.695026,Café


Filter all venues by restaurant only

In [209]:
auckland_food = auckland_venues[auckland_venues['Venue Category'].str.contains('Restaurant')].reset_index(drop = True)
print(auckland_food.shape)
auckland_food.head()

(645, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albany,-36.72795,174.70135,Matsu Sushi,-36.728465,174.708699,Sushi Restaurant
1,Albany,-36.72795,174.70135,The Wine Box,-36.724628,174.694747,Restaurant
2,Albany,-36.72795,174.70135,McDonald's,-36.729671,174.703299,Fast Food Restaurant
3,Albany,-36.72795,174.70135,Nando's,-36.732591,174.708883,Portuguese Restaurant
4,Albany,-36.72795,174.70135,Lone Star Albany,-36.72282,174.704525,American Restaurant


Check how many venues were returned for each zone

In [210]:
auckland_food.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albany,12,12,12,12,12,12
Arch Hill,26,26,26,26,26,26
Auckland,32,32,32,32,32,32
Avondale,1,1,1,1,1,1
Balmoral,19,19,19,19,19,19
...,...,...,...,...,...,...
Wesley,4,4,4,4,4,4
West Harbour,4,4,4,4,4,4
Western Springs,13,13,13,13,13,13
Whangaparaoa,1,1,1,1,1,1


## Inferential statistical testing

Analyze each zone

In [212]:
# one hot encoding
auckland_onehot = pd.get_dummies(auckland_food[['Venue Category']], prefix = "", prefix_sep = "")
# add neighborhood column back to dataframe
auckland_onehot['Neighborhood'] = auckland_food['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = list(auckland_onehot.columns)
fixed_columns.remove('Neighborhood')
fixed_columns = ['Neighborhood'] + fixed_columns
auckland_onehot = auckland_onehot[fixed_columns]
auckland_onehot.rename(columns = {'Neighborhood': 'zone'}, inplace = True)
print(auckland_onehot.shape)
auckland_onehot.tail()

(645, 49)


Unnamed: 0,zone,American Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,...,Seafood Restaurant,South Indian Restaurant,Sri Lankan Restaurant,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yakitori Restaurant
640,Wiri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
641,Wiri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
642,Wiri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
643,Wiri,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
644,Wiri,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group rows by zone and by taking the mean of the frequency of occurrence of each category

In [213]:
auckland_grouped = auckland_onehot.groupby('zone').mean().reset_index()
auckland_grouped.head()

Unnamed: 0,zone,American Restaurant,Argentinian Restaurant,Asian Restaurant,Australian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Colombian Restaurant,Comfort Food Restaurant,...,Seafood Restaurant,South Indian Restaurant,Sri Lankan Restaurant,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yakitori Restaurant
0,Albany,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0
1,Arch Hill,0.038462,0.038462,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.038462,0.0,0.0,0.0,0.115385,0.0,0.0,0.076923,0.0
2,Auckland,0.03125,0.0,0.09375,0.0,0.0,0.0,0.03125,0.0,0.0,...,0.0,0.0625,0.0,0.125,0.0,0.0625,0.0,0.0,0.0,0.03125
3,Avondale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Balmoral,0.0,0.0,0.105263,0.0,0.0,0.0,0.157895,0.0,0.0,...,0.0,0.052632,0.052632,0.0,0.0,0.157895,0.0,0.0,0.0,0.0


Get zone along with the top 5 most common restaurant

In [215]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False) 
    return row_categories_sorted.index.values[0:num_top_venues]

In [216]:
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['zone']
for ind in np.arange(num_top_venues):
    try: columns.append('{}{} Most Common Restaurant'.format(ind + 1, indicators[ind]))
    except: columns.append('{}th Most Common Restaurant'.format(ind+1))
# create a new dataframe
auckland_venues_sorted = pd.DataFrame(columns = columns)
auckland_venues_sorted['zone'] = auckland_grouped['zone']
for ind in np.arange(auckland_grouped.shape[0]):
    auckland_venues_sorted.iloc[ind, 1:] = return_most_common_venues(auckland_grouped.iloc[ind, :], num_top_venues)
auckland_venues_sorted.tail()

Unnamed: 0,zone,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
100,Wesley,Indian Restaurant,Asian Restaurant,Restaurant,Ethiopian Restaurant,Yakitori Restaurant
101,West Harbour,Fast Food Restaurant,American Restaurant,Restaurant,Dutch Restaurant,Indian Restaurant
102,Western Springs,Thai Restaurant,Japanese Restaurant,Middle Eastern Restaurant,Vegetarian / Vegan Restaurant,Turkish Restaurant
103,Whangaparaoa,Indian Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Greek Restaurant
104,Wiri,Fast Food Restaurant,Japanese Restaurant,Sushi Restaurant,Portuguese Restaurant,Australian Restaurant


## Machine learnings used

Cluster zones

In [217]:
from sklearn.cluster import KMeans # import k-means from clustering stage
# set number of clusters
kclusters = 10
auckland_grouped_clustering = auckland_grouped.drop('zone', 1)
# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(auckland_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([7, 7, 7, 0, 5, 2, 7, 7, 8, 7, 7, 7, 7, 7, 7, 2, 7, 5, 7, 1, 4, 7,
       5, 4, 2, 7, 2, 7, 5, 7, 7, 5, 9, 5, 7, 5, 5, 1, 0, 7, 2, 8, 8, 7,
       7, 7, 7, 5, 5, 5, 3, 7, 0, 7, 7, 8, 5, 7, 0, 7, 1, 7, 8, 5, 0, 5,
       8, 4, 5, 0, 8, 7, 7, 2, 8, 7, 1, 7, 7, 1, 5, 7, 8, 8, 6, 7, 0, 7,
       8, 5, 5, 8, 2, 7, 7, 8, 1, 1, 5, 0, 7, 8, 5, 9, 8], dtype=int32)

Add clustering labels

In [219]:
auckland_venues_sorted['Cluster Labels'] = kmeans.labels_
auckland_merged = auckland_df
# merge auckland_grouped with auckland_df to add latitude/longitude for each zone
auckland_merged = auckland_merged.join(auckland_venues_sorted.set_index('zone'), on = 'zone')
auckland_merged['Cluster Labels'].fillna(-1, inplace = True)
auckland_merged['Cluster Labels'] = auckland_merged['Cluster Labels'].astype(int)
auckland_merged.head()

Unnamed: 0,zone,latitude,longitude,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant,Cluster Labels
0,Ahuroa,-36.501,174.5307,,,,,,-1
1,Albany,-36.72795,174.70135,Fast Food Restaurant,Mexican Restaurant,Japanese Restaurant,Restaurant,Indonesian Restaurant,7
2,Algies Bay,-36.4404,174.7235,,,,,,-1
3,Arch Hill,-36.863,174.7432,Japanese Restaurant,Italian Restaurant,Restaurant,Thai Restaurant,Mexican Restaurant,7
4,Auckland,-36.8625,174.7658,Sushi Restaurant,Japanese Restaurant,Asian Restaurant,Thai Restaurant,French Restaurant,7


## 4 Results

Visualize the resulting clusters, note that rural area without restaurant in 1km would be tagged as black.

In [221]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 10)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i * x) ** 2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
# black for NaN cluster
rainbow = ['#000000'] + [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(auckland_merged['latitude'], auckland_merged['longitude'],
                                  auckland_merged['zone'], auckland_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = rainbow[cluster + 1],
        fill = True,
        fill_color = rainbow[cluster + 1],
        fill_opacity = 0.7).add_to(map_clusters)
map_clusters

Check one cluster for example

In [227]:
auckland_merged.loc[auckland_merged['Cluster Labels'] == 0, auckland_merged.columns[[0] + list(range(3, auckland_merged.shape[1]))]]

Unnamed: 0,zone,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant,Cluster Labels
5,Avondale,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
55,Mangere East,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
72,New Windsor,Fast Food Restaurant,Chinese Restaurant,Yakitori Restaurant,Eastern European Restaurant,Indonesian Restaurant,0
80,Onehunga,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
87,Otara,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
94,Papakura,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
124,Takanini,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0
148,Wellsford,Fast Food Restaurant,Yakitori Restaurant,Vietnamese Restaurant,Indonesian Restaurant,Indian Restaurant,0


## 5 Discussion
### Observations
Based on this project, it's quite clear that different zone have different restaurant cluster. For example, if you love Fast Food so much, cluster 0 zones would be your great choice.  
It is very interested to see that central Auckland area is mainly divided by two clusters while other East, West, North, South areas have mixed clusters.  
This model also identified those black zones which don't have convenient restaurants nearby. It can be used to identify the boundary between urban and rural area as well.
### Recommemdations
There are some notice during the process:  
1 When choose the 500m radius as the beginning, the number of restaurant is very small for many zones. Hence the radius increased to be 1km. This could be due to the low density nature in Auckland comparing to other big cities in the world.  
2 At this stage, the project only focused on restaurant. However, other food choice like cafe, ice cream, bubble tea, etc could be take into consideration as well in the next stage.  
3 Also the number of features and clusters could be adjust even more to fit the client' taste.

## 6 Conclusion
This project successfully show the way of clustering zones by restaurant tastes. Clients could easily use this tool to identify which zones fit their favourite best.  
The most important thing is that this process indicates the possibility of analysis complex problem by using accessible data in real world.  
For myself, this project helped me practicing what I've learned from this Data Science certificate course. I would use those tools to unlock the power of data.

Please refer to the final_capstone_20200426.pdf for the presentation