**Introduction**

Mexico City is a huge place to live in and the amount of venues as well. I want to use k-means to cluster the various venues by their categories so that a tourism agency can quickly generate the right icons for the venues on a map without having to go manually through the dataset.

**Data**

I will base my entire analysis on the data of the Foursquare API combines with the visualisation of folium.

Methodology

I will analyse what kind of venues exist in the dataset that I get from the Foursquare API which unfortuantely is limited to 100 venues. I will then use k-means to let the venues be clustered in 9 clusters and showcase them with different colours and tables.

In [343]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [344]:
CLIENT_ID = 'hidden' # your Foursquare ID
CLIENT_SECRET = 'hidden' # your Foursquare Secret
VERSION = '20200723' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [345]:
latitude = 19.42847
longitude =  -99.12766
radius = 50000
LIMIT = 1000

In [346]:
# type your answer here
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f19ea61c0cd9c37324adebd'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Mexico City',
  'headerFullLocation': 'Mexico City',
  'headerLocationGranularity': 'city',
  'totalResults': 231,
  'suggestedBounds': {'ne': {'lat': 19.87847045000045,
    'lng': -98.65137912489877},
   'sw': {'lat': 18.978469549999552, 'lng': -99.60394087510124}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd5ea83cfa7b713b2ac26da',
       'name': 'Al Andalus',
       'location': {'address': 'Mesones 171',
        'crossStreet': 'Entre Las Cruces y Jesús María',
        'lat': 19.427880891457818,
        'lng': -99.12922428863952,
        'labeledLatLngs': [{'lab

In [347]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [348]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

df=nearby_venues

df.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Al Andalus,Middle Eastern Restaurant,19.427881,-99.129224
1,Centro Histórico,Plaza,19.430583,-99.13449
2,Museo Nacional de Arte (MUNAL),Art Museum,19.436018,-99.139603
3,Palacio de Bellas Artes,Opera House,19.434953,-99.141959
4,Museo del Templo Mayor,History Museum,19.434839,-99.131642


In [349]:
print('{} venues were returned by Foursquare.'.format(df.shape[0]))

100 venues were returned by Foursquare.


In [350]:
# create map of Toronto using latitude and longitude values
map_cdmx = folium.Map(location=[latitude, longitude], zoom_start=13)

In [351]:
# add markers to map
for la, ln, borough, neighborhood in zip(df['lat'], df['lng'], df['categories'], df['name']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [la, ln],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_cdmx)  
    
map_cdmx

In [352]:
df.groupby("categories").count()["name"].sort_values(ascending=False)

categories
Ice Cream Shop                     12
Bakery                              6
Mexican Restaurant                  5
Park                                4
Taco Place                          4
Hotel                               3
Museum                              3
Seafood Restaurant                  3
Coffee Shop                         3
Pet Store                           2
Plaza                               2
Vegetarian / Vegan Restaurant       2
Japanese Restaurant                 2
Art Museum                          2
Forest                              1
Food Truck                          1
Department Store                    1
Flower Shop                         1
Dog Run                             1
Fountain                            1
Dessert Shop                        1
Yoga Studio                         1
Cycle Studio                        1
Cupcake Shop                        1
Health & Beauty Service             1
Coworking Space                     1
C

In [353]:
print('There are {} uniques categories.'.format(len(df['categories'].unique())))

There are 61 uniques categories.


In [354]:
# one hot encoding
onehot = pd.get_dummies(df[['categories']], prefix="", prefix_sep="")
onehot

Unnamed: 0,Arepa Restaurant,Argentinian Restaurant,Art Museum,Asian Restaurant,Bakery,Baseball Stadium,Bookstore,Burger Joint,Cafeteria,Café,Church,Coffee Shop,Concert Hall,Coworking Space,Cupcake Shop,Cycle Studio,Department Store,Dessert Shop,Dog Run,Flower Shop,Food Truck,Forest,Fountain,Gourmet Shop,Health & Beauty Service,History Museum,Hotel,Hotel Bar,Ice Cream Shop,Japanese Restaurant,Jewelry Store,Juice Bar,Mexican Restaurant,Middle Eastern Restaurant,Monument / Landmark,Motel,Museum,Music Store,Opera House,Optical Shop,Paella Restaurant,Park,Pet Store,Pie Shop,Plaza,Post Office,Public Art,Racetrack,Salon / Barbershop,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Southern / Soul Food Restaurant,Spa,Taco Place,Tattoo Parlor,Theater,Track,University,Vegetarian / Vegan Restaurant,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [355]:
# add neighborhood column back to dataframe
onehot['name'] = df['name'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,name,Arepa Restaurant,Argentinian Restaurant,Art Museum,Asian Restaurant,Bakery,Baseball Stadium,Bookstore,Burger Joint,Cafeteria,Café,Church,Coffee Shop,Concert Hall,Coworking Space,Cupcake Shop,Cycle Studio,Department Store,Dessert Shop,Dog Run,Flower Shop,Food Truck,Forest,Fountain,Gourmet Shop,Health & Beauty Service,History Museum,Hotel,Hotel Bar,Ice Cream Shop,Japanese Restaurant,Jewelry Store,Juice Bar,Mexican Restaurant,Middle Eastern Restaurant,Monument / Landmark,Motel,Museum,Music Store,Opera House,Optical Shop,Paella Restaurant,Park,Pet Store,Pie Shop,Plaza,Post Office,Public Art,Racetrack,Salon / Barbershop,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Southern / Soul Food Restaurant,Spa,Taco Place,Tattoo Parlor,Theater,Track,University,Vegetarian / Vegan Restaurant,Yoga Studio
0,Al Andalus,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Centro Histórico,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Museo Nacional de Arte (MUNAL),0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Palacio de Bellas Artes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Museo del Templo Mayor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [356]:
onehot.shape

(100, 62)

In [357]:
# set number of clusters
kclusters = 10

clustering = onehot.drop('name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 3, 0, 0, 1, 0], dtype=int32)

In [358]:
# add clustering labels
df.insert(0, 'Cluster Labels', kmeans.labels_)

#merged = df

# merge grouped with df to add latitude/longitude for each restaurant name
#merged = merged.join(df.set_index('name'), on='name')



In [359]:
df.tail() # check the last columns!

Unnamed: 0,Cluster Labels,name,categories,lat,lng
95,1,Pujol,Mexican Restaurant,19.432378,-99.194808
96,2,Chiandoni,Ice Cream Shop,19.386255,-99.176928
97,0,Yoga Espacio,Yoga Studio,19.385674,-99.171723
98,0,Tiffany & Co.,Jewelry Store,19.432026,-99.201028
99,2,Amorino,Ice Cream Shop,19.429808,-99.196587


In [360]:
df.describe()

Unnamed: 0,Cluster Labels,lat,lng
count,100.0,100.0,100.0
mean,1.78,19.415821,-99.161217
std,2.702711,0.021407,0.023884
min,0.0,19.356467,-99.201028
25%,0.0,19.40252,-99.176688
50%,0.0,19.419805,-99.167228
75%,2.25,19.429997,-99.146045
max,9.0,19.484026,-99.08643


In [361]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['lat'], df['lng'], df['categories'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [362]:
df.loc[df['Cluster Labels'] == 0, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
0,Middle Eastern Restaurant
1,Plaza
2,Art Museum
3,Opera House
4,History Museum
6,Church
7,Art Museum
9,Post Office
11,Scenic Lookout
13,Flower Shop


In [363]:
df.loc[df['Cluster Labels'] == 1, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
8,Mexican Restaurant
14,Mexican Restaurant
29,Mexican Restaurant
72,Mexican Restaurant
95,Mexican Restaurant


In [364]:
df.loc[df['Cluster Labels'] == 2, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
12,Ice Cream Shop
23,Ice Cream Shop
30,Ice Cream Shop
35,Ice Cream Shop
37,Ice Cream Shop
45,Ice Cream Shop
53,Ice Cream Shop
61,Ice Cream Shop
66,Ice Cream Shop
71,Ice Cream Shop


In [365]:
df.loc[df['Cluster Labels'] == 3, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
5,Taco Place
22,Taco Place
39,Taco Place
83,Taco Place


In [366]:
df.loc[df['Cluster Labels'] == 4, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
36,Park
47,Park
58,Park
67,Park


In [367]:
df.loc[df['Cluster Labels'] == 5, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
17,Vegetarian / Vegan Restaurant
46,Vegetarian / Vegan Restaurant


In [368]:
df.loc[df['Cluster Labels'] == 6, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
60,Museum
75,Museum
94,Museum


In [369]:
df.loc[df['Cluster Labels'] == 7, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
16,Bakery
25,Bakery
31,Bakery
77,Bakery
81,Bakery
84,Bakery


In [370]:
df.loc[df['Cluster Labels'] == 8, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
10,Hotel
32,Hotel
44,Hotel


In [371]:
df.loc[df['Cluster Labels'] == 9, df.columns[[2] + list(range(5, df.shape[1]))]]

Unnamed: 0,categories
21,Coffee Shop
41,Coffee Shop
79,Coffee Shop


**Results**

K-Means works excellent in separating the venues from each other. While the first cluster puts all venues in one group that appear only once, the others get grouped correctly. With more data from the Foursquare API, this would have been a much more interesting analysis but I see it as a proof of concept that the k-means can save a lot of time and offer insights into large datasets that would have taken hours to observe in more complex cases than this one.

From this graphical analysis it is very easy to create icons for every color and put an taco on the map for every Taco Map, that is fantastic.

**Discussion**

The Foursquare API is not very helpful if you don't pay for it. Also many people are not using Foursquare anymore but I picked Mexico City because people have been using it here for years oppsed to Europe, for example, where the dataset would have been ot representative at all because nobody has ever used it.

The amount of Ice Cream shops was surprising to me. I also wondered how many Taco Places were never reviewed because they should remain a secret. The amount of Bakeries surprised me as well.

**Conclusion**

I am excited about how fast k-means is. I will keep working with it on datasets that don't have to rely on the Foursquare API as in this assignment and am thrilled to find out patterns in my data that I have never realized before.