<h1 align="center">Clustering the cities of the Netherlands based on venues</h1>

# Business Problem

In this project, we cluster the largest cities in the Netherlands based on venues. We take into consideration the venue categories (e.g. drugstores, cafés, bus stations, pubs, restaurants, shoe stores, bakeries, etc.) and the relative amount of these venue categories for each city. 

By clustering the cities, we obtain an insight in similarity between cities. This information can be used for many purposes, such as  helping tourists choose their new destination based on cities they previously enjoyed visiting. Similarly, this also helps people make decisions if they are thinking about moving within the Netherlands. Furthermore, our findings will help stakeholders make informed business decisions and address concerns they have related to competitors.


# Data Description

## Cities

We require geolocation data for the biggest cities in the Netherlands. To derive our solution, we scrape our data from https://wikikids.nl/Lijst_van_grote_Nederlandse_steden

1. *Naam* : Name of the city
2. *Inwoners* : The population of that city

This wikipedia page has information about the biggest cities in the Netherlands, including the population for each city. This wikipedia page lacks information about the geographical locations. To solve this problem we use ArcGIS API

### ArcGIS API

ArcGIS Online enables you to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. Share your insights with the world or specific groups. 

More specifically, we use ArcGIS to get the geo locations of the cities in the Netherlands. The following columns are added to our initial dataset which prepares our data. 

3. *Latitude* : Latitude for city
4. *Longitude* : Longitude for city

## Foursquare API Data

We will need data about different venues in different cities. In order to gain that information we will use the "Foursquare" locational information. Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus and even photos. As such, the foursquare location platform will be used as the sole data source since all the stated required information can be obtained through the API.

After finding the list of cities, we then connect to the Foursquare API to gather information about venues inside each city. For all cities, we have chosen the radius to be 3 kilometers.

The data retrieved from Foursquare contained information of venues within a specified distance of the longitude and latitude of the postcodes. The information obtained per venue as follows:

1. *Stad* : Name of the city
2. *Stad Latitude* : Latitude of the city
3. *Stad Longitude* : Longitude of the city
4. *Venue* : Name of the venue
5. *Venue Latitude* : Latitude of venue
6. *Venue Longitude* : Longitude of venue
7. *Venue Category* : Category of venue

Based on the information collected for the cities, we have sufficient data to build our model. We cluster the cities together based on similar venue categories. We then present our observations and findings. Using this data, our stakeholders can take the necessary decisions.

In [5]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans

<h1>Exploring the largest citties in the Netherland</h1>

We scrape the webpage and take the first table. We need only the cities (Stad) and population (Inwoners) for further steps. We can drop the nr, province, and image of the city.

In [6]:
wiki_url = requests.get("https://wikikids.nl/Lijst_van_grote_Nederlandse_steden")
wiki_data = pd.read_html(wiki_url.text)
wiki_data = wiki_data[1]
data = wiki_data.drop(labels=['Nº', 'Provincie', 'Stadsbeeld'], axis=1)
data = data.rename(columns={'Naam': 'Stad'})
data

Unnamed: 0,Stad,Inwoners
0,Amsterdam,862.965
1,Rotterdam,581.750
2,Den Haag,537.833
3,Utrecht,352.866
4,Eindhoven,231.642
...,...,...
58,Den Helder,56.707
59,Doetinchem,56.418
60,Hoogeveen,54.699
61,Terneuzen,54.687


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Stad      63 non-null     object 
 1   Inwoners  63 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.1+ KB


<h1>Geolocations of the cities</h1>

### ArcGis API
We need to get the geographical co-ordinates for the cities to plot out map. We will use the arcgis package to do so. Arcgis doesn't have a limitation on the number of API calls made.

In [8]:
!pip install arcgis



In [9]:
from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()

Defining arcgis geocode function to return latitude and longitude for all cities in the Netherlands

In [10]:
def get_x_y(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, Netherlands, NL'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

We copy over the city names to pass it into the geolocator function that we defined above

In [11]:
geo_coordinates = data['Stad']
coordinates_latlng = geo_coordinates.apply(lambda x: get_x_y(x))
coordinates_latlng

0      52.36993000000007,4.907880000000034
1      51.91438000000005,4.487160000000074
2      52.08409000000006,4.317320000000052
3      52.08965000000006,5.114350000000059
4     51.435880000000054,5.485460000000046
                      ...                 
58    52.958380000000034,4.758910000000071
59    51.963700000000074,6.291360000000054
60    52.725970000000075,6.475520000000074
61    51.33814000000007,3.8275500000000306
62      51.49541000000005,3.60964000000007
Name: Stad, Length: 63, dtype: object

### Latitude & Longitude

We extract the latitude and longitude from the collected coordinates and merge them with our source data.

In [12]:
lat = coordinates_latlng.apply(lambda x: x.split(',')[0])
lng = coordinates_latlng.apply(lambda x: x.split(',')[1])

In [13]:
merged = pd.concat([data, lat.astype(float), lng.astype(float)], axis=1)
merged.columns= ['Stad','Inwoners','Latitude','Longitude']
merged

Unnamed: 0,Stad,Inwoners,Latitude,Longitude
0,Amsterdam,862.965,52.36993,4.90788
1,Rotterdam,581.750,51.91438,4.48716
2,Den Haag,537.833,52.08409,4.31732
3,Utrecht,352.866,52.08965,5.11435
4,Eindhoven,231.642,51.43588,5.48546
...,...,...,...,...
58,Den Helder,56.707,52.95838,4.75891
59,Doetinchem,56.418,51.96370,6.29136
60,Hoogeveen,54.699,52.72597,6.47552
61,Terneuzen,54.687,51.33814,3.82755


In [14]:
merged.dtypes

Stad          object
Inwoners     float64
Latitude     float64
Longitude    float64
dtype: object

### Co-ordinates for the Netherlands
Getting the geocode for the Netherlands so we can center it on the map

In [15]:
nederland = geocode(address='Netherlands, NL')[0]
nederland_lng_coords = nederland['location']['x']
nederland_lat_coords = nederland['location']['y']
print('Coordinates:', nederland_lng_coords, nederland_lat_coords)

Coordinates: 5.616126398000063 52.24937529300007


## Visualize the map of the Netherlands
To help visualize the map of the Netherlands and its cities, we make use of the folium package. The size of the marker is based on the population of the specific city.

In [22]:
# Creating the map of the Netherlands
map_nederland = folium.Map(location=[nederland_lat_coords, nederland_lng_coords], zoom_start=7, tiles='cartodbpositron')
map_nederland

# adding markers to map
for latitude, longitude, stad, inwoners in zip(merged['Latitude'], merged['Longitude'], merged['Stad'], merged['Inwoners']):
    label = '{}'.format(stad)
    label = folium.Popup(label, parse_html=True)
    radius_size = inwoners / 18
    folium.CircleMarker(
        [latitude, longitude],
        radius=radius_size,
        popup=label,
        color='blue',
        fill=True
        ).add_to(map_nederland)

map_nederland

In [23]:
CLIENT_ID = 'ZTZLCINMLPQ4DAMDLVW4UNTLEY5SWXVK5X2WEJEKM5DBKBU1' 
CLIENT_SECRET = 'SAUEHHCYE1ZEVBYVDCNF1H5W1MT2Q0KGN11KSLN3GLVYMQ3J'
VERSION = '20210225'

In [24]:
LIMIT=200

def getNearbyVenues(names, latitudes, longitudes, radius=3000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Stad', 
                  'Stad Latitude', 
                  'Stad Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

### Venues in the Netherlands

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories in each city in the Netherlands.

In [25]:
venues = getNearbyVenues(merged['Stad'], merged['Latitude'], merged['Longitude'])

Amsterdam
Rotterdam
Den Haag
Utrecht
Eindhoven
Tilburg
Groningen
Almere
Breda
Nijmegen
Enschede
Apeldoorn
Haarlem
Amersfoort
Zaanstad
Arnhem
Haarlemmermeer
's Hertogenbosch
Zoetermeer
Zwolle
Maastricht
Leiden
Dordrecht
Ede
Emmen
Westland
Venlo
Delft
Deventer
Leeuwarden
Alkmaar
Sittard-Geleen
Helmond
Heerlen
Hilversum
Oss
Amstelveen
Súdwest-Fryslân
Hengelo
Purmerend
Roosendaal
Schiedam
Lelystad
Alphen aan den Rijn
Leidschendam-Voorburg
Almelo
Spijkenisse
Hoorn
Gouda
Vlaardingen
Assen
Bergen op Zoom
Capelle aan den IJssel
Veenendaal
Katwijk
Zeist
Nieuwegein
Roermond
Den Helder
Doetinchem
Hoogeveen
Terneuzen
Middelburg


In [26]:
venues.head()

Unnamed: 0,Stad,Stad Latitude,Stad Longitude,Venue,Venue Category
0,Amsterdam,52.36993,4.90788,HPS,Cocktail Bar
1,Amsterdam,52.36993,4.90788,Sotto,Pizza Place
2,Amsterdam,52.36993,4.90788,De Hortus,Botanical Garden
3,Amsterdam,52.36993,4.90788,Black Gold,Coffee Shop
4,Amsterdam,52.36993,4.90788,Rosalia's Menagerie,Cocktail Bar


In [27]:
venues.shape

(5083, 5)

In total, we have scraped 5083 venues for 62 cities.

### Grouping by Venue Categories
We will check how many Venue Categories there are for further processing

In [28]:
venues.groupby('Venue Category').max()

Unnamed: 0_level_0,Stad,Stad Latitude,Stad Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,Roermond,51.19614,5.98372,Michael Kors Outlet
Advertising Agency,Zwolle,52.51621,6.09247,Dopit Media
Afghan Restaurant,Tilburg,51.69088,5.48546,Zaher
African Restaurant,Amersfoort,52.15252,5.38626,Restaurant De Olifant
Airport,Hoogeveen,52.72597,6.47552,Vliegveld Hoogeveen (EHHO)
...,...,...,...,...
Wine Shop,Utrecht,53.21687,6.57393,Wijnkoperij Platenburg
Women's Store,Rotterdam,51.91438,4.48716,Dearhunter Vintage Clothing & Accessories
Yoga Studio,Den Haag,52.36993,4.90788,Delight Yoga
Zoo,Maastricht,52.78223,6.89636,Wildlands Adventure Zoo Emmen


We can see 334 records, indicating a great diversity in venues / very well-defined venues in the Netherlands.

### One Hot Encoding 
We need to encode our venue categories for our clustering

In [29]:
venue_cat = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
venue_cat

Unnamed: 0,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Apres Ski Bar,Arcade,Arepa Restaurant,...,Warehouse Store,Waterfront,Whisky Bar,Windmill,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5078,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5079,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5081,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We add the cities as first column

In [30]:
venue_cat['Stad'] = venues['Stad'] 

# moving city column to the first column
fixed_columns = [venue_cat.columns[-1]] + list(venue_cat.columns[:-1])
venue_cat = venue_cat[fixed_columns]

venue_cat.head()

Unnamed: 0,Stad,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Apres Ski Bar,Arcade,...,Warehouse Store,Waterfront,Whisky Bar,Windmill,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Amsterdam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Venue categories mean value
We will group by cities and calculate the mean venue categories value in each city

In [31]:
grouped = venue_cat.groupby('Stad').mean().reset_index()
grouped.head()

Unnamed: 0,Stad,Accessories Store,Advertising Agency,Afghan Restaurant,African Restaurant,Airport,American Restaurant,Amphitheater,Apres Ski Bar,Arcade,...,Warehouse Store,Waterfront,Whisky Bar,Windmill,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,'s Hertogenbosch,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alkmaar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Almelo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Almere,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alphen aan den Rijn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0


The following function will be used to get the top most common venue categories

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Since there are too many venue categories (334), we only take the top 25 to cluster the cities.

The following function is used to label the columns of the venue correctly

In [35]:
num_top_venues = 25

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Stad']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

### Top venue categories

Getting the top venue categories for each city in the Netherlands

In [36]:
# create a new dataframe
venues_sorted2 = pd.DataFrame(columns=columns)
venues_sorted2['Stad'] = grouped['Stad']

for ind in np.arange(grouped.shape[0]):
    venues_sorted2.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

venues_sorted2.head()

Unnamed: 0,Stad,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,21th Most Common Venue,22th Most Common Venue,23th Most Common Venue,24th Most Common Venue,25th Most Common Venue
0,'s Hertogenbosch,Supermarket,Bar,Café,Restaurant,Mediterranean Restaurant,Drugstore,Gastropub,Coffee Shop,Hotel,...,Italian Restaurant,Gym / Fitness Center,Art Museum,Beer Garden,Spa,Bistro,Bookstore,Big Box Store,Garden Center,Frozen Yogurt Shop
1,Alkmaar,Supermarket,Coffee Shop,Café,Restaurant,Fast Food Restaurant,Park,Italian Restaurant,Drugstore,French Restaurant,...,Pharmacy,Soccer Field,Pizza Place,Pool,Sandwich Place,Church,Scottish Restaurant,Skating Rink,Cheese Shop,Comfort Food Restaurant
2,Almelo,Supermarket,Shopping Mall,Drugstore,Café,Restaurant,Pub,Department Store,Grocery Store,Furniture / Home Store,...,Stadium,Soccer Stadium,Soccer Field,Flea Market,Food & Drink Shop,Garden Center,Multiplex,Sandwich Place,Cafeteria,Discount Store
3,Almere,Supermarket,Sushi Restaurant,Restaurant,Snack Place,Bar,Hotel,Gym,Chinese Restaurant,Train Station,...,Gym / Fitness Center,Park,Asian Restaurant,Hockey Field,Spa,Pool,Discount Store,Soccer Stadium,Soccer Field,Cultural Center
4,Alphen aan den Rijn,Supermarket,Restaurant,Drugstore,Gym,Soccer Field,Bookstore,Discount Store,Japanese Restaurant,Italian Restaurant,...,Pizza Place,Motorsports Shop,Bowling Alley,Middle Eastern Restaurant,Farmers Market,Spanish Restaurant,Sporting Goods Shop,Fast Food Restaurant,Sports Bar,French Restaurant


### K Means clustering
Let's cluster the cities to using K Means clustering.

In [37]:
# set number of clusters
k_num_clusters = 5

grouped_clustering = grouped.drop('Stad', 1)

# run k-means clustering
kmeans_nl5 = KMeans(n_clusters=k_num_clusters, random_state=0).fit(grouped_clustering)
kmeans_nl5

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

### Labelling Clustered Data

In [38]:
kmeans_nl5.labels_

array([2, 2, 1, 2, 1, 0, 4, 4, 0, 0, 2, 2, 4, 1, 4, 4, 2, 4, 2, 0, 2, 4,
       2, 2, 0, 4, 4, 0, 0, 0, 2, 4, 1, 0, 3, 2, 0, 1, 1, 2, 0, 1, 4, 2,
       2, 2, 0, 4, 2, 2, 1, 0, 0, 4, 4, 1, 2, 1, 3, 2, 2, 2, 2])

In [41]:
venues_sorted2.insert(0, 'Cluster Labels', kmeans_nl5.labels_ +1)

Join merged with our venues_sorted to add latitude & longitude for each of the city to prepare it for plotting

In [42]:
nl_data = merged

nl_data = nl_data.join(venues_sorted2.set_index('Stad'), on='Stad')

nl_data.head()

Unnamed: 0,Stad,Inwoners,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,...,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,21th Most Common Venue,22th Most Common Venue,23th Most Common Venue,24th Most Common Venue,25th Most Common Venue
0,Amsterdam,862.965,52.36993,4.90788,5,Hotel,Coffee Shop,Bar,Restaurant,Café,...,Yoga Studio,Pub,Food Truck,French Restaurant,Beer Bar,Spiritual Center,Steakhouse,Beer Store,Soup Place,Indonesian Restaurant
1,Rotterdam,581.75,51.91438,4.48716,5,Bar,Hotel,Coffee Shop,Café,French Restaurant,...,Japanese Restaurant,Bagel Shop,Market,Vietnamese Restaurant,Hostel,Italian Restaurant,Shopping Plaza,Bookstore,Food Truck,Sporting Goods Shop
2,Den Haag,537.833,52.08409,4.31732,5,Coffee Shop,Restaurant,Park,Bar,Hotel,...,Gym,Thai Restaurant,Indonesian Restaurant,Plaza,Asian Restaurant,Butcher,Bike Shop,Wine Bar,Soccer Field,Snack Place
3,Utrecht,352.866,52.08965,5.11435,5,Coffee Shop,Restaurant,Park,Bar,Italian Restaurant,...,Café,Sandwich Place,Hotel,Farm,Monument / Landmark,Beer Store,Steakhouse,Cocktail Bar,Squash Court,Snack Place
4,Eindhoven,231.642,51.43588,5.48546,5,Coffee Shop,Bar,Restaurant,Hotel,Plaza,...,French Restaurant,Bakery,Breakfast Spot,Skate Park,Science Museum,Skating Rink,Lounge,Bookstore,Soccer Stadium,Food Court


Drop all the NaN values to prevent data skew

In [43]:
nl_data_nonan = nl_data.dropna(subset=['Cluster Labels'])

## Examining our Clusters

In [51]:
pd.set_option('display.max_rows', 64)
nl_data.sort_values('Cluster Labels')

Unnamed: 0,Stad,Inwoners,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,...,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,21th Most Common Venue,22th Most Common Venue,23th Most Common Venue,24th Most Common Venue,25th Most Common Venue
62,Middelburg,47.754,51.49541,3.60964,1,Bar,Restaurant,Café,Hotel,Plaza,...,Tea Room,Gastropub,Drugstore,Seafood Restaurant,Snack Place,Sports Club,Bagel Shop,Nightclub,Bookstore,Bakery
61,Terneuzen,54.687,51.33814,3.82755,1,Restaurant,Supermarket,Hotel,Bakery,Clothing Store,...,Mediterranean Restaurant,Rock Club,Chinese Restaurant,Beach Bar,Beach,Shoe Store,Gastropub,Snack Place,Boat or Ferry,Theater
37,Súdwest-Fryslân,82.572,53.03369,5.66133,1,Supermarket,Restaurant,Harbor / Marina,Gastropub,Drugstore,...,Cocktail Bar,Spa,Pub,Coffee Shop,Plaza,Indonesian Restaurant,Department Store,Discount Store,Bed & Breakfast,Museum
40,Roosendaal,77.097,51.53141,4.45749,1,Restaurant,Bar,Supermarket,Bus Stop,Shopping Mall,...,Theater,Furniture / Home Store,Chinese Restaurant,Bookstore,Snack Place,Mexican Restaurant,Gastropub,Sandwich Place,Tapas Restaurant,Music Venue
22,Dordrecht,118.702,51.81195,4.65647,1,Restaurant,Café,Ice Cream Shop,Sushi Restaurant,Bar,...,Sandwich Place,Music Venue,Hotel,Coffee Shop,Chinese Restaurant,Seafood Restaurant,Liquor Store,Creperie,South American Restaurant,Snack Place
21,Leiden,120.105,52.15363,4.49381,1,Restaurant,Supermarket,Museum,Bar,Drugstore,...,Diner,Fast Food Restaurant,Park,Coffee Shop,Church,Indonesian Restaurant,Chocolate Shop,Climbing Gym,Record Shop,Scenic Lookout
47,Hoorn,71.567,52.64243,5.05206,1,Harbor / Marina,Restaurant,Supermarket,Bar,History Museum,...,Gastropub,Tea Room,Beach Bar,Snack Place,Beer Bar,Skating Rink,Theater,Drugstore,Bistro,Sandwich Place
32,Helmond,89.139,51.48223,5.65825,1,Restaurant,Supermarket,Fast Food Restaurant,Ice Cream Shop,Drugstore,...,Sushi Restaurant,Soccer Field,Music Venue,Multiplex,Gas Station,Café,Sandwich Place,Rock Club,Diner,Department Store
15,Arnhem,150.354,51.98038,5.90333,1,Restaurant,Supermarket,Café,Park,Coffee Shop,...,Drugstore,Record Shop,Concert Hall,Discount Store,Chinese Restaurant,Plaza,Sporting Goods Shop,Food Court,South American Restaurant,Liquor Store
48,Gouda,70.981,52.01,4.71071,1,Supermarket,Café,Restaurant,Drugstore,Bar,...,Shopping Mall,Greek Restaurant,French Restaurant,Fish Market,Soccer Field,Bistro,Beer Store,Beer Garden,Library,Brasserie


### Visualizing the clustered cities
Plotting the clusters

In [52]:
map_clusters_nl = folium.Map(location=[nederland_lat_coords, nederland_lng_coords], zoom_start=7, tiles='cartodbpositron')

# set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, inwoners in zip(nl_data_nonan['Latitude'], nl_data_nonan['Longitude'], nl_data_nonan['Stad'], nl_data_nonan['Cluster Labels'], nl_data_nonan['Inwoners']):
    label = folium.Popup('Cluster ' + str(int(cluster)) + '\n' + str(poi) , parse_html=True)
    radius_size = inwoners / 20
    folium.CircleMarker(
        [lat, lon],
        radius=radius_size,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters_nl)
        
map_clusters_nl

# Results and Discussion

## Almere, a city of business opportunities?
<p style="font-size:16px">Of the 10 largest cities, 9 belong to the Red Cluster (Amsterdam, Rotterdam, Den Haag, Utrecht, Eindhoven, Groningen, Tilburg, Breda, Nijmegen), indicating that, in general, large cities share a similar composition of venues. One exception to this is Almere, the eigth largest city (Green Cluster). By examining the data, this discrepancy can be attributed to the fact that Almere has significantly fewer tourist-focused venues, such as bars, coffee shops or hotels. Interestingly, it is Almere that has experienced one of the largest population increases over the last few years [1]. In addition, based on prognoses by CBS, Almere is expected to increase its population by a staggering 22.8% between 2018 and 2035, the largest of the top 10 cities [2].</p>

<p style="font-size:16px">Considering the lack of tourist-focused venues and the purported growth of Almere, it might provide promising business opportunities for entrepreneurs. However, more research is required regarding the expected future tourist influx and the venue related demands of the locals before any conclusions can be drawn.</p>

## Deventer as affordable alternative for large cities 

<p style="font-size:16px">By further observing the results, one can distinguish several smaller cities that exhibit large-city like characteristics (Red Cluster). This includes Hilversum, Deventer, Delft and Amstelveen. Deventer might be of particular interest for people who like larger cities, yet are unable to afford or unwilling to pay the high rent prices asked in these cities. While the average rent price per 100 m² for the 10 largest cities is €1560 (unweighted average), for Deventer this is only €1067.[3]</p>

## Small to medium-sized cities
<p style="font-size:16px">The blue cluster contains only small cities with a population between ~50.000 and ~70.000. In this cluster, venues that are more focused on locals, such as supermarkets, drug stores and fitness centers, are prevalent.</p> 

<p style="font-size:16px">The green and purple clusters can be considered as something in-between the blue and red cluster. Both contain small to medium-sized cities with a population between ~50.000 and ~200.000. The purple cluster seems to be slightly more similar to the large cities with most cities having restaurant as the first most common venue. For cities in the green cluster, this is mostly the supermarket category. However, it might very well be possible that the arbitrary radius of 3 kilometers has an influence on the restaurant/supermarket ratio and subsequently on the cluster arrangement.</p>

## Seaside towns share the same cluster

<p style="font-size:16px">As expected, the seaside cities Westland and Katwijk share the same orange cluster since both have many beach related venues.</p>

### References

[1] P. Vissers, “Nederlandse steden worden drukker, slimmer, rijker - dus ook exclusiever” Trouw, 05-May-2019. [Online]. Available: https://www.trouw.nl/nieuws/nederlandse-steden-worden-drukker-slimmer-rijker-dus-ook-exclusiever~b3b71cd6. [Accessed: 03-Mar-2021] 

[2] “Sterke groei in steden en randgemeenten verwacht” CBS, 10-September-2019. [Online]. Available: https://www.cbs.nl/nl-nl/nieuws/2019/37/sterke-groei-in-steden-en-randgemeenten-verwacht. [Accessed: 03-Mar-2021]  

[3] “Dit betaal je gemiddeld aan maandhuur in 33 Nederlandse steden op de vrije markt” Business Insider Nederland, 21-Jan-2020. [Online]. Available: https://www.businessinsider.nl/huur-prijs-maand-vrije-sector-steden-nederland-randstad/. [Accessed: 03-Mar-2021] https://www.businessinsider.nl/huur-prijs-maand-vrije-sector-steden-nederland-randstad/