# The Battle of the Neighborhoods: Is Copenhagen like Paris?

## Applied Data Science Capstone, Coursera/IBM

**Author: Paw Hermansen (https://pawhermansen.dk)**  
**Date: November 9, 2018**

This notebook contains the first part of my Capstone Project for the Coursera/IBM course series *IBM Data Science Professional Certificate Specialization*.

# Import Pyton Code Libraries

In [1]:
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import os
import numpy as np
import pandas as pd
import requests
import csv
import folium
import geopy.distance

# 1. The Problem

Travel guide publisher *Lonely Planet* recently put the danish capital Copenhagen on top of their list of the best cities to visit in 2019. Copenhagen has a lot to offer, says *Lonely Planet* and mentions the cyclists, the many green spaces, the old and new architecture, the great museums, Tivoli garden, the galleries, the restaurants, including fancy New Nordic restaurants and even marvelous street food markets and indie bars.

That made some Copenhageners claim in the local newspapers that Copenhagen is like Paris in the summer. If this is true it will be interesting not only for tourists trying to find new and exciting destinations but also for the Copenhagen tourist association, visitcopenhagen.dk, that could direct its marketing to compete directly against other cities like Paris.

It is not stated clearly in exactly what way the likeness between Copenhagen and Paris is thought to be. It is clearly not in the weather because the Copenhageners compare Copenhagen in the summer to Paris and they do not mention the winter. Also it is clearly not in the languages - even though that both Copenhagen and Paris are very alike in that they speak languages that are totally un-understandable to anyone else. In Copenhhagen, however, nearly everyone also speaks fluent English which is certainly not the case in Paris.

The likeness between Copenhagen and Paris is probably more a feeling that when you walk around in Copenhagen and Paris you will see the same kind and distribution of restaurants, bars, sights, bakeries and all other kinds of venues and this is this definition of likeness that I choose to investigate.

This notebook uses tools from Data Science and Machine Learning to investigate if Copenhagen is like Paris in the above mentioned sense.

# 2. Machine Learning Approach

My approach will be to part each of the two cities into neighborhoods that I will consider as homogeneous with respect to their venue types.

Then to see how alike the Copenhagen and the Paris neighborhoods are I will make cluster analysis of all the neighborhoods together based on their frequency of their most frequently occuring venue types. For example if I cluster all the neighborhoods into two or more clusters and all the Copenhagen neighborhoods ends up in their own clusters and all the Paris neighborhoods ends up in other clusters then Copenhagen and Paris are more alike to themself than to each other. But if the neighborhoods ends up being mixed in clusters across the two cities then you are very right in claiming that Copenhagen, or at least some neighborhoods of Copenhagen, are like Paris.

If the to cities in fact have neighborhoods that are alike then the clustering will also show exactly which neighborhoods from the two cities are most like each other.

# 3. Data Requirements

Two sets of data for each city is needed for the cluster analysis - the neighborhoods and the venue types within each neighborhood.

## The neighborhoods

To be used in the cluster algorithm the frequencies of the venue types must be brought into so-called One Hot Encoding showing their frequncy within each neighborhood.

Comparing two neighborhoods based on their most frequent venue types will need more than just the one or two most frequently venue types to be fair. I decide to compare the ten most frequently venue types within each neighborhood.

The neighborhoods should be small enough to be close to homogeneous with respect to their venue types but they should also be large enough to contain enough venues and venue types that it makes sense to take the ten most frequently venue types. I guess that each neighborhood will need at least about 25 venues.

The nighborhoods can be found in several ways. One way will be to base them on postal codes but this might not define neighborhoods that align with the common tourist point of view. Instead the neighborhoods can be scraped from tourist websites and their locations can be found from wikipedia or from their addresses through geographical services. The Paris data might be available from OpenData Paris.

## The Venue types within each neighborhood

Foursquare is a service that you can use to find the best places to eat, drink, shop, or visit in any city in the world. They also offer access through an open API with some limitations, registering necessary.

We can call the Foursquare API a list of venues and their types within a certain distance from any location within Copenhagen and Paris. This means that for our purpose the neighborhoods will be defined as a center location and a radius around this center.

# 4. Data Collection

## The neighborhoods of Paris

Paris is parted into 20 so-called arrondissements that are administrative zones of Paris that fit most tourists view of neighborhoods in Paris. The name and the geographical coordinates of the location of each arrondissement can be downloaded in different formats from the [Paris Data](https://opendata.paris.fr) website - the data is covered by the [Open Database License (ODbL)](https://opendatacommons.org/licenses/odbl/).

In [2]:
arrondissementsUrl = "https://opendata.paris.fr/explore/dataset/arrondissements/download"
df_arrondissements = pd.read_csv(arrondissementsUrl, sep=';')

print("Number of rows = {}, number of columns = {}".format(df_arrondissements.shape[0], df_arrondissements.shape[1]))
df_arrondissements.head(3)

Number of rows = 20, number of columns = 12


Unnamed: 0,n_sq_ar,c_ar,c_arinsee,l_ar,l_aroff,n_sq_co,surface,perimetre,geom_x_y,geom,objectid,longueur
0,750000002,2,75102,2ème Ardt,Bourse,750001537,991153.7,4554.10436,"48.8682792225, 2.34280254689","{""type"": ""Polygon"", ""coordinates"": [[[2.351518...",2,4553.938764
1,750000003,3,75103,3ème Ardt,Temple,750001537,1170883.0,4519.263648,"48.86287238, 2.3600009859","{""type"": ""Polygon"", ""coordinates"": [[[2.363828...",3,4519.071982
2,750000012,12,75112,12ème Ardt,Reuilly,750001537,16314780.0,24089.666298,"48.8349743815, 2.42132490078","{""type"": ""Polygon"", ""coordinates"": [[[2.413879...",12,24088.038922


I select the name and location fields.

In [3]:
names = df_arrondissements['l_aroff'].str.strip()

coordinates = df_arrondissements['geom_x_y'].str.split(',', expand=True)
coordinates[0] = pd.to_numeric(coordinates[0])
coordinates[1] = pd.to_numeric(coordinates[1])

df_parisNeighborhoods = pd.concat([names, coordinates], axis=1)
df_parisNeighborhoods.columns = ['Neighborhood', 'Latitude', 'Longitude']

df_parisNeighborhoods.to_csv('data/paris_neighborhoods.csv', quoting=csv.QUOTE_ALL, index=False)

print("Number of rows = {}, number of columns = {}".format(df_parisNeighborhoods.shape[0], df_parisNeighborhoods.shape[1]))
df_parisNeighborhoods.head(3)

Number of rows = 20, number of columns = 3


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Bourse,48.868279,2.342803
1,Temple,48.862872,2.360001
2,Reuilly,48.834974,2.421325


Later I will use Foursquare to find venues in each neighborhood. To get the venues Foursquare requires the geographical latitude and longitude for each neighborhood center and a maximal distance away from each center to search. this gives me a circular search-area centered in each neighborhood. I want my search-areas to be large enough to reflect the venues in the real neighborhoods but I want to avoid too much overlap with other search-areas as this might dilute the result.

To meet this I computer the distance in meter to each neighborhoods closest neighbor and later I set the radius of each search area to half of this distance. Finally I save the Paris neighborhoods to a local csv file.

In [4]:
def nearestDistances(df):
    nearest_distance = []
    for index, row in df.iterrows():
        minimumDist = 9999999
        coords_1 = (df.loc[index, 'Latitude'], df.loc[index, 'Longitude'])
        for i in range(0, df.shape[0]):
            if (i != index):
                coords_2 = (df.loc[i, 'Latitude'], df.loc[i, 'Longitude'])
                dist = geopy.distance.vincenty(coords_1, coords_2).m
                if dist < minimumDist:
                    minimumDist = dist
        nearest_distance.append(minimumDist)
    return nearest_distance

df_parisNeighborhoods['Distance to Nearest'] = nearestDistances(df_parisNeighborhoods)

df_parisNeighborhoods.to_csv('data/paris_neighborhoods.csv', quoting=csv.QUOTE_ALL, index=False)

print("Number of neighborhoods is", df_parisNeighborhoods.shape[0])
df_parisNeighborhoods

Number of neighborhoods is 20


Unnamed: 0,Neighborhood,Latitude,Longitude,Distance to Nearest
0,Bourse,48.868279,2.342803,788.555891
1,Temple,48.862872,2.360001,964.527702
2,Reuilly,48.834974,2.421325,3495.637308
3,Louvre,48.862563,2.336443,788.555891
4,Hôtel-de-Ville,48.854341,2.35763,964.527702
5,Élysée,48.872721,2.312554,1678.621184
6,Observatoire,48.829245,2.326542,2260.122186
7,Buttes-Chaumont,48.887076,2.384821,2145.755629
8,Ménilmontant,48.863461,2.401188,1625.823751
9,Luxembourg,48.84913,2.332898,1407.726786


My search-areas for the nighborhoods of Paris looks like this:

In [5]:
def createTownMap(df, zoom = 12):
    # Create map centered around the mean latitude and longitude values
    latitue = df['Latitude'].mean()
    longitude = df['Longitude'].mean()

    townmap = folium.Map(location=[latitue, longitude], zoom_start=zoom)

    # Add the search-areas to map.
    for lat, lng, neighborhood, radius in zip(df['Latitude'],
                                              df['Longitude'],
                                              df['Neighborhood'],
                                              df['Distance to Nearest'] / 2):
        label = folium.Popup(neighborhood, parse_html=True)

        folium.Marker(
            [lat, lng],
            popup = neighborhood).add_to(townmap) 
        
        folium.Circle(
            radius=radius,
            location=[lat, lng],
            popup=label,
            color='blue',
            stroke= False,
            fill=True,
            fill_opacity=0.2).add_to(townmap)
    return townmap

In [6]:
createTownMap(df_parisNeighborhoods)

## The neighborhoods of Copenhagen

The Copenhagen neighborhoods are a little more difficult to get. From WikiPedia [Bydele i Københavns kommune](https://da.wikipedia.org/wiki/Bydele_i_K%C3%B8benhavns_Kommune) ("neighbohoods in Copenhagen Commune") I collect the ten administrative areas of Copenhagen. They seem to fit most tourists view of neighborhoods in Copenhagen.

The neighborhood *Indre by* ("Inner City") can be subdivided into smaller functionel neigborhoods but it turns out that FourSquare, that I will use later, has too few venues registered for some of the smaller neighborhoods and so I stay with *Indre by* as one neighborhood.

Also I include *Frederiksberg* that is not administratively a part of the Copenhagen Commune but geographically lies inside the borders of Copenhagen (see [https://www.quora.com/Why-is-Frederiksberg-not-a-part-of-Copenhagen](https://www.quora.com/Why-is-Frederiksberg-not-a-part-of-Copenhagen) for more information about this curiosity).

I get the data from the WikiPedia webpage and add *Frederiksberg* by hand.

In [7]:
baseUrl = 'https://da.wikipedia.org/'
url = baseUrl + 'wiki/Bydele_i_K%C3%B8benhavns_Kommune'

sauce = requests.get(url).content
soup = BeautifulSoup(sauce, 'lxml')

table = soup.find('table', {'class': 'navbox'})
td = table.find('td', {'class': 'navbox-list'})
links = td.find_all('a', href=True)

cphNames = []
for link in links:
    cphNames.append(link.text.strip())

cphNames.append('Frederiksberg')

print("Number of neighborhoods is " + str(len(cphNames)))
cphNames

Number of neighborhoods is 11


['Amager Vest',
 'Amager Øst',
 'Bispebjerg',
 'Brønshøj-Husum',
 'Indre By',
 'Nørrebro',
 'Valby',
 'Vanløse',
 'Vesterbro/Kongens Enghave',
 'Østerbro',
 'Frederiksberg']

I am going to use the Nominatim geo locator to find the geographical coordinates near the center of each neighborhood. However, not all of the neighborhood names are found and some that are found returns coordinates far from the center.

I fix it by adding an address for each neighborhood and setting the address to a name that can be found by Nomination.

In [8]:
address = cphNames.copy()

address[address.index('Amager Vest')] = 'Vestamager'
address[address.index('Amager Øst')] = 'Øresundsvej'
address[address.index('Vesterbro/Kongens Enghave')] = 'Sønder Boulevard'
address[address.index('Indre By')] = 'Kongens Nytorv'

df_cphNeighborhoods = pd.DataFrame({'Neighborhood': cphNames, 'Address': address})
df_cphNeighborhoods['Latitude'] = np.nan
df_cphNeighborhoods['Longitude'] = np.nan

print("Number of rows = {}, number of columns = {}".format(df_cphNeighborhoods.shape[0], df_cphNeighborhoods.shape[1]))
df_cphNeighborhoods.head(3)

Number of rows = 11, number of columns = 4


Unnamed: 0,Neighborhood,Address,Latitude,Longitude
0,Amager Vest,Vestamager,,
1,Amager Øst,Øresundsvej,,
2,Bispebjerg,Bispebjerg,,


I use Nominatim to find and add the geographical locations of the neighborhoods.

In [9]:
for name in df_cphNeighborhoods['Address']:
    address = name + ', København, Danmark'
    geolocator = Nominatim(user_agent='dk.pawhermansen')
    location = geolocator.geocode(address)
    if location != None:
        df_cphNeighborhoods.loc[df_cphNeighborhoods['Address'] == name, 'Latitude'] = location.latitude
        df_cphNeighborhoods.loc[df_cphNeighborhoods['Address'] == name, 'Longitude'] = location.longitude

print("Number of rows = {}, number of columns = {}".format(df_cphNeighborhoods.shape[0], df_cphNeighborhoods.shape[1]))
df_cphNeighborhoods.head(3)

Number of rows = 11, number of columns = 4


Unnamed: 0,Neighborhood,Address,Latitude,Longitude
0,Amager Vest,Vestamager,55.619371,12.575584
1,Amager Øst,Øresundsvej,55.661537,12.626487
2,Bispebjerg,Bispebjerg,55.71095,12.534


As explained for the Paris neighborhoods I want to find appropriate size of the search-areas and I computer the distance in meter to each neighborhoods closest neighbor. Finally I remove the address column again and save the Copenhagen neighborhoods to a local csv file.

In [10]:
df_cphNeighborhoods['Distance to Nearest'] = nearestDistances(df_cphNeighborhoods)
df_cphNeighborhoods = df_cphNeighborhoods.drop(columns=['Address'])

df_cphNeighborhoods.to_csv('data/cph_neighborhoods.csv', quoting=csv.QUOTE_ALL, index=False)

print("Number of neighborhoods is", df_cphNeighborhoods.shape[0])
df_cphNeighborhoods

Number of neighborhoods is 11


Unnamed: 0,Neighborhood,Latitude,Longitude,Distance to Nearest
0,Amager Vest,55.619371,12.575584,5364.845473
1,Amager Øst,55.661537,12.626487,3372.547947
2,Bispebjerg,55.71095,12.534,1812.305497
3,Brønshøj-Husum,55.704536,12.501445,2167.519938
4,Indre By,55.680889,12.585253,2698.854219
5,Nørrebro,55.695894,12.544956,1812.305497
6,Valby,55.661802,12.516952,2056.828338
7,Vanløse,55.685625,12.488809,2250.439743
8,Vesterbro/Kongens Enghave,55.665257,12.549573,1776.504126
9,Østerbro,55.705084,12.582614,2579.354408


My search-areas for the nighborhoods of Copenhagen looks like this:

In [11]:
createTownMap(df_cphNeighborhoods)

## Venues for the neighborhoods

### Define Foursquare Credentials and Version

I define my personal credentials to Foursquare in environment variables on the machine where I execute the notebook and read them below from the environment variables.

In [12]:
CLIENT_ID = os.environ['FOURSQUARE_ID'] # your Foursquare ID
CLIENT_SECRET = os.environ['FOURSQUARE_SECRET'] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

if CLIENT_ID and CLIENT_SECRET:
    print('Got your credentails!')
else:
    raise ValueError('Foursquare credentials are missing - should be set as environment variables')

Got your credentails!


### Define function to call the Foursquare API and find venues nearby a given location

The function is copied from the *Applied Data Science Capstone* course material with my own addition to join categories if a venue has more than one category instead of just taking the first category.

The parameters are lists of:
* *names*: a name of the location or neighborhood.
* *cities*: the city name of the neighborhood.
* *latitudes*: the latitude of the location or neighborhood.
* *longitudes*: the longitude of the location or neighborhood.
* *radius*: the maximal distance in meters to from the location to search for venues. Defaults to 500m.

Returns a dataframes with a row for each found venue within *radius* meter from the given location.

In [13]:
def getNearbyVenues(names, cities, latitudes, longitudes, radii):
    
    venues_list=[]
    for name, city, lat, lng, radius in zip(names, cities, latitudes, longitudes, radii):
        print(name)
            
        # create the API request URL
        LIMIT = 100  # max according to foursquare documentation
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            city,
            lat,
            lng,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            ', '.join([c['name'] for c in v['venue']['categories']])) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'City',
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)

### Find the venues from Foursquare

First I create and save a combined table of all the neighborhoods.

In [14]:
df_parisNeighborhoods['City'] = 'Paris'
df_cphNeighborhoods['City'] = 'København'

df_neighborhoods = df_cphNeighborhoods.append(df_parisNeighborhoods).reset_index(drop=True)
df_neighborhoods = df_neighborhoods[['Neighborhood', 'City', 'Latitude', 'Longitude', 'Distance to Nearest']]

df_neighborhoods.to_csv('data/neighborhoods.csv', quoting=csv.QUOTE_ALL, index=False)

print("Number of rows = {}, number of columns = {}".format(df_neighborhoods.shape[0], df_neighborhoods.shape[1]))
df_neighborhoods.head(3)

Number of rows = 31, number of columns = 5


Unnamed: 0,Neighborhood,City,Latitude,Longitude,Distance to Nearest
0,Amager Vest,København,55.619371,12.575584,5364.845473
1,Amager Øst,København,55.661537,12.626487,3372.547947
2,Bispebjerg,København,55.71095,12.534,1812.305497


 Then I find the venues for each neighborhood.

In [15]:
df_venues = getNearbyVenues(df_neighborhoods['Neighborhood'],
                            df_neighborhoods['City'],
                            df_neighborhoods['Latitude'],
                            df_neighborhoods['Longitude'],
                            df_neighborhoods['Distance to Nearest'] / 2)

print()
print('Total number of found venues for all neighborhoods: ', df_venues.shape[0])
df_venues.head(3)

Amager Vest
Amager Øst
Bispebjerg
Brønshøj-Husum
Indre By
Nørrebro
Valby
Vanløse
Vesterbro/Kongens Enghave
Østerbro
Frederiksberg
Bourse
Temple
Reuilly
Louvre
Hôtel-de-Ville
Élysée
Observatoire
Buttes-Chaumont
Ménilmontant
Luxembourg
Opéra
Batignolles-Monceau
Vaugirard
Panthéon
Palais-Bourbon
Entrepôt
Popincourt
Gobelins
Passy
Buttes-Montmartre

Total number of found venues for all neighborhoods:  2774


Unnamed: 0,Neighborhood,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amager Vest,København,55.619371,12.575584,Naturcenter Amager,55.61462,12.574853,Park
1,Amager Vest,København,55.619371,12.575584,8tallet,55.618353,12.571841,Building
2,Amager Vest,København,55.619371,12.575584,Royal Arena,55.625469,12.573884,Event Space


### No Multiple Categories were found

In the earlier defined function *getNearbyVenues* I separate multiple categories in the *Venue Category* by a comma. However, as observed from the count of zero in the next cell, it turns out that none of the results from Foursquare actually has multiple categories and so nothing more needs to be done.

In [16]:
df_venues['Venue Category'].str.contains(',').sum()

0

### 'Neighborhood' is a venue category



In [17]:
df_venues[df_venues['Venue Category'] == 'Neighborhood']

Unnamed: 0,Neighborhood,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
605,Vesterbro/Kongens Enghave,København,55.665257,12.549573,Kødbyen,55.668336,12.558662,Neighborhood


Later in this notebook I will use onehot encoding of the venue categories. Onehot encoding will create a column with the name 'Neighborhood' for the venue category. But I already use the column name 'Neighborhood' for the name of the neighborhood and two different columns cannot have the same name. To solve this I decide to rename the venue category 'Neighborhood' to 'Locality'.

In [18]:
df_venues['Venue Category'] = df_venues['Venue Category'].str.replace('Neighborhood', 'Locality')

df_venues[df_venues['Venue Category'] == 'Locality']

Unnamed: 0,Neighborhood,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
605,Vesterbro/Kongens Enghave,København,55.665257,12.549573,Kødbyen,55.668336,12.558662,Locality


### Save the Venues

Finally I can save the venues table to a local csv file.

In [19]:
df_venues.to_csv('data/venues.csv', quoting=csv.QUOTE_ALL, index=False)

print("Number of rows = {}, number of columns = {}".format(df_venues.shape[0], df_venues.shape[1]))
df_venues.head(3)

Number of rows = 2774, number of columns = 8


Unnamed: 0,Neighborhood,City,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Amager Vest,København,55.619371,12.575584,Naturcenter Amager,55.61462,12.574853,Park
1,Amager Vest,København,55.619371,12.575584,8tallet,55.618353,12.571841,Building
2,Amager Vest,København,55.619371,12.575584,Royal Arena,55.625469,12.573884,Event Space


# 5. Data Understanding

### More Paris neighborhoods than Copenhagen neighborhoods

It is clear from the earlier shown lists of neighborhoods that the number of Paris neighborhoods are nearly the double of the number of Copenhagen neighborhoods. This must be of course be remembered later when comparing how many neighborhoods from each city are in the found clusters.

### Number of Unique Venues Categories

In [20]:
df_categories = pd.DataFrame(df_venues['Venue Category'].unique(), columns = ['Venue Category'])

print('There are {} unique categories.'.format(len(df_categories)))

There are 290 unique categories.


That is quite a lot of different venue categories and there will certainly be no problems in expressing the differencies in the neighborhoods.

On the other hand, the venue categories might be too detailed, for example with restaurants that are categorized by their kitchens originating country. After seeing the first results of the clustering it might become relevant to consider if, for example, a Scandinavian Restaurant in Copenhagen should or should not be counted as being different from a French Restaurant in Paris.

In [21]:
df_restaurantCategories = df_categories[df_categories['Venue Category'].str.contains("Restaurant")]

print('Number of different Restaurant categories in the venues data is', len(df_restaurantCategories))
df_restaurantCategories.head(10)

Number of different Restaurant categories in the venues data is 65


Unnamed: 0,Venue Category
9,Scandinavian Restaurant
11,Tapas Restaurant
20,Fast Food Restaurant
24,Indian Restaurant
25,Restaurant
46,Sushi Restaurant
56,Italian Restaurant
60,Thai Restaurant
63,Chinese Restaurant
86,Middle Eastern Restaurant


### Counting the found venues in each neighborhood

The following table shows how many venues were found from FourSquare for each neighborhood.

It is seen that two of the neighborhoods have only around 30 venues and this is in the lower end of necessary venues in each neighborhood. However because it is only two neighborhoods and the rest have above fifty venues each I will use them as they are.

On the other hand we see that several neighborhoods have exactly 100 found venues and that no neighborhood has more than 100 venues. This is caused by the FourSquare API that has this as a maximum. I assume that the returned venues are representative for all the venues in the neighborhood and use them as they are.

In [22]:
df_venues.groupby('Neighborhood').size().reset_index(name='Venue count')

Unnamed: 0,Neighborhood,Venue count
0,Amager Vest,60
1,Amager Øst,100
2,Batignolles-Monceau,100
3,Bispebjerg,35
4,Bourse,100
5,Brønshøj-Husum,28
6,Buttes-Chaumont,80
7,Buttes-Montmartre,100
8,Entrepôt,100
9,Frederiksberg,93


### Size of the Search Area and the Number of Venues from Foursquare

The table below shows that Foursquare returned three times as many venues per square kilometer for Paris when compared to Copenhagen.

This could indicate that Paris have more venues that are interesting enough to make it into Foursquare but I think it is much more likely that the Foursquare app is more popular in France than in Denmark and I consider this fact as having no influence on the results in this notebook.

In [23]:
df_parisSearchAreas = np.square(df_parisNeighborhoods['Distance to Nearest'] / 2) * 3.1416
df_cphSearchAreas = np.square(df_cphNeighborhoods['Distance to Nearest'] / 2) * 3.1416

df_venuesByCity = df_venues.groupby('City').size().reset_index(name='Venue count')
df_venuesByCity['Search Area'] = [df_cphSearchAreas.sum(), df_parisSearchAreas.sum()]
df_venuesByCity['Venues per km2'] = 1e6 * df_venuesByCity['Venue count'] / df_venuesByCity['Search Area']

df_venuesByCity

Unnamed: 0,City,Venue count,Search Area,Venues per km2
0,København,816,63591110.0,12.831982
1,Paris,1958,53602750.0,36.527975
