# Concert halls in Switzerland : a territorial study

## Introduction

Culture is a major factor when measuring a city's attractivity, whether it is for attracting tourists, skilled people, or simply to increase the well being
of the inhabitants and their quality of living.

There are regular discussions about cultural fundings, and here in Switzerland, the language situation (with 4 different official languages, but we will probably not have 'Romanche' speaking cities, given it is a rural population, and represents less than one % to the total population), and the federal regime, with highly independant states (called canton), makes this issue even more closely followed.

One of the component of cultural activity, is the music scene, and one way to measure how lively is the musical scene in a city, is through the concert halls.

The following study, aims to compare the major cities in Swizerland in terms of concert halls per capita, to provide an insight to the federal authorities, music professionals or simply to people interested in the cultural state of things from a numbers perspective, on the current distribution of the venues between the cities, opening the door for more focused studies on the budget side of things, to see how the discrepancies, if any, between the different territories could be explained, solved.

By grouping the cities into distinct clusers, based on the number of venues per capita, but also the number of inhabitants, we should be able to provide the insight needed to explore further the issue.
Through a chloropeth map, we will also be able to see how the territory part, affects the distribution.

## Data sources

### I. Data on Cities 

Data on cities will include:

- Geospatial coordinates
- The state in which the city is located
- The population
- The name of the city

All this data will be scrapped from the website https://simplemaps.com/data/ch-cities as a downloadable csv.



### II. Data on concert halls

Based on foursquare, we will be able to retrieve the venues, located in the different cities.
Here is an example of concert hall as described in foursquare.
https://foursquare.com/v/lusine/4adcdab5f964a5209a5021e3
A concert hall, is considered a 'Category' and this data should be easy to extract using the API.

We will be only focusing on the number of venues, we will need to defines the radius used to include a venue in the city or not.


# Data processing

## I. Get the data on cities


In [None]:
#Import and install all the needed libraries

%pip install geopy
%pip install requests
!pip install folium

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests # library to handle requests
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
#let's drop all unwanted columns : iso2 / country / capital /population

cities_data = df_data_1.drop(['iso2', 'country','capital','population'], axis=1)

print(cities_data.shape)

In [None]:
#This step will save some API calls and some time

cities_data.tail()

cities_data = cities_data[cities_data['population_proper'].notna()]
print(cities_data.shape)

## II. Get the foursquare data

### Before gathering the foursquare data, let's first have a look of a map of Switzerland major cities

In [None]:
address = 'Switzerland, CH'

geolocator = Nominatim(user_agent="ch_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Switzerland are {}, {}.'.format(latitude, longitude))

In [None]:
import folium

# create map of Toronto using latitude and longitude values
map_ch = folium.Map(location=[latitude, longitude], zoom_start=8)

# add markers to map
for lat, lng, city in zip(cities_data['lat'], cities_data['lng'],cities_data['city']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ch)  
    
map_ch

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 5000 # define radius


url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&categoryId=4d4b7104d754a06370d81259,4bf58dd8d48988d1e5931735&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)


In [None]:
results = requests.get(url).json()
#results

print(results)

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
venues = results['response']['venues'][0]

print(venues)

nearby_venues = json_normalize(venues) # flatten JSON

print(nearby_venues)

In [None]:
LIMIT = 50
radius = 8000

In [None]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&categoryId=5032792091d4c4b30a586d5c,4bf58dd8d48988d1e5931735&ll={},{}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat, 
            lng, 
            v['name'],
            v['location']['lat'], 
            v['location']['lng'],
            v['categories'][0]['name'])
            for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['city',
                  'lat', 
                  'lng', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
interest_categories ='4d4b7104d754a06370d81259'

ch_venues = getNearbyVenues(names=cities_data['city'],
                                latitudes=cities_data['lat'],
                                longitudes=cities_data['lng'],
                                )

In [None]:
print(ch_venues.shape)
ch_venues.head(200)

In [None]:
ch_test = ch_venues.groupby('Venue Category').count()
ch_test_distinct = ch_venues['Venue Category'].unique()
print(ch_test_distinct)

### The category as interpreted by foursquare, is not standardized, meaning that the user who created the venue, affects one or multiple categories, what will happen here is that some data will be missing, cause the main category is not 'Concert Hall' or the subcategories that we choose to retain. <br>
### Also the data might be expired...


In [None]:
ch_concert_test = ch_venues[ch_venues['Venue Category'].isin(['Cultural Center','Rock Club','Concert Hall','Music Venue','Opera House','Performing Arts Venue','Nightclub','Arts & Entertainment','Piano Bar','Jazz Club'])].reset_index(drop=True)
#ch_concert_halls

In [None]:
ch_concert_test.head()
print(ch_concert_test.shape)

In [None]:
ch_concert_test.head(75)

### By looking at the data, we can see that we have several issues:

i. Duplicates, related to the 'radius' parameter used in our foursquare calls <br>
ii. Inclusions in the 'city' columns, of neighborhoods, rather than cities, specially for the main cities (Zurich, Geneva and Bern)

Given that the result set is small, we will try to clean it up manually before going further

In [None]:
#first let's remove the duplicates
ch_cleaned1 = ch_concert_test.drop_duplicates(subset=['Venue'], keep='first').reset_index(drop=True)
print(ch_cleaned1.shape)

print(ch_cleaned1.city.unique())

### Let's clean out the data, in order to avoid cities belonging to a bigger aglomeration in our table, once again we will be using data from
https://simplemaps.com/data/ch-cities


In [None]:
ch_cleaned1.head(100)

In [None]:
ch_test = ch_cleaned1
ch_test.city.replace(['Onex','Carouge','Meyrin'],'Geneva', inplace=True)
ch_test.head()
ch_test.city.unique()

## Now let's examine the results for Geneva, just to assess how reliable is the data, we have various way of doing so, but we will compare the foursqure results with data from a cultural agenda.
https://www.leprogramme.ch/concerts/Geneve

In [None]:

ch_geneva = ch_test[ch_test['city'] == 'Laufenburg']


#ch_test2 = ch_zurich.groupby('Venue Category').count()
#ch_test2_distinct = ch_zurich['Venue Category'].unique()
#print(ch_test2_distinct)
ch_geneva.head(100)

## Foursquare limits

### When comparing with the agenda mentionned above, some concert halls are missing, some are not listed in foursquare, others are listed but linked to cities that do not belong to our referential, or just not returned by the endpoint we decided to use.
Let's see what went wrongwith l'arena by making the call to a distinct endpoint and gather the json result.


In [None]:
# The code was removed by Watson Studio for sharing.

## Let's start the clustering process

In [None]:
#Let's count the number of venues
#df with the venues
ch_test.head()

ch_count = ch_test.groupby('city').count().reset_index()
ch_count.head()



In [None]:

#add population
cities_data.head()

df_for_clustering = cities_data.merge(ch_count,left_on='city',right_on='city')

df_for_clustering.head()

df_for_clustering.sort_values(by=['Venue'],ascending=False)

In [None]:
#let's keep only relevant data

df_for_clustering_final = df_for_clustering.drop(['lat_y','lng_y','Venue Latitude','Venue Longitude','Venue Category'],1).reset_index(drop=True)
df_for_clustering_final.head()

In [None]:
#Faire un petit histogram par admin
import matplotlib.pyplot as plt

df_for_histo = df_for_clustering_final.groupby(['admin_name','Venue']).count().reset_index()
df_for_histo2 = df_for_histo.sort_values(by =['Venue'],ascending=False)
df_for_histo2.head()


fig = plt.figure(figsize=(15, 10))
ax = fig.add_axes([0,0,1,1])
langs = df_for_histo2['admin_name']
students = df_for_histo2['Venue']
ax.barh(langs,students)
plt.show()


In [237]:
df_for_clustering_final.head()
df_for_clustering_last = df_for_clustering_final.drop('admin_name',1).reset_index(drop=True)
df_for_clustering_last.head()

df_for_clustering_final['bypop'] = df_for_clustering_final['Venue'] / df_for_clustering_final['population_proper']

#ecarter laufenburg quêst ce que cêst que #On va ajuster en rajoutant, venue par population
#cette merde
#ecarter du clustering
df_for_clustering_final.sort_values(by='bypop',ascending=False).head(100)


Unnamed: 0,Cluster Labels,city,lat_x,lng_x,admin_name,population_proper,Venue,bypop
44,1,Interlaken,46.6881,7.8646,Bern,5592.0,24,0.004292
45,4,Bad Zurzach,47.5872,8.2944,Aargau,4242.0,5,0.001179
17,4,Sion,46.2304,7.3661,Valais,34708.0,36,0.001037
39,4,Davos,46.8091,9.8398,Graubünden,10937.0,11,0.001006
37,4,Glarus,47.0331,9.0664,Glarus,12515.0,12,0.000959
29,4,Martigny,46.1,7.0728,Valais,17998.0,15,0.000833
34,0,Sierre,46.2918,7.532,Valais,16860.0,12,0.000712
43,0,Appenzell,47.3333,9.4167,Appenzell Innerrhoden,5649.0,4,0.000708
16,0,Chur,46.8521,9.5297,Graubünden,35038.0,23,0.000656
42,0,Altdorf,46.8806,8.6394,Uri,9401.0,6,0.000638


In [252]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 4


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_for_clustering_last[['Venue','population_proper','lat_x','lng_x']])
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(df_for_clustering_final[['bypop']])


# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 
kmeans2.labels_[0:10]

array([3, 3, 3, 3, 3, 3, 3, 2, 2, 3], dtype=int32)

In [256]:


df_new = df_for_clustering_last.drop(['Cluster Labels'], axis= 1)

df_for_clustering_last.head()

#df_for_clustering_final.insert(0, 'Cluster Labels', kmeans2.labels_)

df_new.insert(0, 'Cluster Labels', kmeans.labels_)

df_new.sort_values(by='bypop').head(50)

Unnamed: 0,Cluster Labels,city,lat_x,lng_x,population_proper,Venue,bypop
18,1,Uster,47.3492,8.7192,34442.0,1,2.9e-05
22,1,Montreux,46.4333,6.9167,25984.0,1,3.8e-05
30,1,Muttenz,47.5228,7.6452,17805.0,1,5.6e-05
32,1,Grenchen,47.1931,7.3958,17140.0,1,5.8e-05
21,1,Rapperswil-Jona,47.2286,8.8317,26989.0,2,7.4e-05
38,1,Weinfelden,47.5698,9.112,11534.0,1,8.7e-05
0,2,Zürich,47.3786,8.54,434008.0,38,8.8e-05
5,3,Winterthur,47.4992,8.7267,109775.0,10,9.1e-05
9,3,Biel/Bienne,47.1372,7.2472,54456.0,5,9.2e-05
25,1,Riehen,47.5794,7.6512,21448.0,2,9.3e-05


In [234]:
import numpy as np

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=8)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_for_clustering_last['lat_x'], df_for_clustering_last['lng_x'], df_for_clustering_last['city'], df_for_clustering_last['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [235]:
import numpy as np

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=8)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_for_clustering_final['lat_x'], df_for_clustering_final['lng_x'], df_for_clustering_final['city'], df_for_clustering_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [257]:
import numpy as np

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=8)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_new['lat_x'], df_new['lng_x'], df_new['city'], df_new['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters