In [1]:
%matplotlib inline
#!pip install folium geopy
import json
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

<center><h1>Venue Data Analysis of Thessaloniki</h1></center>

## A. Introduction

### A.1. Disscusion

Thessaloniki is a Greek port city and the second-largest city in Greece with over 1 million inhabitants and a population density of 7.100 residents per square kilometer. The city is divided into 20 districts in total. This will be of general interest comparison of different neighbors and how they cluster together.

### A.2. Data Description

To consider the problem the following data will be required:

  * Places data from Thessaloniki Risk Data Portal [http://riskdata.thessaloniki.gr/].
  * Forsquare API to get the most common venues of given borough of Thessaloniki.

In [None]:
!wget -q -O 'data.json' "http://riskdata.thessaloniki.gr/geoserver/wfs?srsName=EPSG%3A4326&typename=geonode%3Aall_places&outputFormat=json&version=1.0.0&service=WFS&request=GetFeature"
with open('data.json') as json_data:
    neighborhoods_data = json.load(json_data)
    neighborhoods_data = neighborhoods_data['features']

## B. Methodology

Our master data has the main components Neighborhood, Latitude and Longitude information for the city.

In [None]:
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    if data['properties']['place'] in ['neighbourhood', 'suburb']:
        neighborhood_name = data['properties']['name']        
        latlon = data['geometry']['coordinates']
        neighborhood_lat = latlon[1]
        neighborhood_lon = latlon[0]
        neighborhoods = neighborhoods.append({'Neighborhood': neighborhood_name,
                                              'Latitude': neighborhood_lat,
                                              'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods

Let's start by visualizing the geographic details of Thessaloniki and its neighborhoods. We create a map of Thessaloniki with neighborhoods superimposed on top.

In [None]:
loc = Nominatim(user_agent="saloniki_explorer").geocode('Thessaloniki, Greece')
map_saloniki = folium.Map(location=[loc.latitude, loc.longitude], zoom_start=12)
for lat, lon, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_saloniki)
map_saloniki

By using the Foursquare API, we will explore the neighborhoods and segment them. We designate a venue limit of 100 and a radius limit of 600 meter for each neighborhood from their given latitude and longitude information.

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
CLIENT_ID = CLIENT_ID
CLIENT_SECRET = CLIENT_SECRET
VERSION = '20180605'
radius=600
limit=100
venues_list=[]
for name, lat, lng in zip(neighborhoods['Neighborhood'], neighborhoods['Latitude'], neighborhoods['Longitude']):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        radius, 
        limit)
    results = requests.get(url).json()["response"]['groups'][0]['items']
    venues_list.append([(
        name, 
        lat, 
        lng, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])
saloniki_venues = pd.DataFrame([item for venues_list in venues_list for item in venues_list])
saloniki_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

In [None]:
print('There are {} uniques venue categories.'.format(len(saloniki_venues['Venue Category'].unique())))

In [None]:
venue_count = saloniki_venues.groupby('Neighborhood').count()
venue_count['Venue'].plot.bar(x='Index', y='Venue', rot=90)

We see that Νεάπολη has the most venues. Let's list the top 10 venue category for each borough in the following table.

In [None]:
saloniki_onehot = pd.get_dummies(saloniki_venues[['Venue Category']], prefix="", prefix_sep="")
saloniki_onehot['Neighborhood'] = saloniki_venues['Neighborhood'] 
fixed_columns = [saloniki_onehot.columns[-1]] + list(saloniki_onehot.columns[:-1])
saloniki_onehot = saloniki_onehot[fixed_columns]
saloniki_grouped = saloniki_onehot.groupby('Neighborhood').mean().reset_index()
num_top_venues = 10
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in range(num_top_venues):
    try:
        columns.append('{}{}'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th'.format(ind+1))
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = saloniki_grouped['Neighborhood']
for ind in range(saloniki_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(saloniki_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted

As there are some common venue categories in boroughs, we can make use of the unsupervised learning K-means algorithm to cluster the boroughs. Initially we analyze the K-Means with elbow method so we can find the optimum k of the K-Means.

In [None]:
saloniki_grouped_clustering = saloniki_grouped.drop('Neighborhood', 1)
inertias = []
for k in range(1,5):
    km = KMeans(n_clusters=k)
    km.fit(saloniki_grouped_clustering)
    inertias.append(km.inertia_)
plt.plot(range(1,5), inertias, 'bx-')
plt.show()

It seems there is no optimal number of clusers. Nevertheless we will try with 4 clusters.

In [None]:
kclusters = 4
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(saloniki_grouped_clustering)
kmeans.labels_[0:10] 

Here is a merged table with cluster labels for each borough.

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
saloniki_merged = neighborhoods
saloniki_merged = saloniki_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
saloniki_merged

We can visualize the clustered neighboors.

In [None]:
map_clusters = folium.Map(location=[loc.latitude, loc.longitude], zoom_start=12)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(saloniki_merged['Latitude'],
                                  saloniki_merged['Longitude'],
                                  saloniki_merged['Neighborhood'],
                                  saloniki_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

We will now explore the clusters.

In [None]:
saloniki_merged.loc[saloniki_merged['Cluster Labels'] == 0, saloniki_merged.columns[[1] + list(range(5, saloniki_merged.shape[1]))]]

In [None]:
saloniki_merged.loc[saloniki_merged['Cluster Labels'] == 1, saloniki_merged.columns[[1] + list(range(5, saloniki_merged.shape[1]))]]

In [None]:
saloniki_merged.loc[saloniki_merged['Cluster Labels'] == 2, saloniki_merged.columns[[1] + list(range(5, saloniki_merged.shape[1]))]]

In [None]:
saloniki_merged.loc[saloniki_merged['Cluster Labels'] == 3, saloniki_merged.columns[[1] + list(range(5, saloniki_merged.shape[1]))]]

Examining the above tables we can label each cluster as follows:

  * Cluster 0 : "Food Venues"
  * Cluster 1 : "Bar & Food Venues"
  * Cluster 2 : "Social Venues"
  * Cluster 4 : "Market Venes"

## C. Results

By considering the previous, we find that the various neighboors bear a similarity to each other based on what venues are available.

## D. Discussion

Thessaloniki is a big city with a high population density in a narrow area. The total number of measurements and population densities of the 20 districts in total can vary. Necause of the high complexity, very different approaches can be tried in clustering and classification studies. It should be noted that not every classification method can yield the quality results.

The K-means algorithm was used as part of this clustering study. Testing with the elbow method, no optimum k value could be found. The following can be considered:

* Maybe K-means is the wrong algorithm for the problem.
* Maybe the preprocessing wasn't done correctly and better work is needed.
* Maybe there is a single cluster afterall.

## F. Conclusion

We found that more work is needed, perhaps by using a different algorithm or doing preprocessing differently, or maybe there is really homogeonity among the neighbors.