# Applied Data Science Capstone Final Project

## Idea
Moving to a new city is always a big step. Choosing the right neighborhood in the new city is difficult because while descriptions can of course help, it's difficult to estimate the real character of a neighborhood. I will try to help people moving from Berlin to Munich or vice versa making that decision by clustering neighborhoods of these cities in one pool to create a groups with a similar character. Knowing the city you are moving from you should then be able to better estimate the character of the different neighborhoods in the new one.

## Data
I will be using lists of neighborhoods in the respective cities from the web. The geo coordinates of these neighborhoods will then be extracted from the geopy module. A collection of venues at these locations will then be requested from the foursquare database to get the character of the neighborhood.

# Munich

In [544]:
import pandas as pd
from geopy.geocoders import Nominatim 
import requests
from pandas.io.json import json_normalize
import folium
pd.options.display.max_columns = None
pd.options.display.max_rows = None

### Load munich boroughs from Wikipedia and do some editing to make the names compatible with geopy

In [545]:
boroughs_muc = pd.read_html('https://de.wikipedia.org/wiki/Liste_der_Stadtteile_M%C3%BCnchens')[0][['Stadtteil']]

In [546]:
boroughs_muc.drop([22],inplace=True)
boroughs_muc.loc[18] = 'Obergiesing'
boroughs_muc.loc[19] = 'Untergiesing'
boroughs_muc.loc[21] = 'Holzapfelkreuth'
boroughs_muc.loc[44] = 'Schwabing'
boroughs_muc.loc[45] = 'Nord Schwabing'
boroughs_muc.loc[47] = 'Sendling'
boroughs_muc.loc[48] = 'Untersendling'
boroughs_muc.loc[49] = 'Obersendling'


In [321]:
boroughs_muc = boroughs_muc.append(pd.DataFrame({'Stadtteil':['Altschwabing','West Schwabing','Mittersendling']}),ignore_index=True).sort_values(by='Stadtteil').reset_index(drop=True)

In [322]:
address = 'Munich, DE'
location_muc = Nominatim(user_agent="toronto").geocode(address)
longitude_muc = location_muc.longitude
latitude_muc = location_muc.latitude

In [323]:
for stadtteil in boroughs_muc['Stadtteil']:
    print(stadtteil)
    address = str(stadtteil)+' Munich, DE'
    location = Nominatim(user_agent="toronto").geocode(address)
    boroughs_muc.loc[boroughs_muc['Stadtteil'] == stadtteil,'Latitude'] = location.latitude
    boroughs_muc.loc[boroughs_muc['Stadtteil'] == stadtteil,'Longitude'] = location.longitude

Allach
Altschwabing
Altstadt
Am Hart
Am Moosfeld
Am Riesenfeld
Au
Aubing
Berg am Laim
Bogenhausen
Daglfing
Denning
Englschalking
Fasangarten
Feldmoching
Forstenried
Freiham
Freimann
Fürstenried
Hadern
Haidhausen
Harlaching
Hasenbergl
Holzapfelkreuth
Isarvorstadt
Johanneskirchen
Laim
Langwied
Lehel
Lochhausen
Ludwigsvorstadt
Maxvorstadt
Milbertshofen
Mittersendling
Moosach
Neuhausen
Nord Schwabing
Nymphenburg
Oberföhring
Obergiesing
Obermenzing
Obersendling
Pasing
Perlach
Ramersdorf
Riem
Schwabing
Schwanthalerhöhe
Sendling
Solln
Steinhausen
Thalkirchen
Trudering
Untergiesing
Untermenzing
Untersendling
West Schwabing
Zamdorf


In [234]:
#boroughs_muc.to_csv('boroughs_muc.csv',index=False)

In [329]:
boroughs_muc = pd.read_csv('boroughs_muc.csv')

In [308]:
boroughs_muc.head()

Unnamed: 0,Stadtteil,Latitude,Longitude
0,Allach,48.195994,11.457013
1,Altschwabing,48.151286,11.572806
2,Altstadt,48.137108,11.575382
3,Am Hart,48.195925,11.571815
4,Am Moosfeld,48.133866,11.666309


In [309]:
CLIENT_ID = 'WC4YTOLELHB5LBYGCGOEWTWCMCD0NMM3JWJKMDH0JSQ31JJR' # your Foursquare ID
CLIENT_SECRET = 'A33R3HGGAVZ1LGGOPDAZV45RQKTOJJCKUTVSKR2LAT1HEASA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Using the function from the lab to retrieve venues

In [310]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [330]:
names = boroughs_muc['Stadtteil']
latitudes = boroughs_muc['Latitude']
longitudes = boroughs_muc['Longitude']

#df = getNearbyVenues(names,latitudes,longitudes,radius=1000)

Allach
Altschwabing
Altstadt
Am Hart
Am Moosfeld
Am Riesenfeld
Au
Aubing
Berg am Laim
Bogenhausen
Daglfing
Denning
Englschalking
Fasangarten
Feldmoching
Forstenried
Freiham
Freimann
Fürstenried
Hadern
Haidhausen
Harlaching
Hasenbergl
Holzapfelkreuth
Isarvorstadt
Johanneskirchen
Laim
Langwied
Lehel
Lochhausen
Ludwigsvorstadt
Maxvorstadt
Milbertshofen
Mittersendling
Moosach
Neuhausen
Nord Schwabing
Nymphenburg
Oberföhring
Obergiesing
Obermenzing
Obersendling
Pasing
Perlach
Ramersdorf
Riem
Schwabing
Schwanthalerhöhe
Sendling
Solln
Steinhausen
Thalkirchen
Trudering
Untergiesing
Untermenzing
Untersendling
West Schwabing
Zamdorf


In [331]:
#df.to_csv('venues_muc.csv',index=False)

In [312]:
df = pd.read_csv('venues_muc.csv').drop('Unnamed: 0',axis=1)

In [313]:
df.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allach,48.195994,11.457013,Bäckerei Schuhmair,48.197175,11.459016,Bakery


Only use categories that occure more than 50 times

In [314]:
venue_count = df.groupby('Venue Category').size().sort_values(ascending=False)

In [315]:
df = df[df['Venue Category'].isin(venue_count[venue_count>50].index)]

Only consider boroughs with 10 or more entries

In [242]:
borough_count = df.groupby('Neighborhood').size().sort_values(ascending=False)

In [243]:
df = df[df['Neighborhood'].isin(borough_count[borough_count>=10].index)].reset_index(drop=True)

One hot encoding of teh venue categories

In [244]:
one_hot = pd.get_dummies(df['Venue Category'])

In [245]:
df = pd.merge(df,one_hot,left_index=True,right_index=True,how='inner')

In [246]:
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Bakery,Bus Stop,Café,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Plaza,Restaurant,Supermarket
0,Allach,48.195994,11.457013,Bäckerei Schuhmair,48.197175,11.459016,Bakery,1,0,0,0,0,0,0,0,0,0,0
1,Allach,48.195994,11.457013,Würmtalhof,48.188834,11.46068,German Restaurant,0,0,0,0,1,0,0,0,0,0,0
2,Allach,48.195994,11.457013,Westside Hotel,48.201045,11.458564,Hotel,0,0,0,0,0,1,0,0,0,0,0
3,Allach,48.195994,11.457013,Sicilia,48.193331,11.459387,Italian Restaurant,0,0,0,0,0,0,0,1,0,0,0
4,Allach,48.195994,11.457013,dm-drogerie markt,48.194118,11.46564,Drugstore,0,0,0,1,0,0,0,0,0,0,0


Get distribution of venue types

In [247]:
venue_freq = df.groupby('Neighborhood').mean()
venue_freq = venue_freq.drop(venue_freq.columns.values[0:4],axis=1)

In [248]:
venue_freq.head()

Unnamed: 0_level_0,Bakery,Bus Stop,Café,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Plaza,Restaurant,Supermarket
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Allach,0.2,0.1,0.0,0.2,0.1,0.1,0.0,0.1,0.0,0.0,0.2
Altschwabing,0.075,0.0,0.5,0.0,0.1,0.0,0.075,0.15,0.075,0.025,0.0
Altstadt,0.0,0.0,0.258065,0.032258,0.096774,0.258065,0.032258,0.032258,0.16129,0.129032,0.0
Am Hart,0.071429,0.214286,0.142857,0.071429,0.071429,0.142857,0.0,0.071429,0.0,0.071429,0.142857
Au,0.0,0.0,0.333333,0.0,0.111111,0.074074,0.074074,0.185185,0.185185,0.037037,0.0


Clustering in munich

In [250]:
from sklearn.cluster import KMeans, DBSCAN
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
freq_copy = venue_freq.copy()

In [251]:
venue_freq = freq_copy.copy()

In [252]:
k = 4
cluster = KMeans(n_clusters=k).fit(venue_freq)

In [253]:
venue_freq['Cluster Label'] = cluster.labels_
venue_freq.reset_index(inplace=True)

In [254]:
boroughs_muc_used = boroughs_muc[boroughs_muc['Stadtteil'].isin(venue_freq['Neighborhood'].unique())]

In [255]:
venue_freq.head()
venue_freq_muc = venue_freq.copy()

Unnamed: 0,Neighborhood,Bakery,Bus Stop,Café,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Plaza,Restaurant,Supermarket,Cluster Label
0,Allach,0.2,0.1,0.0,0.2,0.1,0.1,0.0,0.1,0.0,0.0,0.2,2
1,Altschwabing,0.075,0.0,0.5,0.0,0.1,0.0,0.075,0.15,0.075,0.025,0.0,3
2,Altstadt,0.0,0.0,0.258065,0.032258,0.096774,0.258065,0.032258,0.032258,0.16129,0.129032,0.0,1
3,Am Hart,0.071429,0.214286,0.142857,0.071429,0.071429,0.142857,0.0,0.071429,0.0,0.071429,0.142857,0
4,Au,0.0,0.0,0.333333,0.0,0.111111,0.074074,0.074074,0.185185,0.185185,0.037037,0.0,3


In [256]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_muc_used['Latitude'], boroughs_muc_used['Longitude'], venue_freq['Neighborhood'], venue_freq['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Berlin

### Load Berlin boroghs from Wikipedia

In [257]:
boroughs_berlin = pd.read_html('https://de.wikipedia.org/wiki/Verwaltungsgliederung_Berlins')[1][['Ortsteil']]

In [258]:
boroughs_berlin.head()

Unnamed: 0,Ortsteil
0,Mitte
1,Moabit
2,Hansaviertel
3,Tiergarten
4,Wedding


In [285]:
address = 'Berlin, DE'
location_berlin = Nominatim(user_agent="Berlin").geocode(address)
longitude_berlin = location_berlin.longitude
latitude_berlin = location_berlin.latitude

In [260]:
for stadtteil in boroughs_berlin['Ortsteil']:
    print(stadtteil)
    address = str(stadtteil)+' Berlin, DE'
    location = Nominatim(user_agent="Berlin").geocode(address)
    boroughs_berlin.loc[boroughs_berlin['Ortsteil'] == stadtteil,'Latitude'] = location.latitude
    boroughs_berlin.loc[boroughs_berlin['Ortsteil'] == stadtteil,'Longitude'] = location.longitude

Mitte
Moabit
Hansaviertel
Tiergarten
Wedding
Gesundbrunnen
Friedrichshain
Kreuzberg
Prenzlauer Berg
Weißensee
Blankenburg
Heinersdorf
Karow
Stadtrandsiedlung Malchow
Pankow
Blankenfelde
Buch
Französisch Buchholz
Niederschönhausen
Rosenthal
Wilhelmsruh
Charlottenburg
Wilmersdorf
Schmargendorf
Grunewald
Westend
Charlottenburg-Nord
Halensee
Spandau
Haselhorst
Siemensstadt
Staaken
Gatow
Kladow
Hakenfelde
Falkenhagener Feld
Wilhelmstadt
Steglitz
Lichterfelde
Lankwitz
Zehlendorf
Dahlem
Nikolassee
Wannsee
Schöneberg
Friedenau


KeyboardInterrupt: 

In [91]:
#boroughs_berlin.to_csv('boroughs_berlin.csv',index=False)

In [286]:
boroughs_berlin = pd.read_csv('boroughs_berlin.csv')

In [287]:
boroughs_berlin.head()

Unnamed: 0,Ortsteil,Latitude,Longitude
0,Mitte,52.51769,13.402376
1,Moabit,52.530102,13.342542
2,Hansaviertel,52.519123,13.341873
3,Tiergarten,52.509778,13.35726
4,Wedding,52.550123,13.34197


Using the function from the lab to retrieve venues

In [138]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [282]:
names = boroughs_berlin['Ortsteil']
latitudes = boroughs_berlin['Latitude']
longitudes = boroughs_berlin['Longitude']

#df = getNearbyVenues(names,latitudes,longitudes,radius=1000)

In [102]:
#df.to_csv('venues_berlin.csv',index=False)

In [288]:
df = pd.read_csv('venues_berlin.csv')

In [289]:
df.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Mitte,52.51769,13.402376,Lustgarten,52.518469,13.399454,Garden


Only use categories that occure more than 50 times

In [290]:
venue_count = df.groupby('Venue Category').size().sort_values(ascending=False)

In [291]:
df = df[df['Venue Category'].isin(venue_count[venue_count>50].index)]

Only consider boroughs with 10 or more entries

In [292]:
borough_count = df.groupby('Neighborhood').size().sort_values(ascending=False)

In [293]:
df = df[df['Neighborhood'].isin(borough_count[borough_count>=10].index)].reset_index(drop=True)

One hot encoding of the venue categories

In [294]:
one_hot = pd.get_dummies(df['Venue Category'])

In [295]:
df = pd.merge(df,one_hot,left_index=True,right_index=True,how='inner')

In [296]:
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Bakery,Bar,Bus Stop,Café,Coffee Shop,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Park,Plaza,Supermarket,Tram Station
0,Mitte,52.51769,13.402376,Radisson Blu,52.519561,13.402857,Hotel,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,Mitte,52.51769,13.402376,Nikolaikirchplatz,52.5167,13.406839,Plaza,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Mitte,52.51769,13.402376,James-Simon-Park,52.521907,13.399361,Park,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,Mitte,52.51769,13.402376,The Greens,52.515485,13.408987,Café,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,Mitte,52.51769,13.402376,Capri By Fraser Berlin,52.513972,13.404902,Hotel,0,0,0,0,0,0,0,1,0,0,0,0,0,0


Get distribution of venue types

In [297]:
venue_freq = df.groupby('Neighborhood').mean()
venue_freq = venue_freq.drop(venue_freq.columns.values[0:4],axis=1)

In [298]:
venue_freq.head()

Unnamed: 0_level_0,Bakery,Bar,Bus Stop,Café,Coffee Shop,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Park,Plaza,Supermarket,Tram Station
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Alt-Hohenschönhausen,0.0,0.071429,0.0,0.071429,0.071429,0.142857,0.071429,0.071429,0.071429,0.0,0.0,0.0,0.214286,0.214286
Alt-Treptow,0.107143,0.035714,0.107143,0.142857,0.0,0.035714,0.035714,0.0,0.035714,0.142857,0.178571,0.0,0.178571,0.0
Baumschulenweg,0.230769,0.0,0.0,0.153846,0.0,0.153846,0.0,0.0,0.076923,0.076923,0.153846,0.0,0.153846,0.0
Britz,0.076923,0.0,0.230769,0.0,0.0,0.0,0.076923,0.076923,0.0,0.076923,0.153846,0.0,0.307692,0.0
Charlottenburg,0.088235,0.029412,0.0,0.205882,0.058824,0.029412,0.088235,0.029412,0.029412,0.235294,0.029412,0.058824,0.117647,0.0


Clustering in Berlin

In [434]:
from sklearn.cluster import KMeans, DBSCAN
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib import pyplot as plt
freq_copy = venue_freq.copy()

In [300]:
venue_freq = freq_copy.copy()

In [301]:
k = 4
cluster = KMeans(n_clusters=k).fit(venue_freq)

In [302]:
venue_freq['Cluster Label'] = cluster.labels_
venue_freq.reset_index(inplace=True)

In [303]:
boroughs_berlin_used = boroughs_berlin[boroughs_berlin['Ortsteil'].isin(venue_freq['Neighborhood'].unique())].copy()

In [304]:
boroughs_berlin_used.sort_values(by='Ortsteil', inplace=True)

In [305]:
venue_freq.sort_values(by='Neighborhood', inplace=True)

In [306]:
# create map
map_clusters = folium.Map(location=[latitude_berlin, longitude_berlin], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_berlin_used['Latitude'], boroughs_berlin_used['Longitude'], venue_freq['Neighborhood'], venue_freq['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# Combined clustering

In [335]:
boroughs_muc = pd.read_csv('boroughs_muc.csv')
boroughs_muc.head(1)

Unnamed: 0,Stadtteil,Latitude,Longitude
0,Allach,48.195994,11.457013


In [336]:
venues_muc = pd.read_csv('venues_muc.csv')
venues_muc.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allach,48.195994,11.457013,Bäckerei Schuhmair,48.197175,11.459016,Bakery


In [337]:
boroughs_ber = pd.read_csv('boroughs_berlin.csv')
boroughs_ber.head(1)

Unnamed: 0,Ortsteil,Latitude,Longitude
0,Mitte,52.51769,13.402376


In [338]:
venues_ber = pd.read_csv('venues_berlin.csv')
venues_ber.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Mitte,52.51769,13.402376,Lustgarten,52.518469,13.399454,Garden


In [339]:
boroughs_ber.rename({'Ortsteil':'Neighborhood'},axis=1,inplace=True)
boroughs_muc.rename({'Stadtteil':'Neighborhood'},axis=1,inplace=True)

In [445]:
venues = pd.concat([venues_ber,venues_muc])

In [446]:
venues.tail(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2889,Zamdorf,48.141649,11.638602,H Vollmannstraße,48.147437,11.62833,Bus Stop


In [447]:
boroughs = pd.concat([boroughs_ber, boroughs_muc])

In [448]:
boroughs.tail(1)

Unnamed: 0,Neighborhood,Latitude,Longitude
57,Zamdorf,48.141649,11.638602


In [470]:
venue_type_count = venues.groupby('Venue Category').size().sort_values(ascending=False)
used_venue_types = venue_type_count[venue_type_count > 100].index.values

In [471]:
used_venues = venues[venues['Venue Category'].isin(used_venue_types)]

In [472]:
venue_one_hot = pd.concat([used_venues[['Neighborhood']], pd.get_dummies(used_venues['Venue Category'])],axis=1)

In [473]:
venue_type_freq = venue_one_hot.groupby('Neighborhood').mean()

In [474]:
venue_type_freq.head()

Unnamed: 0_level_0,Bakery,Bar,Bus Stop,Café,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Park,Plaza,Restaurant,Supermarket
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Adlershof,0.0,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.142857,0.142857,0.0,0.0,0.428571
Allach,0.181818,0.0,0.090909,0.0,0.181818,0.090909,0.090909,0.0,0.090909,0.090909,0.0,0.0,0.181818
Alt-Hohenschönhausen,0.0,0.1,0.0,0.1,0.2,0.1,0.1,0.1,0.0,0.0,0.0,0.0,0.3
Alt-Treptow,0.103448,0.034483,0.103448,0.137931,0.034483,0.034483,0.0,0.034483,0.137931,0.172414,0.0,0.034483,0.172414
Altglienicke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [538]:
k = 5
clusters = KMeans(n_clusters=k).fit(venue_type_freq)

In [539]:
labeled_boroughs = pd.concat([venue_type_freq.reset_index()['Neighborhood'],pd.Series(clusters.labels_,name='Labels')],axis=1)
labeled_boroughs.head(1)

Unnamed: 0,Neighborhood,Labels
0,Adlershof,4


In [540]:
boroughs_to_plot = pd.merge(boroughs, labeled_boroughs, left_on='Neighborhood', right_on='Neighborhood', how='inner')

In [541]:
boroughs_to_plot.head(1)

Unnamed: 0,Neighborhood,Latitude,Longitude,Labels
0,Mitte,52.51769,13.402376,2


In [542]:
# create map
map_clusters = folium.Map(location=[latitude_muc, longitude_muc], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_to_plot['Latitude'], boroughs_to_plot['Longitude'], boroughs_to_plot['Neighborhood'], boroughs_to_plot['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [543]:
# create map
map_clusters = folium.Map(location=[latitude_berlin, longitude_berlin], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_to_plot['Latitude'], boroughs_to_plot['Longitude'], boroughs_to_plot['Neighborhood'], boroughs_to_plot['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters