# Applied Data Science Capstone Final Project

## Idea
Moving to a new city is always a big step. Choosing the right neighborhood in the new city is difficult because while descriptions can of course help, it's difficult to estimate the real character of a neighborhood. I will try to help people moving from Berlin to Munich or vice versa making that decision by clustering neighborhoods of these cities in one pool to create a groups with a similar character. Knowing the city you are moving from you should then be able to better estimate the character of the different neighborhoods in the new one.

## Data
I will be using lists of neighborhoods in the respective cities from the web. The geo coordinates of these neighborhoods will then be extracted from the geopy module. A collection of venues at these locations will then be requested from the foursquare database to get the character of the neighborhood.

## Methodology

Venues around in a 1000 m radius around the centers of the different boroughs were collected. Only venue categories containing 100 or more entries in total for both cities were considered. The boroughs were described by their frequency distribution of these venue categories. A K-means clustering algorithm was then used to cluster into categories. Attempts with different nr for k showed an ideal number of k=4. With less, to my knowledge very different boroughs were sorted together. With more, a clear distinction between the groups was lost.

## Results

The numeric results and a discussion of the groups can be found at the bottom of this notebook.

A clear classification of boroughs in Berlin and Munich could be made, grouping them in similar clusters. Only one group was exclusive to a city. For all others boroughs from both cities fit in those categories.

# Munich

In [586]:
import pandas as pd
from geopy.geocoders import Nominatim 
import requests
from pandas.io.json import json_normalize
import folium
from sklearn.cluster import KMeans
pd.options.display.max_columns = None
pd.options.display.max_rows = None

### Load munich boroughs from Wikipedia and do some editing to make the names compatible with geopy

In [548]:
boroughs_muc = pd.read_html('https://de.wikipedia.org/wiki/Liste_der_Stadtteile_M%C3%BCnchens')[0][['Stadtteil']]

In [549]:
boroughs_muc.drop([22],inplace=True)
boroughs_muc.loc[18] = 'Obergiesing'
boroughs_muc.loc[19] = 'Untergiesing'
boroughs_muc.loc[21] = 'Holzapfelkreuth'
boroughs_muc.loc[44] = 'Schwabing'
boroughs_muc.loc[45] = 'Nord Schwabing'
boroughs_muc.loc[47] = 'Sendling'
boroughs_muc.loc[48] = 'Untersendling'
boroughs_muc.loc[49] = 'Obersendling'


In [550]:
boroughs_muc = boroughs_muc.append(pd.DataFrame({'Stadtteil':['Altschwabing','West Schwabing','Mittersendling']}),ignore_index=True).sort_values(by='Stadtteil').reset_index(drop=True)

### Get Munich coordinates

In [551]:
address = 'Munich, DE'
location_muc = Nominatim(user_agent="toronto").geocode(address)
longitude_muc = location_muc.longitude
latitude_muc = location_muc.latitude

### Get munich boroughs coordinates

In [552]:
for stadtteil in boroughs_muc['Stadtteil']:
    print(stadtteil)
    address = str(stadtteil)+' Munich, DE'
    location = Nominatim(user_agent="toronto").geocode(address)
    boroughs_muc.loc[boroughs_muc['Stadtteil'] == stadtteil,'Latitude'] = location.latitude
    boroughs_muc.loc[boroughs_muc['Stadtteil'] == stadtteil,'Longitude'] = location.longitude

Allach
Altschwabing
Altstadt
Am Hart
Am Moosfeld
Am Riesenfeld
Au
Aubing
Berg am Laim
Bogenhausen
Daglfing
Denning
Englschalking
Fasangarten
Feldmoching
Forstenried
Freiham
Freimann
Fürstenried
Hadern
Haidhausen
Harlaching
Hasenbergl
Holzapfelkreuth
Isarvorstadt
Johanneskirchen
Laim
Langwied
Lehel
Lochhausen
Ludwigsvorstadt
Maxvorstadt
Milbertshofen
Mittersendling
Moosach
Neuhausen
Nord Schwabing
Nymphenburg
Oberföhring
Obergiesing
Obermenzing
Obersendling
Pasing
Perlach
Ramersdorf
Riem
Schwabing
Schwanthalerhöhe
Sendling
Solln
Steinhausen
Thalkirchen
Trudering
Untergiesing
Untermenzing
Untersendling
West Schwabing
Zamdorf


In [553]:
boroughs_muc.head()

Unnamed: 0,Stadtteil,Latitude,Longitude
0,Allach,48.195994,11.457013
1,Altschwabing,48.151286,11.572806
2,Altstadt,48.137108,11.575382
3,Am Hart,48.195925,11.571815
4,Am Moosfeld,48.133867,11.666309


In [554]:
CLIENT_ID = 'remove' # your Foursquare ID
CLIENT_SECRET = 'removed' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

### Using the function from the lab to retrieve venues

In [555]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [557]:
names = boroughs_muc['Stadtteil']
latitudes = boroughs_muc['Latitude']
longitudes = boroughs_muc['Longitude']

venues_muc = getNearbyVenues(names,latitudes,longitudes,radius=1000)

Allach
Altschwabing
Altstadt
Am Hart
Am Moosfeld
Am Riesenfeld
Au
Aubing
Berg am Laim
Bogenhausen
Daglfing
Denning
Englschalking
Fasangarten
Feldmoching
Forstenried
Freiham
Freimann
Fürstenried
Hadern
Haidhausen
Harlaching
Hasenbergl
Holzapfelkreuth
Isarvorstadt
Johanneskirchen
Laim
Langwied
Lehel
Lochhausen
Ludwigsvorstadt
Maxvorstadt
Milbertshofen
Mittersendling
Moosach
Neuhausen
Nord Schwabing
Nymphenburg
Oberföhring
Obergiesing
Obermenzing
Obersendling
Pasing
Perlach
Ramersdorf
Riem
Schwabing
Schwanthalerhöhe
Sendling
Solln
Steinhausen
Thalkirchen
Trudering
Untergiesing
Untermenzing
Untersendling
West Schwabing
Zamdorf


# Berlin

### Load Berlin boroghs from Wikipedia

In [561]:
boroughs_ber = pd.read_html('https://de.wikipedia.org/wiki/Verwaltungsgliederung_Berlins')[1][['Ortsteil']]

In [562]:
boroughs_ber.head()

Unnamed: 0,Ortsteil
0,Mitte
1,Moabit
2,Hansaviertel
3,Tiergarten
4,Wedding


### Get Berlin coordinates

In [563]:
address = 'Berlin, DE'
location_berlin = Nominatim(user_agent="Berlin").geocode(address)
longitude_berlin = location_berlin.longitude
latitude_berlin = location_berlin.latitude

### Get Berlin boroughs coordinates

In [571]:
for stadtteil in boroughs_ber['Ortsteil']:
    print(stadtteil)
    address = str(stadtteil)+' Berlin, DE'
    location = Nominatim(user_agent="Berlin").geocode(address)
    boroughs_ber.loc[boroughs_ber['Ortsteil'] == stadtteil,'Latitude'] = location.latitude
    boroughs_ber.loc[boroughs_ber['Ortsteil'] == stadtteil,'Longitude'] = location.longitude

Mitte
Moabit
Hansaviertel
Tiergarten
Wedding
Gesundbrunnen
Friedrichshain
Kreuzberg
Prenzlauer Berg
Weißensee
Blankenburg
Heinersdorf
Karow
Stadtrandsiedlung Malchow
Pankow
Blankenfelde
Buch
Französisch Buchholz
Niederschönhausen
Rosenthal
Wilhelmsruh
Charlottenburg
Wilmersdorf
Schmargendorf
Grunewald
Westend
Charlottenburg-Nord
Halensee
Spandau
Haselhorst
Siemensstadt
Staaken
Gatow
Kladow
Hakenfelde
Falkenhagener Feld
Wilhelmstadt
Steglitz
Lichterfelde
Lankwitz
Zehlendorf
Dahlem
Nikolassee
Wannsee
Schöneberg
Friedenau
Tempelhof
Mariendorf
Marienfelde
Lichtenrade
Neukölln
Britz
Buckow
Rudow
Gropiusstadt
Alt-Treptow
Plänterwald
Baumschulenweg
Johannisthal
Niederschöneweide
Altglienicke
Adlershof
Bohnsdorf
Oberschöneweide
Köpenick
Friedrichshagen
Rahnsdorf
Grünau
Müggelheim
Schmöckwitz
Marzahn
Biesdorf
Kaulsdorf
Mahlsdorf
Hellersdorf
Friedrichsfelde
Karlshorst
Lichtenberg
Falkenberg
Malchow
Wartenberg
Neu-Hohenschönhausen
Alt-Hohenschönhausen
Fennpfuhl
Rummelsburg
Reinickendorf
Tegel
Kon

In [572]:
boroughs_ber.head()

Unnamed: 0,Ortsteil,Latitude,Longitude
0,Mitte,52.51769,13.402376
1,Moabit,52.530102,13.342542
2,Hansaviertel,52.519123,13.341872
3,Tiergarten,52.509778,13.35726
4,Wedding,52.550123,13.34197


### Using the function from the lab to retrieve venues

In [138]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [573]:
names = boroughs_ber['Ortsteil']
latitudes = boroughs_ber['Latitude']
longitudes = boroughs_ber['Longitude']

venues_ber = getNearbyVenues(names,latitudes,longitudes,radius=1000)

Mitte
Moabit
Hansaviertel
Tiergarten
Wedding
Gesundbrunnen
Friedrichshain
Kreuzberg
Prenzlauer Berg
Weißensee
Blankenburg
Heinersdorf
Karow
Stadtrandsiedlung Malchow
Pankow
Blankenfelde
Buch
Französisch Buchholz
Niederschönhausen
Rosenthal
Wilhelmsruh
Charlottenburg
Wilmersdorf
Schmargendorf
Grunewald
Westend
Charlottenburg-Nord
Halensee
Spandau
Haselhorst
Siemensstadt
Staaken
Gatow
Kladow
Hakenfelde
Falkenhagener Feld
Wilhelmstadt
Steglitz
Lichterfelde
Lankwitz
Zehlendorf
Dahlem
Nikolassee
Wannsee
Schöneberg
Friedenau
Tempelhof
Mariendorf
Marienfelde
Lichtenrade
Neukölln
Britz
Buckow
Rudow
Gropiusstadt
Alt-Treptow
Plänterwald
Baumschulenweg
Johannisthal
Niederschöneweide
Altglienicke
Adlershof
Bohnsdorf
Oberschöneweide
Köpenick
Friedrichshagen
Rahnsdorf
Grünau
Müggelheim
Schmöckwitz
Marzahn
Biesdorf
Kaulsdorf
Mahlsdorf
Hellersdorf
Friedrichsfelde
Karlshorst
Lichtenberg
Falkenberg
Malchow
Wartenberg
Neu-Hohenschönhausen
Alt-Hohenschönhausen
Fennpfuhl
Rummelsburg
Reinickendorf
Tegel
Kon

# Combined clustering

### Check that all data is OK

In [574]:
boroughs_muc.head(1)

Unnamed: 0,Stadtteil,Latitude,Longitude
0,Allach,48.195994,11.457013


In [575]:
venues_muc.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allach,48.195994,11.457013,Bäckerei Schuhmair,48.197175,11.459016,Bakery


In [576]:
boroughs_ber.head(1)

Unnamed: 0,Ortsteil,Latitude,Longitude
0,Mitte,52.51769,13.402376


In [577]:
venues_ber.head(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Mitte,52.51769,13.402376,Lustgarten,52.518469,13.399454,Garden


### Unify naming

In [578]:
boroughs_ber.rename({'Ortsteil':'Neighborhood'},axis=1,inplace=True)
boroughs_muc.rename({'Stadtteil':'Neighborhood'},axis=1,inplace=True)

### Combine the datasets

In [579]:
venues = pd.concat([venues_ber,venues_muc])

In [446]:
venues.tail(1)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2889,Zamdorf,48.141649,11.638602,H Vollmannstraße,48.147437,11.62833,Bus Stop


In [447]:
boroughs = pd.concat([boroughs_ber, boroughs_muc])

In [448]:
boroughs.tail(1)

Unnamed: 0,Neighborhood,Latitude,Longitude
57,Zamdorf,48.141649,11.638602


### Limit the data to venue categories that appear more than 100 times in total

In [580]:
venue_type_count = venues.groupby('Venue Category').size().sort_values(ascending=False)
used_venue_types = venue_type_count[venue_type_count > 100].index.values

In [581]:
used_venues = venues[venues['Venue Category'].isin(used_venue_types)]

### One hot encoding and calculating the frequency of venues

In [582]:
venue_one_hot = pd.concat([used_venues[['Neighborhood']], pd.get_dummies(used_venues['Venue Category'])],axis=1)

In [583]:
venue_type_freq = venue_one_hot.groupby('Neighborhood').mean()

In [584]:
venue_type_freq.head()

Unnamed: 0_level_0,Bakery,Bar,Bus Stop,Café,Drugstore,German Restaurant,Hotel,Ice Cream Shop,Italian Restaurant,Park,Plaza,Restaurant,Supermarket
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Adlershof,0.0,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.142857,0.142857,0.0,0.0,0.428571
Allach,0.181818,0.0,0.090909,0.0,0.181818,0.090909,0.090909,0.0,0.090909,0.090909,0.0,0.0,0.181818
Alt-Hohenschönhausen,0.0,0.1,0.0,0.1,0.2,0.1,0.1,0.1,0.0,0.0,0.0,0.0,0.3
Alt-Treptow,0.103448,0.034483,0.103448,0.137931,0.034483,0.034483,0.0,0.034483,0.137931,0.172414,0.0,0.034483,0.172414
Altglienicke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Clustering

In [614]:
k = 5
clusters = KMeans(n_clusters=k).fit(venue_type_freq)

In [615]:
labeled_boroughs = pd.concat([venue_type_freq.reset_index()['Neighborhood'],pd.Series(clusters.labels_,name='Labels')],axis=1)
labeled_boroughs.head(1)

Unnamed: 0,Neighborhood,Labels
0,Adlershof,0


### Combine neighborhood names, coordinates and labels

In [616]:
boroughs_to_plot = pd.merge(boroughs, labeled_boroughs, left_on='Neighborhood', right_on='Neighborhood', how='inner')

In [617]:
boroughs_to_plot.head(1)

Unnamed: 0,Neighborhood,Latitude,Longitude,Labels
0,Mitte,52.51769,13.402376,1


### Plot Munich

In [618]:
# create map
map_clusters = folium.Map(location=[latitude_muc, longitude_muc], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_to_plot['Latitude'], boroughs_to_plot['Longitude'], boroughs_to_plot['Neighborhood'], boroughs_to_plot['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Plot Berlin

In [619]:
# create map
map_clusters = folium.Map(location=[latitude_berlin, longitude_berlin], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.jet(np.linspace(0, 1, k+1))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boroughs_to_plot['Latitude'], boroughs_to_plot['Longitude'], boroughs_to_plot['Neighborhood'], boroughs_to_plot['Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Evaluation

In [620]:
eval_data = pd.merge(boroughs_to_plot, venue_type_freq, left_on='Neighborhood', right_index=True, how='inner')

In [621]:
for label in sorted(eval_data['Labels'].unique()):
    print(label)
    print(eval_data.groupby('Labels').get_group(label).mean().sort_values(ascending=False))
    print('\n')

0
Latitude              51.719009
Longitude             13.039810
Supermarket            0.330491
Bus Stop               0.208573
Bakery                 0.090506
Italian Restaurant     0.064827
Drugstore              0.063579
Park                   0.057007
Restaurant             0.040105
German Restaurant      0.038342
Plaza                  0.029471
Hotel                  0.027418
Café                   0.023992
Ice Cream Shop         0.023290
Bar                    0.002399
Labels                 0.000000
dtype: float64


1
Latitude              49.788074
Longitude             12.267709
Labels                 1.000000
Supermarket            0.136262
Hotel                  0.132172
Italian Restaurant     0.121721
Bakery                 0.115024
Bus Stop               0.080905
German Restaurant      0.076528
Drugstore              0.067634
Café                   0.059040
Ice Cream Shop         0.057394
Plaza                  0.049921
Restaurant             0.045234
Park               

### Description of labels

0. Mellow living boroughs. Charcterized by opportunities for grovery shopping but still with the possibilities to eat out or go to a cafe.
1. Similar to 0 but more diverse venue selection
2. These regions have the most cafes, bars, and restaurants of all categories. The more expensive boroughs to live in
3. Categorie exclusive to Berlin in this comparison. This has very few venues aside from supermarkets. Appears to be mostly in former eastern Berlin
4. Very few venues in total. Mostly in the outskirts

In [635]:
for label in sorted(eval_data['Labels'].unique()):
    
    print(f'Category {label}:')
    print(eval_data.groupby('Labels').get_group(label)['Neighborhood'].sort_values().values)
    print('\n')

Category 0:
['Adlershof' 'Blankenburg' 'Britz' 'Buch' 'Buckow' 'Charlottenburg-Nord'
 'Daglfing' 'Denning' 'Falkenhagener Feld' 'Fennpfuhl' 'Forstenried'
 'Freimann' 'Friedrichsfelde' 'Gatow' 'Hadern' 'Hakenfelde' 'Haselhorst'
 'Hasenbergl' 'Hellersdorf' 'Holzapfelkreuth' 'Johannisthal' 'Karow'
 'Kaulsdorf' 'Kladow' 'Konradshöhe' 'Lankwitz' 'Lichtenrade'
 'Lichterfelde' 'Mariendorf' 'Marienfelde' 'Marzahn' 'Niederschöneweide'
 'Niederschönhausen' 'Pankow' 'Plänterwald' 'Siemensstadt' 'Staaken'
 'Tempelhof' 'Untermenzing' 'Waidmannslust' 'Wannsee' 'Wilhelmsruh'
 'Wilhelmstadt' 'Wittenau']


Category 1:
['Allach' 'Alt-Hohenschönhausen' 'Alt-Treptow' 'Am Hart' 'Am Moosfeld'
 'Am Riesenfeld' 'Aubing' 'Baumschulenweg' 'Berg am Laim' 'Biesdorf'
 'Bogenhausen' 'Borsigwalde' 'Englschalking' 'Fasangarten' 'Feldmoching'
 'Fürstenried' 'Gropiusstadt' 'Grünau' 'Halensee' 'Harlaching'
 'Johanneskirchen' 'Karlshorst' 'Laim' 'Ludwigsvorstadt' 'Lübars'
 'Mahlsdorf' 'Milbertshofen' 'Mitte' 'Mittersendl