## Assignment 'The Battle Of Neighborhoods'

___
#### In this analysis we compare neighborhoods of Berlin to some major capitols in order to find out if there are simillarities. The comparison is done by obtaining the top venue categories of each place and clustering them with K-Means, taking the venue categories of the cities we want to compare with as initial mean values. The idea is that after the procedure of K-Means clustering we can find all the berlin neighborhoods similar to a city in the corresponding cluster.


___

In [1]:
import numpy as np
import pandas as pd
import requests
from geopy.geocoders import Nominatim
!conda install -c conda-forge folium=0.5.0 --yes
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  52.32 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  34.93 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  30.93 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  43.94 MB/s


___
#### Retrieve information about Berlin from wikipedia (Borough = 'Bezirk' / Neighborhood = 'Ortsteil') and clean dataset.

* Get boroughs and neighborhoods by parsing wikipedia page <br>
* Drop small Neighborhoods such as <br>
 < 1 square km <br>
 < 20.000 citizens <br>
 < 5000 citizens per square km

In [2]:
url='https://de.wikipedia.org/wiki/Liste_der_Bezirke_und_Ortsteile_Berlins'
html = requests.get(url).text
html = html.replace('\t', '').replace('\n', '').replace('\r', '').replace('.', '').replace(',', '.')
tables = pd.read_html(html, header=0)
df = pd.DataFrame(tables[2])
df['Einwohnerpro km²']=df['Einwohnerpro km²'].astype('int32')
df.columns=['Nr','Ortsteil','Bezirk','Area','Citizens','Citizens per Area']
df=df[df['Area']>5]
df=df[df['Citizens']>20000]
df=df[df['Citizens per Area']>5000]
df.reset_index(drop=True, inplace=True)
print('Size of resultset (rows, cols): ',df.shape)
df.head()

Size of resultset (rows, cols):  (27, 6)


Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area
0,101,Mitte,Mitte,10.7,99998,9346
1,102,Moabit,Mitte,7.72,78491,10167
2,105,Wedding,Mitte,9.23,86468,9368
3,106,Gesundbrunnen,Mitte,6.13,94293,15382
4,201,Friedrichshain,Friedrichshain-Kreuzberg,9.78,131953,13492


#### Add geolocations to dataset and display folium map.
The radius of the markers depends on the size of the neighborhood. <br>
This becomes important later, when we search the venue catgories through the foursuqare API using different radius.

In [3]:
# get coordinates
latitude, longitude = [],[]
geolocator = Nominatim()
for neighborhood in df['Ortsteil']:
    address = neighborhood + ', Berlin'
    location = geolocator.geocode(address)
    latitude.append(location.latitude)
    longitude.append(location.longitude)
df['Latitude']=latitude
df['Longitude']=longitude
#display map
berlin_coordinates=[52.517690, 13.402376]
map_berlin = folium.Map(location=berlin_coordinates, zoom_start=10)
for lat, lng, borough, neighborhood, area in zip(df['Latitude'], df['Longitude'], df['Bezirk'], df['Ortsteil'], df['Area']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    marker=folium.Circle(
        [lat, lng],
        radius=int(np.sqrt(area/3.14)*1000),
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_berlin)  
map_berlin

___
#### Add the cities we want to compare Berlin neighborhoods with.
* define list of Cities
* Add it to dataframe


In [4]:
city_list=[{'name': 'Paris','coordinates': [48.856614,2.3522219],'nr': 2000, 'area': 105},
           {'name': 'London','coordinates': [51.507351,-0.127758],'nr': 3000, 'area': 100},
           {'name': 'Madrid','coordinates': [40.416775,-3.703790],'nr': 4000, 'area': 105},
           {'name': 'Tokio','coordinates': [35.689487,139.691706],'nr': 5000, 'area': 122},
           {'name': 'Peking','coordinates': [39.904211,116.407395],'nr': 6000, 'area': 100},
           {'name': 'New York','coordinates': [40.712784,-74.005941],'nr': 7000, 'area': 100},
           {'name': 'Sydney','coordinates': [-33.867487,151.206990],'nr': 8000, 'area': 100},
           {'name': 'Moskau','coordinates': [55.755826,37.617300],'nr': 9000, 'area': 100}
          ]
number_of_cities = len(city_list)

In [5]:
# add to dataframe
for data in city_list:
    borough = neighborhood_name = data['name']
    neighborhood_latlon = data['coordinates']
    neighborhood_lon = neighborhood_latlon[1]
    neighborhood_lat = neighborhood_latlon[0]
    number = data['nr']
    area = data['area']
    df = df.append({'Nr': number,
                    'Area': area,
                    'Bezirk': borough,
                    'Ortsteil': neighborhood_name,
                    'Latitude': neighborhood_lat,
                    'Longitude': neighborhood_lon}, ignore_index=True)
df.tail(10)

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude
25,1110,Alt-Hohenschönhausen,Lichtenberg,9.33,48458.0,5194.0,52.549382,13.504673
26,1201,Reinickendorf,Reinickendorf,10.5,82945.0,7900.0,52.604763,13.295287
27,2000,Paris,Paris,105.0,,,48.856614,2.352222
28,3000,London,London,100.0,,,51.507351,-0.127758
29,4000,Madrid,Madrid,105.0,,,40.416775,-3.70379
30,5000,Tokio,Tokio,122.0,,,35.689487,139.691706
31,6000,Peking,Peking,100.0,,,39.904211,116.407395
32,7000,New York,New York,100.0,,,40.712784,-74.005941
33,8000,Sydney,Sydney,100.0,,,-33.867487,151.20699
34,9000,Moskau,Moskau,100.0,,,55.755826,37.6173


#### Now fetch nearby venue categories for each neighborhood / city
* Define and call function that returns all venue categories for given locations from the fousquare API <br>
* Onehot encode data and calculate grouped mean value for each location <br>
* Define initial clusters by setting cities as centroids and run K-Means

In [6]:
def getNearbyVenues(nr, names, latitudes, longitudes, area, radius=500, LIMIT=100):
    venues_list=[]
    for nr, name, lat, lng, ar in zip(nr, names, latitudes, longitudes, area):
        print(name,'\t - area: ',ar)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            int(np.sqrt(ar/3.14)*1000), 
            LIMIT)
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            results = []
        venues_list.append([(
            nr,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Nr',
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [8]:
venues = getNearbyVenues(nr=df['Nr'],
                         names=df['Ortsteil'],
                         latitudes=df['Latitude'],
                         longitudes=df['Longitude'],
                         area=df['Area'] 
                         )

Mitte 	 - area:  10.7
Moabit 	 - area:  7.72
Wedding 	 - area:  9.23
Gesundbrunnen 	 - area:  6.13
Friedrichshain 	 - area:  9.78
Kreuzberg 	 - area:  10.4
Prenzlauer Berg 	 - area:  11.0
Weißensee 	 - area:  7.93
Pankow 	 - area:  5.66
Charlottenburg 	 - area:  10.6
Wilmersdorf 	 - area:  7.16
Falkenhagener Feld 	 - area:  6.88
Steglitz 	 - area:  6.79
Lankwitz 	 - area:  6.99
Schöneberg 	 - area:  10.6
Tempelhof 	 - area:  12.2
Mariendorf 	 - area:  9.38
Lichtenrade 	 - area:  10.1
Neukölln 	 - area:  11.7
Buckow 	 - area:  6.35
Marzahn 	 - area:  19.5
Hellersdorf 	 - area:  8.1
Friedrichsfelde 	 - area:  5.55
Lichtenberg 	 - area:  7.22
Neu-Hohenschönhausen 	 - area:  5.16
Alt-Hohenschönhausen 	 - area:  9.33
Reinickendorf 	 - area:  10.5
Paris 	 - area:  105.0
London 	 - area:  100.0
Madrid 	 - area:  105.0
Tokio 	 - area:  122.0
Peking 	 - area:  100.0
New York 	 - area:  100.0
Sydney 	 - area:  100.0
Moskau 	 - area:  100.0


In [9]:
# one hot encoding
venues_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")
venues_onehot['Ortsteil'] = venues['Neighborhood'] 
venues_onehot['Nr']=venues['Nr']
fixed_columns = [venues_onehot.columns[-1]] + [venues_onehot.columns[-2]] + list(venues_onehot.columns[:-2])
venues_onehot = venues_onehot[fixed_columns]
venues_grouped = venues_onehot.groupby('Ortsteil').mean().reset_index()

In [10]:
kclusters = number_of_cities # number of clusters
venues_grouped_clustering = venues_grouped.drop('Ortsteil', 1).drop('Nr', 1)
# set initial centroids
init=np.array([venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='London').idxmax()].tolist(), # London
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Madrid').idxmax()].tolist(), # Madrid
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='New York').idxmax()].tolist(), # New York                               
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Paris').idxmax()].tolist(), # Paris
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Peking').idxmax()].tolist(), # Peking
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Sydney').idxmax()].tolist(), # Sydney
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Moskau').idxmax()].tolist(), # Moskau
              venues_grouped_clustering.iloc[(venues_grouped['Ortsteil']=='Tokio').idxmax()].tolist()]) # Tokio                               
kmeans = KMeans(n_clusters=kclusters, init=init, n_init=1).fit(venues_grouped_clustering)
kmeans.labels_[:] 

array([2, 1, 5, 2, 2, 5, 5, 2, 5, 2, 0, 2, 0, 3, 2, 2, 3, 5, 6, 2, 5, 3, 2,
       3, 4, 5, 2, 5, 5, 5, 5, 7, 5, 2, 5], dtype=int32)

___
#### Explore the result. Are the neighborhoods of Berlin similar to any of the given cities.
As cities to compare with we set: <br>
* London
* Paris
* Madrid
* Tokio
* Peking
* New York
* Sydney
* Moskow

Explore in which cluster every Berlin neighbofhood falls:

In [11]:
venues_explore=df.sort_values(by='Ortsteil')
venues_explore['cluster']=kmeans.labels_

#### Explore clusters 0, 1, 2, ... , 7

In [12]:
### cluster 0:
venues_explore[venues_explore['cluster']==0]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
23,1103,Lichtenberg,Lichtenberg,7.22,40759.0,5645.0,52.532161,13.511893,0
28,3000,London,London,100.0,,,51.507351,-0.127758,0


In [13]:
venues_explore[venues_explore['cluster']==1]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
19,803,Buckow,Neukölln,6.35,40708.0,6411.0,52.418662,13.42895,1


In [14]:
venues_explore[venues_explore['cluster']==2]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
25,1110,Alt-Hohenschönhausen,Lichtenberg,9.33,48458.0,5194.0,52.549382,13.504673,2
11,508,Falkenhagener Feld,Spandau,6.88,38569.0,5606.0,52.552403,13.166894,2
22,1101,Friedrichsfelde,Lichtenberg,5.55,52502.0,9460.0,52.502936,13.520546,2
21,1005,Hellersdorf,Marzahn-Hellersdorf,8.1,81177.0,10022.0,52.536854,13.604774,2
13,603,Lankwitz,Steglitz-Zehlendorf,6.99,42877.0,6134.0,52.433698,13.345486,2
17,706,Lichtenrade,Tempelhof-Schöneberg,10.1,51280.0,5077.0,52.393456,13.40204,2
16,704,Mariendorf,Tempelhof-Schöneberg,9.38,52248.0,5570.0,52.44008,13.390028,2
20,1001,Marzahn,Marzahn-Hellersdorf,19.5,111215.0,5703.0,52.542948,13.563142,2
24,1109,Neu-Hohenschönhausen,Lichtenberg,5.16,56469.0,10944.0,52.566331,13.514065,2
8,307,Pankow,Pankow,5.66,63492.0,11218.0,52.597811,13.436383,2


In [15]:
venues_explore[venues_explore['cluster']==3]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
29,4000,Madrid,Madrid,105.0,,,40.416775,-3.70379,3
0,101,Mitte,Mitte,10.7,99998.0,9346.0,52.51769,13.402376,3
32,7000,New York,New York,100.0,,,40.712784,-74.005941,3
27,2000,Paris,Paris,105.0,,,48.856614,2.352222,3


In [16]:
venues_explore[venues_explore['cluster']==4]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
31,6000,Peking,Peking,100.0,,,39.904211,116.407395,4


In [17]:
venues_explore[venues_explore['cluster']==5]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
9,401,Charlottenburg,Charlottenburg-Wilmersdorf,10.6,129010.0,12171.0,52.515747,13.309683,5
4,201,Friedrichshain,Friedrichshain-Kreuzberg,9.78,131953.0,13492.0,52.512215,13.45029,5
3,106,Gesundbrunnen,Mitte,6.13,94293.0,15382.0,52.55092,13.384846,5
5,202,Kreuzberg,Friedrichshain-Kreuzberg,10.4,154010.0,14809.0,52.497644,13.411914,5
1,102,Moabit,Mitte,7.72,78491.0,10167.0,52.530102,13.342542,5
18,801,Neukölln,Neukölln,11.7,166714.0,14249.0,52.48115,13.43535,5
6,301,Prenzlauer Berg,Pankow,11.0,163481.0,14862.0,52.539847,13.428565,5
14,701,Schöneberg,Tempelhof-Schöneberg,10.6,122770.0,11582.0,52.482157,13.35519,5
12,601,Steglitz,Steglitz-Zehlendorf,6.79,75278.0,11087.0,52.457257,13.322287,5
33,8000,Sydney,Sydney,100.0,,,-33.867487,151.20699,5


In [18]:
venues_explore[venues_explore['cluster']==6]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
34,9000,Moskau,Moskau,100.0,,,55.755826,37.6173,6


In [19]:
venues_explore[venues_explore['cluster']==7]

Unnamed: 0,Nr,Ortsteil,Bezirk,Area,Citizens,Citizens per Area,Latitude,Longitude,cluster
30,5000,Tokio,Tokio,122.0,,,35.689487,139.691706,7


#### Display clusters on a folium map

In [20]:
cluster_labels = {'0': 'London',
                  '1': 'NN',
                  '2': 'NN',
                  '3': 'Madrid, Paris',
                  '4': 'Peking',
                  '5': 'New York, Sydney',
                  '6': 'Moskau',
                  '7': 'Tokio'}

In [21]:
# create map
map_clusters = folium.Map(location=[52.517690, 13.402376], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, area in zip(venues_explore['Latitude'], venues_explore['Longitude'], venues_explore['Ortsteil'], venues_explore['cluster'], venues_explore['Area']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster_labels[str(cluster)]), parse_html=True)
    folium.Circle(
        [lat, lon],
        radius=int(np.sqrt(area/3.14)*1000),
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Thank you for your attention!