# Battle of the Neighborhoods
## A Foursquare-based recommender
#### (and also the capstone project for IBM Data Science Professional Course on Coursera)

Disclaimer: This notebook is representing the technical aspects of the problem, for a higher level approach please see the report or the blogpost.

### The Problem

Let's assume, we have to move to another city. It can be challenging to find a proper place to live, so what can we do? We're trying to find a place which is a lot similar to what we've used to. This recommender system helps us to select our new spot using Foursquare's API.

### The Data

For this project we'll use Budapest as an example. Budapest, the capital of Hungary. It’s also the most populous city in the country with a population of approx. 1.7M. It’s a popular place to live, with a thousands of venues and sights. 

Budapest has 23 districts. Each district can be associated with one or more neighbourhoods, however, even district-based resolution of the data is quite coarse, we cannot dig deeper for now, because the distribution of the neighborhoods is quite uneven - there are districts which have only one neighborhood and there are which have even 8-10.


### The Drawbacks

This system has a huge drawback. If you don't have a track record in Foursquare, you won't be able to use it. Sorry about that. 
(But I've exported everything to csv files, so we'll just load them in this notebook (although I left the original code there for educational reasons as well)).

And one other thing: this is for especially optimized for Budapest. With minor work, you can adapt it to any other major city in the world.

### Imports and installs

Here comes all the libraries we need. We're installing them in dry mode, not to flush the output with unnecessary information.

In [32]:
!conda install -d -y -c anaconda lxml
!conda install -d -y -c conda-forge geopy 
!pip install --upgrade gap-stat

import time
import numpy as np
import pandas as pd
import folium
import requests
import json 
import unicodedata
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim 
from pandas.io.json import json_normalize 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

print('All imports and dependency installs done!')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - lxml


The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2020.4.5~ --> anaconda::ca-certificates-2020.1.1-0
  certifi            conda-forge::certifi-2020.4.5.1-py36h~ --> anaconda::certifi-2020.4.5.1-py36_0
  openssl            conda-forge::openssl-1.1.1g-h516909a_0 --> anaconda::openssl-1.1.1g-h7b6447c_0



DryRunExit: Dry run. Exiting.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Requirement already up-to-date: gap-stat in /home/jupyterlab/conda/envs/python/lib/python3.6/site-packages (2.0.1)
All imports and dependency installs done!


### Credentials

Here are the credentials needed for Foursquare API. Of course, I don't leave my keys under the doormat, but feel free to fill in the gaps with your own.
Or, if you don't want to mess around with Foursquare developer account, you can just import everything from csv files later.

In [None]:
CLIENT_ID = 'your Foursquare client_id here'
CLIENT_SECRET = 'your Foursquare client_secret here'
VERSION = '20180605' # Foursquare API version
TOKEN = 'your oauth access token here' # to generate see: https://developer.foursquare.com/docs/places-api/authentication/

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


### Helper functions

These are needed for Foursquare data magic, but again, you don't have to use them if you import from csv.

In [34]:
def getCheckins(userID=None):
    if not userID:
        userID = 'self'
    
    url = 'https://api.foursquare.com/v2/users/{}/checkins?&oauth_token={}&v={}'.format(userID, TOKEN, VERSION)
    results = requests.get(url).json()["response"]["checkins"]["items"]

    c = []

    for item in results:
        c.append(cat['name'] for cat in item['venue']['categories'])
    
        
    checkins = pd.DataFrame(c)
    checkins.columns = ['Checkins']
    return checkins

In [46]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        time.sleep(2) # not spamming the fsq api
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Data import

Now, let's get to the data. First we need the geographical coordinates of Budapest. We're using Openstreetmap's Nominatim engine to find them.

In [37]:
address = 'Budapest, HU'

geolocator = Nominatim(user_agent="bp_capstone")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Budapest are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Budapest are 47.4983815, 19.0404707.


Next, we're collecting some other data, respectively:
- our past checkins from Foursquare (this is working only with your checkins at the moment, so if you don't have any, feel free to import mine)
- flat prices by district (we will need them at the end)
- districts of Budapest with coordinates (I've already collected them so you don't have to spam Nominatim)

In [39]:
# use this if you have a Foursquare record
# checkins = getCheckins()

# comment the following out, if you are using your own Foursquare check-in data
checkins = pd.read_csv('data/my_checkins.csv')
checkins.drop(['Unnamed: 0'], axis=1,inplace=True)
checkins

Unnamed: 0,Checkins
0,Café
1,Supermarket
2,Café
3,Café
4,Diner
5,Butcher
6,Café
7,Diner
8,Café
9,Café


In [40]:
df_bp_prices = pd.read_csv('data/flat_prices.csv', sep=';')
df_bp_prices = df_bp_prices[['District', 'AvgPrice']]
df_bp_prices

Unnamed: 0,District,AvgPrice
0,I.,1023321
1,II.,892150
2,III.,754493
3,IV.,591152
4,V.,1222010
5,VI.,939904
6,VII.,827268
7,VIII.,746541
8,IX.,826198
9,X.,577233


In [41]:
bp_districts = pd.read_csv('data/bp_districts_only.csv')
bp_districts.drop(['Unnamed: 0'], axis=1,inplace=True)
bp_districts

Unnamed: 0,District,Latitude,Longitude
0,I.,47.499163,19.035143
1,II.,47.538887,18.982636
2,III.,47.568691,19.027668
3,IV.,47.577779,19.093164
4,V.,47.499945,19.050549
5,VI.,47.509494,19.065323
6,VII.,47.502627,19.077243
7,VIII.,47.490595,19.08734
8,IX.,47.465356,19.090356
9,X.,47.482235,19.156494


I have requested the upcoming data from Foursquare, so again, you just have to import it. Here come the venues for every district, within 500 meters from the district center (a maximum of 100 / district).

In [44]:
# To avoid spamming the Foursquare API, the output of the following cell was saved to csv before, so
# this is left here for educational purposes only

# bp_venues = getNearbyVenues(names=bp_districts['District'],
#                                      latitudes=bp_districts['Latitude'],
#                                      longitudes=bp_districts['Longitude']
#                                     )


bp_venues = pd.read_csv('data/bp_venues_by_district.csv')
bp_venues.drop(columns='Unnamed: 0', axis=1, inplace=True)
bp_venues.rename(columns={'Neighborhood': 'District', 'Neighborhood Latitude': 'District Latitude', 'Neighborhood Longitude': 'District Longitude'}, inplace=True)

After a little data cleanup (renaming columns and dropping unnecessary stuff), our venue data looks like this:

In [45]:
bp_venues.head()

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,I.,47.499163,19.035143,Dísz tér,47.4991,19.036163,Plaza
1,I.,47.499163,19.035143,Halászbástya | Fisherman's Bastion (Halászbástya),47.502029,19.035058,Historic Site
2,I.,47.499163,19.035143,Stand25 Bisztró,47.497673,19.032679,Bistro
3,I.,47.499163,19.035143,Várnegyed,47.501195,19.032261,Scenic Lookout
4,I.,47.499163,19.035143,Honda Dream,47.498561,19.031825,Motorcycle Shop


Let's group them by Venue Category to check the categories

In [47]:
bp_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghan Restaurant,1,1,1,1,1,1
Airport,2,2,2,2,2,2
Airport Food Court,1,1,1,1,1,1
Airport Service,1,1,1,1,1,1
Airport Terminal,1,1,1,1,1,1
...,...,...,...,...,...,...
Wine Bar,12,12,12,12,12,12
Wine Shop,7,7,7,7,7,7
Yoga Studio,6,6,6,6,6,6
Zoo,1,1,1,1,1,1


For later clustering, we're doing a little bit of One Hot Encoding here. After that we'll calculate the mean of the categories, grouped by district.

In [48]:
bp_onehot = pd.get_dummies(bp_venues[['Venue Category']], prefix="", prefix_sep=" ")

bp_onehot['District'] = bp_venues['District'] 

fixed_columns = [bp_onehot.columns[-1]] + list(bp_onehot.columns[:-1])
bp_onehot = bp_onehot[fixed_columns]

bp_onehot.head()

Unnamed: 0,District,Afghan Restaurant,Airport,Airport Food Court,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Aquarium,Arcade,...,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,I.,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,I.,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,I.,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,I.,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,I.,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
bp_grouped = bp_onehot.groupby('District').mean().reset_index()
bp_grouped.head()

Unnamed: 0,District,Afghan Restaurant,Airport,Airport Food Court,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Aquarium,Arcade,...,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Zoo,Zoo Exhibit
0,I.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.01,0.0,0.03,0.02,0.0,0.0,0.0
1,II.,0.0,0.027778,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.013889,0.0,0.0,0.0
2,III.,0.0,0.0,0.0,0.0,0.0,0.016667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,IV.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,IX.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we're getting closer! Our next step is to determine the 10 most common venue categories for each district, according to our data.

In [50]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['District']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['District'] = bp_grouped['District']

for ind in np.arange(bp_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bp_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,I.,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop
1,II.,Scenic Lookout,Grocery Store,Park,Pizza Place,Tram Station,Bakery,Track,Ice Cream Shop,Bus Stop,Bar
2,III.,Bus Stop,Supermarket,Electronics Store,Grocery Store,Flower Shop,Pizza Place,Plaza,Historic Site,Arts & Crafts Store,Pharmacy
3,IV.,Bus Stop,Dessert Shop,Athletics & Sports,Grocery Store,Park,Hungarian Restaurant,Playground,Gastropub,Tram Station,Burger Joint
4,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place
5,V.,Hotel,Coffee Shop,Italian Restaurant,Ice Cream Shop,Cocktail Bar,Pizza Place,Hungarian Restaurant,Wine Bar,Plaza,Restaurant
6,VI.,Coffee Shop,Beer Bar,Bar,Pizza Place,Hungarian Restaurant,Ice Cream Shop,Restaurant,Italian Restaurant,Dessert Shop,Bakery
7,VII.,Coffee Shop,Bar,Restaurant,Hotel,Pub,Park,Ice Cream Shop,Burger Joint,Beer Bar,Toy / Game Store
8,VIII.,Coffee Shop,Hotel,Beer Bar,Wine Bar,Bar,Multiplex,Burger Joint,Café,Vietnamese Restaurant,Thai Restaurant
9,X.,Bus Stop,Bakery,Grocery Store,Supermarket,Park,Brewery,Gym,Historic Site,Fast Food Restaurant,Bus Station


We'll soon do a K-means clustering, but for that, we have to determine the optimal K for the clusters. So, first of all, we'll optimize our data a little bit.

In [51]:
X = bp_grouped.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet



array([[-0.21320072, -0.21320072, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072],
       [-0.21320072,  4.69041576, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072],
       [-0.21320072, -0.21320072, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072],
       ...,
       [-0.21320072, -0.21320072, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072],
       [-0.21320072, -0.21320072, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072],
       [-0.21320072, -0.21320072, -0.21320072, ..., -0.44328547,
        -0.21320072, -0.21320072]])

I'm a bit of a lazy guy, so I'll be using Miles Granger's Gap Statistic library to determine the K.

In [52]:
use_this_k, gapdf = optimalK(Clus_dataSet, nrefs=5, maxClusters=10)
print('Optimal k is: ', use_this_k)

Optimal k is:  8


And here comes the clustering itself with the K calculated.

In [54]:
kclusters = use_this_k

bp_grouped_clustering = bp_grouped.drop('District', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bp_grouped_clustering)
kmeans.labels_

array([3, 7, 2, 2, 0, 3, 3, 3, 3, 2, 0, 2, 0, 0, 2, 2, 5, 4, 5, 1, 2, 5,
       6], dtype=int32)

In [55]:
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
bp_merged = bp_venues
bp_merged = bp_merged.join(neighborhoods_venues_sorted.set_index('District'), on='District')


So, our clustered data, merged with cluster numbers looks like this:

In [56]:
bp_merged.head()

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,I.,47.499163,19.035143,Dísz tér,47.4991,19.036163,Plaza,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop,3
1,I.,47.499163,19.035143,Halászbástya | Fisherman's Bastion (Halászbástya),47.502029,19.035058,Historic Site,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop,3
2,I.,47.499163,19.035143,Stand25 Bisztró,47.497673,19.032679,Bistro,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop,3
3,I.,47.499163,19.035143,Várnegyed,47.501195,19.032261,Scenic Lookout,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop,3
4,I.,47.499163,19.035143,Honda Dream,47.498561,19.031825,Motorcycle Shop,Bakery,Coffee Shop,Hotel,Gym / Fitness Center,Scenic Lookout,Bistro,Historic Site,Hungarian Restaurant,Italian Restaurant,Dessert Shop,3


A beautiful map is worth a thousand words, so let's see our clusters visualized with Folium! I've chosen this monochrome color scheme to highlight the clusters even more.

In [57]:
map_clusters = folium.Map(location=[latitude, longitude],  tiles='Stamen Toner', zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bp_merged['Venue Latitude'], bp_merged['Venue Longitude'], bp_merged['Venue'], bp_merged['Cluster Labels']):
    poi_norm = (unicodedata.normalize('NFKD', poi).encode('ASCII', 'ignore')).decode('utf-8')
    label = folium.Popup(str(poi_norm) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can examine the clusters if we want. I've put an example below, feel free to modify the number in the 'Cluster Labels' condition to inspect other clusters as well!

In [58]:
bp_merged.loc[bp_merged['Cluster Labels'] == 0, bp_merged.columns[[0] + list(range(7, bp_merged.shape[1]))]]

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
721,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place,0
722,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place,0
723,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place,0
724,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place,0
725,IX.,Park,Music Venue,Bakery,Fast Food Restaurant,Coffee Shop,Gym / Fitness Center,Burger Joint,Nightclub,Soccer Stadium,Sandwich Place,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1257,XIV.,Bakery,Dessert Shop,Gym / Fitness Center,Chinese Restaurant,Supermarket,Ice Cream Shop,Grocery Store,Gym,Electronics Store,Italian Restaurant,0
1258,XIV.,Bakery,Dessert Shop,Gym / Fitness Center,Chinese Restaurant,Supermarket,Ice Cream Shop,Grocery Store,Gym,Electronics Store,Italian Restaurant,0
1259,XIV.,Bakery,Dessert Shop,Gym / Fitness Center,Chinese Restaurant,Supermarket,Ice Cream Shop,Grocery Store,Gym,Electronics Store,Italian Restaurant,0
1260,XIV.,Bakery,Dessert Shop,Gym / Fitness Center,Chinese Restaurant,Supermarket,Ice Cream Shop,Grocery Store,Gym,Electronics Store,Italian Restaurant,0


Okay, but I don't want to dig through endless data manually, how can I determine which cluster fits me the most?
This brings us back to our check-in data:

In [59]:
checkins

Unnamed: 0,Checkins
0,Café
1,Supermarket
2,Café
3,Café
4,Diner
5,Butcher
6,Café
7,Diner
8,Café
9,Café


For simplicity, we consider only the top 3 venue categories in our clusters.

In [61]:
d_top3 = bp_merged[['District', '1st Most Common Venue', '2nd Most Common Venue', '3rd Most Common Venue', 'Cluster Labels']]
d_top3.sort_values('Cluster Labels').head() 

Unnamed: 0,District,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,Cluster Labels
1199,XIV.,Bakery,Dessert Shop,Gym / Fitness Center,0
1134,XIII.,Coffee Shop,Gym / Fitness Center,Park,0
1133,XIII.,Coffee Shop,Gym / Fitness Center,Park,0
1132,XIII.,Coffee Shop,Gym / Fitness Center,Park,0
1131,XIII.,Coffee Shop,Gym / Fitness Center,Park,0


And here comes the Sun: we're cross-checking our Foursquare checkin data with the venue hitlist.

In [62]:
venueList = list(set(checkins['Checkins']))
venueCol = ['1st Most Common Venue', '2nd Most Common Venue', '3rd Most Common Venue'] 

for c in venueCol:
    d_top3[c] = d_top3[c].str.strip()

count = 0
fav_K = []

for k in range(kclusters):
    filter_Cluster = (d_top3['Cluster Labels'] == k)
    filter_Fav = d_top3[venueCol].isin(venueList)
    new = d_top3[filter_Cluster]
    count = new[filter_Fav].count()
    x = 0
    for c in venueCol:
        x+=count[c]
    fav_K.append(x)

winner_Cluster = fav_K.index(max(fav_K))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


So we have the cluster, which has the most of the venues matching to our previous checkins.

In [65]:
print('The selected cluster is: Cluster #{}'.format(winner_Cluster))

The selected cluster is: Cluster #2


So for a bit more specific visualization, we have to determine, which districts belong to our selected cluster.

In [66]:
winner_Districts = bp_districts[bp_districts['District'].isin(list(set(d_top3[d_top3['Cluster Labels'] == winner_Cluster]['District'])))]
winner_Districts

Unnamed: 0,District,Latitude,Longitude
2,III.,47.568691,19.027668
3,IV.,47.577779,19.093164
9,X.,47.482235,19.156494
11,XII.,47.5048,18.982815
14,XV.,47.562714,19.140218
18,XIX.,47.449333,19.14412
20,XXI.,47.424317,19.069214


We can combine these data with the average real estate prices and show the lowest first.

In [69]:
winning_districts_with_prices = pd.merge(winner_Districts, df_bp_prices, on=['District'], how='inner')
winning_districts_with_prices.sort_values('AvgPrice')

Unnamed: 0,District,Latitude,Longitude,AvgPrice
6,XXI.,47.424317,19.069214,493899
5,XIX.,47.449333,19.14412,518294
4,XV.,47.562714,19.140218,534810
2,X.,47.482235,19.156494,577233
1,IV.,47.577779,19.093164,591152
0,III.,47.568691,19.027668,754493
3,XII.,47.5048,18.982815,937865


We're saving the minimum price for marker colorization.

In [78]:
min_price = min(winning_districts_with_prices['AvgPrice'])
min_price

493899

And a map again, with a bit of spice. So we can select from a given number of district, but this information may not be sufficient. Okay, why don't we combine the average real estate prices with selected districts then? Seems like we need a choropleth map here, with the markers of the winning districts superimposed on top, the one with the lowest price highlighted:

In [80]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

bp_geo = r'data/bp_districts.json' # geojson file

map_clusters.choropleth(
    geo_data=bp_geo,
    data=df_bp_prices,
    columns=['District', 'AvgPrice'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Real Estate Prices (avg HUF / sqm)'
)

markers_colors = []
for lat, lon, dist, price in zip(winning_districts_with_prices['Latitude'], winning_districts_with_prices['Longitude'], winning_districts_with_prices['District'], winning_districts_with_prices['AvgPrice']):
    label = folium.Popup(str(dist), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill_color=('red' if price == min_price else 'lightblue'),
        fill=True,
        color='darkblue',
        fill_opacity=0.7,
        line_opacity=0.6).add_to(map_clusters)
 

map_clusters

So, we can consider other factors here, such as crime rates, education and so on... but we have to spare something for the next version too.... ;)


### References
- Gap statistic algorithm: Miles Granger (https://github.com/milesgranger/gap_statistic)
- Real estate price data: https://www.ingatlannet.hu/statisztika/Budapest
- District polygon data: https://www.openstreetmap.org/
- Polygon to GeoJSON conversion: http://polygons.openstreetmap.fr/
- GeoJSON validation: http://geojson.io/

###### Gergely Karacsonyi 