WARNING! For running this notebook, its better to download it and run it outside of GitHub, as GitHub is not able to use some libraries (for example print Folium maps). Also it is required to install packages geocoder,folium,requests,pandas,numpy,sklearn

In [4]:
#install these libraries if you do not have them
!pip install geocoder folium requests pandas numpy sklearn



You should consider upgrading via the 'python -m pip install --upgrade pip' command.


# Retirement home project in Prague
Prague is city in Czech republic (country in central Europe). Like all other countries in Europe, average age in population is slowly increasing over decades. This continuous change creates a requirement for more services for elderly people, such as pharmacies, hospitals, parks, and so on. This also includes a problem with housing and living. Many apartments are in buildings without elevators, sometimes there are no close stores, or even no public transport stations.

Fortunately, there are projects, companies, and people, who want to offer decent living posibilities for seniors. There are many retirement homes in the country, and they make the living of elderly people much easier (especially, when there is no family which can take care of these people).

Let's say there is fictional real estate development company which wants to build a new retirement home in Prague. __In this project, I will analyze different parts of Prague, using Foursquare API and information from Wikipedia, to find out potentionally good places for building new retirement home__. I will use different criteria, which I consider important.

_Please take into account, that this project is simplified and inlcudes only some basic criteria, thus it does not reflect whole reality._

## Usage of data in planning
For our project, I decided to set several criteria, which may be important for analysis of different district in Prague. But before that, we need to decide how to determine what is _part of Prague_. 

### Dividing city into different neighborhoods / districts

On Wikipedia (https://en.wikipedia.org/wiki/Districts_of_Prague) we can find list of districts in Prague. From what we can see, Prague can be split into different districts or neighborhoods in three different ways:
 - Administrative districts: Prague is divided into 22 parts administrative parts which are named very simply: Prague 1, Prague 2, Prague 3,... ,Prague 22. This type of division is sometimes only formal, because it includes huge areas, so it's not very precise.
 - Municipal districts: Prague can be divided also to 57 parts, which is more precise.
 - Cadastral areas: There are 111 cadastral areas in Prague, which is even more precise, but sometimes there is not enough information about each of these cadastral areas, because they are more historicaly defined.
 
In general, each Municipal district is part of one Administrative district. For our purpose, it is best to use Municipal districts, as much information is available for them (unlike cadastral areas) and they are smaller (thus more precise) than Administrative districts

### Selecting our criteria
As our criteria I decided to use following (everything is based just on my knowledge and my personal experience with living in Prague)

 - How much is the district good for relaxation? Usually elderly people prefers calm and clean environment
     - How many parks are in the district? (available on Foursquare)
     - How many clubs, bars and hotels are in the district? These places usually attract people who wants to enjoy their free time, so there could be problems with noise and some conflicts in streets, parks, and so on. This data is also available on Foursquare.
 - How much is the district suitable in terms of public services? For elderly people it's especially important to have everything they need in their district.
     - How many hospitals, pharmacies, and clinics are in the district? (available on Foursquare)
     - How many stores and supermarkets are in the district? (available on Foursquare)
     - Is there any public transport (tram or subway) station? (available on Foursquare)
     
There is more criteria which could be considered, but for purpose of our project, these should be enough.

----------------------------------------------------

## Technical part of the project
In this part, I do all the work, which is needed to get to some conclusion. In the end, there should be list of districts, which are most suitable for building a new retirement home.

### 1) Import necessary libraries
First we need to import some basic libraries we are gonna use.


In [5]:
import folium
import requests
import pandas as pd
import numpy as np
import geocoder
import re
from sklearn.cluster import KMeans

### 2)  Getting list of districts from wiki
We need to get the data. If you open the link in description of the project, you can see there is table with all municipal districts in Prague. We will use it.

In [6]:
PrgDistrictsLink = 'https://en.wikipedia.org/wiki/Districts_of_Prague'
data = pd.read_html(PrgDistrictsLink)

#because read_html returns list of dataframes, we want to select the first dataframe which represents table with districts
data = data[0]
data.head()

Unnamed: 0,Former district,Current administrative district,Current municipal districts
0,Prague 1,Prague 1,Prague 1
1,Prague 2,Prague 2,Prague 2
2,Prague 3,Prague 3,Prague 3
3,Prague 4,Prague 4,"Prague 4, Kunratice"
4,Prague 4,Prague 11 (part),"Prague 11, Šeberov, Újezd u Průhonic"


Because one row can represent more municipal districts, we need to split them. Also we want to remove parts in brackets

In [7]:
#getting districts
districts = data['Current municipal districts']

#we join all items with comma - this will help us to split later
districts = ','.join(districts)

#now we need to remove all brackets and everything in them. For that we use regex
districts = re.sub('\(.*?\)','',districts)

#here we split our string by comma again, and also we should strip ending spaces from each municipality
districts = [mun.strip() for mun in districts.split(',')]
districts

['Prague 1',
 'Prague 2',
 'Prague 3',
 'Prague 4',
 'Kunratice',
 'Prague 11',
 'Šeberov',
 'Újezd u Průhonic',
 'Prague 12',
 'Libuš',
 'Prague 5',
 'Slivenec',
 'Prague 13',
 'Řeporyje',
 'Prague 16',
 'Lipence',
 'Lochkov',
 'Velká Chuchle',
 'Zbraslav',
 'Zličín',
 'Prague 6',
 'Lysolaje',
 'Nebušice',
 'Přední Kopanina',
 'Suchdol',
 'Prague 17',
 'Prague 7',
 'Troja',
 'Prague 8',
 'Březiněves',
 'Dolní Chabry',
 'Ďáblice',
 'Prague 9',
 'Prague 14',
 'Dolní Počernice',
 'Prague 18',
 'Čakovice',
 'Prague 19',
 'Miškovice',
 'Satalice',
 'Vinoř',
 'Třeboradice',
 'Prague 20',
 'Prague 21',
 'Běchovice',
 'Klánovice',
 'Koloděje',
 'Prague 10',
 'Křeslice',
 'Prague 15',
 'Dolní Měcholupy',
 'Dubeč',
 'Petrovice',
 'Štěrboholy',
 'Prague 22',
 'Benice',
 'Kolovraty',
 'Královice',
 'Nedvězí']

### 3) Adding coordinates to each district
Now we need to get latitude, longitude, and bounding box (defined by north-east corner and south-west corner) of each district. For that we will use geocoder library, which can fetch this data from ArcGIS service (because Foursquare is not that good in determining districts coordinates)

In [8]:
latitudes,longitudes,nes,sws = [],[],[],[]

#get coordinates for all districts
for district in districts:
    g = geocoder.arcgis('{}, Prague'.format(district)) #this will send a request to get geolocation
    
    #save latitude, longtitude and bounding box of district, defined by northeast and southwest corners
    lat,lng = g.json['lat'],g.json['lng']
    ne = ','.join(str(round(i,5)) for i in g.json['bbox']['northeast'])
    sw = ','.join(str(round(i,5)) for i in g.json['bbox']['southwest'])
    
    latitudes.append(lat)
    longitudes.append(lng)
    nes.append(ne)
    sws.append(sw)

#put everything to dataframe together, swap columns for rows, and set columns and indexes
districts_coordinates = pd.DataFrame([districts,latitudes,longitudes,nes,sws]).transpose()
districts_coordinates.columns = ['district','latitude','longitude','northeast','southwest']
districts_coordinates.set_index('district',inplace=True)

#show result
districts_coordinates

Unnamed: 0_level_0,latitude,longitude,northeast,southwest
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Prague 1,50.0873,14.4174,"50.09728,14.42742","50.07728,14.40742"
Prague 2,50.0739,14.4396,"50.08394,14.44956","50.06394,14.42956"
Prague 3,50.0826,14.4554,"50.0926,14.46537","50.0726,14.44537"
Prague 4,50.0423,14.4481,"50.05231,14.45805","50.03231,14.43805"
Kunratice,50.0137,14.4853,"50.02371,14.49528","50.00371,14.47528"
Prague 11,50.0309,14.5241,"50.04094,14.53406","50.02094,14.51406"
Šeberov,50.0131,14.5139,"50.02314,14.52392","50.00314,14.50392"
Újezd u Průhonic,50.0095,14.5443,"50.01449,14.54934","50.00449,14.53934"
Prague 12,50.002,14.4181,"50.01201,14.4281","49.99201,14.4081"
Libuš,50.01,14.4608,"50.01999,14.47084","49.99999,14.45084"


### 4) Getting the data about each district
Here we use Foursquare API to get information about number of different venues in each district. I already mentioned it in the beginning, but to conclude it we are looking for:
 - Bars, Clubs, and night life venues
 - Parks, forests, and other nature
 - Pharmacies, hospitals, stores, public transport stations
 
 
---


At first, we need to define categories of venues we want to search for. Fortunately, Foursquare has its own documentation, where each specific venue, has different category. Also it allows in one API call, to call more different types of venues. We will create custom groups of venues

In [9]:
#define categories from foursquare, by the API ID
park = '4bf58dd8d48988d163941735'
lake = '4bf58dd8d48988d161941735'
garden = '4bf58dd8d48988d15a941735'
botanical_garden = '52e81612bcbc57f1066b7a22'
dog_run = '4bf58dd8d48988d1e5941735'
nature_preserve = '52e81612bcbc57f1066b7a13'
pedestrian_plaza = '52e81612bcbc57f1066b7a25'
plaza = '4bf58dd8d48988d164941735'
river = '4eb1d4dd4b900d56c88a45fd'
forest = '52e81612bcbc57f1066b7a23'
nature = ','.join([park,lake,garden,forest])

Now we create function, which will find these venues in given location, using Foursquare API

In [10]:
#prepare foursquare client id and client secret
client_id = 'AJTAE4Q4HU5CGMLDWUJI1EAZMC1UMQ1K24MRTVQRXXWYZSYS'
client_secret = 'SIDVGKLWPPXTGQ0JJCLHUA220CXYITS2CO534PGVCHKFP2H2'

#lets create function which returns all these facilities
def get_venues(lat,lon,northeast,southwest,catid):
    command = "https://api.foursquare.com/v2/venues/search?ll={},{}&categoryId={}&client_id={}&limit=200&\
client_secret={}&v=20200309&intent=browse&ne={}&sw={}".format(lat,lon,catid,client_id,client_secret,northeast,southwest)
    return requests.get(command)

Lets try our function to search some parks for some parks in district _Prague 3_

In [11]:
#get coordinates and borders of Prague 3
lat,lng,ne,sw = districts_coordinates.loc['Prague 3',:].tolist()

#get parks in Prague 3
parksInPrague3 = get_venues(lat,lng,ne,sw,nature)

Now lets take a look at first park which was found in this location. We see that it contains information about its latitude and longitude... So we could possibly show this on map?

In [12]:
parksInPrague3.json()['response']['venues'][0]

{'id': '4c932d0a332c37045c738e64',
 'name': 'Náměstí Jiřího z Poděbrad',
 'location': {'address': 'nám. Jiřího z Poděbrad',
  'lat': 50.0780016927659,
  'lng': 14.449823332933661,
  'distance': 647,
  'postalCode': '130 00',
  'cc': 'CZ',
  'neighborhood': 'Vinohrady',
  'city': 'Praha',
  'state': 'Hlavní město Praha',
  'country': 'Česká republika',
  'formattedAddress': ['nám. Jiřího z Poděbrad',
   '130 00 Praha',
   'Česká republika']},
 'categories': [{'id': '4bf58dd8d48988d164941735',
   'name': 'Plaza',
   'pluralName': 'Plazas',
   'shortName': 'Plaza',
   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/plaza_',
    'suffix': '.png'},
   'primary': True}],
 'referralId': 'v-1584309348',
 'hasPerk': False}

Okay, now lets show all parks in Prague 3 on map, using folium

In [13]:
fmap = folium.Map(location=(lat,lng),zoom_start=14)

#a function which takes as input list of foursquare venues and folium map instance and adds venues to that folium instance
def showFolium(venues_response,fmap,markerColor='blue'):
    for venue in venues_response.json()['response']['venues']:

        #lets find out venue nameand its location
        name = venue['name']
        lat,lng = venue['location']['lat'],venue['location']['lng']
        folium.Marker([lat,lng], popup=name, tooltip=name, icon=folium.Icon(color=markerColor)).add_to(fmap)
    
    return fmap

#show final map with parks in Prague 3
mapParksPrague3 = showFolium(parksInPrague3,fmap)
mapParksPrague3

Good! By using this we can also find other venues which we are looking for. Lets create several groups of venues as mentioned above. For health we want to select only some types of venues (for example Maternity Clinic is quite useless for elderly people), so we will not use whole category. For Nightlife, we use whole category which covers all types of bars and nightlife spots. We also include hotels as potentially noisy venue. For stores we will look separatedly for food stores, and for the whole category (which includes many more subcategories like gas stations, ATMs, clothes stores, etc.). For public transport we take a look at bus, tram, and subway stops.

In [14]:
#health (having this venue around is advantage)
pharmacy = '4bf58dd8d48988d10f951735'
medical_center = '4bf58dd8d48988d104941735'
hospital = '4bf58dd8d48988d196941735'
rehab_center = '56aa371be4b08b9a8d57351d'
doctors_office = '4bf58dd8d48988d177941735'
dentist = '4bf58dd8d48988d178941735'
health = ','.join([pharmacy,medical_center,hospital,rehab_center,doctors_office,dentist])

#night life and hotels (this is considered to be negative in our case)
nightlife_spot = '4d4b7105d754a06376d81259'
hotel = '4bf58dd8d48988d1fa931735'
noisy_venues = ','.join([nightlife_spot,hotel])

#stores (having these venues around is advantage)
food = '4bf58dd8d48988d1f9941735'
store = '4d4b7105d754a06378d81259'

#public transport (this is something that we want to have in selected district)
bus_station = '4bf58dd8d48988d1fe931735'
bus_stop = '52f2ab2ebcbc57f1066b8b4f'
metro_station = '4bf58dd8d48988d1fd931735'
tram_station = '52f2ab2ebcbc57f1066b8b51'
public_transport = ','.join([bus_station,bus_stop,metro_station,tram_station])

Now we will repeat what we already did above - print the folium map, but this time, we will show different colors for different groups of venues.

In [15]:
health_data = get_venues(lat,lng,ne,sw,health)
noisy_data = get_venues(lat,lng,ne,sw,noisy_venues)
food_data = get_venues(lat,lng,ne,sw,food)
store_data = get_venues(lat,lng,ne,sw,store)
public_transport_data = get_venues(lat,lng,ne,sw,public_transport)

#now create new folium layer and plot data
fmap = folium.Map(location=(lat,lng),zoom_start=14)
fmap = showFolium(health_data,fmap,markerColor='lightblue')
fmap = showFolium(noisy_data,fmap,markerColor='red')
fmap = showFolium(food_data,fmap,markerColor='green')
fmap = showFolium(store_data,fmap,markerColor='darkgreen')
fmap = showFolium(public_transport_data,fmap,markerColor='blue')
fmap

#### Downloading all data
Now, when we have way how to download data for each district, we will download them, and append to original districts dataframe

In [16]:
#copy original df
districts_all_data = pd.DataFrame(districts_coordinates)

#prepare two lists of group names and group codes
groups_names = ['nature','health','stores','foodstores','transport','nightlife']
groups_codes = [nature,health,store,food,public_transport,noisy_venues]

#add venues types iteratively (each iterations adds one column with number of venues in each district)
for group_name,group_code in zip(groups_names,groups_codes):
    num_v = districts_all_data.apply(axis=1,func=lambda x: len(get_venues(*x[:4],group_code).json()['response']['venues']))
    districts_all_data[group_name] = num_v

districts_all_data

Unnamed: 0_level_0,latitude,longitude,northeast,southwest,nature,health,stores,foodstores,transport,nightlife
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Prague 1,50.0873,14.4174,"50.09728,14.42742","50.07728,14.40742",37,50,50,25,19,50
Prague 2,50.0739,14.4396,"50.08394,14.44956","50.06394,14.42956",38,50,50,30,25,50
Prague 3,50.0826,14.4554,"50.0926,14.46537","50.0726,14.44537",39,50,50,25,15,50
Prague 4,50.0423,14.4481,"50.05231,14.45805","50.03231,14.43805",11,50,40,50,31,50
Kunratice,50.0137,14.4853,"50.02371,14.49528","50.00371,14.47528",3,14,50,15,8,11
Prague 11,50.0309,14.5241,"50.04094,14.53406","50.02094,14.51406",11,50,50,28,20,46
Šeberov,50.0131,14.5139,"50.02314,14.52392","50.00314,14.50392",1,12,15,4,8,7
Újezd u Průhonic,50.0095,14.5443,"50.01449,14.54934","50.00449,14.53934",0,0,8,1,2,0
Prague 12,50.002,14.4181,"50.01201,14.4281","49.99201,14.4081",7,50,50,19,16,20
Libuš,50.01,14.4608,"50.01999,14.47084","49.99999,14.45084",4,35,50,15,14,20


### 5) Clustering analysis
Now when we have both data - data about each district location and venues, we can use them to do clustering. We will use KMeans algorithm. Let's try to cluster our data into 3 different clusters.

lets setup clustering settings and data, and lets run it.

In [18]:
#we create KMeans with 3 clusters
clustering = KMeans(n_clusters=3)

#now we fit it to data. But we want to fit only to columns with venues numbers, so we use columns 4 to 9
clustering.fit(districts_all_data.iloc[:,4:10])

#Now we save clusters information to new column
districts_all_data['cluster'] = clustering.labels_

#lets print the data with their clusters
districts_all_data.sort_values(by='cluster')

Unnamed: 0_level_0,latitude,longitude,northeast,southwest,nature,health,stores,foodstores,transport,nightlife,cluster
district,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Březiněves,50.1659,14.4839,"50.17591,14.49389","50.15591,14.47389",0,0,11,2,3,4,0
Suchdol,50.137,14.3693,"50.14701,14.37932","50.12701,14.35932",12,6,26,6,5,11,0
Královice,50.0382,14.6373,"50.04824,14.64727","50.02824,14.62727",2,0,3,1,0,0,0
Dolní Chabry,50.1463,14.4483,"50.15625,14.45826","50.13625,14.43826",1,8,37,5,10,10,0
Ďáblice,50.1484,14.4864,"50.15843,14.49643","50.13843,14.47643",0,10,20,1,6,7,0
Dolní Počernice,50.0897,14.5843,"50.09967,14.59432","50.07967,14.57432",1,3,26,3,7,10,0
Miškovice,50.1574,14.5428,"50.16735,14.55278","50.14735,14.53278",1,0,6,1,2,1,0
Satalice,50.1255,14.572,"50.13649,14.583","50.11449,14.561",3,2,15,2,12,6,0
Vinoř,50.1452,14.5811,"50.15824,14.59408","50.13224,14.56808",10,5,39,9,10,12,0
Přední Kopanina,50.1176,14.2971,"50.12855,14.30814","50.10655,14.28614",0,0,5,2,3,4,0


Now we will visualize our clusters on map. Each point on map will represent center of one Prague district, and color of point represents its cluster (if you hoover over it, it will show the cluster number).

In [19]:
#specify colors and map
colors = ['red','green','blue']
fmap = folium.Map(location=('50.07','14.45'),zoom_start=11)

#iterate over clusters and colors
for colr,cluster in zip(colors,sorted(set(clustering.labels_))):
    
    #select all districts matching to current cluster
    clst_dist = districts_all_data[districts_all_data['cluster'] == cluster]
    
    #iteratively add all districts of current cluster to map, with current color
    for dist,lat,lng in zip(clst_dist.reset_index()['district'],clst_dist['latitude'],clst_dist['longitude']):
        folium.Marker([lat,lng], popup=dist, tooltip=cluster, icon=folium.Icon(color=colr)).add_to(fmap)
    
#show map
fmap

### 6) Conclusion
Please notice that what is meant by "first", "second" and "third" cluster here, does not have to be cluster labels 0,1,2 (for example cluster label 2 can be "first"), due to random initialization of KMeans clustering algorithm.

What we can see from these three clusters? On first look, its not clear, but if you look more precisely at the map and the data you can see following:
 - first cluster: mostly districts closer to city center. These districts usually have lot of venues of all kinds. It means that you can find there more stores, medical centers, but unfortunatelly this applies also to number of nightlife spots, like bar, hotels, and so on. _This is not very good for elderly people, as they usually do not require lot of different stores, and although there is lot of parks in the centre, if you take a look at the map, they are usually smaller than those on the border of the city._
 - second: this cluster's districts are usually more distant from center of the city, than first cluster's districts. They are worse in terms of health care venues (pharmacies, clinics,...) and public transport stations, on the other hand, it is still enough if we take a look at the numbers. Also, even though there is less parks, they are usually larger than those in first cluster. Possibilities of shopping are very similar to first cluster. _This cluster might be very good, as public transport is still no problem in its districts, parks are larger, health care and shopping possibilities are also reachable. The best thing is, that there is less hotels, bars, and other noisy places, than in first cluster._
 - third cluster: from map we can see, districts in this cluster are usually most distant from the city center. The good thing is, that there is still similar number of parks as in the second cluster, and they are large. Also, there is less hotels, bars and other noisy places, not just in compare with first cluster, but also in compare with second cluster. Unfortunatelly, number of venues which can be useful is also not very high. _This cluster seems to be not very good - its exact opposite of the first cluster: this is border of city, no noise, but also not enough public transport, not enough stores and health care services._

--------------------------------------------------------------------------------

Finally we can tell that __second cluster is most suitable for building new retirement home.__ While first and third clusters are bit extreme (first is in city center, with not enough nature, and very noisy, third has not enough stores, public transport stations, and other important facilities), second cluster is quite good balanced - services are still reachable in its districts, and its much less noisy than first cluster.