# Best place to stay at while travelling to Istanbul

### Introduction

When planning to visit an unknown place we often have a hard time deciding which area would be the best to stay at for this short time. Some of the main concerns are price of the accomodation and safety. Additionally, because we come as tourists we also want to be close to tourist attractions and eating out places.

The purpose of this notbook is to help to make that decision while travelling to Istanbul by classifying districts based on mentioned criteria.

At the end districts are classified into 3 categories:

class: 2 -> **high** (red) rent and crime, but best attractions and food distance-score<br>
class: 0 -> **optimal** (green) rent and attractions/food distance-score, and low crime index<br>
class: 1 -> **low** (blue) rent, medium crime but worst attractions and food distance-score
    


The classification, together with other statistics and location of attractions and restaurants, is shown on the map at the end of this notebook.

Calculations and assumptions while detemining criteria for classification are explaind in each section.

# Table of content

1. [Gather data](#get_data)<br>
    1.1 [Districts of Istanbul](#districts)<br>
    1.2 [Geojson data](#geoj_data)<br>
    1.3 [Rent prices](#rent_data)<br>
    1.4 [Crime data](#crime_data)<br>
    1.5 [Top attractions](#attractions_data)<br>
    1.6 [Top eating-out places](#food_data)<br>
2. [Classification](#classif)
3. [Map](#map)<br>
    3.1 [Show map](#show_map)

## 1. Gather data<a name='get_data'></a>

### 1.1 Districts of Istanbul

Download list of districts in Istanbul with simple statistics from wikipedia:
[https://en.wikipedia.org/wiki/List_of_districts_of_Istanbul](https://en.wikipedia.org/wiki/List_of_districts_of_Istanbul)

In [1]:
import pandas as pd
import numpy as np

In [2]:
# don't show warnings about chained assignments
pd.options.mode.chained_assignment = None  # default='warn'

Use pandas read_html to scrap the table.

In [3]:
url_districts = 'https://en.wikipedia.org/wiki/List_of_districts_of_Istanbul'
districts = pd.read_html(url_districts)
df_districts = districts[0].loc[:38]
df_districts.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²)
0,Adalar,15238,11.05,1379
1,Arnavutköy,282488,450.35,627
2,Ataşehir,425094,25.23,16849
3,Avcılar,448882,42.01,10685
4,Bağcılar,745125,22.36,33324


#### District coordinates

Using geopy and foursquare_agent get location coordinates for each district

In [4]:
from geopy.geocoders import Nominatim

In [5]:
geolocator = Nominatim(user_agent="foursquare_agent")

for idx, district in enumerate(df_districts['District']):
    location = geolocator.geocode(district + ', Istanbul, Turkey')
    df_districts.loc[idx, 'latitude'] = location.latitude
    df_districts.loc[idx, 'longitude'] = location.longitude

In [6]:
df_districts.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude
0,Adalar,15238,11.05,1379,40.876259,29.091027
1,Arnavutköy,282488,450.35,627,41.184182,28.740729
2,Ataşehir,425094,25.23,16849,40.984749,29.10672
3,Avcılar,448882,42.01,10685,40.980135,28.717547
4,Bağcılar,745125,22.36,33324,41.033899,28.857898


#### Unify district names

District names in from different sources might differ in Turkish letters, therefore, we replace all to latin alphabet.

In [7]:
from unidecode import unidecode

- replace district names for dataframe

In [8]:
for i, dist in enumerate(df_districts['District']):
    df_districts.loc[i, 'District'] = unidecode(dist)

In [9]:
df_districts[['District']].head()

Unnamed: 0,District
0,Adalar
1,Arnavutkoy
2,Atasehir
3,Avcilar
4,Bagcilar


### 1.2 Geojson data for districts<a name='geoj_data'></a>

The geospatial data of borders of Districts in Turkey were downloaded from [gadm.org](gadm.org)

In [10]:
import json

- import coordination data for districts borders

In [None]:
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\turkey_dist.json", 'r', encoding='utf-8') as f:
    turkey_dist_geo = json.load(f)

- extract coordination data for districts of Istanbul

In [None]:
istanbul_dist_geo_list = []
dist_names = []
for dist in turkey_dist_geo['features']:
    if dist['properties']['NAME_1'] == 'Istanbul':
        istanbul_dist_geo_list.append(dist)
        dist_names.append(dist['properties']['NAME_2'])

istanbul_dist_geo = {}
istanbul_dist_geo['type'] = turkey_dist_geo['type']
istanbul_dist_geo['features'] = istanbul_dist_geo_list

- Save json file with district borders for Istanbul

In [None]:
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\istanbul.json", "w", encoding='utf-8') as f:
        json.dump(istanbul_dist_geo, f, ensure_ascii=False)

- turned out that Silivri and Üsküdar districts were both under Üsküdar district. Using https://geojson.io/ I separated two districts and save whole data as geojson file

#### Unify district names

District names in from different sources might differ in Turkish letters, therefore, we replace all to latin alphabet.

- replace names for geojson file

In [11]:
#open geojson file
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\istanbul_map.geojson", "r", encoding='utf-8') as f:
    istanbul_dist_geo = json.load(f)

In [12]:
istanbul_dist_geo['features'][4]['properties']['NAME_2']

'Bağcılar'

In [13]:
# replace all district names
names = []
for i, dist in enumerate(istanbul_dist_geo['features']):
    dist_name = dist['properties']['NAME_2']
    istanbul_dist_geo['features'][i]['properties']['NAME_2'] = unidecode(dist_name)
    names.append(unidecode(dist_name))

In [14]:
istanbul_dist_geo['features'][4]['properties']['NAME_2']

'Bagcilar'

In [15]:
# save geojson file
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\istanbul_map2.geojson", "w", encoding='utf-8') as f:
        json.dump(istanbul_dist_geo, f, ensure_ascii=False)

- check if all names match for both

In [16]:
names == list(df_districts['District'])
filtered_list = [i for i, dist in enumerate(names) if dist not in list(df_districts['District'])]
filtered_list

[18]

In [17]:
print(names[18])
print(df_districts.loc[18,'District'])

Eyup
Eyupsultan


\* *name of one of the districts differ slightly*

- change dataframe name to match that of geojson

In [18]:
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\istanbul_map2.geojson", "r") as f:
    name = json.load(f)

In [19]:
df_districts.loc[filtered_list, 'District'] = name['features'][filtered_list[0]]['properties']['NAME_2']

## 1.3 Rent prices<a name="rent_data"></a>

I was unable to find avaible free API or data for personal use for prices of short term rent. Instead, it is assumed here that prices of short time stay accomodation are correlated with rent prices. Because exact amount is not necessary but comparison between districts, this should be sufficient.

We webscrap rent prices from [https://www.realtygroup.com.tr/average-rent-price-in-istanbul-is-1486-tl/](https://www.realtygroup.com.tr/average-rent-price-in-istanbul-is-1486-tl/)

- use BeautifulSoup library to get average rent price per $m^{2}$ for each district

In [20]:
from bs4 import BeautifulSoup
import requests

In [21]:
# url for the website
url_rent = 'https://www.realtygroup.com.tr/average-rent-price-in-istanbul-is-1486-tl/'
html_content = requests.get(url_rent)
content = html_content.text
soup = BeautifulSoup(content, 'lxml')

# find table on the website
table = soup.find("table", attrs = {"width": "0"})

In [22]:
# get table headers
t_headers = []
for th in table.find_all("strong"):
    t_headers.append(th.text.replace("\n",' ').strip())

t_headers

['Payback Period (years)', 'Rent (TL/m2)', 'District']

In [23]:
# get data from table
table_data = []
for tr in table.tbody.find_all("tr"): # each row in tbody of table is tr
    t_row = {}
    for td, th in zip(tr.find_all("td"), t_headers): # each cell in row is td
        t_row[th] = td.text.replace('\n','').strip()
    table_data.append(t_row)
table_data = table_data[1:]

In [24]:
df_rent = pd.DataFrame(table_data)
df_rent.drop(['Payback Period (years)'], axis=1, inplace=True)
df_rent.head()

Unnamed: 0,Rent (TL/m2),District
0,18,Adalar
1,8,Arnavutköy
2,16,Ataşehir
3,12,Avcılar
4,12,Bağcılar


#### Unify district names

District names in from different sources might differ in Turkish letters, therefore, we replace all to latin alphabet.

- remove Turkish characters

In [25]:
for i, dist in enumerate(df_rent['District']):
    df_rent.loc[i, 'District'] = unidecode(dist)

In [26]:
df_rent.head()

Unnamed: 0,Rent (TL/m2),District
0,18,Adalar
1,8,Arnavutkoy
2,16,Atasehir
3,12,Avcilar
4,12,Bagcilar


- check if district names match those from before

In [27]:
filtered_list = [i for i, dist in enumerate(df_rent['District']) if dist not in list(df_districts['District'])]
filtered_list

[]

- add rent prices to districts dataframe

In [28]:
df_districts_rent = df_districts.join(df_rent.set_index('District'), on='District', how='left')
df_districts_rent.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2)
0,Adalar,15238,11.05,1379,40.876259,29.091027,18
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12


## 1.4 Crime data<a name='crime_data'></a>

Because some crimes are more serious than other the safety of the district will be determined based on The Crime Severity Index (https://www150.statcan.gc.ca/n1/pub/85-004-x/2009001/part-partie1-eng.htm). 

It is calculated by multiplying number of crimes commited by weight which determines seriousness of the type of the crime (more serious crimes have higher weight) and divided by the poopulation number of the district.

The weights will be based on the weights given for Canada (https://www150.statcan.gc.ca/n1/pub/85-004-x/2009001/t001-eng.htm).

The only crime statistics for Istanbul I could find was from year 2003.<br>
*Ergun, N., & Yirmibeşoğlu, F. (2007). Distribution of Crime Rates in Different Districts in Istanbul. Turkish Studies, 8(3), 435–455. doi:10.1080/14683840701489324*

Type of crimes for Istanbul are more compact than listed for weights in Canada, thus some weights will be averaged of few types at that category.

- load data from pdf file

In [29]:
import tabula

In [30]:
# Read pdf into a list of DataFrame
pdf_path = "C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\ergun2007.pdf"
dfs = tabula.read_pdf(pdf_path, pages='5,6')

In [31]:
dfs.head()

Unnamed: 0,Eminönü,27,244,110,91,513,346,1331,55635,"0,024"
0,Beyoğlu,21,497,236,183,2059,1096,4092,231900,17
1,Fatih,74,504,378,116,4377,699,6148,403508,15
2,Sişli,17,216,42,108,3107,284,3774,270674,14
3,Beşiktaş,7,99,37,79,2322,268,2812,190813,15
4,Üsküdar,16,223,196,49,2015,224,2723,495118,5


In [32]:
dfs.append(list(dfs.columns.values))
dfs.rename(columns={'Eminönü':'District',
                    '27':'Homocide', 
                    '244':'Attempted Homocide', 
                    '110':'Assault', 
                    '91':'Aggravated Assult',
                    '513':'Theft',
                    '346':'Pickpocketing_Snatching',
                    '1331':'Total crime 2003',
                    '55635':'Population 2000',
                    '0,024':'Crime %00'}, inplace=True)
dfs.head()           

Unnamed: 0,District,Homocide,Attempted Homocide,Assault,Aggravated Assult,Theft,Pickpocketing_Snatching,Total crime 2003,Population 2000,Crime %00
0,Beyoğlu,21,497,236,183,2059,1096,4092,231900,17
1,Fatih,74,504,378,116,4377,699,6148,403508,15
2,Sişli,17,216,42,108,3107,284,3774,270674,14
3,Beşiktaş,7,99,37,79,2322,268,2812,190813,15
4,Üsküdar,16,223,196,49,2015,224,2723,495118,5


In [33]:
dfs.tail()

Unnamed: 0,District,Homocide,Attempted Homocide,Assault,Aggravated Assult,Theft,Pickpocketing_Snatching,Total crime 2003,Population 2000,Crime %00
28,Sultanbeyli,7,95,124,8,524,22,780,175700,4
29,Avcılar,5,102,39,33,883,60,1122,233749,5
30,Tuzla,4,45,84,11,451,71,666,107883,6
31,Average,144,165,76,49,13618,1795,1846063,283925,7
32,TOTAL,462,5281,2437,1573,43578,5743,59074,9085599,7


In [34]:
# remove last 2 row
dfs.drop([31,32], inplace=True)

- check dtypes of constructed dataframe

In [35]:
dfs.dtypes

District                   object
Homocide                   object
Attempted Homocide          int64
Assault                     int64
Aggravated Assult           int64
Theft                      object
Pickpocketing_Snatching    object
Total crime 2003           object
Population 2000             int64
Crime %00                  object
dtype: object

\* some of the values in dataframe are of type object instead of int or float

- change object type to int in dataframe

In [36]:
cols = dfs.columns.drop('District')
dfs[cols] = dfs[cols].apply(pd.to_numeric, errors='coerce')

- get weights for crime types

In [37]:
url_crime_weights = 'https://www150.statcan.gc.ca/n1/pub/85-004-x/2009001/t001-eng.htm'
crime_weights = pd.read_html(url_crime_weights)
df_crime_weights = crime_weights[0][2:]
df_crime_weights.rename(columns={'Unnamed: 0': 'Offence'}, inplace=True)
df_crime_weights.reset_index(inplace=True, drop=True)
df_crime_weights

Unnamed: 0,Offence,Weight
0,Murder 1 st and 2 nd degree,7042
1,Manslaughter,1822
2,Attempted murder,1411
3,Sexual assault - level 3,1047
4,Discharging firearm with intent,988
5,Sexual assault - level 2,678
6,Robbery,583
7,Assault - level 3,405
8,Using firearm in commission of an offence,267
9,Sexual assault - level 1,211


Check data type in crime weights dataframe

In [38]:
df_crime_weights.dtypes

Offence    object
Weight     object
dtype: object

Change object to numeric in 'Weight' column

In [39]:
df_crime_weights['Weight'] = pd.to_numeric(df_crime_weights['Weight'])

#### calculate The Crime Severity Index for each district
- we assign crime types for Istanbul with weights as:
    - Homocide: Murder 1 st and 2 nd degree
    - Attempted Homocide: Attempted murder
    - Assault: average of Assault - level 1, Assault - level 2, Sexual assault - level 1
    - Aggravated Assult: average of Assault - level 3, Sexual assault - level 2, Sexual assault - level 3
    - Theft: average of Theft of a motor vehicle, Theft over \$5,000
    - Pickpocketing_Snatching: Theft under \$5,000

In [40]:
crime_idx = []
for i in range(dfs.shape[0]):
    
    homocide = (dfs.loc[i,'Homocide'] * df_crime_weights.loc[0,'Weight'])/dfs.loc[i,'Population 2000']
    
    attemp_homocide = (dfs.loc[i,'Attempted Homocide'] * df_crime_weights.loc[2,'Weight'])/dfs.loc[i,'Population 2000']
    
    assault = (dfs.loc[i,'Assault'] * df_crime_weights.loc[[23,16,9],'Weight'].mean())/dfs.loc[i,'Population 2000']
    
    agg_assault = (dfs.loc[i,'Aggravated Assult'] * df_crime_weights.loc[[7,5,3],'Weight'].mean())/dfs.loc[i,'Population 2000']
    
    theft = (dfs.loc[i,'Theft'] * df_crime_weights.loc[[15,12],'Weight'].mean())/dfs.loc[i,'Population 2000']
                                                   
    snatching = (dfs.loc[i,'Pickpocketing_Snatching'] * df_crime_weights.loc[21,'Weight'])/dfs.loc[i,'Population 2000']
    
    
    crime_idx.append(homocide+attemp_homocide+assault+agg_assault+theft+snatching)

- standardize values to 100 

In [41]:
crime_idx_ = list(np.round((crime_idx/max(crime_idx))*100))
crime_idx_100 = [int(i) for i in crime_idx_]

- crime idx for districts

In [42]:
df_crime_idx = dfs[['District']]
df_crime_idx['Crime index'] = crime_idx_100
df_crime_idx.head()

Unnamed: 0,District,Crime index
0,Beyoğlu,100
1,Fatih,84
2,Sişli,58
3,Beşiktaş,49
4,Üsküdar,26


#### Unify district names

District names in from different sources might differ in Turkish letters, therefore, we replace all to latin alphabet.

In [43]:
for i, dist in enumerate(df_crime_idx['District']):
    df_crime_idx.loc[i, 'District'] = unidecode(dist)

In [44]:
filtered_list = [i for i, dist in enumerate(df_crime_idx['District']) if dist not in list(df_districts['District'])]
filtered_list

[]

- add crime data with rest

In [45]:
df_districts_rent_crime = df_districts_rent.join(df_crime_idx.set_index('District'), on='District', how='left')
df_districts_rent_crime.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index
0,Adalar,15238,11.05,1379,40.876259,29.091027,18,30.0
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8,
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16,
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12,24.0
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12,19.0


## 1.4 Top attractions<a name='attractions_data'></a>

Using Triposo API (free for personal use) find top 100 sightseeing places in Istanbul.
Get their name, score and coordinates

- use Triposo API to find sightseeing places

In [46]:
# get my creditentials
with open("C:\\Users\\Pawel\\Documents\\Github\\triposo_creds.json", 'r') as f:
    triposo_creds = json.load(f)
triposo_id = triposo_creds['ID']
triposo_token = triposo_creds['API_token']

In [47]:
# determine place and number of results
place = 'Istanbul'
count = '100'

In [48]:
url = ('https://www.triposo.com/api/20200405/poi.json?location_id={}' \
    + '&tag_labels=sightseeing' \
    + '&count={}' \
    + '&fields=name,score,tag_labels,coordinates' \
    + '&order_by=-score' \
    + '&account={}' \
    + '&token={}').format(place, count, triposo_id, triposo_token)

results = requests.get(url).json()

In [49]:
istanbul_attractions = pd.DataFrame(results['results'])
istanbul_attractions.head()

Unnamed: 0,name,coordinates,score,tag_labels,sightseeing_score
0,Topkapı Palace,"{'latitude': 41.0132564, 'longitude': 28.984852}",9.979321,"[museums, district, character, sightseeing, to...",9.979321
1,Sultan Ahmed Mosque,"{'latitude': 41.0052619, 'longitude': 28.9768725}",9.921485,"[district, character, sightseeing, poitype-Mos...",9.921485
2,Hagia Sophia,"{'latitude': 41.008337, 'longitude': 28.9792172}",9.878436,"[district, character, sightseeing, poitype-Chu...",9.878436
3,Dolmabahçe Palace,"{'latitude': 41.0391994360295, 'longitude': 28...",9.841813,"[district, sightseeing, poitype-Palace, topatt...",9.841813
4,İstanbul Archaeology Museums,"{'latitude': 41.011661893672574, 'longitude': ...",9.69159,"[museums, district, sightseeing, architectural...",9.69159


In [50]:
istanbul_attractions.drop(['tag_labels', 'sightseeing_score'], axis=1, inplace=True)
istanbul_attractions.head()

Unnamed: 0,name,coordinates,score
0,Topkapı Palace,"{'latitude': 41.0132564, 'longitude': 28.984852}",9.979321
1,Sultan Ahmed Mosque,"{'latitude': 41.0052619, 'longitude': 28.9768725}",9.921485
2,Hagia Sophia,"{'latitude': 41.008337, 'longitude': 28.9792172}",9.878436
3,Dolmabahçe Palace,"{'latitude': 41.0391994360295, 'longitude': 28...",9.841813
4,İstanbul Archaeology Museums,"{'latitude': 41.011661893672574, 'longitude': ...",9.69159


In [51]:
latitude = []
longitude = []
for coordinate in istanbul_attractions['coordinates']:
    latitude.append(coordinate['latitude'])
    longitude.append(coordinate['longitude'])

In [52]:
istanbul_attractions['Latitude'] = latitude
istanbul_attractions['Longitude'] = longitude
istanbul_attractions.drop(['coordinates'], axis=1, inplace=True)
istanbul_attractions.head()

Unnamed: 0,name,score,Latitude,Longitude
0,Topkapı Palace,9.979321,41.013256,28.984852
1,Sultan Ahmed Mosque,9.921485,41.005262,28.976872
2,Hagia Sophia,9.878436,41.008337,28.979217
3,Dolmabahçe Palace,9.841813,41.039199,28.999731
4,İstanbul Archaeology Museums,9.69159,41.011662,28.981299


- Change names of attractions to not contain Turkish characters

In [53]:
for i, name in enumerate(istanbul_attractions['name']):
    istanbul_attractions.loc[i, 'name'] = unidecode(name)

#### District-attraction score

Calculate district-attraction score, which is averaged distance from center of the district to each attraction place weighted by the inverse of the score the attraction obtained on Triposo.<br>
    - This way if one attraction is furtheraway than another but it has much better score it will have better district-attraction score

\* real travel time would be more appropriate, however, I could not get free API or data for personal use providing such information.

- based on coordination data we determine distance between attraction and center of the districts and average it for each district
- then we wieght that distance by score that the place has
    - you would rather walk a bit more if the restaurant is much better

In [54]:
def distance(point1, point2):
    # uses the ‘haversine’ formula to calculate the great-circle distance between two points
    
    # input is a list with location in form point=[latitude, longitude]
        
    # volumetric mean radius of Earth [km]
    R = 6371
    
    point1_rad = np.array(point1)*np.pi/180
    point2_rad = np.array(point2)*np.pi/180
    d_lat = point1_rad[0]-point2_rad[0]
    d_lon = point1_rad[1]-point2_rad[1]
    
    a = np.sin((d_lat)/2)**2 + np.cos(point1_rad[0])*np.cos(point2_rad[0])*np.sin((d_lon)/2)**2
    c = 2*np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = R*c
    return d

In [55]:
df_districts_rent_crime.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index
0,Adalar,15238,11.05,1379,40.876259,29.091027,18,30.0
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8,
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16,
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12,24.0
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12,19.0


In [56]:
istanbul_attractions.head()

Unnamed: 0,name,score,Latitude,Longitude
0,Topkapi Palace,9.979321,41.013256,28.984852
1,Sultan Ahmed Mosque,9.921485,41.005262,28.976872
2,Hagia Sophia,9.878436,41.008337,28.979217
3,Dolmabahce Palace,9.841813,41.039199,28.999731
4,Istanbul Archaeology Museums,9.69159,41.011662,28.981299


In [57]:
mean_distance_att = []
for dist, lat, lon in zip(df_districts_rent_crime['District'],df_districts_rent_crime['latitude'], df_districts_rent_crime['longitude']):
    position_dist = [lat, lon]
    distance_att = []
    for lat_att, lon_att, score_att in zip(istanbul_attractions['Latitude'], istanbul_attractions['Longitude'], istanbul_attractions['score']):
        position_att = [lat_att, lon_att]
        distance_att.append(distance(position_dist, position_att)*(1/score_att))
    
    mean_distance_att.append(sum(distance_att)/len(distance_att))

In [58]:
df_istanbul = df_districts_rent_crime
df_istanbul['District-attractions score'] = mean_distance_att
df_istanbul.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index,District-attractions score
0,Adalar,15238,11.05,1379,40.876259,29.091027,18,30.0,2.297994
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8,,3.141996
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16,,1.390679
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12,24.0,2.739199
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12,19.0,1.308855


## 1.5 Top eating-out places<a name='food_data'></a>

Using Triposo API (free for personal use) find top 100 eating-out places in Istanbul. Get their name, score and coordinates

In [59]:
url = ('https://www.triposo.com/api/20200405/poi.json?location_id={}' \
    + '&tag_labels=eatingout' \
    + '&count={}' \
    + '&fields=name,score,tag_labels,coordinates' \
    + '&order_by=-score' \
    + '&account={}' \
    + '&token={}').format(place, count, triposo_id, triposo_token)

results = requests.get(url).json()

In [60]:
istanbul_restaurants = pd.DataFrame(results['results'])
istanbul_restaurants.head()

Unnamed: 0,name,coordinates,score,tag_labels,eatingout_score
0,Mikla Restaurant,"{'latitude': 41.0310866, 'longitude': 28.9740165}",9.735005,"[lunch, dinner, cuisine, feature, district, ch...",9.735005
1,Beyti,"{'latitude': 40.97334205996415, 'longitude': 2...",9.6342,"[lunch, dinner, cuisine, district, eatingout, ...",9.6342
2,Babylonia Garden Terrace Restaurant,"{'latitude': 41.005844763358574, 'longitude': ...",9.501239,"[lunch, dinner, cuisine, feature, district, ea...",9.501239
3,Cozy Pub & Restaurant,"{'latitude': 41.0082598, 'longitude': 28.9743634}",8.994464,"[lunch, coffee, dinner, cuisine, feature, dist...",8.994464
4,Arch Bistro,"{'latitude': 41.0046271, 'longitude': 28.9712651}",7.328758,"[lunch, coffee, dinner, cuisine, district, bre...",7.328758


In [61]:
latitude = []
longitude = []
for coordinate in istanbul_restaurants['coordinates']:
    latitude.append(coordinate['latitude'])
    longitude.append(coordinate['longitude'])

In [62]:
istanbul_restaurants['Latitude'] = latitude
istanbul_restaurants['Longitude'] = longitude
istanbul_restaurants.drop(['coordinates', 'tag_labels', 'eatingout_score'], axis=1, inplace=True)
istanbul_restaurants.head()

Unnamed: 0,name,score,Latitude,Longitude
0,Mikla Restaurant,9.735005,41.031087,28.974017
1,Beyti,9.6342,40.973342,28.793959
2,Babylonia Garden Terrace Restaurant,9.501239,41.005845,28.980645
3,Cozy Pub & Restaurant,8.994464,41.00826,28.974363
4,Arch Bistro,7.328758,41.004627,28.971265


- Change names of attractions to not contain Turkish characters

In [63]:
for i, name in enumerate(istanbul_restaurants['name']):
    istanbul_restaurants.loc[i, 'name'] = unidecode(name)

#### District-food score

Calculate district-attraction score, which is averaged distance from center of the district to each attraction place weighted by the inverse of the score the attraction obtained on Triposo.<br>
    - This way if one attraction is furtheraway than another but it has much better score it will have better district-attraction score

\* real travel time would be more appropriate, however, I could not get free API or data for personal use providing such information.

- based on coordination data we determine distance between attraction and center of the districts and average it for each district
- then we wieght that distance by score that the place has
    - you would rather walk a bit more if the restaurant is much better

In [64]:
istanbul_restaurants.head()

Unnamed: 0,name,score,Latitude,Longitude
0,Mikla Restaurant,9.735005,41.031087,28.974017
1,Beyti,9.6342,40.973342,28.793959
2,Babylonia Garden Terrace Restaurant,9.501239,41.005845,28.980645
3,Cozy Pub & Restaurant,8.994464,41.00826,28.974363
4,Arch Bistro,7.328758,41.004627,28.971265


In [65]:
mean_distance_food = []
for dist, lat, lon in zip(df_districts_rent_crime['District'],df_districts_rent_crime['latitude'], df_districts_rent_crime['longitude']):
    position_dist = [lat, lon]
    distance_food = []
    for lat_food, lon_food, score_food in zip(istanbul_restaurants['Latitude'], istanbul_restaurants['Longitude'], istanbul_restaurants['score']):
        position_food = [lat_food, lon_food]
        distance_food.append(distance(position_dist, position_food)*(1/score_food))
    
    mean_distance_food.append(sum(distance_food)/len(distance_food))

In [66]:
df_istanbul['District-food score'] = mean_distance_food
df_istanbul.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index,District-attractions score,District-food score
0,Adalar,15238,11.05,1379,40.876259,29.091027,18,30.0,2.297994,3.0443
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8,,3.141996,4.438903
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16,,1.390679,1.882177
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12,24.0,2.739199,3.689194
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12,19.0,1.308855,1.741132


## 2. Classification<a name='classif'><a/>

Implement Kmeans algorithm to classify districts of Istanbul into 3 clusters based on the parameters described in introduction.

- the clustering will be based on rent, crime score, district-attraction score and district-food score

In [67]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

- before clustering make sure to change all values to numeric

In [68]:
cols = df_istanbul.drop('District',axis=1).columns
df_istanbul[cols] = df_istanbul[cols].apply(pd.to_numeric, errors='coerce')
df_istanbul.dtypes

District                       object
Population (2019)               int64
Area (km²)                    float64
Density (per km²)               int64
latitude                      float64
longitude                     float64
Rent (TL/m2)                    int64
Crime index                   float64
District-attractions score    float64
District-food score           float64
dtype: object

- Choose features for classification

In [69]:
features = df_istanbul[['Rent (TL/m2)', 'Crime index', 'District-attractions score', 'District-food score']]

- crime data has missing data for few districts
    - to deal with this we perform iterative calculation for classification where:
        - first nan values are replaced mean for specific feature
        - with that we run Kmeans
        - we extract center of centroid of the cluster to which elemet with nan value was assign
        - for the next iteration we replace nan with that centroid istead of mean
        - we run Kmeans again and get centroid
        - if it didn't change we stop iteration

In [70]:
def kmeans_missing(X, n_clusters, max_iter=10):

    # Initialize missing values to their column means
    # locate missing data
    missing = ~np.isfinite(X)
    # calculate mean for each column excluding nan rows
    mu = np.nanmean(X, 0, keepdims=1)
    # replace nan with mean
    X_hat = np.where(missing, mu, X)

    for i in range(max_iter):
        if i == 0:
            # for first iteration calculate Kmeans where nan are replaced by mean values
            cls = KMeans(n_clusters, n_jobs=-1, init='k-means++', n_init=20, max_iter=500, random_state=4) #n_jobs=-1 means using all processors
        else:
            # for next iterations instead of mean nan are replaced by center of centroid for the feature corresponding to cluster the element was classified 
            cls = KMeans(n_clusters, n_init=20, random_state=4)

        # find cluster centers for each cluster
        labels = cls.fit_predict(X_hat)
        centroids = cls.cluster_centers_

        # fill in the missing values based on their cluster centroids
        X_hat[missing] = centroids[labels][missing]

        # when the labels have stopped changing then we have converged
        #if i > 0 and np.all(labels == prev_labels):
        # when the centroid stopped changing
        if i > 0 and np.all(X_hat[missing] == centroids[labels][missing]):
            break

        prev_labels = labels
        prev_centroids = cls.cluster_centers_

    return labels, centroids, X_hat, i

Normalize features. This is so that each feature is equaly important is classifying districts. 

In [71]:
cluster_dataset = StandardScaler().fit_transform(features)

- Perform clustering

In [72]:
labels, centroids, X_hat, i = kmeans_missing(cluster_dataset, 3, 100)

In [73]:
df_istanbul_class = df_istanbul
df_istanbul_class['Class'] = labels
df_istanbul_class.head()

Unnamed: 0,District,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index,District-attractions score,District-food score,Class
0,Adalar,15238,11.05,1379,40.876259,29.091027,18,30.0,2.297994,3.0443,0
1,Arnavutkoy,282488,450.35,627,41.184182,28.740729,8,,3.141996,4.438903,0
2,Atasehir,425094,25.23,16849,40.984749,29.10672,16,,1.390679,1.882177,0
3,Avcilar,448882,42.01,10685,40.980135,28.717547,12,24.0,2.739199,3.689194,0
4,Bagcilar,745125,22.36,33324,41.033899,28.857898,12,19.0,1.308855,1.741132,0


- check how districts where divided in clusters

In [74]:
df_istanbul_class.groupby(by=['Class']).mean()

Unnamed: 0_level_0,Population (2019),Area (km²),Density (per km²),latitude,longitude,Rent (TL/m2),Crime index,District-attractions score,District-food score
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,461831.76,70.6044,17023.28,41.01417,28.97112,12.48,28.333333,1.862418,2.550404
1,165318.6,605.834,865.2,41.044706,28.843554,10.4,45.4,5.403565,7.528982
2,349653.333333,60.993333,15056.888889,41.041964,28.954692,22.0,48.875,0.971887,1.269735


class: 2 -> **high** (red) rent and crime, but best attractions and food distance-score<br>
class: 0 -> **optimal** (green) rent and attractions/food distance-score, and low crime index<br>
class: 1 -> **low** (blue) rent, medium crime but worst attractions and food distance-score

## 3. Map<a name='map'></a>

Create interactive map showing:<pre>
    a. classification of districts
    b. population of districts
    c. population density of districts
    d. average rent prices for districts
    e. crime scores for districts
    f. mark top 100 attractions
    g. mark top 100 eating-out places

In [75]:
import folium
import branca
from folium import plugins

- create a base map

In [76]:
# create a plain world map
istanbul_map = folium.Map(
    [df_districts['latitude'].median(), df_districts['longitude'].median()],
    zoom_start=9,
    tiles=None)

base_map = folium.FeatureGroup(name='Basemap', overlay=True, control=False)
folium.TileLayer(tiles='OpenStreetMap').add_to(base_map)
base_map.add_to(istanbul_map)

# geojson file
with open("C:\\Users\\Pawel\\Documents\\Github\\projects\\Coursera_Capstone\\notebooks\\Istanbul\\istanbul_map2.geojson", "r", encoding='utf-8') as f:
    istanbul_geo = json.load(f)

- create color schemes for layers

In [77]:
# color schemes for different layers
colors = [['#CCD7FF', '#9788FF', '#A248FA', '#D50DF1', '#CD06AD', '#A8025E', '#800022'],
          ['#DEEDCF', '#99D492', '#56B870', '#1D9A6C', '#16837A', '#0F596B', '#0A2F51'],
          ['#DFCCFF', '#E888FF', '#FF44E9', '#FF007C', '#D50028', '#AA0700', '#802C00'],
          ['#DAFAF1', '#ACE9F0', '#7FB4E5', '#5568D7', '#5444B6', '#5D3593', '#5A2670']]

names = ['Population (2019)','Density (per km²)','Rent (TL/m2)', 'Crime index']
colorscales=[]
for i in range(len(names)):
    colorscale = branca.colormap.LinearColormap(
                                colors=colors[i]
                                ).scale(0, df_istanbul[names[i]].max())
    colorscales.append(colorscale)

- empty layer

In [78]:
style_function = lambda x: {'color': '#000000',
                            'opacity': 0.3,
                            'weight': 2,
                            'fillOpacity': 0
                           }

highlight_function = lambda x: {'fillColor': '#000000',
                                'fillOpacity': 0.2}

empty_layer = folium.FeatureGroup(name='Empty', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                fields=['NAME_2'],
                aliases=['District:'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(empty_layer)

empty_layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248e8992588>

#### a. classification map

In [79]:
# add clusters to geospatial data
for i in range(df_istanbul_class.shape[0]):
    istanbul_geo['features'][i]['properties']['Class'] = 'High' if int(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Class'].values)==2 else('Optimal' if int(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Class'].values)==0 else 'Low')

In [80]:
style_function = lambda x: {'fillColor': 'red' if df_istanbul_class[df_istanbul_class['District']==x['properties']['NAME_2']]['Class'].values==2
                            else ('green' if df_istanbul_class[df_istanbul_class['District']==x['properties']['NAME_2']]['Class'].values==0
                            else 'blue'),
                            'fillOpacity': 0.7,
                            'color': '#000000',
                            'opacity': 0.5,
                            'weight': 2
                           }

highlight_function = lambda x: {'fillColor': '#00FF00'}

layer = folium.FeatureGroup(name='Clusters', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                #fields=['NAME_2', ('high' if 'Class'== 0 else ('optimal' if 'Class'==1 else 'low'))],
                fields=['NAME_2', 'Class'],
                aliases=['District:', 'Cluster:'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(layer)

layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9b0448>

#### b. population layer

In [81]:
# add population number for districts to geospatial data
for i in range(df_istanbul.shape[0]):
    istanbul_geo['features'][i]['properties']['Population'] = int(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Population (2019)'].values)

In [82]:
style_function = lambda x: {'fillColor': '#00FFFFFF' if int(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Population (2019)'].values)=='Nan'
                            else colorscales[0](int(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Population (2019)'].values)),
                            'fillOpacity': 0.7,
                            'color': '#000000',
                            'opacity': 0.5,
                            'weight': 2
                           }

highlight_function = lambda x: {'fillColor': '#00FF00'}

layer = folium.FeatureGroup(name='Population', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                fields=['NAME_2', 'Population'],
                aliases=['District:', 'Population:'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(layer)

layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9b9808>

#### c. density layer

In [83]:
# add population density data to geospatial data
for i in range(df_istanbul.shape[0]):
    istanbul_geo['features'][i]['properties']['Density'] = int(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Density (per km²)'].values)

In [84]:
style_function = lambda x: {'fillColor': colorscales[1](int(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Density (per km²)'].values)),
                            'fillOpacity': 0.7,
                            'color': '#000000',
                            'opacity': 0.5,
                            'weight': 2
                           }

highlight_function = lambda x: {'fillColor': '#00FF00'}

layer = folium.FeatureGroup(name='Density', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                fields=['NAME_2', 'Density'],
                aliases=['District:','Density [km<sup>-2</sup>]:'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(layer)

layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9b4bc8>

#### d. rent layer

In [85]:
# add rent price data to geospatial data
for i in range(df_istanbul.shape[0]):
    istanbul_geo['features'][i]['properties']['Rent'] = int(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Rent (TL/m2)'].values)

In [86]:
style_function = lambda x: {'fillColor': colorscales[2](int(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Rent (TL/m2)'].values)),
                            'fillOpacity': 0.7,
                            'color': '#000000',
                            'opacity': 0.5,
                            'weight': 2
                           }

highlight_function = lambda x: {'fillColor': '#00FF00'}

layer = folium.FeatureGroup(name='Rent', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                fields=['NAME_2','Rent'],
                aliases=['District:','Rent [TL/m<sup>2</sup>]:'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(layer)

layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9b8e08>

#### e. crime layer

In [87]:
# add crime score data to geospatial data
for i in range(df_istanbul.shape[0]):
    istanbul_geo['features'][i]['properties']['Crime'] = float(df_istanbul[df_istanbul['District']==istanbul_geo['features'][i]['properties']['NAME_2']]['Crime index'].values)

In [88]:
style_function = lambda x: {'fillColor': '#000000' if np.isnan(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Crime index'].values)
                            else colorscales[3]((df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Crime index'].values)),
                            'fillOpacity': 0 if np.isnan(df_istanbul[df_istanbul['District']==x['properties']['NAME_2']]['Crime index'].values)
                                            else 0.7,
                            'color': '#000000',
                            'opacity': 0.5,
                            'weight': 2
                           }

highlight_function = lambda x: {'fillColor': '#00FF00'}

layer = folium.FeatureGroup(name='Crime', overlay=False)

folium.GeoJson(
            istanbul_geo,
            style_function=style_function,
            tooltip=folium.features.GeoJsonTooltip(
                fields=['NAME_2', 'Crime'],
                aliases=['District:', 'Crime index (0-100):'],
                style=('background-color: grey; color: white;')
                ),
            highlight_function=highlight_function
        ).add_to(layer)

layer.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eac21388>

#### f. Markers for attractions

In [89]:
layer_attr = folium.FeatureGroup(name='Attractions', overlay=True, show=False)

# create clusters for markers
clusters = plugins.MarkerCluster().add_to(layer_attr)

for latitude, longitude, name, score in zip(istanbul_attractions['Latitude'],
                                            istanbul_attractions['Longitude'],
                                            istanbul_attractions['name'], 
                                            istanbul_attractions['score']):
    text = folium.Html('<b>{}</b><br>({}/10)'.format(name, round(score,2)), script=True)
    popup = folium.Popup(text)
    folium.Marker(
        [latitude, longitude],
        popup= popup,
        #tooltip='click me!',   #hover msg
        icon= folium.Icon(color='red', icon='glyphicon-map-marker')
    ).add_to(clusters)
    
layer_attr.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9b9c88>

#### g. Markers for eating-out places

In [90]:
layer_rest = folium.FeatureGroup(name='Food', overlay=True, show=False)

# create clusters for markers
clusters_rest = plugins.MarkerCluster().add_to(layer_rest)

for latitude, longitude, name, score in zip(istanbul_restaurants['Latitude'],
                                            istanbul_restaurants['Longitude'],
                                            istanbul_restaurants['name'], 
                                            istanbul_restaurants['score']):
    text = folium.Html('<b>{}</b><br>({}/10)'.format(name, round(score,2)), script=True)
    popup = folium.Popup(text)
    folium.Marker(
        [latitude, longitude],
        popup= popup,
        #tooltip='click me!',   #hover msg
        icon= folium.Icon(color='blue', icon='glyphicon-map-marker')
    ).add_to(clusters_rest)
    
layer_rest.add_to(istanbul_map)

<folium.map.FeatureGroup at 0x248eb9a0088>

### 3.1 Show map <a name='show_map'></a>

In [91]:
folium.LayerControl(collapsed=False).add_to(istanbul_map)

istanbul_map

In [92]:
istanbul_map.save('istanbul_map.html')